Skip to main content

Sample Prep & Sequencing: Data Release

Sequencing Data Quality Control

All sequence data undergo a QC protocol before they are released. For exomes, this includes an assessment of:

  • Total reads – exome completion typically requires a minimum of 40 million paired-end 75bp reads
  • Library complexity – the ratio of unique reads to total reads mapped to target.
  • Capture efficiency – the ratio of reads mapped to human versus reads mapped to target
  • Coverage distribution – 90% at 8X required for completion
  • Capture uniformity
  • Raw error rates
  • Transition/Transversion ratio (Ti/Tv) – typically ~3 for known sites and ~2.5 for novel sites
  • Distribution of known and novel variants relative to dbSNP – typically < 7% novel using dbSNP build 138 in samples of European ancestry
  • Fingerprint concordance > 99%
  • Sample homozygosity and heterozygosity
  • Sample contamination

All QC metrics for both single-lane and merged data are reviewed by a sequence data analyst to identify data deviations from known or historical norms.

Lanes/samples that fail sequencing data QC are flagged in the system and can be re-queued for library prep ( < 5% failure) or further sequencing (< 2% failure), depending upon the QC issue.

Requirements for Passing Sequencing Data QC:

Exome completion is defined as having > 90% of the exome target at > 8X coverage and >80% of the exome target at > 20X coverage. Typically this yields a mean coverage of the target at 50-60X.

Data Release

All projects will have data returned in the following file types:

  • bam file (*.bam)- is a binary SAM file that contains the sequence alignment data for a single sample created using BWA
  • bam index file (*.bam.bai)- are the index files for the corresponding bam files
  • Multi-sample vcf file (.vcf)- contains all samples in a project called at every site that is variant in at least one of the samples as generally recommended by GATK best practices. For more information on VCF format go to: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
  • Sample ID Lookup table (.xls)- is a cross-reference between the UW-CMG LIMS IDs generated at the sequencing center and the investigator sample IDs. It also lists the family information for the project.
  • Genotype data- Not all projects will have high-density genotyping data generated. If the project was genotyped, the investigator will receive a PLINK-formatted file containing the genotypes. For more information on PLINK, go to: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml
  • SeattleSeq annotation files are provided. For additional information on this annotation, see the SeattleSeq website (http://snp.gs.washington.edu/SeattleSeqAnnotation138/)

Additional analysis for projects is available at an hourly rate.