Sample Prep & Sequencing: Data Release
Sequencing Data Quality Control
All sequence data undergo a QC protocol before they are released. For exomes, this includes an assessment of:
- Total reads – exome completion typically requires a minimum of 40 million paired-end 75bp reads
- Library complexity – the ratio of unique reads to total reads mapped to target.
- Capture efficiency – the ratio of reads mapped to human versus reads mapped to target
- Coverage distribution – 90% at 8X required for completion
- Capture uniformity
- Raw error rates
- Transition/Transversion ratio (Ti/Tv) – typically ~3 for known sites and ~2.5 for novel sites
- Distribution of known and novel variants relative to dbSNP – typically < 7% novel using dbSNP build 138 in samples of European ancestry
- Fingerprint concordance > 99%
- Sample homozygosity and heterozygosity
- Sample contamination
All QC metrics for both single-lane and merged data are reviewed by a sequence data analyst to identify data deviations from known or historical norms.
Lanes/samples that fail sequencing data QC are flagged in the system and can be re-queued for library prep ( < 5% failure) or further sequencing (< 2% failure), depending upon the QC issue.
Requirements for Passing Sequencing Data QC:
Exome completion is defined as having > 90% of the exome target at > 8X coverage and >80% of the exome target at > 20X coverage. Typically this yields a mean coverage of the target at 50-60X.
All projects will have data returned in the following file types:
- bam file (*.bam)- is a binary SAM file that contains the sequence alignment data for a single sample created using BWA
- bam index file (*.bam.bai)- are the index files for the corresponding bam files
- Multi-sample vcf file (.vcf)- contains all samples in a project called at every site that is variant in at least one of the samples as generally recommended by GATK best practices. For more information on VCF format go to: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
- Sample ID Lookup table (.xls)- is a cross-reference between the UW-CMG LIMS IDs generated at the sequencing center and the investigator sample IDs. It also lists the family information for the project.
- Genotype data- Not all projects will have high-density genotyping data generated. If the project was genotyped, the investigator will receive a PLINK-formatted file containing the genotypes. For more information on PLINK, go to: http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml
- SeattleSeq annotation files are provided. For additional information on this annotation, see the SeattleSeq website (http://snp.gs.washington.edu/SeattleSeqAnnotation138/)
Additional analysis for projects is available at an hourly rate.