You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 11KB

3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
3 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203
  1. # Quality control of germline variants calling results using a Chinese Quartet family
  2. > Author: Run Luyao
  3. >
  4. > E-mail:18110700050@fudan.edu.cn
  5. >
  6. > Git: http://47.103.223.233/renluyao/quartet_dna_quality_control_wgs_big_pipeline
  7. >
  8. > Last Updates: 2021/7/5
  9. ## Install
  10. ```
  11. open-choppy-env
  12. choppy install renluyao/quartet_dna_quality_control_big_pipeline
  13. ```
  14. ## Introduction of Chinese Quartet DNA reference materials
  15. With the rapid development of sequencing technology and the dramatic decrease of sequencing costs, DNA sequencing has been widely used in scientific research, diagnosis of and treatment selection for human diseases. However, due to the lack of effective quality assessment and control of the high-throughput omics data generation and analysis processes, variants calling results are seriously inconsistent among different technical replicates, batches, laboratories, sequencing platforms, and analysis pipelines, resulting in irreproducible scientific results and conclusions, huge waste of resources, and even endangering the life and health of patients. Therefore, reference materials for quality control of the whole process from omics data generation to data analysis are urgently needed.
  16. We first established genomic DNA reference materials from four immortalized B-lymphoblastoid cell lines of a Chinese Quartet family including parents and monozygotic twin daughters to make performance assessment of germline variants calling results. To establish small variant benchmark calls and regions, we generated whole-genome sequencing data in nine batches, with depth ranging from 30x to 60x, by employing PCR-free and PCR libraries on four popular short-read sequencing platforms (Illumina HiSeq XTen, Illumina NovaSeq, MGISEQ-2000, and DNBSEQ-T7) with three replicates at each batch, resulting in 108 libraries in total and 27 libraries for each Quartet DNA reference material. Then, we selected variants concordant in multiple call sets and in Mendelian consistency within Quartet family members as small variant benchmark calls, resulting in 4.2 million high-confidence variants (SNV and Indel) and 2.66 G high confidence genomic region, covering 87.8% of the human reference genome (GRCh38, chr1-22 and X). Two orthogonal technologies were used for verifying the high-confidence variants. The consistency rate with PMRA (Axiom Precision Medicine Research Array) was 99.6%, and 95.9% of high-confidence variants were validated by 10X Genomics whole-genome sequencing data. Genetic built-in truth of the Quartet family design is another kind of “truth” within the four Quartet samples. Apart from comparison with benchmark calls in the benchmark regions to identify false-positive and false-negative variants, pedigree information among the Quartet DNA reference materials, i.e., reproducibility rate of variants between the twins and Mendelian concordance rate among family members, are complementary approaches to comprehensively estimate genome-wide variants calling performance. Finally, we developed a whole-genome sequencing data quality assessment pipeline and demonstrated its utilities with two examples of using the Quartet reference materials and datasets to evaluate data generation performance in three sequencing labs and different data analysis pipelines.
  17. ## Softwares and parameters
  18. ![workflow](./pictures/workflow.png)
  19. ### 1. Pre-alignment QC
  20. #### [Fastqc](<https://www.bioinformatics.babraham.ac.uk/projects/fastqc/>) v0.11.5
  21. [FastQC](<https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/>) is used to investigate the quality of fastq files
  22. ```bash
  23. fastqc -t <threads> -o <output_directory> <fastq_file>
  24. ```
  25. #### [Fastq Screen](<https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/>) 0.12.0
  26. Fastq Screen is used to inspect whether the library were contaminated. For example, we expected 99% reads aligned to human genome, 10% reads aligned to mouse genome, which is partly homologous to human genome. If too many reads are aligned to E.Coli or Yeast, libraries or cell lines are probably comtminated.
  27. ```bash
  28. fastq_screen --aligner <aligner> --conf <config_file> --top <number_of_reads> --threads <threads> <fastq_file>
  29. ```
  30. ### 2. Post-alignment QC
  31. #### [Qualimap](<http://qualimap.bioinfo.cipf.es/>) 2.0.0
  32. Qualimap is used to check the quality od bam files
  33. ```bash
  34. qualimap bamqc -bam <bam_file> -outformat PDF:HTML -nt <threads> -outdir <output_directory> --java-mem-size=32G
  35. ```
  36. ### 3. Variants Calling QC
  37. ![performance](./pictures/performance.png)
  38. #### 3.1 Performance assessment based on reference datasets
  39. #### [Hap.py](<https://github.com/Illumina/hap.py>) v0.3.9
  40. ```bash
  41. hap.py <truth_vcf> <query_vcf> -f <bed_file> --threads <threads> -o <output_filename>
  42. ```
  43. #### 3.2 Performance assessment based on Quartet genetic built-in truth
  44. #### [Mendelian Concordance Rate](https://github.com/sbg/VBT-TrioAnalysis) (vbt v1.1)
  45. We splited the Quartet family to two trios (F7, M8, D5 and F7, M8, D6) and then do the Mendelian analysis. A Quartet Mendelian concordant variant is the same between the twins (D5 and D6) , and follow the Mendelian concordant between parents (F7 and M8). Mendelian concordance rate is the Mendelian concordance variant divided by total detected variants in a Quartet family.
  46. ```bash
  47. vbt mendelian -ref <fasta_file> -mother <family_merged_vcf> -father <family_merged_vcf> -child <family_merged_vcf> -pedigree <ped_file> -outDir <output_directory> -out-prefix <output_directory_prefix> --output-violation-regions -thread-count <threads>
  48. ```
  49. ## Input files
  50. ```bash
  51. choppy samples renluyao/quartet_dna_quality_control_wgs_big_pipeline-latest --output samples
  52. ```
  53. ####Samples CSV file
  54. #### 1. Start from Fastq files
  55. ```BASH
  56. sample_id,project,fastq_1_D5,fastq_2_D5,fastq_1_D6,fastq_2_D6,fastq_1_F7,fastq_2_F7,fastq_1_M8,fastq_2_M8
  57. # sample_id in choppy system
  58. # project name
  59. # oss path of D5 fastq read1 file
  60. # oss path of D5 fastq read2 file
  61. # oss path of D6 fastq read1 file
  62. # oss path of D6 fastq read2 file
  63. # oss path of F7 fastq read1 file
  64. # oss path of F7 fastq read2 file
  65. # oss path of M8 fastq read1 file
  66. # oss path of M8 fastq read2 file
  67. ```
  68. #### 2. Start from VCF files
  69. ```BASH
  70. sample_id,project,vcf_D5,vcf_D6,vcf_F7,vcf_M8
  71. # sample_id in choppy system
  72. # project name
  73. # oss path of D5 VCF file
  74. # oss path of D6 VCF file
  75. # oss path of F7 VCF file
  76. # oss path of M8 VCF file
  77. ```
  78. ## Output Files
  79. #### 1. extract_tables.wdl/extract_tables_vcf.wdl
  80. (FASTQ) Pre-alignment QC: pre_alignment.txt
  81. (FASTQ) Post-alignment QC: post_alignment.txt
  82. (FASTQ/VCF) Variants calling QC: variants.calling.qc.txt
  83. ####2. quartet_mendelian.wdl
  84. (FASTQ/VCF) Mendelian concordance rate: mendelian.txt
  85. ## 结果展示与解读
  86. ####1. pre_alignment.txt
  87. | Column name | Description |
  88. | ------------------------- | ------------------------------------ |
  89. | Sample | 样本名,R1结尾为read1,R2结尾为read2 |
  90. | %Dup | % Duplicate reads |
  91. | %GC | Average % GC content |
  92. | Total Sequences (million) | Total sequences |
  93. | %Human | 比对到人类基因组的比例 |
  94. | %EColi | 比对到大肠杆菌基因组的比例 |
  95. | %Adapter | 比对到接头序列的比例 |
  96. | %Vector | 比对到载体基因组的比例 |
  97. | %rRNA | 比对到rRNA序列的比例 |
  98. | %Virus | 比对到病毒基因组的比例 |
  99. | %Yeast | 比对到酵母基因组的比例 |
  100. | %Mitoch | 比对到线粒体序列的比例 |
  101. | %No hits | 没有比对到以上基因组的比例 |
  102. #### 2. post_alignment.txt
  103. | Column name | Description |
  104. | --------------------- | --------------------------------------------- |
  105. | Sample | 样本名 |
  106. | %Mapping | % mapped reads |
  107. | %Mismatch Rate | Mapping error rate |
  108. | Mendelian Insert Size | Median insert size(bp) |
  109. | %Q20 | % bases >Q20 |
  110. | %Q30 | % bases >Q30 |
  111. | Mean Coverage | Mean deduped coverage |
  112. | Median Coverage | Median deduped coverage |
  113. | PCT_1X | Fraction of genome with at least 1x coverage |
  114. | PCT_5X | Fraction of genome with at least 5x coverage |
  115. | PCT_10X | Fraction of genome with at least 10x coverage |
  116. | PCT_30X | Fraction of genome with at least 30x coverage |
  117. ####3. variants.calling.qc.txt
  118. | Column name | Description |
  119. | --------------- | ------------------------------ |
  120. | Sample | 样本名 |
  121. | SNV number | 检测到SNV的数目 |
  122. | INDEL number | 检测到INDEL的数目 |
  123. | SNV query | 在高置信基因组区域中的SNV数目 |
  124. | INDEL query | 在高置信基因组区域中INDEL数目 |
  125. | SNV TP | 真阳性SNV |
  126. | INDEL TP | 真阳性INDEL |
  127. | SNV FP | 假阳性SNV |
  128. | INDEL FP | 假阳性INDEL |
  129. | SNV FN | 假阴性SNV |
  130. | INDEL FN | 假阴性INDEL |
  131. | SNV precision | SNV与标准集比较的precision |
  132. | INDEL precision | INDEL的与标准集比较的precision |
  133. | SNV recall | SNV与标准集比较的recall |
  134. | INDEL recall | INDEL的与标准集比较的recall |
  135. | SNV F1 | SNV与标准集比较的F1-score |
  136. | INDEL F1 | INDEL与标准集比较的F1-score |
  137. ####4 mendelian.txt
  138. | Column name | Description |
  139. | ----------------------------- | ------------------------------------------------------------ |
  140. | Family | 家庭名字,我们目前的设计是4个Quartet样本,每个三个技术重复,family_1是指rep1的4个样本组成的家庭单位,以此类推。 |
  141. | Total_Variants | 四个Quartet样本一共能检测到的变异位点数目 |
  142. | Mendelian_Concordant_Variants | 符合孟德尔规律的变异位点数目 |
  143. | Mendelian_Concordance_Quartet | 符合孟德尔遗传的比例 |