選択できるのは25トピックまでです。 トピックは、先頭が英数字で、英数字とダッシュ('-')を使用した35文字以内のものにしてください。

3年前
1年前
1年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
1年前
3年前
1年前
3年前
1年前
3年前
1年前
3年前
1年前
1年前
3年前
1年前
3年前
1年前
3年前
3年前
1年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
3年前
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266
  1. # Quality control of germline variants calling results using a Chinese Quartet family
  2. > Author: Ren Luyao, Chen Haonan
  3. >
  4. > E-mail:18110700050@fudan.edu.cn, haonanchen0815@163.com
  5. >
  6. > Git: http://choppy.3steps.cn/chenhaonan/quartet_dna_quality_control_wgs_big_pipeline.git
  7. >
  8. > Last Updates: 2023/7/20
  9. ## Install
  10. ```
  11. open-choppy-env
  12. choppy install chenhaonan/quartet_dna_quality_control_big_pipeline
  13. ```
  14. ## Introduction
  15. ###Chinese Quartet DNA reference materials
  16. With the rapid development of sequencing technology and the dramatic decrease of sequencing costs, DNA sequencing has been widely used in scientific research, diagnosis of and treatment selection for human diseases. However, due to the lack of effective quality assessment and control of the high-throughput omics data generation and analysis processes, variants calling results are seriously inconsistent among different technical replicates, batches, laboratories, sequencing platforms, and analysis pipelines, resulting in irreproducible scientific results and conclusions, huge waste of resources, and even endangering the life and health of patients. Therefore, reference materials for quality control of the whole process from omics data generation to data analysis are urgently needed.
  17. We first established genomic DNA reference materials from four immortalized B-lymphoblastoid cell lines of a Chinese Quartet family including parents and monozygotic twin daughters to make performance assessment of germline variants calling results. To establish small variant benchmark calls and regions, we generated whole-genome sequencing data in nine batches, with depth ranging from 30x to 60x, by employing PCR-free and PCR libraries on four popular short-read sequencing platforms (Illumina HiSeq XTen, Illumina NovaSeq, MGISEQ-2000, and DNBSEQ-T7) with three replicates at each batch, resulting in 108 libraries in total and 27 libraries for each Quartet DNA reference material. Then, we selected variants concordant in multiple call sets and in Mendelian consistency within Quartet family members as small variant benchmark calls, resulting in 4.2 million high-confidence variants (SNV and Indel) and 2.66 G high confidence genomic region, covering 87.8% of the human reference genome (GRCh38, chr1-22 and X). Two orthogonal technologies were used for verifying the high-confidence variants. The consistency rate with PMRA (Axiom Precision Medicine Research Array) was 99.6%, and 95.9% of high-confidence variants were validated by 10X Genomics whole-genome sequencing data. Genetic built-in truth of the Quartet family design is another kind of “truth” within the four Quartet samples. Apart from comparison with benchmark calls in the benchmark regions to identify false-positive and false-negative variants, pedigree information among the Quartet DNA reference materials, i.e., reproducibility rate of variants between the twins and Mendelian concordance rate among family members, are complementary approaches to comprehensively estimate genome-wide variants calling performance. Finally, we developed a whole-genome sequencing data quality assessment pipeline and demonstrated its utilities with two examples of using the Quartet reference materials and datasets to evaluate data generation performance in three sequencing labs and different data analysis pipelines.
  18. ### Quality control pipeline for WGS
  19. This Quartet quality control pipeline evaluate the performance of reads quality and variant calling quality. This pipeline accepts FASTQ format input files or VCF format input files. If the users input FASTQ files, this APP will output the results of pre-alignment quality control from FASTQ files, post-alignment quality control from BAM files and variant calling quality control from VCF files. [GATK best practice pipelines](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-) (implemented by [SENTIEON software](https://support.sentieon.com/manual/)) were used to map reads to the reference genome and call variants. If the users input VCF files, this APP will only output the results of variant calling quality control.
  20. Quartet quality control analysis pipeline started from FASTQ files is implemented across seven main procedures:
  21. - Pre-alignment QC of FASTQ files
  22. - Genome alignment
  23. - Post-alignment QC of BAM files
  24. - Germline variant calling
  25. - Variant calling QC depended on benchmark sets of VCF files
  26. - Check Mendelian ingeritance states across four Quartet samples of every variants
  27. - Variant calling QC depended on Quartet genetic relationship of VCF files
  28. Quartet quality control analysis pipeline started from VCF files is implemented across three main procedures:
  29. - Variant calling QC depended on benchmark sets of VCF files
  30. - Check Mendelian ingeritance states across four Quartet samples of every variants
  31. - Variant calling QC depended on Quartet genetic relationship of VCF files
  32. ![workflow](./pictures/workflow.png)
  33. Results generated from this APP can be visualized by Choppy report.
  34. ## Data Processing Steps
  35. ### 1. Pre-alignment QC of FASTQ files
  36. #### [Fastqc](<https://www.bioinformatics.babraham.ac.uk/projects/fastqc/>) v0.11.5
  37. [FastQC](<https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/>) is used to investigate the quality of fastq files
  38. ```bash
  39. fastqc -t <threads> -o <output_directory> <fastq_file>
  40. ```
  41. #### [Fastq Screen](<https://www.bioinformatics.babraham.ac.uk/projects/fastq_screen/>) 0.12.0
  42. Fastq Screen is used to inspect whether the library were contaminated. For example, we expected 99% reads aligned to human genome, 10% reads aligned to mouse genome, which is partly homologous to human genome. If too many reads are aligned to E.Coli or Yeast, libraries or cell lines are probably comtminated.
  43. ```bash
  44. fastq_screen --aligner <aligner> --conf <config_file> --top <number_of_reads> --threads <threads> <fastq_file>
  45. ```
  46. ### 2. Genome alignment
  47. Reads were mapped to the human reference genome GRCh38 using BWA-MEM.SAMTools is a tool used for SAM/BAM file conversion and BAM file sorting.
  48. ####[BWA-MEM](https://github.com/lh3/bwa):v0.7.17
  49. ```bash
  50. # Mapping to reference genome, converting sam to bam, sorting bam file
  51. bwa mem -M -R "@RG\tID:${group}\tSM:${sample}\tPL:${pl}" -t $(nproc) -K 10000000 ${ref_dir}/${fasta} ${fastq_1} ${fastq_2} | samtools view -bS -@ $(nproc) -
  52. | samtools sort -@ $(nproc) -o ${user_define_name}_${project}_${sample}.sorted.bam -
  53. ```
  54. ####[SAMTools](https://github.com/samtools/samtools):v1.17
  55. ```bash
  56. # Building an index for sorted bam file
  57. samtools index -@ $(nproc) -o ${user_define_name}_${project}_${sample}.sorted.bam.bai ${user_define_name}_${project}_${sample}.sorted.bam
  58. ```
  59. ### 3. Post-alignment QC
  60. Qualimap and Picard Tools are used to check the quality of BAM files. Deduplicated BAM files are used in this step.
  61. #### [Qualimap](<http://qualimap.bioinfo.cipf.es/>) 2.0.0
  62. ```bash
  63. # BAM QC by qualimap
  64. qualimap bamqc -bam <bam_file> -outformat PDF:HTML -nt <threads> -outdir <output_directory> --java-mem-size=32G
  65. ```
  66. ####[Picard](https://github.com/broadinstitute/picard/):v3.0.0
  67. ```bash
  68. # Remove duplicates
  69. java -jar /usr/local/picard.jar MarkDuplicates -I ${sorted_bam} -O ${sample}.sorted.deduped.bam -M ${sample}_dedup_metrics.txt --REMOVE_DUPLICATES
  70. # Building an index for the sorted and deduplicated bam file
  71. samtools index -@ $(nproc) -o ${sample}.sorted.deduped.bam.bai ${sample}.sorted.deduped.bam
  72. ```
  73. ### 4. Germline variant calling
  74. HaplotyperCaller implemented by Google DeepVariant is used to identify germline variants.
  75. #### [DeepVariant](<https://github.com/google/deepvariant>) 1.5.0
  76. ```bash
  77. # Calling variant
  78. deepvariant/bin/run_deepvariant --model_type=WGS --ref=${ref_dir}/${fasta} --reads=${recaled_bam} --output_vcf=${sample}_hc.vcf --num_shards=$(nproc)
  79. # Building an index for the vcf file
  80. gatk IndexFeatureFile -I ${sample}_hc.vcf -O ${sample}_hc.vcf.idx
  81. ```
  82. ### 5. Variants Calling QC
  83. ![performance](./pictures/performance.png)
  84. #### 5.1 Performance assessment based on benchmark sets
  85. #### [Hap.py](<https://github.com/Illumina/hap.py>) v0.3.9
  86. Variants were compared with benchmark calls in benchmark regions.
  87. ```bash
  88. hap.py <truth_vcf> <query_vcf> -f <bed_file> --threads <threads> -o <output_filename>
  89. ```
  90. #### 5.2 Performance assessment based on Quartet genetic built-in truth
  91. #### [VBT](https://github.com/sbg/VBT-TrioAnalysis) v1.1
  92. We splited the Quartet family to two trios (F7, M8, D5 and F7, M8, D6) and then do the Mendelian analysis. A Quartet Mendelian concordant variant is the same between the twins (D5 and D6) , and follow the Mendelian concordant between parents (F7 and M8). Mendelian concordance rate is the Mendelian concordance variant divided by total detected variants in a Quartet family. Only variants on chr1-22,X are included in this analysis.
  93. ```bash
  94. vbt mendelian -ref <fasta_file> -mother <family_merged_vcf> -father <family_merged_vcf> -child <family_merged_vcf> -pedigree <ped_file> -outDir <output_directory> -out-prefix <output_directory_prefix> --output-violation-regions -thread-count <threads>
  95. ```
  96. ## Input files
  97. ```bash
  98. choppy samples renluyao/quartet_dna_quality_control_wgs_big_pipeline-latest --output samples
  99. ```
  100. ####Samples CSV file
  101. #### 1. Start from Fastq files
  102. ```BASH
  103. sample_id,project,fastq_1_D5,fastq_2_D5,fastq_1_D6,fastq_2_D6,fastq_1_F7,fastq_2_F7,fastq_1_M8,fastq_2_M8
  104. # sample_id in choppy system
  105. # project name
  106. # oss path of D5 fastq read1 file
  107. # oss path of D5 fastq read2 file
  108. # oss path of D6 fastq read1 file
  109. # oss path of D6 fastq read2 file
  110. # oss path of F7 fastq read1 file
  111. # oss path of F7 fastq read2 file
  112. # oss path of M8 fastq read1 file
  113. # oss path of M8 fastq read2 file
  114. ```
  115. #### 2. Start from VCF files
  116. ```BASH
  117. sample_id,project,vcf_D5,vcf_D6,vcf_F7,vcf_M8
  118. # sample_id in choppy system
  119. # project name
  120. # oss path of D5 VCF file
  121. # oss path of D6 VCF file
  122. # oss path of F7 VCF file
  123. # oss path of M8 VCF file
  124. ```
  125. ## Output Files
  126. #### 1. extract_tables.wdl/extract_tables_vcf.wdl
  127. (FASTQ) Pre-alignment QC: pre_alignment.txt
  128. (FASTQ) Post-alignment QC: post_alignment.txt
  129. (FASTQ/VCF) Variants calling QC: variants.calling.qc.txt
  130. ####2. quartet_mendelian.wdl
  131. (FASTQ/VCF) Variants calling QC: mendelian.txt
  132. ## Output files format
  133. ####1. pre_alignment.txt
  134. | Column name | Description |
  135. | ------------------------- | --------------------------------------------------------- |
  136. | Sample | Sample name |
  137. | %Dup | Percentage duplicate reads |
  138. | %GC | Average GC percentage |
  139. | Total Sequences (million) | Total sequences |
  140. | %Human | Percentage of reads mapped to human genome |
  141. | %EColi | Percentage of reads mapped to Ecoli |
  142. | %Adapter | Percentage of reads mapped to adapter |
  143. | %Vector | Percentage of reads mapped to vector |
  144. | %rRNA | Percentage of reads mapped to rRNA |
  145. | %Virus | Percentage of reads mapped to virus |
  146. | %Yeast | Percentage of reads mapped to yeast |
  147. | %Mitoch | Percentage of reads mapped to mitochondrion |
  148. | %No hits | Percentage of reads not mapped to genomes mentioned above |
  149. #### 2. post_alignment.txt
  150. | Column name | Description |
  151. | --------------------- | --------------------------------------------- |
  152. | Sample | Sample name |
  153. | %Mapping | Percentage of mapped reads |
  154. | %Mismatch Rate | Mapping error rate |
  155. | Mendelian Insert Size | Median insert size(bp) |
  156. | %Q20 | Percentage of bases >Q20 |
  157. | %Q30 | Percentage of bases >Q30 |
  158. | Mean Coverage | Mean deduped coverage |
  159. | Median Coverage | Median deduped coverage |
  160. | PCT_1X | Fraction of genome with at least 1x coverage |
  161. | PCT_5X | Fraction of genome with at least 5x coverage |
  162. | PCT_10X | Fraction of genome with at least 10x coverage |
  163. | PCT_30X | Fraction of genome with at least 30x coverage |
  164. ####3. variants.calling.qc.txt
  165. | Column name | Description |
  166. | --------------- | ------------------------------------------------------------ |
  167. | Sample | Sample name |
  168. | SNV number | Total SNV number (chr1-22,X) |
  169. | INDEL number | Total INDEL number (chr1-22,X) |
  170. | SNV query | SNV number in benchmark region |
  171. | INDEL query | INDEL number in benchmark region |
  172. | SNV TP | True positive SNV |
  173. | INDEL TP | True positive INDEL |
  174. | SNV FP | False positive SNV |
  175. | INDEL FP | True positive INDEL |
  176. | SNV FN | False negative SNV |
  177. | INDEL FN | False negative INDEL |
  178. | SNV precision | Precision of SNV calls when compared with benchmark calls in benchmark regions |
  179. | INDEL precision | Precision of INDEL calls when compared with benchmark calls in benchmark regions |
  180. | SNV recall | Recall of SNV calls when compared with benchmark calls in benchmark regions |
  181. | INDEL recall | Recall of INDEL calls when compared with benchmark calls in benchmark regions |
  182. | SNV F1 | F1 score of SNV calls when compared with benchmark calls in benchmark regions |
  183. | INDEL F1 | F1 score of INDEL calls when compared with benchmark calls in benchmark regions |
  184. ####4 {project}.summary.txt
  185. | Column name | Description |
  186. | ----------------------------- | ----------------------------------------------------------- |
  187. | Family | Family name defined by inputed project name |
  188. | Reproducibility_D5_D6 | Percentage of variants were shared by the twins (D5 and D6) |
  189. | Mendelian_Concordance_Quartet | Percentage of variants were Mendelian concordance |