从RNAseq数据中call突变
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
6 年之前
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
  1. # RNAseq variant calling
  2. ####RNAseq variant calling pipeline:
  3. ![RNA vairant calling pipeline](./pictures/Screen Shot 2019-06-14 at 9.56.40 AM.png)
  4. **(1) Mapping to the reference genome (STAR)**
  5. Aligning RNAseq data to a reference genome is complicates by RNA splicing, [GATK RNAseq variant calling best practice](<https://software.broadinstitute.org/gatk/documentation/article.php?id=3891>) recommends [STAR](<https://github.com/alexdobin/STAR>).
  6. Generate genome index files:
  7. The STAR you use
  8. ```bash
  9. genomeDir=/path/to/GRCh38
  10. mkdir $genomeDir
  11. STAR --runMode genomeGenerate\
  12. --genomeDir $genomeDir\
  13. --genomeFastaFiles hg19.fa\
  14. --runThreadN <n>
  15. ```
  16. Mapping reads to the genome:
  17. The author recommend running STAR in the 2-pass mode for the most sensitive novel junction discovery. It does not increase the number of detected novel junctions, but allows to detect more splices reads mapping to novel junctions. The basic idea is to run 1st pass of STAR mapping with the usual paraments, then collect the junctions detected inthe first pass, and use them as "annotated" junctions for the 2nd pass mapping
  18. ```bash
  19. # 1. 1st pass
  20. runDir=/path/to/1pass
  21. mkdir $runDir
  22. cd $runDir
  23. STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>
  24. # For the 2-pass STAR, a new index is then created using splice junstion information contained inthe file SJ.out.tab from the first pass
  25. genomeDir=/path/to/hg19_2pass
  26. mkdir $genomeDir
  27. STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
  28. --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>
  29. # 3. 2nd pass
  30. runDir=/path/to/2pass
  31. mkdir $runDir
  32. cd $runDir
  33. STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>
  34. ```
  35. **(2) Convert alignment output SAM to BAM, and add read groups ([picard](<https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_sam_AddOrReplaceReadGroups.php>)) and index bam ([samtools](<http://www.htslib.org/doc/samtools.html>))**
  36. ```bash
  37. java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample
  38. # index bam and get .bai
  39. samtools index rg_added_sorted.bam
  40. ```
  41. **(3) Remove deplicates**
  42. ```bash
  43. sentieon driver -t NUMBER_THREADS -i SORTED_BAM \
  44. --algo LocusCollector --fun score_info SCORE.gz
  45. sentieon driver -t NUMBER_THREADS -i SORTED_BAM \
  46. --algo Dedup --rmdup --score_info SCORE.gz \
  47. --metrics DEDUP_METRIC_TXT DEDUPED_BAM
  48. ```
  49. **(4) Split reads at junction**
  50. This step slits the RNA reads into exon segments by getting rid of Ns while maintaining grouping information, and hard-clips any sequences overhanging into the intron regions. Additionally, the step will reassign the mapping qualities from STAR to be consistent with what is expected in subsequent steps by converting from quality 255 to 60.
  51. ```bash
  52. sentieon driver -t NUMBER_THREADS -r REFERENCE -i DEDUPED_BAM \
  53. --algo RNASplitReadsAtJunction --reassign_mapq 255:60 SPLIT_BAM
  54. ```
  55. **(5) Base quality score recalibration**
  56. ```bash
  57. sentieon driver -r $fasta -t $nt -i SPLIT_BAM --algo QualCal -k $dbsnp -k $known_Mills_indels -k $known_1000G_indels recal_data.table
  58. sentieon driver -r $fasta -t $nt -i SPLIT_BAM -q recal_data.table --algo QualCal -k $dbsnp -k $known_Mills_indels -k $known_1000G_indels recal_data.table.post
  59. sentieon driver -t $nt --algo QualCal --plot --before recal_data.table --after recal_data.table.post recal.csv
  60. sentieon plot QualCal -o recal_plots.pdf recal.csv
  61. ```
  62. **(6) Variant calling**
  63. ```bash
  64. sentieon driver -t NUMBER_THREADS -r REFERENCE -i SPLIT_BAM \
  65. -q RECAL_DATA.TABLE --algo Haplotyper --trim_soft_clip \
  66. --call_conf 20 --emit_conf 20 [-d dbSNP] VARIANT_VCF
  67. ```
  68. **(7) Variant Filtering**
  69. Hard filtration is applied, for GATK do not yet have the RNAseq training/truth resources that would be needed to run VQSR.
  70. - `-window 35 -cluster 2` Remove clusters of at least 3 SNPs that are within a window of 35 bases between them
  71. - `FS > 30.0` Fisher Strand values
  72. - `QD < 2.0`Qual By Depth values
  73. ```bash
  74. # GATK3, if you are using gatk4, please pay attention to the new arguments
  75. java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf
  76. ```
  77. **one sample running time**
  78. Less than 4 hours
  79. Settings:
  80. - Disksize 500
  81. - Cluster OnDemand ecs.sn1ne.8xlarge img-ubuntu-vpc
  82. | Module | Time |
  83. | :------------: | :-----: |
  84. | Mapping | 1h20min |
  85. | SamToBam | 1h |
  86. | indexBam | 5min |
  87. | Dedup | 13min |
  88. | SplitReads | 6min |
  89. | BQSR | 21min |
  90. | Haplotyper | 15min |
  91. | Hardfiltration | 3min |
  92. **Reference**
  93. 1. GATK best practice https://software.broadinstitute.org/gatk/documentation/article.php?id=3891
  94. 2. Sentieon manual https://support.sentieon.com/manual/RNA_call/rna/
  95. 3. STAR github https://github.com/alexdobin/STAR
  96. 4. Picard github https://github.com/broadinstitute/picard
  97. 5. Picard manual https://broadinstitute.github.io/picard/
  98. 6. samtools manual http://www.htslib.org/doc/samtools.html