You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5 年之前
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
  1. #高置信位点的整合
  2. ## 一、流程
  3. ## 二、模块
  4. ###1. 选择被所有callset支持的SNVs和Indels
  5. http://choppy.3steps.cn/renluyao/HCC_Select_HF_mutations.git
  6. ### 2.从VCF文件和bam文件中提取SNVs和Indel信息
  7. ## 4. Extract SNPs and Indels information
  8. (1) Separate the SNVs and Indels
  9. ```bash
  10. vcftools --gzvcf sample.filtered.normed.vcf.gz --remove-indels --recode --recode-INFO-all --out sample.filtered.normed.snv
  11. vcftools --gzvcf sample.filtered.normed.vcf.gz --keep-only-indels --recode --recode-INFO-all --out sample.filtered.normed.indel
  12. rtg bgzip sample.filtered.normed.snv.vcf
  13. rtg index sample.filtered.normed.snv.vcf.gz
  14. rtg bgzip sample.filtered.normed.indel.vcf
  15. rtg index sample.filtered.normed.indel.vcf.gz
  16. ```
  17. (2) Separate training and testing vcf files
  18. ```bash
  19. rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.train.vcf.gz --include-vcf=all.filtered.vcf.gz
  20. rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz
  21. rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.train.vcf.gz --include-vcf=all.filtered.vcf.gz
  22. rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz
  23. ```
  24. (3) Extract the information from VCF file
  25. ```bash
  26. gzip -d sample.filtered.normed.snv.train.vcf.gz
  27. gzip -d sample.filtered.normed.snv.test.vcf.gz
  28. gzip -d sample.filtered.normed.indel.train.vcf.gz
  29. gzip -d sample.filtered.normed.indel.test.vcf.gz
  30. python extract_vcf_information.py -i sample.filtered.normed.snv.train.vcf -o sample.filtered.normed.snv.train.txt
  31. python extract_vcf_information.py -i sample.filtered.normed.snv.test.vcf -o sample.filtered.normed.snv.test.txt
  32. python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
  33. python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
  34. ```
  35. (4)Extract the information from bam-readcount
  36. ```bash
  37. ## get bed
  38. # snv
  39. cat sample.filtered.normed.snv.train.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.train.bed
  40. cat sample.filtered.normed.snv.test.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.test.bed
  41. #indel
  42. python bed_for_bamReadcount.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train
  43. python bed_for_bamReadcount.py -i sample.filtered.normed.indel.test.vcf -o sample.filtered.normed.indel.test
  44. ## bam-readcount
  45. #snv
  46. bam-readcount -f reference.fa -l sample.filtered.normed.snv.train.bed sample.bam -b 20> sample.filtered.normed.snv.train.readcount
  47. bam-readcount -f reference.fa -l sample.filtered.normed.snv.test.bed sample.bam -b 20 > sample.filtered.normed.snv.test.readcount
  48. #indel
  49. bam-readcount -f reference.fa -l sample.filtered.normed.indel.train.bed sample.bam -i -b 20 > sample.filtered.normed.indel.train.readcount
  50. bam-readcount -f reference.fa -l sample.filtered.normed.indel.test.bed sample.bam -i -b 20 > sample.filtered.normed.indel.test.readcount
  51. ```
  52. (5) Parse bam-readcount information and combine with vcf information
  53. ```bash
  54. # add alt in to bam-readcount
  55. awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.train.bed sample.filtered.normed.indel.train.readcount > sample.filtered.normed.indel.train.bamReadcount
  56. awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.test.bed sample.filtered.normed.indel.test.readcount > sample.filtered.normed.indel.test.bamReadcount
  57. awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.train.bed sample.filtered.normed.snv.train.readcount > sample.filtered.normed.snv.train.bamReadcount
  58. awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.test.bed sample.filtered.normed.snv.test.readcount > sample.filtered.normed.snv.test.bamReadcount
  59. ```
  60. ## 5. Train machine-learning models on every call sets
  61. Novelty detection <https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection>
  62. Example <https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#sphx-glr-auto-examples-svm-plot-oneclass-py>
  63. Setting tips <https://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection>
  64. (1) Train SNVs model
  65. (2) Train Indels model
  66. ## 6. Intergrate the information
  67. add info indicates which platform support the variants, how many sequencing sites, replicates, mapper and callers support the variants
  68. <https://github.com/brentp/vcfanno/>
  69. ## 7. WDL
  70. <https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism>q