|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126 |
- #高置信位点的整合
-
- ## 一、流程
-
-
-
- ## 二、模块
-
- ###1. 选择被所有callset支持的SNVs和Indels
-
- http://choppy.3steps.cn/renluyao/HCC_Select_HF_mutations.git
-
- ### 2.从VCF文件和bam文件中提取SNVs和Indel信息
-
-
-
- ## 4. Extract SNPs and Indels information
-
- (1) Separate the SNVs and Indels
-
- ```bash
- vcftools --gzvcf sample.filtered.normed.vcf.gz --remove-indels --recode --recode-INFO-all --out sample.filtered.normed.snv
-
- vcftools --gzvcf sample.filtered.normed.vcf.gz --keep-only-indels --recode --recode-INFO-all --out sample.filtered.normed.indel
-
- rtg bgzip sample.filtered.normed.snv.vcf
- rtg index sample.filtered.normed.snv.vcf.gz
-
- rtg bgzip sample.filtered.normed.indel.vcf
- rtg index sample.filtered.normed.indel.vcf.gz
- ```
-
- (2) Separate training and testing vcf files
-
- ```bash
- rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.train.vcf.gz --include-vcf=all.filtered.vcf.gz
-
- rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz
-
- rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.train.vcf.gz --include-vcf=all.filtered.vcf.gz
-
- rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz
- ```
-
- (3) Extract the information from VCF file
-
- ```bash
- gzip -d sample.filtered.normed.snv.train.vcf.gz
- gzip -d sample.filtered.normed.snv.test.vcf.gz
- gzip -d sample.filtered.normed.indel.train.vcf.gz
- gzip -d sample.filtered.normed.indel.test.vcf.gz
-
- python extract_vcf_information.py -i sample.filtered.normed.snv.train.vcf -o sample.filtered.normed.snv.train.txt
- python extract_vcf_information.py -i sample.filtered.normed.snv.test.vcf -o sample.filtered.normed.snv.test.txt
- python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
- python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
- ```
-
- (4)Extract the information from bam-readcount
-
- ```bash
- ## get bed
- # snv
- cat sample.filtered.normed.snv.train.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.train.bed
-
- cat sample.filtered.normed.snv.test.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.test.bed
- #indel
- python bed_for_bamReadcount.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train
-
- python bed_for_bamReadcount.py -i sample.filtered.normed.indel.test.vcf -o sample.filtered.normed.indel.test
-
- ## bam-readcount
- #snv
- bam-readcount -f reference.fa -l sample.filtered.normed.snv.train.bed sample.bam -b 20> sample.filtered.normed.snv.train.readcount
-
- bam-readcount -f reference.fa -l sample.filtered.normed.snv.test.bed sample.bam -b 20 > sample.filtered.normed.snv.test.readcount
-
- #indel
- bam-readcount -f reference.fa -l sample.filtered.normed.indel.train.bed sample.bam -i -b 20 > sample.filtered.normed.indel.train.readcount
-
- bam-readcount -f reference.fa -l sample.filtered.normed.indel.test.bed sample.bam -i -b 20 > sample.filtered.normed.indel.test.readcount
- ```
-
- (5) Parse bam-readcount information and combine with vcf information
-
- ```bash
- # add alt in to bam-readcount
- awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.train.bed sample.filtered.normed.indel.train.readcount > sample.filtered.normed.indel.train.bamReadcount
-
- awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.test.bed sample.filtered.normed.indel.test.readcount > sample.filtered.normed.indel.test.bamReadcount
-
- awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.train.bed sample.filtered.normed.snv.train.readcount > sample.filtered.normed.snv.train.bamReadcount
-
- awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.test.bed sample.filtered.normed.snv.test.readcount > sample.filtered.normed.snv.test.bamReadcount
- ```
-
-
-
- ## 5. Train machine-learning models on every call sets
-
- Novelty detection <https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection>
-
- Example <https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#sphx-glr-auto-examples-svm-plot-oneclass-py>
-
- Setting tips <https://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection>
-
- (1) Train SNVs model
-
-
-
- (2) Train Indels model
-
-
-
- ## 6. Intergrate the information
-
- add info indicates which platform support the variants, how many sequencing sites, replicates, mapper and callers support the variants
-
- <https://github.com/brentp/vcfanno/>
-
-
-
- ## 7. WDL
-
- <https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism>q
-
|