|
|
@@ -0,0 +1,126 @@ |
|
|
|
#高置信位点的整合 |
|
|
|
|
|
|
|
## 一、流程 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 二、模块 |
|
|
|
|
|
|
|
###1. 选择被所有callset支持的SNVs和Indels |
|
|
|
|
|
|
|
http://choppy.3steps.cn/renluyao/HCC_Select_HF_mutations.git |
|
|
|
|
|
|
|
### 2.从VCF文件和bam文件中提取SNVs和Indel信息 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 4. Extract SNPs and Indels information |
|
|
|
|
|
|
|
(1) Separate the SNVs and Indels |
|
|
|
|
|
|
|
```bash |
|
|
|
vcftools --gzvcf sample.filtered.normed.vcf.gz --remove-indels --recode --recode-INFO-all --out sample.filtered.normed.snv |
|
|
|
|
|
|
|
vcftools --gzvcf sample.filtered.normed.vcf.gz --keep-only-indels --recode --recode-INFO-all --out sample.filtered.normed.indel |
|
|
|
|
|
|
|
rtg bgzip sample.filtered.normed.snv.vcf |
|
|
|
rtg index sample.filtered.normed.snv.vcf.gz |
|
|
|
|
|
|
|
rtg bgzip sample.filtered.normed.indel.vcf |
|
|
|
rtg index sample.filtered.normed.indel.vcf.gz |
|
|
|
``` |
|
|
|
|
|
|
|
(2) Separate training and testing vcf files |
|
|
|
|
|
|
|
```bash |
|
|
|
rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.train.vcf.gz --include-vcf=all.filtered.vcf.gz |
|
|
|
|
|
|
|
rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz |
|
|
|
|
|
|
|
rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.train.vcf.gz --include-vcf=all.filtered.vcf.gz |
|
|
|
|
|
|
|
rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz |
|
|
|
``` |
|
|
|
|
|
|
|
(3) Extract the information from VCF file |
|
|
|
|
|
|
|
```bash |
|
|
|
gzip -d sample.filtered.normed.snv.train.vcf.gz |
|
|
|
gzip -d sample.filtered.normed.snv.test.vcf.gz |
|
|
|
gzip -d sample.filtered.normed.indel.train.vcf.gz |
|
|
|
gzip -d sample.filtered.normed.indel.test.vcf.gz |
|
|
|
|
|
|
|
python extract_vcf_information.py -i sample.filtered.normed.snv.train.vcf -o sample.filtered.normed.snv.train.txt |
|
|
|
python extract_vcf_information.py -i sample.filtered.normed.snv.test.vcf -o sample.filtered.normed.snv.test.txt |
|
|
|
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt |
|
|
|
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt |
|
|
|
``` |
|
|
|
|
|
|
|
(4)Extract the information from bam-readcount |
|
|
|
|
|
|
|
```bash |
|
|
|
## get bed |
|
|
|
# snv |
|
|
|
cat sample.filtered.normed.snv.train.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.train.bed |
|
|
|
|
|
|
|
cat sample.filtered.normed.snv.test.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.test.bed |
|
|
|
#indel |
|
|
|
python bed_for_bamReadcount.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train |
|
|
|
|
|
|
|
python bed_for_bamReadcount.py -i sample.filtered.normed.indel.test.vcf -o sample.filtered.normed.indel.test |
|
|
|
|
|
|
|
## bam-readcount |
|
|
|
#snv |
|
|
|
bam-readcount -f reference.fa -l sample.filtered.normed.snv.train.bed sample.bam -b 20> sample.filtered.normed.snv.train.readcount |
|
|
|
|
|
|
|
bam-readcount -f reference.fa -l sample.filtered.normed.snv.test.bed sample.bam -b 20 > sample.filtered.normed.snv.test.readcount |
|
|
|
|
|
|
|
#indel |
|
|
|
bam-readcount -f reference.fa -l sample.filtered.normed.indel.train.bed sample.bam -i -b 20 > sample.filtered.normed.indel.train.readcount |
|
|
|
|
|
|
|
bam-readcount -f reference.fa -l sample.filtered.normed.indel.test.bed sample.bam -i -b 20 > sample.filtered.normed.indel.test.readcount |
|
|
|
``` |
|
|
|
|
|
|
|
(5) Parse bam-readcount information and combine with vcf information |
|
|
|
|
|
|
|
```bash |
|
|
|
# add alt in to bam-readcount |
|
|
|
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.train.bed sample.filtered.normed.indel.train.readcount > sample.filtered.normed.indel.train.bamReadcount |
|
|
|
|
|
|
|
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.test.bed sample.filtered.normed.indel.test.readcount > sample.filtered.normed.indel.test.bamReadcount |
|
|
|
|
|
|
|
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.train.bed sample.filtered.normed.snv.train.readcount > sample.filtered.normed.snv.train.bamReadcount |
|
|
|
|
|
|
|
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.test.bed sample.filtered.normed.snv.test.readcount > sample.filtered.normed.snv.test.bamReadcount |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 5. Train machine-learning models on every call sets |
|
|
|
|
|
|
|
Novelty detection <https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection> |
|
|
|
|
|
|
|
Example <https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#sphx-glr-auto-examples-svm-plot-oneclass-py> |
|
|
|
|
|
|
|
Setting tips <https://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection> |
|
|
|
|
|
|
|
(1) Train SNVs model |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(2) Train Indels model |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 6. Intergrate the information |
|
|
|
|
|
|
|
add info indicates which platform support the variants, how many sequencing sites, replicates, mapper and callers support the variants |
|
|
|
|
|
|
|
<https://github.com/brentp/vcfanno/> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 7. WDL |
|
|
|
|
|
|
|
<https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism>q |
|
|
|
|