http://choppy.3steps.cn/renluyao/HCC_Select_HF_mutations.git
(1) Separate the SNVs and Indels
vcftools --gzvcf sample.filtered.normed.vcf.gz --remove-indels --recode --recode-INFO-all --out sample.filtered.normed.snv
vcftools --gzvcf sample.filtered.normed.vcf.gz --keep-only-indels --recode --recode-INFO-all --out sample.filtered.normed.indel
rtg bgzip sample.filtered.normed.snv.vcf
rtg index sample.filtered.normed.snv.vcf.gz
rtg bgzip sample.filtered.normed.indel.vcf
rtg index sample.filtered.normed.indel.vcf.gz
(2) Separate training and testing vcf files
rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.train.vcf.gz --include-vcf=all.filtered.vcf.gz
rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz
rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.train.vcf.gz --include-vcf=all.filtered.vcf.gz
rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz
(3) Extract the information from VCF file
gzip -d sample.filtered.normed.snv.train.vcf.gz
gzip -d sample.filtered.normed.snv.test.vcf.gz
gzip -d sample.filtered.normed.indel.train.vcf.gz
gzip -d sample.filtered.normed.indel.test.vcf.gz
python extract_vcf_information.py -i sample.filtered.normed.snv.train.vcf -o sample.filtered.normed.snv.train.txt
python extract_vcf_information.py -i sample.filtered.normed.snv.test.vcf -o sample.filtered.normed.snv.test.txt
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
(4)Extract the information from bam-readcount
## get bed
# snv
cat sample.filtered.normed.snv.train.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.train.bed
cat sample.filtered.normed.snv.test.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.test.bed
#indel
python bed_for_bamReadcount.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train
python bed_for_bamReadcount.py -i sample.filtered.normed.indel.test.vcf -o sample.filtered.normed.indel.test
## bam-readcount
#snv
bam-readcount -f reference.fa -l sample.filtered.normed.snv.train.bed sample.bam -b 20> sample.filtered.normed.snv.train.readcount
bam-readcount -f reference.fa -l sample.filtered.normed.snv.test.bed sample.bam -b 20 > sample.filtered.normed.snv.test.readcount
#indel
bam-readcount -f reference.fa -l sample.filtered.normed.indel.train.bed sample.bam -i -b 20 > sample.filtered.normed.indel.train.readcount
bam-readcount -f reference.fa -l sample.filtered.normed.indel.test.bed sample.bam -i -b 20 > sample.filtered.normed.indel.test.readcount
(5) Parse bam-readcount information and combine with vcf information
# add alt in to bam-readcount
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.train.bed sample.filtered.normed.indel.train.readcount > sample.filtered.normed.indel.train.bamReadcount
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.test.bed sample.filtered.normed.indel.test.readcount > sample.filtered.normed.indel.test.bamReadcount
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.train.bed sample.filtered.normed.snv.train.readcount > sample.filtered.normed.snv.train.bamReadcount
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.snv.test.bed sample.filtered.normed.snv.test.readcount > sample.filtered.normed.snv.test.bamReadcount
Novelty detection https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection
Setting tips https://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection
(1) Train SNVs model
(2) Train Indels model
add info indicates which platform support the variants, how many sequencing sites, replicates, mapper and callers support the variants
https://github.com/brentp/vcfanno/
https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelismq