renluyao
/
high_confidence_calls_intergration

#高置信位点的整合

## 一、流程


## 二、模块

###1. 选择被所有callset支持的SNVs和Indels

http://choppy.3steps.cn/renluyao/HCC_Select_HF_mutations.git

### 2.从VCF文件和bam文件中提取SNVs和Indel信息


## 4. Extract SNPs and Indels information

(1) Separate the SNVs and Indels

```bash
vcftools --gzvcf sample.filtered.normed.vcf.gz --remove-indels --recode --recode-INFO-all --out sample.filtered.normed.snv

vcftools --gzvcf sample.filtered.normed.vcf.gz --keep-only-indels  --recode --recode-INFO-all --out sample.filtered.normed.indel

rtg bgzip sample.filtered.normed.snv.vcf
rtg index sample.filtered.normed.snv.vcf.gz

rtg bgzip sample.filtered.normed.indel.vcf
rtg index sample.filtered.normed.indel.vcf.gz
```

(2) Separate training and testing vcf files

```bash
rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.train.vcf.gz --include-vcf=all.filtered.vcf.gz

rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz

rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.train.vcf.gz --include-vcf=all.filtered.vcf.gz

rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz
```

(3) Extract the information from VCF file

```bash
gzip -d sample.filtered.normed.snv.train.vcf.gz
gzip -d sample.filtered.normed.snv.test.vcf.gz
gzip -d sample.filtered.normed.indel.train.vcf.gz
gzip -d sample.filtered.normed.indel.test.vcf.gz

python extract_vcf_information.py -i sample.filtered.normed.snv.train.vcf -o sample.filtered.normed.snv.train.txt
python extract_vcf_information.py -i sample.filtered.normed.snv.test.vcf -o sample.filtered.normed.snv.test.txt
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
```

（4）Extract the information from bam-readcount

```bash
## get bed
# snv
cat sample.filtered.normed.snv.train.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.train.bed

cat sample.filtered.normed.snv.test.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.test.bed
#indel
python bed_for_bamReadcount.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train

python bed_for_bamReadcount.py -i sample.filtered.normed.indel.test.vcf -o sample.filtered.normed.indel.test

## bam-readcount
#snv
bam-readcount -f reference.fa -l sample.filtered.normed.snv.train.bed sample.bam -b 20> sample.filtered.normed.snv.train.readcount

bam-readcount -f reference.fa -l sample.filtered.normed.snv.test.bed sample.bam -b 20 > sample.filtered.normed.snv.test.readcount

#indel
bam-readcount -f reference.fa -l sample.filtered.normed.indel.train.bed sample.bam -i -b 20 > sample.filtered.normed.indel.train.readcount

bam-readcount -f reference.fa -l sample.filtered.normed.indel.test.bed sample.bam -i -b 20 > sample.filtered.normed.indel.test.readcount
```

(5) Parse bam-readcount information and combine with vcf information

```bash
# add alt in to bam-readcount
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.train.bed sample.filtered.normed.indel.train.readcount > sample.filtered.normed.indel.train.bamReadcount

awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.test.bed sample.filtered.normed.indel.test.readcount > sample.filtered.normed.indel.test.bamReadcount

awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}'  OFS="\t" sample.filtered.normed.snv.train.bed sample.filtered.normed.snv.train.readcount > sample.filtered.normed.snv.train.bamReadcount

awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}'  OFS="\t" sample.filtered.normed.snv.test.bed sample.filtered.normed.snv.test.readcount > sample.filtered.normed.snv.test.bamReadcount
```


## 5. Train machine-learning models on every call sets

Novelty detection <https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection>

Example <https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#sphx-glr-auto-examples-svm-plot-oneclass-py>

Setting tips <https://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection>

(1) Train SNVs model


(2) Train Indels model


## 6. Intergrate the information 

add info indicates which platform support the variants, how many sequencing sites, replicates, mapper and callers support the variants

<https://github.com/brentp/vcfanno/>


## 7. WDL

<https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelism>q