您最多选择25个主题 主题必须以字母或数字开头,可以包含连字符 (-),并且长度不得超过35个字符
LUYAO REN 8d93caff55 first commit 5 年前
README.md first commit 5 年前

README.md

高置信位点的整合

一、流程

二、模块

1. 选择被所有callset支持的SNVs和Indels

http://choppy.3steps.cn/renluyao/HCC_Select_HF_mutations.git

2.从VCF文件和bam文件中提取SNVs和Indel信息

4. Extract SNPs and Indels information

(1) Separate the SNVs and Indels

vcftools --gzvcf sample.filtered.normed.vcf.gz --remove-indels --recode --recode-INFO-all --out sample.filtered.normed.snv

vcftools --gzvcf sample.filtered.normed.vcf.gz --keep-only-indels  --recode --recode-INFO-all --out sample.filtered.normed.indel

rtg bgzip sample.filtered.normed.snv.vcf
rtg index sample.filtered.normed.snv.vcf.gz

rtg bgzip sample.filtered.normed.indel.vcf
rtg index sample.filtered.normed.indel.vcf.gz

(2) Separate training and testing vcf files

rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.train.vcf.gz --include-vcf=all.filtered.vcf.gz

rtg vcffilter -i sample.filtered.normed.snv.vcf.gz -o sample.filtered.normed.snv.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz

rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.train.vcf.gz --include-vcf=all.filtered.vcf.gz

rtg vcffilter -i sample.filtered.normed.indel.vcf.gz -o sample.filtered.normed.indel.test.vcf.gz --exclude-vcf=all.filtered.vcf.gz

(3) Extract the information from VCF file

gzip -d sample.filtered.normed.snv.train.vcf.gz
gzip -d sample.filtered.normed.snv.test.vcf.gz
gzip -d sample.filtered.normed.indel.train.vcf.gz
gzip -d sample.filtered.normed.indel.test.vcf.gz

python extract_vcf_information.py -i sample.filtered.normed.snv.train.vcf -o sample.filtered.normed.snv.train.txt
python extract_vcf_information.py -i sample.filtered.normed.snv.test.vcf -o sample.filtered.normed.snv.test.txt
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt
python extract_vcf_information.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train.txt

(4)Extract the information from bam-readcount

## get bed
# snv
cat sample.filtered.normed.snv.train.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.train.bed

cat sample.filtered.normed.snv.test.vcf | sed '/^#/d' | awk '{print $1"\t"$2"\t"$2"\t"$4"\t"$5}' > sample.filtered.normed.snv.test.bed
#indel
python bed_for_bamReadcount.py -i sample.filtered.normed.indel.train.vcf -o sample.filtered.normed.indel.train

python bed_for_bamReadcount.py -i sample.filtered.normed.indel.test.vcf -o sample.filtered.normed.indel.test

## bam-readcount
#snv
bam-readcount -f reference.fa -l sample.filtered.normed.snv.train.bed sample.bam -b 20> sample.filtered.normed.snv.train.readcount

bam-readcount -f reference.fa -l sample.filtered.normed.snv.test.bed sample.bam -b 20 > sample.filtered.normed.snv.test.readcount

#indel
bam-readcount -f reference.fa -l sample.filtered.normed.indel.train.bed sample.bam -i -b 20 > sample.filtered.normed.indel.train.readcount

bam-readcount -f reference.fa -l sample.filtered.normed.indel.test.bed sample.bam -i -b 20 > sample.filtered.normed.indel.test.readcount

(5) Parse bam-readcount information and combine with vcf information

# add alt in to bam-readcount
awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.train.bed sample.filtered.normed.indel.train.readcount > sample.filtered.normed.indel.train.bamReadcount

awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}' OFS="\t" sample.filtered.normed.indel.test.bed sample.filtered.normed.indel.test.readcount > sample.filtered.normed.indel.test.bamReadcount

awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}'  OFS="\t" sample.filtered.normed.snv.train.bed sample.filtered.normed.snv.train.readcount > sample.filtered.normed.snv.train.bamReadcount

awk 'NR==FNR{a[$1,$2]=$5;next} ($1,$2) in a{print $0, a[$1,$2]}'  OFS="\t" sample.filtered.normed.snv.test.bed sample.filtered.normed.snv.test.readcount > sample.filtered.normed.snv.test.bamReadcount

5. Train machine-learning models on every call sets

Novelty detection https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection

Example https://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#sphx-glr-auto-examples-svm-plot-oneclass-py

Setting tips https://scikit-learn.org/stable/modules/svm.html#svm-outlier-detection

(1) Train SNVs model

(2) Train Indels model

6. Intergrate the information

add info indicates which platform support the variants, how many sequencing sites, replicates, mapper and callers support the variants

https://github.com/brentp/vcfanno/

7. WDL

https://gatkforums.broadinstitute.org/wdl/discussion/6716/scatter-gather-parallelismq