You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

4 年之前
4 年之前
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
  1. # DeepVariant
  2. ![deepvariatn workflow](./picture/Screen Shot 2019-06-11 at 10.09.27 AM.png)
  3. DeepVariant was trained on 8 whole genome replicates of NA12878 sequenced under a variety of conditions related to library preparation. These conditions include loading concentration, library size selection, and laboratory technician. Then the trained model can be generalized to a variety of new datasets and call variants.
  4. Docker uploaded to Alibaba Cloud is downloaded from [dockerhub](<https://hub.docker.com/r/dajunluo/deepvariant>) (version r0.8.0).
  5. WGS training datasets v0.8 include 12 HG001 PCR-free, 2 HG005 PCR-free, 4 HG001 PCR+. Sequencing platforms are all Illumina Hiseq. For Illumina Novaseq, BGISEQ-500 and BGISEQ-2000, we probably need to train customized small variants callers, more information can be found in [Github](<https://github.com/google/deepvariant/blob/r0.8/docs/deepvariant-tpu-training-case-study.md>).
  6. [CUSTOMIZED MODELS TO BE ADDED]
  7. DeepVariant pipeline consist of 3 steps:
  8. 1. `make_examples` consumes reads and the refernece genome to create TensorFlow examples for evaluation with deep learning models.
  9. 2. `call_variants` (Multiple-threads) consums TFRecord files of tf.Examples protos created by `make_examples` and a deep learning model checkpoint and evaluates the model on each example in the TFRecord. The output here is a TFRecord of CallVariantOutput protos. Multiple-threads
  10. 3. `postprocess_variants` (Single-thread) reads all of the output TFRecord files from `call_variants`, it needs to see all of the outputs from `call_variants`for a single sample to merge into a final VCF.
  11. **Some tips for DeepVariant:**
  12. 1. Duplicate marking may be performed, there is almost no difference inaccuracy except lower (<20x) coverages.
  13. 2. Authors recommend that you do not perform BQSR. Running BQSR has a small decrease on accuracy.
  14. 3. It is not necessary to do any form of indel realignment, though there is not a difference in DeepVariant accuracy either way.
  15. You can run with one common using the `run_deepvariant.py` script
  16. ```bash
  17. python run_deepvariant.py --model_type=WGS \
  18. --ref=../../data/"${REFERENCE_FILE}" \
  19. --reads=../../data/"${BAM_FILE}" \
  20. --regions "chr20:10,000,000-10,010,000" \
  21. --output_vcf=../output/output.vcf.gz \
  22. --output_gvcf=../output/output.g.vcf.gz
  23. ```
  24. Four files are generated
  25. ```bash
  26. output.vcf.gz
  27. output.vcf.gz.tbi
  28. output.g.vcf.gz
  29. output.g.vcf.gz.tbi
  30. ```
  31. Command used in choppy app
  32. ```bas
  33. python run_deepvariant.py --model_type=WGS \
  34. --ref=${ref_dir}/${fasta} \
  35. --reads=${Dedup.bam} \
  36. --output_vcf=${sample}_DP.vcf.gz
  37. ```
  38. **Reference:**
  39. 1. DeepVariant Github <https://github.com/google/deepvariant>
  40. 2. DeepVariant paper <https://www.nature.com/articles/nbt.4235>