4 лет назад · a345bc3cdc
--- a/README.md
+++ b/README.md
@@ -3,11 +3,11 @@
 > Author: Yaqing Liu
 >
 > E-mail: yaqing.liu@outlook.com
 >

 CNVkit is a Python library and command-line software toolkit to infer and visualize copy number from high-throughput DNA sequencing data. It is designed for use with **hybrid capture**, including both whole-exome and custom target panels, and short-read sequencing platforms such as Illumina and Ion Torrent.

 Official document: https://cnvkit.readthedocs.io/en/stable/index.html

 ## Install

 ```
@@ -18,7 +18,9 @@ choppy install YaqingLiu/CNVkit
 ```

 ## Input

 Please note that the input file of this APP is in the JSON format instead of CSV.

 ```json
 {
    "tumor_bam": [
@@ -48,20 +50,24 @@ Please note that the input file of this APP is in the JSON format instead of CSV
 ```

 ## Submit your task

 ```
 # choppy batch
 chopppy batch YaqingLiu/CNVkit-latest samples.json -p project_name
 ```

 ## Copy number calling pipeline

 ![image](https://cnvkit.readthedocs.io/en/stable/_images/workflow.png)

 ### Note
 <font color=darkred>***-m {hybrid,amplicon,wgs}, --seq-method {hybrid,amplicon,wgs}, --method {hybrid,amplicon,wgs}***</font>

 <font color=darkred>**_-m {hybrid,amplicon,wgs}, --seq-method {hybrid,amplicon,wgs}, --method {hybrid,amplicon,wgs}_**</font>

 Sequencing assay type: hybridization capture ('hybrid'), targeted amplicon sequencing ('amplicon'), or whole genome sequencing ('wgs').
 Determines whether and how to use antitarget bins.

 <font color=darkred>***sequencing-accessible regions***</font>
 <font color=darkred>**_sequencing-accessible regions_**</font>

 Many fully sequenced genomes, including the human genome, contain large regions of DNA that are inaccessable to sequencing. (These are mainly the centromeres, telomeres, and highly repetitive regions.) In the FASTA genome sequence these regions are filled in with large stretches of N characters. These regions cannot be mapped by resequencing, so we can avoid them when calculating the antitarget locations by passing the locations of the accessible sequence regions with the -g or --access option.

@@ -73,40 +79,40 @@ cnvkit.py batch ... -t Tiled.bed -g Tiled.bed ...

 This results in empty ”.antitarget.cnn” files which CNVkit will handle safely from version 0.3.4 onward. **However, this approach does not collect any copy number information between targeted regions, so it should only be used if you have in fact prepared your samples with a targeted amplicon sequencing protocol.**

 **_To reuse an existing reference or create a new:_**

 ***To reuse an existing reference or create a new:***

 *-r REFERENCE, --reference REFERENCE*
 _-r REFERENCE, --reference REFERENCE_

 Copy number reference file (.cnn).

 *--output-reference FILENAME*
 _--output-reference FILENAME_

 Output filename/path for the new reference file being created. (If given, ignores the -o/--output-dir option and will write the file to the given path. Otherwise, "reference.cnn" will be created in the current directory or specified output directory.)

 ***--annotate***
 **_--annotate_**

 The gene annotations file (refFlat.txt) is useful to apply gene names to your baits BED file, if the BED file does not already have short, informative names for each bait interval. This file can be used in the next step.
 If the BED looks like this:
 > chr1   1508981   1509154   SSU72
 > 
 > chr1   2407978   2408183   PLCH2
 > 
 > chr1   2409866   2410095   PLCH2

 > chr1 1508981 1509154 SSU72
 >
 > chr1 2407978 2408183 PLCH2
 >
 > chr1 2409866 2410095 PLCH2

 Then you don’t need refFlat.txt.

 ***index files***
 **_index files_**

 If you’ve prebuilt the index file (.bai, .fai), make sure its timestamp is later than the BAM file’s and fa's.

 CNVkit will automatically index the file if needed – that is, if the .bai/.fa file is missing, or if the timestamp of the .bai/.fa file is older than that of the corresponding .bam/.fa file. 
 CNVkit will automatically index the file if needed – that is, if the .bai/.fa file is missing, or if the timestamp of the .bai/.fa file is older than that of the corresponding .bam/.fa file.

 ***-s min_gap_size***
 **_-s min_gap_size_**

 Minimum gap size between accessible sequence regions. Regions separated by less than this distance will be joined together. [Default: 5000]

 ***segment method***
 **_segment method_**
 Segmentation methods
 The following segmentation algorithms can be specified with the -m option:

@@ -123,10 +129,11 @@ hmm-germline (experimental) – a 3-state HMM with fixed amplitude for the loss,
 none – simply calculate the weighted mean log2 value of each chromosome arm. Useful for testing or debugging, or as a baseline for benchmarking other methods.
 The first two methods use R internally, and to use them you will need to have R and the R package dependencies installed (i.e. DNAcopy, cghFLasso). If you installed CNVkit with conda as recommended, these should have been installed for you automatically. If you installed the R packages in a nonstandard or non-default location, you can specify the location of the right Rscript executable you want to use with --rscript-path.

 The HMM methods hmm, hmm-tumor and hmm-germline were introduced provisionally in CNVkit v.0.9.2, and may change in future releases. 
 The HMM methods hmm, hmm-tumor and hmm-germline were introduced provisionally in CNVkit v.0.9.2, and may change in future releases.

 ## Output
 1. *.cnn/cns of each sample.

 1. \*.cnn/cns of each sample.
 2. A whole-genome copy ratio profile as a PDF scatter plot.
 3. An ideogram of copy ratios on chromosomes as a PDF.
 4. A segment file which can be imported into IGV.
 4. A segment file which can be imported into IGV.
--- a/defaults
+++ b/defaults
@@ -2,8 +2,6 @@
    "fasta": "oss://pgx-reference-data/GRCh38.d1.vd1/GRCh38.d1.vd1.fa",
    "faidx": "oss://pgx-reference-data/GRCh38.d1.vd1/GRCh38.d1.vd1.fa.fai",
    "docker": "registry.cn-shanghai.aliyuncs.com/pgx-docker-registry/cnvkit:0.9.7",
    "samtools_docker": "registry.cn-shanghai.aliyuncs.com/pgx-docker-registry/samtools:v1.3.1",
    "samtools_cluster": "OnDemand bcs.a2.large img-ubuntu-vpc",
    "disk_size": "400",
    "cluster_config": "OnDemand bcs.a2.7xlarge img-ubuntu-vpc",
    "bed": "oss://pgx-reference-data/bed/hg38/Exome-Agilent_V6_chr.bed",
--- a/inputs
+++ b/inputs
@@ -8,8 +8,6 @@
  "{{ project_name }}.method": "{{ method }}",
  "{{ project_name }}.segment_method": "{{ segment_method }}",
  "{{ project_name }}.reference": "{{ reference }}",
  "{{ project_name }}.samtools_docker": "{{ samtools_docker }}",
  "{{ project_name }}.samtools_cluster": "{{ samtools_cluster }}",
  "{{ project_name }}.docker": "{{ docker }}",
  "{{ project_name }}.bed": "{{ bed }}",
  "{{ project_name }}.disk_size": "{{ disk_size }}",
--- a/tasks/batch.wdl
+++ b/tasks/batch.wdl
@@ -2,7 +2,6 @@ task batch {
    String sample_id
    Array[File] tumor_bam
    Array[File] normal_bam
    Array[File] bam_index
    File bed
    File faidx
    File fasta
@@ -25,7 +24,6 @@ task batch {
        nt=$(nproc)

        mkdir -p /cromwell_root/tmp/cnvkit
        cp -r ${sep=" " bam_index} /cromwell_root/tmp/cnvkit

        # must exist parameters
        cp ${fasta} /cromwell_root/tmp/cnvkit/hg38.fa
--- a/tasks/samtools.wdl
+++ b/tasks/samtools.wdl
@@ -1,28 +0,0 @@
 task samtools {
  Array[File] tumor_bam
  Array[File] normal_bam
  String docker
  String cluster
  String disk_size

  command <<<
    set -o pipefail
    set -e
    nt=$(nproc)
    mkdir -p /cromwell_root/tmp/samtools
    cd /cromwell_root/tmp/samtools
    /opt/conda/bin/samtools index ${sep=' ' tumor_bam}
    /opt/conda/bin/samtools index ${sep=' ' normal_bam}
  >>>

  runtime {
    docker: docker
    cluster: cluster
    systemDisk: "cloud_ssd 40"
    dataDisk: "cloud_ssd " + disk_size + " /cromwell_root/"
  }

  output {
    Array[File] bam_index = glob("/cromwell_root/tmp/samtools/*bai")
  }
 }
--- a/workflow.wdl
+++ b/workflow.wdl
@@ -1,5 +1,4 @@
 import "./tasks/access.wdl" as access
 import "./tasks/samtools.wdl" as samtools
 import "./tasks/batch.wdl" as batch
 import "./tasks/export.wdl" as export

@@ -17,8 +16,6 @@ workflow {{ project_name }} {
    String segment_method
    String docker
    String cluster_config
    String samtools_docker
    String samtools_cluster
    String disk_size

    call access.access as access {
@@ -33,15 +30,6 @@ workflow {{ project_name }} {
        disk_size = disk_size
    }

    call samtools.samtools as samtools {
        input:
        tumor_bam = tumor_bam,
        normal_bam = normal_bam,
        docker=samtools_docker,
        cluster=samtools_cluster,
        disk_size=disk_size
    }

    call batch.batch as batch {
        input: 
        sample_id = sample_id,
@@ -53,7 +41,6 @@ workflow {{ project_name }} {
        reference = reference,
        tumor_bam = tumor_bam,
        normal_bam = normal_bam,
        bam_index = samtools.bam_index,
        bed = bed,
        access_bed = access.access_bed,
        docker = docker,