您最多选择25个主题 主题必须以字母或数字开头,可以包含连字符 (-),并且长度不得超过35个字符

6 年前
6 年前
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
  1. # Novoalign
  2. [Novoalign](<http://www.novocraft.com/>) is a commercial mapping software, without any published paper, and a license is needed.
  3. For Read1 Novoalign uses a seeded alignment process to find alignment locations each with a Read1 alignment score. For each good location found Novoalign does a [Needleman­Wunsch alignment](<https://zh.wikipedia.org/zh-hans/%E5%B0%BC%E5%BE%B7%E6%9B%BC-%E7%BF%81%E6%96%BD%E7%AE%97%E6%B3%95>) of the second read against a region starting from the Read1 alignment and extending 6 standard deviations beyond mean fragment length. The best alignment for Read2 will define the pair score for Read1/Read2. All the alignments are added to a collection for Read1.
  4. This process is repeated using Read2 seeded alignment and then N­W for Read1, creating a collection of Read2/Read1 pairs. There are very likely duplicates amongst the two collections.
  5. Novoalign then decides whether there is a "proper pair" or not. To do this a structural variation penalty is used as follows.
  6. Novoalign has a proper pair if the score of the best pair (Read1/Read2 or Read2/Read1 combined score including fragment length penalty) is less than the structural variation penalty (default 70) plus best single­end Read1 score plus best single­end Read2 score.
  7. If Novoalign has a proper pair, Read1/Read2 & Read2/Read1 lists are combined, removing duplicates and sorting by alignment score. At this point Novoalign has a list of one or more proper pair alignments. This list is passed to reporting which can report one or more alignments depending on the options.
  8. If there wasn't a proper pair then Novoalign reports alignments to each read in single end mode and the reporting options will decide whether Novoalign reports one or more alignments.
  9. The result of the paired search can be two paired alignments where the pairing is more probable than a structural variation, or it can be two individual alignments, one to each read of the pair.
  10. Given the threshold, gap penalties and reads it is quite possible for novoalign to find alignments with gaps in both ends of the reads. There are no design restrictions that prevent this type of result and it depends only on the scoring parameters and threshold. [cited from the manual]
  11. #####1. index reference seuqneces
  12. ```bash
  13. novoindex GRCh38 GRCh38.d1.vd1.fa
  14. ```
  15. ##### 2. Novoalign
  16. ```bash
  17. novoalign -d <reference.ndx> -f <read1.fastq.gz> <read2.fastq.gz> -o SAM -c $nt --hlimit 9 -t 0,2 > ${sample}.novoalign.sam
  18. ```
  19. ##### 3. covert sam to bam and index bam
  20. ```bash
  21. java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample
  22. samtools index rg_added_sorted.bam
  23. ```
  24. ##### 4. Markduplicates
  25. ```bash
  26. sentieon driver -t NUMBER_THREADS -i SORTED_BAM \
  27. --algo LocusCollector --fun score_info SCORE.gz
  28. sentieon driver -t NUMBER_THREADS -i SORTED_BAM \
  29. --algo Dedup --rmdup --score_info SCORE.gz \
  30. --metrics DEDUP_METRIC_TXT DEDUPED_BAM
  31. ```
  32. #####NIST's settings
  33. ```bash
  34. novoalign -d <reference.ndx> -f <read1.fastq.gz> <read2.fastq.gz>
  35. -F STDFQ --Q2Off -t 400 -o SAM -c 10
  36. ```
  37. Parameters explanation
  38. `-F` Specifies the format of the read file. Normally Novoalign can detect the format of read files and this option is not required. `STDFQ` means Fastq format with Sanger coding of quality values.
  39. ­10log10(Perr) + '!'
  40. `--Q20ff` For Novoalign disables treating Q=2 bases as Illumina "The Read Segment Quality Control Indicator". Setting Q2 off will treat Q=2 bases as normal bases with a quality of 2. When off Q=2 bases are included in quality calibration and may be recalibrated to higher qualities
  41. `-t` Sets absolute threshold or highest alignment score acceptable for the best alignment
  42. `-o` Specifies the report format
  43. `-c` Sets the number of threads to be used. On licensed versions it defaults to the number of CPUs as reported by sysinfo(). On free version the option is disabled.
  44. ##### Per sample running time
  45. | Module | Time |
  46. | --------- | ------- |
  47. | Novoalign | 13h |
  48. | SamToBam | 3h40min |
  49. | indexBam | 15min |
  50. | Dedup | 20min |