Ви не можете вибрати більше 25 тем Теми мають розпочинатися з літери або цифри, можуть містити дефіси (-) і не повинні перевищувати 35 символів.

114 lines
4.6KB

  1. import json
  2. import pandas as pd
  3. import sys, argparse, os
  4. parser = argparse.ArgumentParser(description="This script is to get information from multiqc and sentieon, output the raw fastq, bam and variants calling (precision and recall) quality metrics")
  5. parser.add_argument('-quality', '--quality_yield', type=str, help='*.quality_yield.txt', required=True)
  6. parser.add_argument('-depth', '--wgs_metrics', type=str, help='*deduped_WgsMetricsAlgo.txt', required=True)
  7. parser.add_argument('-aln', '--aln_metrics', type=str, help='*_deduped_aln_metrics.txt', required=True)
  8. parser.add_argument('-is', '--is_metrics', type=str, help='*_deduped_is_metrics.txt', required=True)
  9. parser.add_argument('-fastqc', '--fastqc', type=str, help='multiqc_fastqc.txt', required=True)
  10. parser.add_argument('-fastqscreen', '--fastqscreen', type=str, help='multiqc_fastq_screen.txt', required=True)
  11. parser.add_argument('-hap', '--happy', type=str, help='multiqc_happy_data.json', required=True)
  12. args = parser.parse_args()
  13. # Rename input:
  14. quality_yield_file = args.quality_yield
  15. wgs_metrics_file = args.wgs_metrics
  16. aln_metrics_file = args.aln_metrics
  17. is_metrics_file = args.is_metrics
  18. fastqc_file = args.fastqc
  19. fastqscreen_file = args.fastqscreen
  20. hap_file = args.happy
  21. #############################################
  22. # fastqc
  23. fastqc = pd.read_table(fastqc_file)
  24. #fastqc = dat.loc[:, dat.columns.str.startswith('FastQC')]
  25. #fastqc.insert(loc=0, column='Sample', value=dat['Sample'])
  26. #fastqc_stat = fastqc.dropna()
  27. # qulimap
  28. #qualimap = dat.loc[:, dat.columns.str.startswith('QualiMap')]
  29. #qualimap.insert(loc=0, column='Sample', value=dat['Sample'])
  30. #qualimap_stat = qualimap.dropna()
  31. # fastqc
  32. #dat = pd.read_table(fastqc_file)
  33. #fastqc_module = dat.loc[:, "per_base_sequence_quality":"kmer_content"]
  34. #fastqc_module.insert(loc=0, column='Sample', value=dat['Sample'])
  35. #fastqc_all = pd.merge(fastqc_stat,fastqc_module, how='outer', left_on=['Sample'], right_on = ['Sample'])
  36. # fastqscreen
  37. dat = pd.read_table(fastqscreen_file)
  38. fastqscreen = dat.loc[:, dat.columns.str.endswith('percentage')]
  39. dat['Sample'] = [i.replace('_screen','') for i in dat['Sample']]
  40. fastqscreen.insert(loc=0, column='Sample', value=dat['Sample'])
  41. # pre-alignment
  42. pre_alignment_dat = pd.merge(fastqc,fastqscreen,how="outer",left_on=['Sample'],right_on=['Sample'])
  43. pre_alignment_dat['FastQC_mqc-generalstats-fastqc-total_sequences'] = pre_alignment_dat['FastQC_mqc-generalstats-fastqc-total_sequences']/1000000
  44. del pre_alignment_dat['FastQC_mqc-generalstats-fastqc-percent_fails']
  45. del pre_alignment_dat['FastQC_mqc-generalstats-fastqc-avg_sequence_length']
  46. del pre_alignment_dat['ERCC percentage']
  47. del pre_alignment_dat['Phix percentage']
  48. del pre_alignment_dat['Mouse percentage']
  49. pre_alignment_dat = pre_alignment_dat.round(2)
  50. pre_alignment_dat.columns = ['Sample','%Dup','%GC','Total Sequences (million)','%Human','%EColi','%Adapter','%Vector','%rRNA','%Virus','%Yeast','%Mitoch','%No hits']
  51. pre_alignment_dat.to_csv('pre_alignment.txt',sep="\t",index=0)
  52. ############################
  53. dat = pd.read_table(aln_metrics_file)
  54. dat['PCT_ALIGNED_READS'] = dat["PF_READS_ALIGNED"]/dat["TOTAL_READS"]
  55. aln_metrics = dat[["Sample", "PCT_ALIGNED_READS","PF_MISMATCH_RATE"]]
  56. aln_metrics = aln_metrics * 100
  57. dat = pd.read_table(is_metrics_file)
  58. is_metrics = dat[['Sample', 'MEDIAN_INSERT_SIZE']]
  59. dat = pd.read_table(quality_yield_file)
  60. dat['%Q20'] = dat['Q20_BASES']/dat['TOTAL_BASES']
  61. dat['%Q30'] = dat['Q30_BASES']/dat['TOTAL_BASES']
  62. quality_yield = dat[['Sample','%Q20','%Q30']]
  63. quality_yield = quality_yield * 100
  64. dat = pd.read_table(wgs_metrics_file)
  65. wgs_metrics = dat[['Sample','MEDIAN_COVERAGE','PCT_1X', 'PCT_5X', 'PCT_10X','PCT_30X']]
  66. wgs_metrics['PCT_1X'] = wgs_metrics['PCT_1X'] * 100
  67. wgs_metrics['PCT_5X'] = wgs_metrics['PCT_5X'] * 100
  68. wgs_metrics['PCT_10X'] = wgs_metrics['PCT_10X'] * 100
  69. wgs_metrics['PCT_30X'] = wgs_metrics['PCT_30X'] * 100
  70. data_frames = [aln_metrics, is_metrics, quality_yield, wgs_metrics]
  71. post_alignment_dat = reduce(lambda left,right: pd.merge(left,right,on=['Sample'],how='outer'), data_frames)
  72. # benchmark
  73. with open(hap_file) as hap_json:
  74. happy = json.load(hap_json)
  75. dat =pd.DataFrame.from_records(happy)
  76. dat = dat.loc[:, dat.columns.str.endswith('ALL')]
  77. dat_transposed = dat.T
  78. benchmark = dat_transposed.loc[:,['sample_id','METRIC.Precision','METRIC.Recall']]
  79. benchmark.columns = ['Sample','Precision','Recall']
  80. #output
  81. fastqc_all.to_csv('fastqc.final.result.txt',sep="\t",index=0)
  82. fastqscreen.to_csv('fastqscreen.final.result.txt',sep="\t",index=0)
  83. qualimap_stat.to_csv('qualimap.final.result.txt',sep="\t",index=0)
  84. benchmark.to_csv('benchmark.final.result.txt',sep="\t",index=0)