Zhihui 8f32ac81e9 ......		4年前
tasks	......	4年前
.DS_Store	......	4年前
Fastp.md	......	4年前
README.md	first commit	4年前
defaults	......	4年前
inputs	......	4年前
workflow.wdl	......	4年前

README.md

RNA Sequencing Quality Control Pipeline

Author： Li Zhihui

E-mail：18210700119@fudan.edu.cn

Git:

Last Updates: 2020/07/13

安装指南

# 激活choppy环境
source activate choppy
# 安装app
choppy install lizhihui/test_dataportol1

App概述——中华家系1号标准物质介绍

建立高通量全基因组测序的生物计量和质量控制关键技术体系，是保障测序数据跨技术平台、跨实验室可比较、相关研究结果可重复、数据可共享的重要关键共性技术。建立国家基因组标准物质和基准数据集，突破基因组学的生物计量技术，是将测序技术转化成临床应用的重要环节与必经之路，目前国际上尚属空白。中国计量科学研究院与复旦大学、复旦大学泰州健康科学研究院共同研制了人源中华家系1号基因组标准物质（Quartet，一套4个样本，编号分别为LCL5，LCL6，LCL7，LCL8，其中LCL5和LCL6为同卵双胞胎女儿，LCL7为父亲，LCL8为母亲），以及相应的全基因组测序序列基准数据集（“量值”），为衡量基因序列检测准确与否提供一把“标尺”，成为保障基因测序数据可靠性的国家基准。人源中华家系1号基因组标准物质来源于泰州队列同卵双生双胞胎家庭，从遗传结构上体现了我国南北交界的人群结构特征，同时家系的设计也为“量值”的确定提供了遗传学依据。

该Quality_control APP用于转录组测序（RNA Sequencing，RNA-Seq）数据的质量评估，包括原始数据质控、比对数据质控和基因表达数据质控。

流程与参数

App输入文件

inputSamplesFile

#read1	#read2	#sample_id	#adapter_sequence	#adapter_sequence_r2

read1 是阿里云上fastq read1的地址

read2 是阿里云上fastq read2的地址

sample_id 是指样本的命名

adapter_sequence 是R1端需要去除的接头

adapter_sequence_r2 是R2端需要去除的接头

所有上传的文件应有规范的命名

App输出文件

1.上游质控参数

列名	说明	范围
SampleID
#Date
#LibraryPrep
Replicate
Sample
#SequenceMachine
#SequenceSite
#SequenceTech

2.下游质控参数

Quality metrics	Category	Description	Reference value
Number of detected genes	One group	This metric is used to estimate the detection abundance of one sample.	(**, 58,395]
Detection Jaccard index (JI)	One group	Detection JI is the ratio of number of the genes detected in both replicates than the number of the genes detected in either of the replicates. This metric is used to estimate the repeatability of one sample detected gene from different replicates.	[0.8, 1]
Coefficient of variation (CV)	One group	CV is calculated based on the normalized expression levels in all 3 replicates of one sample for each genes. This metric is used to estimate the repeatability of one sample expression level from different replicates.	[0, 0.2]
Correlation of technical replicates (CTR)	One group	CTR is calculated based on the correlation of one sample expression level from different replicates.	[0.95, 1]
Signal-to-noise Ratio (SNR)	More groups	Signal is defined as the average distance between libraries from the different samples on PCA plots and noise are those form the same samples. SNR is used to assess the ability to distinguish technical replicates from different biological samples.	[5, inf)
Sensitivity of detection	One group	Sensitivity is the proportion of “true” detected genes from reference dataset which can be correctly detected by the test set.	[0.96, 1]
/Reference dependent
Specificity of detection	One group	Specificity is the proportion of “true” non-detected genes from reference dataset which can be correctly not detected by the test set.	[0.94, 1]
/Reference dependent
Consistency ratio of relative expression	Two groups	Proportion of genes that falls into reference range (mean ± 2 fold SD) in relative ratio (log2FC).	[0.82, 1]
/Reference dependent
Correlation of relative log2FC	Two groups	Pearson correlation between mean value of reference relative ratio and test site.	[0.96,1]
/Reference dependent
Sensitivity of DEGs	Two groups	Sensitivity is the proportion of “true” DEGs from reference dataset which can be correctly identified as DEG by the test set.	[0.80, 1]
/Reference dependent
Specificity of DEGs	Two groups	Specificity is the proportion of “true” not DEGs from reference dataset which can be can be correctly identified as non-DEG by the test set.	[0.95, 1]
/Reference dependent