Title: | Analyze High-Dimensional High-Throughput Dataset and Quality Control Single-Cell RNA-Seq |
---|---|
Description: | The advent of genomic technologies has enabled the generation of two-dimensional or even multi-dimensional high-throughput data, e.g., monitoring multiple changes in gene expression in genome-wide siRNA screens across many different cell types (E Robert McDonald 3rd (2017) <doi: 10.1016/j.cell.2017.07.005> and Tsherniak A (2017) <doi: 10.1016/j.cell.2017.06.010>) or single cell transcriptomics under different experimental conditions. We found that simple computational methods based on a single statistical criterion is no longer adequate for analyzing such multi-dimensional data. We herein introduce 'ZetaSuite', a statistical package initially designed to score hits from two-dimensional RNAi screens.We also illustrate a unique utility of 'ZetaSuite' in analyzing single cell transcriptomics to differentiate rare cells from damaged ones (Vento-Tormo R (2018) <doi: 10.1038/s41586-018-0698-6>). In 'ZetaSuite', we have the following steps: QC of input datasets, normalization using Z-transformation, Zeta score calculation and hits selection based on defined Screen Strength. |
Authors: | Yajing Hao [aut] , Shuyang Zhang [ctb] , Junhui Li [cre], Guofeng Zhao [ctb], Xiang-Dong Fu [cph, fnd] |
Maintainer: | Junhui Li <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-10-16 05:17:58 UTC |
Source: | https://github.com/cran/ZetaSuite |
A data frame with 1609 individual screened genes and 100 functional readouts. The data was generated from a siRNA screen for global splicing regulators. In this screen, we interrogated ~400 endogenous alternative splicing (AS) events by using an oligo ligation-based strategy to quantify 18,480 pools of siRNAs against annotated protein-coding genes in the human genome.
data("countMat")
data("countMat")
A data frame with 1609 observations on the following 100 variables
A data frame with 1609 observations on the following 100 maker variables.Each row represents gene with specific knocking-down siRNA pool, each column is an AS event. The values in the matrix are the processed foldchange values between included exons and skipping exons read counts.
This data frame is the raw output data from large-scale screening.
data(countMat)
data(countMat)
A scRNA-seq dataset generated from placenta that has been analyzed with CellRanger and used to develop EmptyDrops. We have subsampled the genes from the real datasets to generated the matrix.
data("countMatSC")
data("countMatSC")
A data frame with 1090 cells and 10000 genes. This is the subset of data obtained from single-cell RNAseq for package testing. Each row represents one cell detected in single-cell RNA-seq, each column is one gene in detected cells. The values in the matrix are the raw read counts from single-cell RNAseq.
A data frame with 1090 cells and 10000 genes.This is the subset of data obtained from single-cell RNAseq for package testing.
This data frame is the generated by single-cell RNA-seq.
data(countMatSC)
data(countMatSC)
A zeta plot is generated from the input Z-score matrix. Zeta plot labels: x-axis: Z-score cutoffs, y-axis: the percentage of readouts that survived at a given Z-score cutoff over the total scored readouts. In order to generate this plot, the range of Z-scores is determined by ranking the absolute value of Zij (Z-score value in row i and column j) from the smallest to the largest. Z-cutoffs next are selected in the range of (-|Znxmx0.9999|, -2) to (2, |Znxmx0.9999|) to excluded the insignificant changes that may result from experimental noise( |Z| < 2, which equals to p-value >0.05). Then, for all Zij within the selected range (both positive range and negative range), the range is divided equally into x bins (the recommended input of x is 100). Thus, the percentage of readouts scored above the Z-cutoff in each bin is determined.
EventCoverage(ZscoreVal, negGene, posGene, binNum, combine = TRUE)
EventCoverage(ZscoreVal, negGene, posGene, binNum, combine = TRUE)
ZscoreVal |
zscore value |
negGene |
negative control dataset, the siRNAs/genes used as negative controls in screening. |
posGene |
positive control dataset, the siRNAs/genes used as positive controls in screening. |
binNum |
bin number |
combine |
combine two direction zeta together(TRUE or FALSE),default FALSE |
A list of data.frames and plots, the data.frame includes 'ZseqList', 'EC_N_I', 'EC_N_D', 'EC_P_I' and 'EC_P_D'. The plot 'EC_jitter_D' and 'EC_jitter_I' are the zeta plot for positive and negative samples.'ZseqList', 'EC_N_I', 'EC_N_D', 'EC_P_I' and 'EC_P_D' are the inputfiles for zeta plot and SVM.R. ZseqList describs the bin size in the zeta plot.
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
data(countMat) data(negGene) data(posGene) ZscoreVal <- Zscore(countMat,negGene) ECList <- EventCoverage(ZscoreVal,negGene,posGene,binNum=100,combine=TRUE)
data(countMat) data(negGene) data(posGene) ZscoreVal <- Zscore(countMat,negGene) ECList <- EventCoverage(ZscoreVal,negGene,posGene,binNum=100,combine=TRUE)
Find a cutoff according to the Screen Strength (SS) and graph the Screen Strength plot. Zeta score is used to rank genes, and then, SS is calculated to define a suitable cutoff so that the cutoff can define hits at different confidence intervals. Formula of SS: SS = 1 - aFDR/bFDR, where aFDR (apparent FDR) = number of non-expressors identified at hits divided by the total number of hits, bFDR (baseline FDR) = total number of non-expressors divided by all screened genes. SS plot labels: x-axis: zeta score, y-axis: Screen Strength, SS value is determined at each bin (m bin in total), then connect individual SS value to generate a simulated SS curve based on balance points. Users may choose one or multiple balance point as the different SS intervals.
FDRcutoff(zetaData, negGene, posGene, nonExpGene, combine = FALSE)
FDRcutoff(zetaData, negGene, posGene, nonExpGene, combine = FALSE)
zetaData |
ZetaScore file calculated by ZetaSuite. |
negGene |
negative control dataset, the siRNAs/genes used as negative controls in screening. |
posGene |
positive control dataset, the siRNAs/genes used as positive controls in screening. |
nonExpGene |
non-expressed gene |
combine |
combine two direction zeta together(TRUE or FALSE),default FALSE |
A list of data.frame and plots, the data.frame is cut off matrix with 6 columns including "Cut_Off","aFDR", "SS","TotalHits","Num_nonExp" and "Type". Plots includes 'Zeta_type' and 'SS_cutOff'.
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
data(nonExpGene) data(negGene) data(posGene) data(ZseqList) data(countMat) ZscoreVal <- Zscore(countMat,negGene) zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE) cutoffval <- FDRcutoff(zetaData,negGene,posGene,nonExpGene,combine=TRUE)
data(nonExpGene) data(negGene) data(posGene) data(ZseqList) data(countMat) ZscoreVal <- Zscore(countMat,negGene) zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE) cutoffval <- FDRcutoff(zetaData,negGene,posGene,nonExpGene,combine=TRUE)
A data frame with 510 different well IDs in which the cells treated with non-specific siRNAs.If users did not have the build-in negative controls, the non-expressed genes should be provided here.
data("negGene")
data("negGene")
A data frame with 510 different well IDs in which the cells treated with non-specific siRNAs.
A data frame with 510 different well IDs in which the cells treated with non-specific siRNAs. These wells were served as negative control.
These wells were designed by the authors in the large-scale screen.
data(negGene)
data(negGene)
A data frame with 722 different well IDs in which the cells treated with siRNAs targeting to non-expressed genes in HeLa cells.It the subset of total non-expressed genes in HeLa cells.
data("nonExpGene")
data("nonExpGene")
A data frame with 722 different well IDs in which the cells treated with siRNAs targeting to non-expressed genes in HeLa cells.
A data frame with 722 different well IDs in which the cells treated with siRNAs targeting to non-expressed genes in HeLa cells. These wells were served as internal negative controls.
These non-expressed genes can be obtained from a prior expression profile.
data(nonExpGene)
data(nonExpGene)
A data frame with 299 different well IDs in which the cells treated with siRNAs targeting to PTB.If users didn't have the build-in positive controls, choose the parameters -withoutsvm and the filename can use any name such as 'NA'.
data("negGene")
data("negGene")
A data frame with 299 different well IDs in which the cells treated with siRNAs targeting to PTB.
A data frame with 299 different well IDs in which the cells treated with siRNAs targeting to PTB. These wells were served as positive control.
These wells were designed by the authors in the large-scale screen.
data(posGene)
data(posGene)
Quality Control (QC) is a step in evaluating the experiment design. For all two-dimension high throughput data, the t-SNE plot is firstly used to evaluate whether features are sufficient to separate positive and negative controls. The SSMD score (See reference Zhang) is further generated for each readout to evaluate the percentage of high-quality readouts.
QC(countMat, negGene, posGene)
QC(countMat, negGene, posGene)
countMat |
input data set. The siRNA/gene x readouts matrix from HTS2 or large-scale RNAi screens |
negGene |
negative control data set, the siRNAs/genes used as negative controls in screening. |
posGene |
positive control data set, the siRNAs/genes used as positive controls in screening. |
A list of plots, and their names are 'score_q', 'tSNE_QC', 'QC_box' and 'QC_SSMD'. 'tSNE_QC' is the global evaluation based on all the readouts. This figure can evaluate whether the positive and negative samples are well separated based on current all readouts. And the other 3 plots are the quality evaluation of the individual readouts.
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
Laurens van der Maaten GH: Visualizing Data using t-SNE. JournalofMachineLearningResearch 2008,9(2008):2579-2605.
Zhang XD: A pair of new statistical parameters for quality control in RNA interference high-throughput screening assays. Genomics 2007, 89:552-561.
data(countMat) data(negGene) data(posGene) QC(countMat,negGene,posGene)
data(countMat) data(negGene) data(posGene) QC(countMat,negGene,posGene)
Radical kernel SVM is constructed to maximally separate positive controls from negative controls in the prior defined Z range using e1071 packages of R, and therefore, the SVM curve is generated.
SVM(ECdataList)
SVM(ECdataList)
ECdataList |
data list of output EventCoverage, names of list shoule be 'EC_N_D', 'EC_P_D', 'EC_N_I', 'EC_P_I' and 'ZseqList' |
A list of data.frame, including 'cutOffD' and 'cutOffI'.cutOffD and cutOffI are the deduced SVM.
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
data(countMat) data(negGene) data(posGene) ZscoreVal <- Zscore(countMat,negGene) ECdataList <- EventCoverage(ZscoreVal,negGene,posGene,binNum=10,combine=TRUE) SVM(ECdataList)
data(countMat) data(negGene) data(posGene) ZscoreVal <- Zscore(countMat,negGene) ECdataList <- EventCoverage(ZscoreVal,negGene,posGene,binNum=10,combine=TRUE) SVM(ECdataList)
The SVM curves were calculated from raw input matrix files. They were designed to maximally seperate the positive and negative genes.
data("SVMcurve")
data("SVMcurve")
A data frame with 24 rows and 4 features.
A data frame with 24 rows and 4 features.The first column is the bins cut-offs for decresed direction. The second column is the values of percentage with different cut-offs in column 1. The third column is the bins cut-offs for increased direction. The fourth column is the values of percentage with different cut-offs in column 3.
This data frame is the generated by SVM.R.
data(SVMcurve)
data(SVMcurve)
This step calculates the Zeta Score based on the two curvecs. Firstly, this step provides another curve above the SVM curve to set a value to represent the regulatory function of gene i. Then, the area between the two curves (the one mentioned above and the SVM curve) is calculated as the Zeta score for this gene. Since the graph of the curves is divided into m bins, then the Zeta score can be calculated as the sum of all the bins' areas that exist between the two curves.
Zeta(ZscoreVal, ZseqList, SVMcurve = NULL, SVM = FALSE)
Zeta(ZscoreVal, ZseqList, SVMcurve = NULL, SVM = FALSE)
ZscoreVal |
input file name. |
ZseqList |
the list of bins. |
SVMcurve |
SVM curves for decrease and increase direction.###not always use |
SVM |
do SVM or not, default is FALSE |
A data.frame where zeta values for all tested knockding-down genes including positive and negative controls. The first column is the direction which knockding-down gene will lead to exon inclusion, whereas the second column is the knock-down genes will lead to exon skipping.
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
data(ZseqList) data(SVMcurve) data(countMat) data(negGene) ZscoreVal <- Zscore(countMat,negGene) zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE)
data(ZseqList) data(SVMcurve) data(countMat) data(negGene) ZscoreVal <- Zscore(countMat,negGene) zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE)
This tool is used to evalucate the quality of cells detected in the single-cell RNA-seq. A zeta score will be assigned to each cell. And a cut-off for low quality and broken cells will be provided. The users can based on the selected cut-off to select the high quality cells for further analysis.
ZetaSuitSC(countMatSC, binNum = 10, filter = TRUE)
ZetaSuitSC(countMatSC, binNum = 10, filter = TRUE)
countMatSC |
Shalek input matrix |
binNum |
bin number for ZetaScore calculation. |
filter |
Whether to filter the extreme low read counts cells with nCount <100. default is TRUE |
A list of data.frame and plots. The data.frame is the Cell matrix with column name 'Cell' and 'Zeta'. The plot is the distribution of Zeta score for the detected cells and including a cut-off for removing the broken and empty cells.
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
data(countMatSC) zetaDataSC <- ZetaSuitSC(countMatSC,binNum=50,filter=TRUE)
data(countMatSC) zetaDataSC <- ZetaSuitSC(countMatSC,binNum=50,filter=TRUE)
In this step, the input matrix is transformed to Z-score matrix.
Zscore(countMat, negGene)
Zscore(countMat, negGene)
countMat |
input data set. The siRNA/gene x readouts matrix from HTS2 or large-scale RNAi screens. |
negGene |
negative control dataset, the siRNAs/genes used as negative controls in screening. Z-transfromation according to thses negative control siRNAs/genes for each readout. |
The initial input matrix is arranged in N x M dimension, where each row contains individual functional readouts against a siRNA pool and each column corresponds to individually siRNA pools tested on a given functional readout. Readouts in each column may be thus considered as the data from one-dimensional screen (many-to-one), and thus, the typical Z statistic can be used to evaluate the relative function of individual genes in such column. The conversion is repeated on all columns, thereby converting the raw activity matrix into a matrix. Suppose Nij are the values in the original matrix i (1 <= i <= N siRNA pool) row and j ( 1 <= j <= M readout) column, then Zij = (Nij - uj) / sigma(j), where uj and sigma(j) are the mean and standard deviation of negative control samples in column j.
A Z-transformated matrix, where each row represents each knocking-down condition and each column is a specific readout (AS event). The values in the matrix are the normalized values(Z-scores).
Yajing Hao, Shuyang Zhang, Junhui Li, Guofeng Zhao, Xiang-Dong Fu
data(countMat) data(negGene) ZscoreVal <- Zscore(countMat,negGene) ZscoreVal[1:5,1:5]
data(countMat) data(negGene) ZscoreVal <- Zscore(countMat,negGene) ZscoreVal[1:5,1:5]
A data frame with 11 different cut-offs and 2 directions. We divided the ranges of input values into bins. The number of bins is determined by the users.
data("ZseqList")
data("ZseqList")
A data frame with 11 different cut-offs and 2 directions.
A data frame with 11 different cut-offs and 2 directions.We divided the ranges of input values into bins. The number of bins is determined by the users.
This data frame is the generated by EventCoverage.R.
data(ZseqList)
data(ZseqList)