The rapid advance of high-throughput technologies has enabled the generation of two-dimensional or even multi-dimensional high-throughput data, e.g., genome-wide siRNA screen (1st dimension) for multiple changes in gene expression (2nd dimension) in many different cell types or tissues or under different experimental conditions (3rd dimension). We show that the simple Z-based statistics and derivatives are no longer suitable for analyzing such data because of the accumulation of experimental noise and/or off-target effects.
Here we introduce ZetaSuite, a new R package designed to score and rank hits from two-dimensional screens, select the thresholds for high-confidence and candidate hits . Applying this method to two large cancer dependency screen datasets, we identify not only genes critical for cell fitness, but also those required for constraining cell proliferation. Strikingly, most of those cancer constraining genes function in DNA replication/repair checkpoint, suggesting that cancer cells also need to protect their genomes for long-term survival. We also illustrate a unique utility of ZetaSuite in analyzing single cell transcriptomics to differentiate rare cells from damaged ones, as showcased with the identification of critically infiltrated immune cells in placenta, demonstrating the power of ZetaSuite in elucidating underlying biology yet minimizing unwanted artifacts.
1). QC evaluation of the input matrix. This function only works when the user provides negative and positive controls. The aim of QC is to evaluate the ability of functional readouts in discriminating negative and positive controls. This process won’t affect the following analysis.
2). Calculating the Z-score to make the readouts are comparable. This function is optimal. The users can ignore this function when the values in the input matrix has been already normalized.
3). Generation of Zeta plot for each gene. We first divide the Z-score/values provided by users into bins, and at each bin, we determine the number of readouts that show significant changes at or above the cutoff at the bin for each siRNA-targeted gene and then divide this number with the total of measured readouts.
4). Using the SVM curve to filter the genes which show similar response with negative controls.
5). Calculating the Zeta score. Zeta score is the area under the screening results. It was designed to evaluate the hits effects in two dimensional screening and the quality of the cells detected in scRNAseq datasets.
6). Screen Strength for choosing appropriate cut-off. In SS plot, the SS value would be progressively elevated (indicating progressive reduction of false positive hits) with increasing cut-off stringency. The balance points in the Screen Strength are the solutions to choose cutoffs that maximum the hits number at certain false discovery levels.
We provided an example data (generated from our in-house HTS2 screening dataset) for using ZetaSuite to explore the hits. To save the testing time, we provided a subsampled dataset. While this test data may not yield reasonable results, it can be used to see how the workflow is configured and executed.
we started with the pre-processed data set which was already removed the low quality rows and columns.
Users can find the example data set with data() command. The example input files include:
input matrix file, countMat, Each row represents gene with specific knocking-down siRNA pool, each column is an AS event. The values in the matrix are the processed fold change values between included exons and skipping exons read counts. Check the data by ‘data(“CountMat”)’.(we randomly pick-up 1609 genes and 100 AS events as example matrix)
input negative file, the wells treated with non-specific siRNAs, negGene. If users didn’t have the build-in negative controls, the non-expressed genes should be provided here. Check the data by ‘data(“negGene”)’.
input positive file, the wells treated with siRNAs targeting to PTB, posGene. The file is optimal is users don’t want to deduce a SVM curve. Check the data by ‘data(“posGene”)’.
input internal negative control (non-expressed genes), genes which annotated as non-expressed (RPKM<1) in HeLa cells, nonExpGene. Check the data by ‘data(“nonExpGene”)’.
step 1: QC figures
library(ZetaSuite)
data(countMat)
data(negGene)
data(posGene)
qcList <- QC(countMat,negGene,posGene)
test_tSNE_QC is the global evaluation based on all the readouts. This figure can evaluate whether the positive and negative samples are well separated based on current all readouts.
The below 3 figures are the quality evaluation of the individual readouts.
step 2: Normalized matrix
## X5621.1 X157.1 X2451.1 X6798.1 X2570.1
## HLA-DOB -2.498000 1.8430201 0.20901748 0.7276768 1.80779786
## DOK2 1.665303 -2.3679279 0.45552333 2.6441584 1.49359264
## POU3F1 0.126659 -1.1401584 0.09130339 -0.8482665 -0.32759624
## VWC2 1.170266 0.3294906 -0.74381915 -0.7616473 1.52798191
## FGFBP3 2.042632 -1.7006933 -1.53998400 -0.7430438 0.06798774
ZscoreVal is the normalized matrix, each row represents each knocking-down condition and each column is a specific readout (AS event). The values in the matrix are the normalized values.
step 3: Generating Zeta plot for positive and negative samples
data(countMat)
data(negGene)
data(posGene)
ZscoreVal <- Zscore(countMat,negGene)
ECList <- EventCoverage(ZscoreVal,negGene,posGene,binNum=100,combine=TRUE)
ECList[[2]][[1]]
Step 4: Calculating Zeta score for each screened gene
data(ZseqList)
data(SVMcurve)
data(countMat)
data(negGene)
ZscoreVal <- Zscore(countMat,negGene)
zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE)
zetaData is the zeta values for all tested knockding-down genes including positive and negative controls. The first column is the direction which knockding-down gene will lead to exon inclusion, whereas the second column is the knock-down genes will lead to exon skipping.
Step 5: Generating Screen Strength plot for hits selection
data(nonExpGene)
data(negGene)
data(posGene)
data(ZseqList)
data(countMat)
ZscoreVal <- Zscore(countMat,negGene)
zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE)
cutoffval <- FDRcutoff(zetaData,negGene,posGene,nonExpGene,combine=TRUE)
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
software : Yajing Hao, Changwei Shao, Guofeng Zhao, Xiang-Dong Fu (2021). ZetaSuite, A Computational Method for Analyzing Multi-dimensional High-throughput Data, Reveals Genes with Opposite Roles in Cancer Dependency. Forthcoming
in-house dataset : Changwei Shao, Yajing Hao, Jinsong Qiu, Bing Zhou, Hairi Li, Yu Zhou, Fan Meng, Li Jiang, Lan-Tao Gou, Jun Xu, Yuanjun Li,Hui Wang, Gene W. Yeo, Dong Wang, Xiong Ji, Christopher K. Glass, Pedro Aza-Blanc, Xiang-Dong Fu (2021). HTS2 Screen for Global Splicing Regulators Reveals a Key Role of the Pol II Subunit RPB9 in Coupling between Transcription and Pre-mRNA Splicing. Cell. Forthcoming.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ZetaSuite_1.0.1 rmarkdown_2.28
##
## loaded via a namespace (and not attached):
## [1] plotly_4.10.4 sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [5] tidyr_1.3.1 class_7.3-22 stringi_1.8.4 lattice_0.22-6
## [9] digest_0.6.37 magrittr_2.0.3 evaluate_1.0.1 grid_4.4.1
## [13] RColorBrewer_1.1-3 fastmap_1.2.0 plyr_1.8.9 Matrix_1.7-0
## [17] jsonlite_1.8.9 e1071_1.7-16 survival_3.7-0 gridExtra_2.3
## [21] mgcv_1.9-1 httr_1.4.7 kernlab_0.9-33 purrr_1.0.2
## [25] fansi_1.0.6 viridisLite_0.4.2 scales_1.3.0 lazyeval_0.2.2
## [29] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4 splines_4.4.1
## [33] munsell_0.5.1 withr_3.0.1 cachem_1.1.0 yaml_2.3.10
## [37] Rtsne_0.17 tools_4.4.1 reshape2_1.4.4 dplyr_1.1.4
## [41] colorspace_2.1-1 ggplot2_3.5.1 buildtools_1.0.0 vctrs_0.6.5
## [45] R6_2.5.1 proxy_0.4-27 lifecycle_1.0.4 stringr_1.5.1
## [49] htmlwidgets_1.6.4 MASS_7.3-61 segmented_2.1-2 pkgconfig_2.0.3
## [53] pillar_1.9.0 bslib_0.8.0 gtable_0.3.5 glue_1.8.0
## [57] data.table_1.16.2 Rcpp_1.0.13 highr_0.11 xfun_0.48
## [61] tibble_3.2.1 tidyselect_1.2.1 mixtools_2.0.0 sys_3.4.3
## [65] knitr_1.48 farver_2.1.2 nlme_3.1-166 htmltools_0.5.8.1
## [69] labeling_0.4.3 maketools_1.3.1 compiler_4.4.1