ZetaSuite

Introduction

The rapid advance of high-throughput technologies has enabled the generation of two-dimensional or even multi-dimensional high-throughput data, e.g., genome-wide siRNA screen (1st dimension) for multiple changes in gene expression (2nd dimension) in many different cell types or tissues or under different experimental conditions (3rd dimension). We show that the simple Z-based statistics and derivatives are no longer suitable for analyzing such data because of the accumulation of experimental noise and/or off-target effects.

Here we introduce ZetaSuite, a new R package designed to score and rank hits from two-dimensional screens, select the thresholds for high-confidence and candidate hits . Applying this method to two large cancer dependency screen datasets, we identify not only genes critical for cell fitness, but also those required for constraining cell proliferation. Strikingly, most of those cancer constraining genes function in DNA replication/repair checkpoint, suggesting that cancer cells also need to protect their genomes for long-term survival. We also illustrate a unique utility of ZetaSuite in analyzing single cell transcriptomics to differentiate rare cells from damaged ones, as showcased with the identification of critically infiltrated immune cells in placenta, demonstrating the power of ZetaSuite in elucidating underlying biology yet minimizing unwanted artifacts.

Quick start

## install ZetaSuite R package
install.packages("ZetaSuite")

## load ZetaSuite R pakcage
library(ZetaSuite)

ZetaSuite R package includes the following steps:

1). QC evaluation of the input matrix. This function only works when the user provides negative and positive controls. The aim of QC is to evaluate the ability of functional readouts in discriminating negative and positive controls. This process won’t affect the following analysis.

2). Calculating the Z-score to make the readouts are comparable. This function is optimal. The users can ignore this function when the values in the input matrix has been already normalized.

3). Generation of Zeta plot for each gene. We first divide the Z-score/values provided by users into bins, and at each bin, we determine the number of readouts that show significant changes at or above the cutoff at the bin for each siRNA-targeted gene and then divide this number with the total of measured readouts.

4). Using the SVM curve to filter the genes which show similar response with negative controls.

5). Calculating the Zeta score. Zeta score is the area under the screening results. It was designed to evaluate the hits effects in two dimensional screening and the quality of the cells detected in scRNAseq datasets.

6). Screen Strength for choosing appropriate cut-off. In SS plot, the SS value would be progressively elevated (indicating progressive reduction of false positive hits) with increasing cut-off stringency. The balance points in the Screen Strength are the solutions to choose cutoffs that maximum the hits number at certain false discovery levels.

An example of large-scale screening analysis workflow using ZetaSuite

We provided an example data (generated from our in-house HTS2 screening dataset) for using ZetaSuite to explore the hits. To save the testing time, we provided a subsampled dataset. While this test data may not yield reasonable results, it can be used to see how the workflow is configured and executed.

we started with the pre-processed data set which was already removed the low quality rows and columns.

Users can find the example data set with data() command. The example input files include:

  1. input matrix file, countMat, Each row represents gene with specific knocking-down siRNA pool, each column is an AS event. The values in the matrix are the processed fold change values between included exons and skipping exons read counts. Check the data by ‘data(“CountMat”)’.(we randomly pick-up 1609 genes and 100 AS events as example matrix)

  2. input negative file, the wells treated with non-specific siRNAs, negGene. If users didn’t have the build-in negative controls, the non-expressed genes should be provided here. Check the data by ‘data(“negGene”)’.

  3. input positive file, the wells treated with siRNAs targeting to PTB, posGene. The file is optimal is users don’t want to deduce a SVM curve. Check the data by ‘data(“posGene”)’.

  4. input internal negative control (non-expressed genes), genes which annotated as non-expressed (RPKM<1) in HeLa cells, nonExpGene. Check the data by ‘data(“nonExpGene”)’.

step 1: QC figures

library(ZetaSuite)
data(countMat)
data(negGene)
data(posGene)
qcList <- QC(countMat,negGene,posGene)

test_tSNE_QC is the global evaluation based on all the readouts. This figure can evaluate whether the positive and negative samples are well separated based on current all readouts.

qcList[[2]]

The below 3 figures are the quality evaluation of the individual readouts.

qcList[[1]]

grid::grid.draw(qcList[[3]])

qcList[[4]]

step 2: Normalized matrix

data(countMat)
data(negGene)
ZscoreVal <- Zscore(countMat,negGene)
ZscoreVal[1:5,1:5]
##           X5621.1     X157.1     X2451.1    X6798.1     X2570.1
## HLA-DOB -2.498000  1.8430201  0.20901748  0.7276768  1.80779786
## DOK2     1.665303 -2.3679279  0.45552333  2.6441584  1.49359264
## POU3F1   0.126659 -1.1401584  0.09130339 -0.8482665 -0.32759624
## VWC2     1.170266  0.3294906 -0.74381915 -0.7616473  1.52798191
## FGFBP3   2.042632 -1.7006933 -1.53998400 -0.7430438  0.06798774

ZscoreVal is the normalized matrix, each row represents each knocking-down condition and each column is a specific readout (AS event). The values in the matrix are the normalized values.

step 3: Generating Zeta plot for positive and negative samples

data(countMat)
data(negGene)
data(posGene)
ZscoreVal <- Zscore(countMat,negGene)
ECList <- EventCoverage(ZscoreVal,negGene,posGene,binNum=100,combine=TRUE)
ECList[[2]][[1]]

ECList[[2]][[2]]

Step 4: Calculating Zeta score for each screened gene

data(ZseqList)
data(SVMcurve)
data(countMat)
data(negGene)
ZscoreVal <- Zscore(countMat,negGene)
zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE)

zetaData is the zeta values for all tested knockding-down genes including positive and negative controls. The first column is the direction which knockding-down gene will lead to exon inclusion, whereas the second column is the knock-down genes will lead to exon skipping.

Step 5: Generating Screen Strength plot for hits selection

data(nonExpGene)
data(negGene)
data(posGene)
data(ZseqList)
data(countMat)
ZscoreVal <- Zscore(countMat,negGene)
zetaData <- Zeta(ZscoreVal,ZseqList,SVM=FALSE)
cutoffval <- FDRcutoff(zetaData,negGene,posGene,nonExpGene,combine=TRUE)

cutoffval[[2]][[1]]
## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
cutoffval[[2]][[2]]

Citations

software : Yajing Hao, Changwei Shao, Guofeng Zhao, Xiang-Dong Fu (2021). ZetaSuite, A Computational Method for Analyzing Multi-dimensional High-throughput Data, Reveals Genes with Opposite Roles in Cancer Dependency. Forthcoming

in-house dataset : Changwei Shao, Yajing Hao, Jinsong Qiu, Bing Zhou, Hairi Li, Yu Zhou, Fan Meng, Li Jiang, Lan-Tao Gou, Jun Xu, Yuanjun Li,Hui Wang, Gene W. Yeo, Dong Wang, Xiong Ji, Christopher K. Glass, Pedro Aza-Blanc, Xiang-Dong Fu (2021). HTS2 Screen for Global Splicing Regulators Reveals a Key Role of the Pol II Subunit RPB9 in Coupling between Transcription and Pre-mRNA Splicing. Cell. Forthcoming.

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ZetaSuite_1.0.1 rmarkdown_2.29 
## 
## loaded via a namespace (and not attached):
##  [1] plotly_4.10.4      sass_0.4.9         utf8_1.2.4         generics_0.1.3    
##  [5] tidyr_1.3.1        class_7.3-22       stringi_1.8.4      lattice_0.22-6    
##  [9] digest_0.6.37      magrittr_2.0.3     evaluate_1.0.1     grid_4.4.2        
## [13] RColorBrewer_1.1-3 fastmap_1.2.0      plyr_1.8.9         Matrix_1.7-1      
## [17] jsonlite_1.8.9     e1071_1.7-16       survival_3.7-0     gridExtra_2.3     
## [21] mgcv_1.9-1         httr_1.4.7         kernlab_0.9-33     purrr_1.0.2       
## [25] fansi_1.0.6        viridisLite_0.4.2  scales_1.3.0       lazyeval_0.2.2    
## [29] jquerylib_0.1.4    cli_3.6.3          rlang_1.1.4        splines_4.4.2     
## [33] munsell_0.5.1      withr_3.0.2        cachem_1.1.0       yaml_2.3.10       
## [37] Rtsne_0.17         tools_4.4.2        reshape2_1.4.4     dplyr_1.1.4       
## [41] colorspace_2.1-1   ggplot2_3.5.1      buildtools_1.0.0   vctrs_0.6.5       
## [45] R6_2.5.1           proxy_0.4-27       lifecycle_1.0.4    stringr_1.5.1     
## [49] htmlwidgets_1.6.4  MASS_7.3-61        segmented_2.1-3    pkgconfig_2.0.3   
## [53] pillar_1.9.0       bslib_0.8.0        gtable_0.3.6       glue_1.8.0        
## [57] data.table_1.16.2  Rcpp_1.0.13-1      xfun_0.49          tibble_3.2.1      
## [61] tidyselect_1.2.1   mixtools_2.0.0     sys_3.4.3          knitr_1.49        
## [65] farver_2.1.2       nlme_3.1-166       htmltools_0.5.8.1  labeling_0.4.3    
## [69] maketools_1.3.1    compiler_4.4.2