The human genome carries the instruction manuals for gene expression regulation. Deciphering the “genetic codes” encoded in human genome through computational approaches has emerged as a central theme in our lab.
Cells are the fundamental building blocks of life. Several kinds of functional cis- elements are involved in gene expression regulation of cells: more than 90% of the genetic variations linked to diseases and traits reside in noncoding regulatory regions. Modeling these noncoding cis-elements is one of key cornerstones for deciphering the cellular regulatory map.
Eukaryotic tissue-specific gene expression regulation relies on complex gene expression networks, involving the coordination of multiple regulators and specific DNA elements, which act in combinatorial manner. Inspired by our previous successful model design (CARMEN), we developed SVEN, a multi-modality sequence-oriented in silico model, for quantify the regulatory effects of these cis-elements in across more than 350 tissues and cell lines, and applied the model for annotating the transcriptomic impacts of genetic variants.
SVEN took a hybrid distinct model architecture other than the canonical “one-holistic-network-for-all” one: first learning “regulatory rules” from large-scale data through multiple class-oriented holistic models and feature-oriented separate models, and then applying these rules to infer tissue-specific gene expression level from sequences directly (Figure 1):
- A set of sequence-based deep neural networks that learn regulatory codes from sequences to predict functional genomic features (TF binding, histone modification, and DNA accessibility). The basic idea here is to combine feature-oriented models (to learn the context-sensitive sequence-to-regulatory code, like the CTCF binding events in K562) and class-oriented models (to learn a more generalized rule, like TF bindings across different cell lines/tissues) for better utilizing available data.
- A feature selection and transformation module to remove redundant features and reduce the dimensionality of the features.
- A set of gradient-boosting tree models to predict gene transcription level based on transformed functional genomic features. Each model corresponding to one tissue or cell type.
Figure 1. Architecture of the SVEN model
Benefiting from its unique design, SVEN shows consistently superior performance over canonical “one-holistic-network-for-all”-based Enformer in predicting tissue-specific gene expression level and assessing effects of variants on gene expression (Figure 2), with 40% smaller model size (153M for SVEN largest model and 249M for Enformer).
Figure 2. SVEN predicts tissue-specific gene transcription level accurately
Noncoding variants that affect transcription are referred to as transcription-modulating variants. Several approaches have been developed to curate and characterize these variants effectively and efficiently, including our previous REVA database and CARMEN algorithm. Compared to small noncoding variants (≤ 50 bp), structural variants (SVs, > 50 bp) can have a more substantial impact on biological functions due to their larger scale.
The effects of SVs on gene expression were predicted by comparing the predicted expression levels of sequences containing reference alleles versus alternative alleles. In addition to its unique capability on quantifying transcriptomic impacts for large-scale SVs, SVEN can also infer tissue-specific gene expression profiles solely based on gene sequences. Notably, SVEN’s sequence-oriented design enables the identification of plausible underlying mechanisms for identified variants. The last but not the least, SVEN is also capable of handling small transcription-modulating variants.
We assessed SVEN’s ability to predict the regulatory impact of SVs: SVEN demonstrated high accuracy, with a Spearman correlation of 0.921 between predicted and observed expression levels derived from paired RNA-seq data (Figure 3). Notably, the deletion upstream of the cancer biomarker PSMA-encoding gene FOLH1 disrupts the promoter region and the annotation-based algorithm predicted that this deletion would barely affect gene transcription; however, SVEN correctly predicted an increase in expression, partly because its annotation module indicated that the variant effectively increases expression-activating H3K4me3 and H3K27ac signals rather than the deleting known silencers or insulators. This finding suggests a plausible underlying mechanism for the observed effect of the deletion.
Figure 3. SVEN accurately quantify the regulatory potential of genetic variants
SVEN is publicly available at https://github.com/gao-lab/SVEN.
Dr. Yu Wang from School of Life Sciences, Peking University (now is the postdoc in Changping Laboratory) is the first author of the paper. Nan Liang was responsible for conducting CRISPR experiments. This research received support from the National Key Research and Development Program, the State Key Laboratory of Protein and Plant Gene Research, the Beijing Advanced Innovation Center for Genomics, and Changping Laboratory. Computational analysis was conducted on the High-performance Computing Platforms of Changping Laboratory, High-performance Computing Platform of the Center for Life Science, and High-performance Computing Platform of Peking University.
Link to the paper:
https://doi.org/10.1038/s41467-024-55392-7