HELIX: AI-Driven Precise Prediction of RNA Splicing and Isoform Usage

RNA splicing is a fundamental biological process that generates diverse transcript isoforms with distinct functions through alternative splicing of precursor mRNA (pre-mRNA), thereby greatly expanding the complexity of the human transcriptome. RNA splicing plays critical roles in tissue development, cell differentiation, and disease progression, and accumulating evidence has linked aberrant splicing to major diseases such as cancer. However, because RNA splicing is jointly regulated by cis-regulatory elements, RNA-binding proteins (RBPs), and tissue microenvironments, accurately characterizing and predicting its dynamic changes across tissues, cell types, and disease states has remained a longstanding challenge.

To address this challenge, Professor Yuan Gao and his team from China National Center for Bioinformation developed HELIX (Hierarchical Explainable LSTM for Isoform eXpression), a deep-learning framework for modeling RNA splicing and transcript isoform usage. By integrating genomic sequence features with tissue-specific RBP expression profiles, HELIX enables highly accurate prediction of RNA splicing patterns and transcript isoform usage.

The study was published in Nature Computational Science on May 19th.

HELIX overcomes the limitations of previous approaches through a two-layer deep-learning architecture. The framework first integrates DNA sequence information with the expression profiles of 1,499 RBPs and then employs long short-term memory (LSTM) networks to capture the complex dependencies and competitive relationships among multiple splice sites. This design enables precise prediction of RNA splicing and transcript isoform usage. The model was trained and optimized using large-scale RNA-seq datasets spanning 30 distinct human tissues, allowing accurate quantification of complex transcript structures and isoform usage. Benchmarking results demonstrated that HELIX substantially outperformed existing mainstream methods in both splicing strength prediction (PCC = 0.896) and overall isoform usage prediction (PCC = 0.960).

In disease studies, HELIX demonstrated strong capability in deciphering aberrant RNA splicing and transcript isoform alterations. Using large colorectal cancer cohorts, the researchers identified widespread splicing dysregulation and abnormal isoform usage in tumor cells and further revealed strong associations between these alterations and genomic mutations, RBP dysregulation, and clinical characteristics of patients. These findings suggest that splicing abnormalities may serve as important molecular signatures for understanding tumor development and patient stratification.

The team further developed scHELIX, a single-cell extension of HELIX tailored for single-cell RNA sequencing data. scHELIX enables prediction of transcript isoform usage across different cell types and tumor subpopulations at single-cell resolution, providing a more refined view of intratumoral heterogeneity. The results revealed distinct RNA splicing and isoform usage patterns among tumor subclones, offering new insights into tumor evolution and potential therapeutic targets.

Overall, HELIX and its single-cell extension scHELIX provide a powerful artificial intelligence framework for dissecting RNA splicing regulation and transcript isoform usage under complex biological conditions. This work not only advances understanding of tissue-specific and disease-associated splicing mechanisms, but also provides valuable tools and theoretical foundations for cancer classification, pathogenic variant interpretation, and precision medicine research.

Fig. Schematic overview of the HELIX framework

Article link