Shapiro–Senapathy algorithm
From Wikipedia, the free encyclopedia
The Shapiro—Senapathy algorithm (S&S) is a computational method for identifying splice sites in eukaryotic genes. The algorithm employs a Position Weight Matrix (PWM) scoring formula to predict donor and acceptor splice sites in any given gene. This methodology has been used to discover splice sites and disease-causing splice site mutations in the human genome, and has become a standard tool in clinical genomics.

The S&S algorithm has been cited in thousands of clinical studies, according to Google Scholar. It has also formed the basis of widely used software, including Human Splicing Finder,[1] SROOGLE,[2] and Alamut,[3] which identify splice sites and splice site mutations that cause disease. The algorithm has uncovered splicing mutations in diseases ranging from cancers to inherited disorders, and predicted the deleterious effects of these mutations including exon skipping, intron retention, and cryptic splice site activation.
The algorithm
A splice site defines the boundary between a coding exon and a non-coding intron in eukaryotic genes. The S&S algorithm employs a sliding window, corresponding to the length of the splice site motif, to scan a gene sequence and detect potential splice sites. For each sliding window, the algorithm calculates a score by comparing the nucleotide sequence to a Position Weight Matrix (PWM) derived from known splice sites. This formula generates a percentile score, indicating the likelihood that a given sequence functions as a donor or acceptor splice site.
The majority of disease-causing mutations in the human genome are located in splice sites. Clinical genomics studies analyze the splice site scores generated by the S&S algorithm to predict the consequences of splice site mutations including exon skipping and intron retention. The algorithm's sensitivity to single-nucleotide changes allows it to determine mutations that may impact RNA splicing and contribute to disease.
In addition to identifying real splice sites, the S&S algorithm has been used to discover cryptic splice sites — alternative splice sites activated by mutations — which may disrupt normal splicing. The algorithm detects mutations that lead to the activation of cryptic splice sites, which may be located proximal to real splice sites or deep within non-coding introns. It has thus been used to determine the causes of numerous diseases that are due to cryptic splicing.
Cancer gene discovery using S&S
The S&S algorithm has been used to identify splice-site mutations in genes associated with several cancers. For example, genes causing commonly occurring cancers including breast cancer,[4][5][6] ovarian cancer,[7][8][9] colorectal cancer,[10][11][12] leukemia,[13][14] head and neck cancers,[15][16] prostate cancer,[17][18] retinoblastoma,[19][20] squamous cell carcinoma,[21][22][23] gastrointestinal cancer,[24][25] melanoma,[26][27] liver cancer,[28][29] Lynch syndrome,[30][31][11] skin cancer,[21][32][33] and neurofibromatosis[34][35] have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer,[36][37][24] gangliogliomas,[38][39] Li-Fraumeni syndrome, Loeys–Dietz syndrome, Osteochondromas (bone tumor), Nevoid basal cell carcinoma syndrome,[7] and Pheochromocytomas[9] have been identified.
Specific mutations in different splice sites in various genes causing breast cancer (e.g., BRCA1, PALB2), ovarian cancer (e.g., SLC9A3R1, COL7A1, HSD17B7), colon cancer (e.g., APC, MLH1, DPYD), colorectal cancer (e.g., COL3A1, APC, HLA-A), skin cancer (e.g., COL17A1, XPA, POLH), and Fanconi anemia (e.g., FANC, FANA) have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1.
| Disease type | Gene symbol | Mutation location | Original sequence | Mutated sequence | Splicing aberration |
|---|---|---|---|---|---|
| Breast cancer | BRCA1 | Exon 11 | AAGGTGTGT | AAAGTGTGT | Skipping of exon 12[40] |
| PALB2 | Exon 12 | CAGGCAAGT | CAAGCAAGT | Potentially weakening the canonical donor splicing site[41] | |
| Ovarian cancer | SLC9A3R1 | Exon2 | GAGGTGATG | GAGGCGATG | Significant effect in 'splicing'[8] |
| Colorectal Cancer | MLH1 | Exon 9 | TCGGTATGT | TCAGTATGT | Skipping of exon 8 and protein truncation[10] |
| MSH2 | Intron 8 | CAGGTATGC | CAGGCATGC | Intervening sequence, RNA processing,No amino acid change[10] | |
| MSH6 | Intron 9 | TTTTTAATTTTAAGG | TTTTTAATTTTGAGG | Intervening sequence, RNA processing,No amino acid change[10] | |
| Skin Cancer | TGFBR1 | Exon 5 | TTTTGATTCTTTAGG | TTTTGATTCTTTCGG | Exon 5 skipping[21] |
| ITGA6 | Intron 19 | TTATTTTCTAACAGG | TTATTTTCTAACACG | Skipping of the exon 20 and resulted in in-frame deletion[42] | |
| Birt–Hogg–Dubé (BHD) syndrome | FLCN | Exon 9 | GAAGTAAGC | GAAGGAAGC | Skipping of exon 9 and weak retention of 131 bp of intron 9[43] |
| Nevoid basal cell carcinoma | PTCH1 | Intron 4 | CAGGTATAT | CAGGTGTAT | Exon 4 Skipping [7] |
| Mesothelioma | BAP1 | Exon 16 | AAGGTGAGG | TAGGTGAGG | Creates a novel 5' splice site that results in a 4 nucleotide deletion of the 3' end of exon 16[44] |
Discovery of genes causing inherited disorders using S&S
Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example, Type 1 diabetes (e.g., PTPN22, TCF1 (HCF-1A)), hypertension (e.g., LDL, LDLR, LPL), Marfan syndrome (e.g., FBN1, TGFBR2, FBN2), cardiac diseases (e.g., COL1A2, MYBPC3, ACTC1), eye disorders (e.g., EVC, VSX1) have been uncovered. A few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2.
| Disease type | Gene symbol | Mutation location | Original sequence | Mutated sequence | Splicing aberration |
|---|---|---|---|---|---|
| Diabetes | PTPN22 | Exon 18 | AAGGTAAAG | AACGTAAAG | Skipping of exon 18[45] |
| TCF1 | Intron 4 | TTTGTGCCCCTCAGG | TTTGTGCCCCTCGGG | Skipping of exon 5[46] | |
| Hypertension | LDL | Intron 10 | TGGGTGCGT | TGGGTGCAT | Normolipidemic to classical heterozygous FH[47] |
| LDLR | Intron 2 | GCTGTGAGT | GCTGTGTGT | May cause splicing abnormalities through an in-silico analysis[48] | |
| LPL | Intron 2 | ACGGTAAGG | ACGATAAGG | Cryptic splice sites is activated in vivo at the sites[49] | |
| Marfan syndrome | FBN1 | Intron 46 | CAAGTAAGA | CAAGTAAAA | Exon skipping/cryptic splice site[50] |
| TGFBR2 | Intron 1 | ATCCTGTTTTACAGA | ATCCTGTTTTACGGA | Abnormal splicing[51] | |
| FBN2 | Intron45 | TGGGTAAGT | TGGGGAAGT | Splice site alterations leading to frameshift mutations,
causing a truncated protein[51] | |
| Cardiac disease | COL1A2 | Intron 46 | GCTGTAAGT | GCTGCAAGT | Permitted almost exclusive use of a cryptic donor
site 17 nt upstream in the exon[52] |
| MYBPC3 | Intron 5 | CTCCATGCACACAGG | CTCCATGCACACCGG | Abnormal mRNA transcript with a premature
stop codon will produce a truncated protein lacking the binding sites for myosin and titin[53] | |
| ACTC1 | Intron 1 | TTTTCTTCTCATAGG | TTTTCTTCTTATAGG | No effect [54] | |
| Eye disorder | ABCR | Intron 30 | CAGGTACCT | CAGTTACCT | Autosomal recessive RP and CRD[55] |
| VSX1 | Intron 5 | TTTTTTTTTACAAGG | TATTTTTTTACAAGG | Aberrant splicing[56] |
Genes causing immune system disorders
More than 100 immune system disorders affect humans, including inflammatory bowel diseases, multiple sclerosis, systemic lupus erythematosus, bloom syndrome, familial cold autoinflammatory syndrome, and dyskeratosis congenita. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including Ataxia telangiectasia, B-cell defects, epidermolysis bullosa, and X-linked agammaglobulinemia.
Xeroderma pigmentosum, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair.[27]
Type I Bartter syndrome (BS) is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping.[28]
Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine (G) at the position of IVS+5 is well conserved (at the frequency of 84%) among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript.[29]
Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease.[57] Identification of Familial tumoral calcinosis (FTC) is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing.[58]
Application of S&S in hospitals for clinical practice and research
The Shapiro–Senapathy (S&S) algorithm has played a significant role in advancing the diagnosis and treatment of human diseases through its application in modern clinical genomics. With the widespread adoption of next-generation sequencing (NGS) technologies, the S&S algorithm is now routinely integrated into clinical practice by geneticists and diagnostic laboratories. It is implemented in various computational tools such as Human Splicing Finder (HSF),[1] Splice Site Finder (SSF),[59] and Alamut Visual,[3] which assist in interpreting the functional impact of genetic variants on RNA splicing.
The algorithm is particularly useful in identifying pathogenic splice site mutations in cases where the clinical presentation is unclear or where conventional diagnostic methods have failed to identify a causative gene. Its utility has been demonstrated across diverse patient cohorts, including individuals from different ethnic backgrounds with various cancers and inherited genetic disorders. The following are selected examples illustrating its application in clinical research.
Cancers
| Cancer type | Publication title | Year | Ethnicity | Number of patients | |
|---|---|---|---|---|---|
| 1 | Hereditary Breast Cancer | Uncovering the clinical relevance of unclassified variants in DNA repair genes: a focus on BRCA negative Tunisian cancer families[60] | 2024 | Tunisian | 67 Patients |
| 2 | Basel Cell Carcinoma | PTCH1 gene variants rs357564, rs2236405, rs2297086 and rs41313327, mRNA and tissue expression in basal cell carcinoma patients from Western Mexico[61] | 2024 | Western Mexico | 250 Patients
290 Control |
| 3 | Non-Small Cell Lung Cancer | Associations between telomere attrition, genetic variants in telomere maintenance genes, and non-small cell lung cancer risk in the Jammu and Kashmir population of North India[62] | 2023 | India | 162 Patients 561 Controls |
| 4 | Prostate Cancer | Somatic and germline aberrations in homologous recombination repair genes in Chinese prostate cancer patients[63] | 2023 | Chinese | 721 Patients |
| 5 | Colorectal Cancer | Lynch-like syndrome is as frequent as Lynch syndrome in early-onset nonfamilial nonpolyposis colorectal cancer[64] | 2023 | Argentina | 102 patients |
| 6 | Colorectal Cancer | Germline Variants of CYBA and TRPM4 Predispose to Familial Colorectal Cancer[65] | 2022 | Poland | 15 Families |
| 7 | Hereditary Ovarian Cancer | The Genetic and Molecular Analyses of RAD51C and RAD51D Identifies Rare Variants Implicated in Hereditary Ovarian Cancer from a Genetically Unique Population[66] | 2022 | French Canadians | 17 Families
53 Patients |
| 8 | Thymic Carcinoma | Mutation profile and immunoscore signature in thymic carcinomas: An exploratory study and review of the literature[67] | 2021 | Italian | 15 Patients |
| 9 | Neurofibromatosis Type 1 | Simultaneous Detection of NF1, SPRED1, LZTR1, and NF2 Gene Mutations by Targeted NGS in an Italian Cohort of Suspected NF1 Patients[68] | 2020 | Italian | 250 Patients |
| 10 | Neuroendocrine Pancreatic Tumor | Identification of new candidate genes and signalling pathways associated with the development of neuroendocrine pancreatic tumours based on next generation sequencing data[69] | 2020 | Caucasian | 24 Patients |
| 11 | Oral Cancer | Polymorphic variants of drug-metabolizing enzymes alter the risk and survival of oral cancer patients[70] | 2020 | Indian | 909 Controls
539 Patients |
| 12 | Endometrial Cancer | Targeted sequencing of genes associated with the mismatch repair pathway in patients with endometrial cancer[71] | 2020 | Australia | 199 patients |
| 13 | Breast cancer | The germline mutational landscape of BRCA1 and BRCA2 in Brazil[72] | 2018 | Brazil | 649 Patients |
| 14 | Hereditary non-polyposis colorectal cancer | Prevalence and characteristics of hereditary non-polyposis colorectal cancer (HNPCC) syndrome in immigrant Asian colorectal cancer patients[10] | 2017 | Asian Immigrant | 143 Patients |
| 15 | Renal cell cancer | Genetic screening of the FLCN gene identify six novel variants and a Danish founder mutation[73] | 2016 | Danish | 143 individuals |
| 16 | Nevoid basal cell carcinoma syndrome | Nevoid basal cell carcinoma syndrome caused by splicing mutations in the PTCH1 gene[7] | 2016 | Japanese | 10 Patients |
| 17 | Prostate cancer | Identification of Two Novel HOXB13 Germline Mutations in Portuguese Prostate Cancer Patients[74] | 2015 | Portuguese | 462 Patients, 132 Controls |
| 18 | Colorectal adenomatous polyposis | Identification of Novel Causative Genes for Colorectal Adenomatous Polyposis | 2015 | German | 181 Patients,531 Controls |
Inherited disorders
| Disease name | Publication title | Year | Ethnicity | Number of patients | |
|---|---|---|---|---|---|
| 1 | Congenital Myopathy | Exome sequencing in undiagnosed congenital myopathy reveals new genes and refines genes–phenotypes correlations[75] | 2024 | Multiple Population | 310 Families (429 patients) |
| 2 | Neurodevelopmental Delay and Neurodevelopmental Comorbidities | Phenotypic and genetic analysis of children with unexplained neurodevelopmental delay and neurodevelopmental comorbidities in a Chinese cohort using trio-based whole-exome sequencing[76] | 2024 | Chinese | 163 Patients |
| 3 | Congenital Cataracts | Evaluation of Genetic Testing in a Cohort of Diverse Pediatric Patients in the United States with Congenital Cataracts[77] | 2023 | Chicago, USA | 52 Patients |
| 4 | X-linked hypophosphatemia | A genetic study of a Brazilian cohort of patients with X-linked hypophosphatemia reveals no correlation between genotype and phenotype[78] | 2023 | Brazil | 41 Patients |
| 5 | Hereditary Cerebellar Ataxia | Molecular Characterization of Portuguese Patients with Hereditary Cerebellar Ataxia[79] | 2022 | Portuguese | 19 Families (30 Individual) |
| 6 | Stargardt disease | ABCA4 c.859-25A>G, a Frequent Palestinian Founder Mutation Affecting the Intron 7 Branchpoint, Is Associated With Early-Onset Stargardt Disease[80] | 2022 | Palestinian | 175 patients |
| 7 | Hearing Impairment & Retinal Dystrophy | Unraveling the genetic complexities of combined retinal dystrophy and hearing impairment[81] | 2021 | Mexican & Iranian | 59 patients |
| 8 | Angelman syndrome | New genes involved in Angelman syndrome-like: Expanding the genetic spectrum[82] | 2021 | Spain | 14 patients |
| 9 | Acute intermittent porphyria | Molecular Analysis of 55 Spanish Patients with Acute Intermittent Porphyria[83] | 2020 | Spanish | 55 patients |
| 10 | Hearing loss | Novel Loss-of-Function Variants in CDC14A are Associated with Recessive Sensorineural Hearing Loss in Iranian and Pakistani Patients[84] | 2020 | Iranian & Pakistani | 2 Families |
| 11 | Non-syndromic hearing loss | GJB2 and GJB6 Genetic Variant Curation in an Argentinean Non-Syndromic Hearing-Impaired Cohort[85] | 2020 | Argentinean | 600 patients |
| 12 | Inherited retinal diseases | Molecular genetic analysis using targeted NGS analysis of 677 individuals with retinal dystrophy[86] | 2019 | Denmark | 677 patients |
| 13 | Odontogenesis Diseases | Genetic Evidence Supporting the Role of the Calcium Channel, CACNA1S, in Tooth Cusp and Root Patterning[87] | 2018 | Thai families | 11 Patients,18 Controls |
| 14 | Unclear speech developmental delay | Progressive SCAR14 with unclear speech, developmental delay, tremor, and behavioral problems caused by a homozygous deletion of the SPTBN2 pleckstrin homology domain[88] | 2017 | Pakistani family | 9 Patients, 12 controls |
| 15 | Beta-Ketothiolase Deficiency | Clinical and Mutational Characterizations of Ten Indian Patients with Beta-Ketothiolase Deficiency[89] | 2016 | Indian | 10 Patients |
| 16 | Bardet-Biedl Syndrome | The First Nationwide Survey and Genetic Analyses of Bardet-Biedl Syndrome in Japan[90] | 2015 | Japan | 38 Patients(Disease identified in 9 Patients) |
| 17 | Dent's disease | Dent's disease in children: diagnostic and therapeutic consideration[91] | 2015 | Poland | 10 Patients |
| 18 | Atypical Haemolytic Uraemic Syndrome | Genetics Atypical hemolytic-uremic syndrome[92] | 2015 | Newcastle cohort | 28 Families, 7 Sporadic patients |
| 19 | Age-related Macular Degeneration and Stargardt disease | Genetics of Age-related Macular Degeneration and Stargardt disease in South African populations[93] | 2015 | African Populations | 32 Patients |
S&S - Algorithm for identifying splice sites, exons and split genes
The Shapiro–Senapathy algorithm (SSA) was developed to identify splice sites in uncharacterized genomic sequences, with early applications in the Human Genome Project.[94][95] The method introduced a Position Weight Matrix (PWM)-based approach to analyze splicing sequences across eukaryotic organisms, marking the first computational framework to systematically define splice sites using probabilistic scoring.
Key innovations of the algorithm included:
- Exon Detection – Exons were defined as sequences bounded by acceptor and donor splice sites with S&S scores above a threshold, requiring an open reading frame (ORF) for validation.
- Gene Prediction – The method enabled the identification of complete genes by assembling predicted exons, forming a basis for later gene-finding tools.
- Mutation Analysis – The algorithm distinguishes deleterious splice-site mutations (which disrupt protein function by lowering S&S scores) from neutral variations. This capability allowed researchers to study disease-linked cryptic splice sites in humans, animals, and plants.
SSA's PWM-based framework influenced subsequent computational methods, including machine learning and neural network approaches, for splice-site prediction and alternative splicing research. It remains a foundational tool in genomics and disease studies.
Discovering the mechanisms of aberrant splicing in diseases
The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protein defective. A mutant splice site can become "weak" compared to the original site, due to which the mutated splice junction becomes unrecognizable by the spliceosomal machinery. This can lead to the skipping of the exon in the splicing reaction, resulting in the loss of that exon in the spliced mRNA (exon-skipping). On the other hand, a partial or complete intron could be included in the mRNA due to a splice site mutation that makes it unrecognizable (intron inclusion). A partial exon-skipping or intron inclusion can lead to premature termination of the protein from the mRNA, which will become defective leading to diseases. The S&S has thus paved the way to determine the mechanisms by which a deleterious mutation could lead to a defective protein, resulting in different diseases depending on which gene is affected.
Examples of splicing aberrations
| Disease type | Gene symbol | Mutation location | Original donor/acceptor | Mutated donor/acceptor | Aberration effect |
|---|---|---|---|---|---|
| Colon Cancer | APC | Intron 2 | AAGGTAGAT | AAGGAAGAT | Skipping of Exon 3[96] |
| Colorectal cancer | MSH2 | Exon 15 | GAGGTTTGT | GAGGTTTCT | Skipping of Exon 15[97] |
| Retinoblastoma | RB1 | Intron 23 | TCTTAACTTGACAGA | TCTTAACGTGACAGA | New splice acceptor, intron inclusion[19] |
| Trophic benign epidermolysis bullosa | COL17A1 | Intron 51 | AGCGTAAGT | AGCATAAGT | lead to exon skipping, intron inclusion, or the use of a cryptic splice site, resulting in either a truncated protein or a protein lacking a small region of the coding sequence[98] |
| Choroideremia | CHM | Intron 3 | CAGGTAAAG | CAGATAAAG | Premature termination codon[99] |
| Cowden syndrome | PTEN | Intron 4 | GAGGTAGGT | GAGATAGGT | Premature termination codon within exon 5[49] |
An example of splicing aberration (exon skipping) caused by a mutation in the donor splice site in the exon 8 of MLH1 gene that led to colorectal cancer is given below. This example shows that a mutation in a splice site within a gene can lead to a profound effect in the sequence and structure of the mRNA, and the sequence, structure and function of the encoded protein, leading to disease.

S&S in cryptic splice sites research and medical applications
The proper identification of splice sites has to be highly precise as the consensus splice sequences are very short and there are many other sequences similar to the authentic splice sites within gene sequences, which are known as cryptic, non-canonical, or pseudo splice sites. When an authentic or real splice site is mutated, any cryptic splice sites present close to the original real splice site could be erroneously used as authentic site, resulting in an aberrant mRNA. The erroneous mRNA may include a partial sequence from the neighboring intron or lose a partial exon, which may result in a premature stop codon. The result may be a truncated protein that would have lost its function completely.
Shapiro–Senapathy algorithm can identify the cryptic splice sites, in addition to the authentic splice sites. Cryptic sites can often be stronger than the authentic sites, with a higher S&S score. However, due to the lack of an accompanying complementary donor or acceptor site, this cryptic site will not be active or used in a splicing reaction. When a neighboring real site is mutated to become weaker than the cryptic site, then the cryptic site may be used instead of the real site, resulting in a cryptic exon and an aberrant transcript.
Numerous diseases have been caused by cryptic splice site mutations or usage of cryptic splice sites due to the mutations in authentic splice sites.[100][101][102][103][104]
S&S in animal and plant genomics research
S&S has also been used in RNA splicing research in many animals[105][106][107][108][109] and plants.[110][111][112][113][114]
The mRNA splicing plays a fundamental role in gene functional regulation. Very recently, it has been shown that A to G conversions at splice sites can lead to mRNA mis-splicing in Arabidopsis.[110] The splicing and exon–intron junction prediction coincided with the GT/AG rule (S&S) in the Molecular characterization and evolution of carnivorous sundew (Drosera rotundifolia L.) class V b-1,3-glucanase.[111] Unspliced (LSDH) and spliced (SSDH) transcripts of NAD+ dependent sorbitol dehydroge nase (NADSDH) of strawberry (Fragaria ananassa Duch., cv. Nyoho) were investigated for phytohormonal treatments.[112]
Ambra1 is a positive regulator of autophagy, a lysosome-mediated degradative process involved both in physiological and pathological conditions. Nowadays, this function of Ambra1 has been characterized only in mammals and zebrafish.[106] Diminution of rbm24a or rbm24b gene products by morpholino knockdown resulted in significant disruption of somite formation in mouse and zebrafish.[107] Dr.Senapathy algorithm used extensively to study intron-exon organization of fut8 genes. The intron-exon boundaries of Sf9 fut8 were in agreement with the consensus sequence for the splicing donor and acceptor sites concluded using S&S.[108]