ORF Prediction

The OpenProt pipeline predicts open reading frames (ORF) based on the 3 frame translation of all annotated transcripts for a given species. All ORFs of 30 codons or more, beginning with an AUG start codon, and ending with a stop codon are considered, regardless of the transcript biotype or presence of a canonical coding sequence. The obtained ORFs are classified as (1) refORF if the sequences has already been annotated by source annotations, (2) novel isoform if the translated sequence shares significant homology with an annotated sequence of the same gene, or (3) altORF if the ORF is significantly different from any currently annotated sequence.

All translated sequences from all predicted ORF are subjected to a BLAST search to assess homology to any sequence within the same species. Sequences with an exact match to an annotated protein (refProt) are given the accession from the source annotation. Sequences with an overlap of at least 50% of the query sequence and a bit score over 40 are considered novel isoforms of the annotated refProt and are given an accession number of the form II_XXXXX. The remaining sequences (altProts) are given accession numbers of the form IP_XXXXX. Each protein sequence is given a unique and persistent identifier.

Mass Spectrometry

OpenProt re-analyses published MS/MS datasets using the OpenProt library to identify peptides matching to alternative proteins as well as novel isoforms and reference proteins. The data is obtained from the PRIDE repository along with other large independent datasets including TCGA and the BioPlex study.

The pipeline uses SearchGUI 4.2.8 to run searches using 4 algorithms: OMSSA, MS-GF+, X!Tandem and Comet. Then PeptideShaker 2.2.23 is used to obtain an aggregated PSM report resulting from the search along with annotated spectra. The scored PSMs (validated and non validated, including the decoys) are then passed to Ms2Rescore along with the retention time and the MS2 spectra to re-score the PSMs. The output from Ms2Rescore is the result of a Percolator run using a combination of the search engine features along with the delta between predicted and observed RT and fragmentation patterns. The list of identified peptides is then obtained by filtering for a q-value below 0.01. Identified peptides are then assigned to proteins according to the peptide assignment rules.

The resulting PSMs are displayed on the details page for each protein.

Peptide assignment rules

Peptides are preferentially assigned to refProts. If a peptide matches multiple proteins in a protein group containing both refProts and altProts, the peptide is only assigned to the refProt. If it matches to a group containing multiple altProts but no refProts, it is assigned to all altProts.

MS score

The mass spectrometry score (MS score) assigned to every protein is used to indicate the level of MS/MS evidence available for the protein. It represents the sum of the number of unique peptides identified within each dataset.

RiboSeq

The re-analysis of ribosome profiling data was done using the PRICE algorithm. PRICE is an entropy-based model used for identifying translated ORFs from ribosome profiling datasets (PMID: 29529017). The acronym stands for Probabilistic Inference of Codon activities by an EM algorithm. PRICE utilizes parameters inferred from well-translated and annotated ORFs to model the stochastic events in ribosome profiling.

In essence, when a particular codon is present in a ribosomal P site, it can generate multiple footprints. PRICE employs Maximum Likelihood algorithms to reconstruct the set of codons that are most likely to produce the observed reads. These codons are then assembled into ORF candidates, and a machine-learning algorithm predicts the start codon. The detected ORFs undergo filtering based on a stringent False Discovery Rate (FDR) of 1% (traditionally set at 10%) to focus on highly confident translation events.

The p-value associated with an ORF detection corresponds to the significance of a generalized binomial test (not corrected for multiple comparisons). In ribosome profiling experiments, noise can arise from various sources such as ribosomal scanning, abortive translation events in the leader region, non-ribosome-mediated mRNA protection from RNAses, or overlapping ORFs.

Handling Multi-Mapped Reads

In the analysis, we employ the "rescue" mode in PRICE. If a footprint maps to multiple locations in the genome, it is either discarded or rescued based on the presence of uniquely mapped reads near any of the potential genomic loci.

TE score

The TE (Translation Event) score displayed on the search results page and the Translation tab represents the number of studies in which a significant identification of a translation event was made. It indicates how frequently the translation event has been observed across different studies.

Conservation

The phylogenetic conservation analysis was conducted using the InParanoid approach. The InParanoid algorithm (PMID: 25429972) is used to identify ortholog and paralog groups. It involves an all-vs-all Basic Local Alignment Search Tool (BLAST) comparison of all protein sequences in two species.

For example, all proteins from Homo sapiens are compared (BLAST) against all proteins from Pan troglodytes. The algorithm identifies different types of orthologies, including one-to-one (pairwise best reciprocal hit), one-to-many (multiple orthologs to one query protein), many-to-one (all queries matching one ortholog), and many-to-many (all orthologs to all queries). Additionally, the algorithm can identify paralogs within a species. OpenProt applies a significance filter at a bitscore of 40 for an overlap over 50% of the query sequence, as previously published (PMID: 29083303).

Orthologs and Paralogs

Orthologous proteins are similar proteins from different species and share a common ancestor gene. Paralogous proteins are similar proteins from different genes within one species.

Structure prediction

AlphaFold and OmegaFold were used to predict the structure of all altProts. AlphaFold was used in cases where an adequate multiple alignment matrix (min 30 sequences) could be constructed and OmegaFold was used for all other sequences. OmegaFold performs better at structure prediction in cases of limited phylogenetic information.

Intrinsically Disordered Regions

Short Linear Motifs

The Eukaryotic Linear Motif (ELM) ressource (PMID: 34718738) is a database of manually curated studies on protein sequence motifs. Motifs are organized into classes represented by regular expressions. Short Linear motifs (SLiMs) were predicted in all sequences on OpenProt by matching the regular expressions of the classes provided by ELM. Only SLiMs fully contained within intrinsically disordered regions are reported. ELM warns about the possibility of a high false positive rate and calls for caution in the interpretation of the results.

Protein functional domain prediction

InterPro v90.0 was used with default parameters to scan all protein sequences (PMID: 36350672). When significant (e-value<10-3), domain predictions and gene ontology (GO) are reported.

Subcellular localization prediction

Deeploc2 (PMID: 35489069) is a neural deep-learning approach to predict the subcellular localisation of protein. The model was trained on a dataset containing human and eukaryotic proteins with experimentally verified annotations of 9 types of sorting signals. The entirety of the Openprot 2.0 database was given to deeploc2 and its output added to the website.