Explore the extended proteome

The concept behind OpenProt

Current annotations

Current genome annotations hold limiting criteria for Open Reading Frames (ORF) including a minimal ORF length of 100 codons and a single ORF per transcript. Transcripts that do not meet these criteria are labeled non-coding (ncRNAs) and transcripts from unprocessed pseudogenes are also systematically annotated non-coding.

OpenProt annotations

OpenProt relaxes traditional annotation criteria by including all ORFs longer than 30 codons and allowing multiple ORFs per transcript as well as those encoded in ncRNAs and transcripts of pseudogenes. OpenProt offers a deeper description and thus a more realistic and biologically relevant perspective of the proteome.

OpenProt discoveries: re-interpret already acquired data

The annotation of sequences is central to current research in biomolecular sciences. The addition of unannotated protein sequences in the OpenProt protein library has resulted in many important discoveries in the human proteome through the re-analysis of publicly available data. Many of these have been selected for further investigation:

The OpenProt pipeline

Prediction pipeline

The OpenProt ORF prediction pipeline starts from an exhaustive description of the transcriptome consisting of all RNA transcripts reported by both Ensembl and NCBI RefSeq. A 3-frame in silico translation then yields the ORFeome: any ORF longer than 30 codons in any frame of any transcript. This ORFeome is then filtered to categorize predicted ORFs. The first filter retrieves all known proteins, or reference proteins (all ORF already annotated in Ensembl, NCBI RefSeq, and/or UniProtKB). The second filter is based on the homology of currently not annotated ORFs with the refProt of the same gene (if applicable), and retrieves novel predicted isoforms. The remaining ORFs encode novel proteins, called alternative proteins (altProts).

Evidence pipeline

  • Conservation evidence: for every ORF annotated, OpenProt identifies orthologs and paralogs (across the 10 species currently supported by OpenProt).
  • Translation evidence: Publicly available ribosome profiling datasets are re-analysed using the Price algorithm. This gathers translation evidence for any ORF annotated in OpenProt.
  • Expression evidence: Publicly available mass spectrometry datasets are re-analysed using multiple search engines. This gathers expression evidence for any ORF annotated in OpenProt.

Acknowledgements

We would like to thank the tools and servers that made it possible to create the new version of OpenProt
  • MMseqs2: which enabled us to perform large-scale multiple sequence alignments
  • AlphaFold: For three-dimensional structure prediction of proteins with msa greater than 30
  • OmegaFold: For predicting the three-dimensional structure of proteins from their protein sequences
  • The Eukaryotic Linear Motif Ressource (ELM): Which enabled us to find short linear motifs in our protein sequences
  • Deeploc2.0: For predicting the subcellular localization from their protein sequences
  • Compomics: For their tools SearchGUI, PeptideShaker, and Ms2Rescore used at the core of our pipeline
  • Interpro: for InterProScan wich enabled us to predict domains in the proteins
  • GTEx: for their expression profile in different tissues
  • PRICE: used to generate the RIBO score of our proteins
  • InParanoid: For the identification of ortholog and paralog groups
  • jBrowse: For the genome browser in the summary tab.
  • PRIDE: used to find and download most of our ms studies

Older versions of OpenProt