What is OpenProt?
OpenProt is the first database that enforces a polycistronic model of eukaryotic genome annotations. Thus, OpenProt annotates known proteins (called RefProts) but also novel isoforms and novel proteins (called altProts). It also provides supporting evidence for each protein, such as mass-spectrometry and ribosome profiling detection, protein homology and predictions of funtional domains.
Read here the original paper (Brunet MA, Brunelle M, Lucier JF, et al., NAR, 2019).
Why use OpenProt?
Annotations substantially shape today's Research by drawing the scope of possibilities. When using OpenProt, you will gain a better view of the proteomic complexity incumbent to each gene and each transcript. By gathering experimental evidence, OpenProt is a data-driven protein database. All data are freely available and can be downloaded for in-house analyses.
Getting more out of your data!
How to use OpenProt?
OpenProt offers multiple downloads, in particular for mass-spectrometry based proteomics analyses, as well as a search page and a genome browser that allows users to interrogate the database. Detailed tutorials on how to get started, downloads and frequent questions are available on the help page.
You can also refer to our article in JoVE for a tutorial and representative example of OpenProt use in proteomics analysis.
The concept behind OpenProt
The current annotation model
Current genome annotations hold arbitrary criteria for Open Reading Frame (ORF) annotation, such as:
- a minimal ORF length of 100 codons;
- a single ORF per transcript (monocistronic);
- transcripts that do not meet the above criteria are non-coding (ncRNAs);
- transcripts from unprocessed pseudogenes are non-coding (ncRNAs).
With the rare exception of previously characterized examples, these rules are applied and considerably shape the annotated protein landscape (Reference Proteome). Yet, a wealth of experimental data highlights the pitfalls of such annotation model.
The OpenProt annotation model
OpenProt challenges the aforementioned arbitrary criteria, and thus:
- annotates any ORF longer than 30 codons;
- annotates multiple ORF per transcript (polycistronic);
- annotates ORFs within ncRNAs;
- annotates ORFs within pseudogenes transcripts.
Thus OpenProt annotates known proteins (Reference Proteins or RefProts), novel isoforms and novel proteins (Alternative Proteins or AltProts). It offers a deeper, and more realistic, view of the proteome (Biological Proteome).
OpenProt: How does it work?
OpenProt prediction pipeline
The OpenProt prediction pipeline first retrieves transcripts (RNAs) from two well-used annotations (Ensembl and NCBI RefSeq). This constitutes an exhaustive transcriptome.
A 3-frame in silico translation then yields the ORFeome: any ORF longer than 30 codons in any frame of any transcript. This ORFeome is then filtered to categorize predicted ORFs. The first filter retrieves all known protein (all ORF already annotated in Ensembl, NCBI RefSeq, and/or UniProtKB), these are the RefProts. The second filter looks at the homology of the currently not annotated ORFs with the RefProt of the same gene (if applicable), and retrieves novel predicted isoforms.
The remaining ORFs encode novel proteins, called AltProts.
OpenProt evidence pipeline
To increase confidence in ORF expression, since random ORFs are a possibility, OpenProt also cumulates several lines of evidence, such as:
- Conservation evidence: for every ORF annotated, OpenProt identifies orthologs and paralogs (across the 10 species currently supported by OpenProt).
- Translation evidence: OpenProt retrieves publicly available ribosome profiling datasets and re-analyses them using the Price algorithm with a stringent 1 % FDR. This gathers translation evidence for any ORF annotated in OpenProt.
- Expression evidence: OpenProt retrieves publicly available mass spectrometry datasets and re-analyses them using multiple search engines, and a stringent 0,001 % FDR. This gathers expression evidence for any ORF annotated in OpenProt.