OpenVar Documentation

The hypotheses behind OpenVar

What is the genome annotation model used by OpenVar?

OpenVar uses the OpenProt annotation model. OpenProt is the first and only protein database to enforce a polycistronic model of mammalian genome annotation. Where common genome annotations report a single coding sequence (CDS) per transcript, OpenProt allows for multiple coding sequence per transcript. Therefore, OpenProt reports dual-coding and polycistronic genes in eukaryotic genomes. For more information, PMID 33179748.

What is OpenProt? How can I know more about OpenProt?

OpenProt is the first proteogenomic resource supporting a polycistronic annotation model of eukaryotic genomes. It is freely available at www.openprot.org. Every release and related articles can be found in the about section of the website or here.

What is an alternative open reading frame (altORF) / alternative protein (altProt)?

Current genome annotations in eukaryotes rely partly on ORF prediction algorithms, which are reliable only for sequences above a certain length. Consequently, three main criteria are enforced to distinguish true ORFs from randoms: (1) a minimum length of 100 codons; (2) a single coding sequence (CDS) per transcript; and (3) the use of an ATG start codon. However, in OpenProt, we use different terms to identify proteins based upon their genome annotation status:
- Proteins currently annotated in databases, such as UniProtKB, are translated from canonical CDSs (reference ORFs or refORFs) and are termed reference proteins (or RefProts).
- Alternative ORFs (or AltORFs) are defined as potential protein-coding ORF, located either in non-coding RNAs (e.g. long non-coding RNAs, pseudogene RNAs, etc...), or in UTRs or alternative reading frames overlapping a CDS in mRNAs. Predicted proteins translated from AltORFs are termed alternative proteins (or AltProts; IP_). AltProt and RefProt from a same gene are not isoforms: they are coded by different ORFs and their amino acid sequence is completely different.
- Predicted proteins translated from an alternative ORF (as defined above), but that either display (1) a close homology with a reference protein from the same gene; (2) the same start and/or stop codon than the reference protein; and an alignment score above the threshold are considered novel isoforms of the reference proteins (or Isoforms; II_).

How to use the OpenVar website

What do I need to submit an analysis to OpenVar?

To submit an analysis to OpenVar via the website, you will need the folowwing:
- An email adress: the email adress will be used to send you a unique link to the results of the analysis.
- A study name: the study name will be used to identify your analysis report.
- The species: select the adequate species for your analysis.
- The genome assembly: select the adequate genome assembly used for variant calling (for Human, please select adequately between hg9, b37 and hg38).
- The desired genome build: select the genome build with which you would like to annotate your variant calling file (VCF).
- A variant calling file (VCF): upload your VCF. Please note that no VCFs are stored on our servers after your result link expires (10 days after completion of your analysis). Wait for completion of the upload to submit your analysis.

When will I get my results?

This will depend on the size of the VCF you submit: for a VCF with 500 variants, expect 5 mins; for a VCF with 5 millions variants, expect 1 hour.

Can I download the results of the analysis? What is in the output of the analysis?

Yes, absolutely! Everything that you see on the results page can be downloaded. You can download separately either your annotated VCF for downstream in house analysis, or the table of maximal impact annotation per ORF type for each variant. Alternatively, you can download everything as a zipped folder by clicking on "download all results".
The zipped folder contains the following:
- The submitted input vcf (input_vcf.vcf).
- A text file of all analysis warnings (warnings.txt ). This file would contain all SNPs that were not included in the analysis and the reason why (e.g. none of the alleles match the allele of the reference genome at the given position: invalid alleles).
- The usual annotated vcf (unique_id.ann.vcf).
- A tsv file listing all consequences for each variant (study_name_annOnePerLine.tsv). This file would thus contain as many lines as there are consequences for all of the submitted variants. For example, a SNP altering a canonical ORF and an alternative ORF would have 2 lines, one listing the effect on the canonical ORF, the second listing the effect on the alternative ORF.
- A tsv file listing only the maximal impact on canonical and alternative ORFs for each variant (study_name_max_impact.tsv). This file would thus contain as many lines as the number of submitted variants. For example, a SNP altering a canonical ORF and two alternative ORFs, the SNP line would contain the maximal impact for the canonical ORF alongside that of the highest impact on any of the alternative ORFs. Please note that the maximal impact is based on the SnpEff categorization in modifier, low, medium or high impact (for more information, you may want to consult the SnpEff documentation here). With this file, you always keep the effect on the canonical ORFs, but you also have the information gained from deeper annotation. In this file, two columns (ref_max_impact and alt_max_impact) are here to ease filtering in order to look at the variants with the highest impact in the alternative ORFs (alt_max_impact = 1). Please note that when the impact of the SNP is in the same category for the canonical ORF and the alternative ORF, both columns would be equal to 1 (ref_max_impact = 1 and alt_max_impact = 1).
- Figures: the zipped folder includes 4 figures which are in .svg format to easily customize and include in publications. The figures consists of the top 100 mutated genes (top_genes_var_rate.svg), the number of variants per category of impact in canonical and alternative proteins (var_per_impact.svg), the fold-change for each impact category gained with deeper annotation (impact_foldchange.svg), and a barchart of the number of altORF per the percentage of genes' variants clustered on the altORF (hotspots_barchart.svg). For the last one, for example at the far right of the graph, you would have the number of alternative proteins each explaining all of the SNPs of the corresponding gene. They are divided in categories based on the total number of SNPs: only one SNP, one to ten SNPs, and more than ten SNPs.
- A pickle object (summary.pkl) which contains all summary statistics needed to present the results on the web page in a python format. This should ease any downstream filtering you may want to perform.

What are the impact categories reported by OpenVar?

OpenVar uses the same categorization of impacts as SNPEff, some examples are listed below:
- Modifier (impact 0): intergenic variant, intronic variants, and UTR variants.
- Low (impact 1): synonymous variants.
- Moderate (impact 2): missense variants.
- High (impact 3): insertion or deletion frameshift, premature stop codon.
For an exhaustive list of categorization of mutations, please consult the SNPEff documentation here.