Protein structure determination using metagenome sequence data
Sergey Ovchinnikov, Hahnbeom Park, Neha Varghese, Po-Ssu Huang, Georgios A. Pavlopoulos, David E. Kim, Hetunandan Kamisetty, Nikos C. Kyrpides, David Baker.
- Contacts - Using bitscore 27, Pfam consensus sequence used
- Structures - interactive data. Includes the predicted contacts overlaid on contacts made in the models.
- Map_align - source code for our contact map alignment software (Hosted at GitHub)
- What is metagenomics?
Metagenomics is the study of genetic material from environmental samples.
A metagenome is a collection of sequences from a particular sample.
- How are metagenomic sequences different from those found in Uniprot/Genbank?
Traditional sequencing efforts have been biased to organisms that can be cultured in lab. Metagenomic sequences on the other hand, often sample sequences
from unknown organisms and from protein families (PFAMs) not well represented in current sequence databases. For example, there are LARGE protein families
found in all aquatic metagenomic sequences, yet these PFAMs have been neglected due low count from traditional sequencing efforts.
- For the interactive database, what are the different categories?
There are some targets that we think are of biological or structural interest, these we [highlight] in the paper and/or supplementary information. To
make these easy to find we put them at the top of the list. Though we are [confident] about the fold for all the posted targets,
some targets look like they could use some [refine]ment. Refinement is usually needed if the target was broken into multiple domains and is in need of
assembly, or the target was modelled as a monomer though it looks like it could form a homo-oligomer.
- What is "con", "rc"?
The rc score is the ratio of (# contacts made)/(# of expected contacts). The closer this number is to 1, the better fit to co-evolution data. Though there are some
caveats: It is relatively easy to make a "spagetti" model that makes all the predicted contacts but is not physically/biologically realisitic. Also not all contact should be
made. Some contacts are involved in conformational change, ligand mediatiation and higher order symmetry. Because of this we restrict
sampling within protein-like space (using fragments from known structures) and do not force any contacts. To guage if the model is correct we look at convergence (con),
which is the average pairwise TMscore of many independent runs from the initial stages of our protocol. TMscore is
a metric (0,1] for measuring the structural similarity of two protein models.