Data selection by intersecting COG, PFam and EC
Three widely-used gene classifications have been used.
Clusters of orthologous group (COG)
COG is a very-widely used classification of genes.
Complete gene sets of organisms are used as a starting base in constructing
clusters. Similarity search is performed using arbitrary seed genes. The result
can return a certain gene as the highest similarity gene. Then another search
using the returned as a seed gene is performed. If the initial query gene is
returned as the highest scoring gene, then a pseudo-cluster of two genes is
formed. A cluster forms when three genes mutually have the highest similarity
with one another. A cluster expands by including genes that makes the highest
similarity match with any of the gene in the existing cluster. This strict
criterion of forming a cluster and joining of genes into the cluster makes
classification very specific. Because of the high specificity, each gene belongs
to at most one cluster.
Protein family (PFam)
PFam is another widely-used classification scheme. Each
family is started from a seed family that was manually selected. iterative
process of profile building and similarity search is performed to classify genes
into the cluster.
Enzyme Commission number (EC)
This numbering scheme is managed by Enzyme Commission a regulatory body. A gene is assigned an identification composed of four numbers. The numbers form a hierarchy such that each subsequent numbers designate functionality of higher specificity than the preceding one. It is manually assigned and so is very accurate.
Data selection
Genetic sequences often possess multiple domains that pertain to their folding kinetics and in vivo functionality. Common possession of domains mean sequences likely have substrates in common on which they interact. Highly identical domain structure means the two sequences very likely are functionally identical ones that are found in different organisms. Twelve families that have one-to-one correspondence among COG, PFam and EC have been selected. Three families have been combined into a single data set, and total of four sets were obtained. Since a data set comprises of highly distinct families, ambiguity in gene membership is removed and clear cluster boundary is achieved.