Data selection by intersecting COG, PFam and EC

 

Three widely-used gene classifications have been used.

Clusters of orthologous group (COG)

COG is a very-widely used classification of genes. Complete gene sets of organisms are used as a starting base in constructing clusters. Similarity search is performed using arbitrary seed genes. The result can return a certain gene as the highest similarity gene. Then another search using the returned as a seed gene is performed. If the initial query gene is returned as the highest scoring gene, then a pseudo-cluster of two genes is formed. A cluster forms when three genes mutually have the highest similarity with one another. A cluster expands by including genes that makes the highest similarity match with any of the gene in the existing cluster. This strict criterion of forming a cluster and joining of genes into the cluster makes classification very specific. Because of the high specificity, each gene belongs to at most one cluster.
 

Protein family (PFam)

PFam is another widely-used classification scheme. Each family is started from a seed family that was manually selected. iterative process of profile building and similarity search is performed to classify genes into the cluster.
 

Enzyme Commission number (EC)

This numbering scheme is managed by Enzyme Commission a regulatory body. A gene is assigned an identification composed of four numbers. The numbers form a hierarchy such that each subsequent numbers designate functionality of higher specificity than the preceding one. It is manually assigned and so is very accurate.

 

Data selection

Genetic sequences often possess multiple domains that pertain to their folding kinetics and in vivo functionality. Common possession of domains mean sequences likely have  substrates in common on which they interact. Highly identical domain structure means the two sequences very likely are functionally identical ones that are found in different organisms. Twelve families that have one-to-one correspondence among COG, PFam and EC have been selected. Three families have been combined into a single data set, and total of four sets were obtained. Since a data set comprises of highly distinct families, ambiguity in gene membership is removed and clear cluster boundary is achieved.