Absolute CU value as a measure of overall data homogeneity

Five COG families have been selected. Combinatorically different sets of three families have been chosen out of the five 5C3. Hence a total of 10 data sets were obtained (5C3). In clustering, edges are distinguished depending on whether they are between two nodes of different clusters or are in the same.

Homogeneity denotes the overall degree of mutual similarity between clusters of a data set. It is quantitatively defined as

H = Number of out-edges / Total number of edges

This value vary between 0 and 1. Zero value corresponds a case where all edges are within clusters and none are found to bridge across clusters. The homogeneity range of ten data sets is approximately 0.8, which is very wide.

Perfect clustering results were generated for each data set and corresponding CU vales were measured. Figure 1. indicates that CU value linearly decreases as the homogeneity increases. Maximum value is 3.5 and minimum is approximately -0.5.

Table 1. Five COG family profiles

COG Nodes Description
0001 35 [H] COG0001 Glutamate-1-semialdehyde aminotransferase
0160 79 [E] COG0160 PLP-dependent aminotransferases
0161 49 [H] COG0161 Adenosylmethionine-8-amino-7-oxononanoate aminotransferase
0331 38 [I] COG0331 (acyl-carrier-protein) S-malonyltransferase
0523 28 [R] COG0523 Putative GTPases (G3E family)

 

Figure 1. Linear relationship between CU value and data homogeneity

 

Plots of QI vs. CU

         (1) COG 0001, 0160, 0161               (2) COG 0001, 0160, 0331

       

  

      (3) COG 0001, 0160, 0523                   (4) COG 0001, 0161, 0331

      (5) COG 0001, 0161, 0523                   (6) COG 0001, 0331, 0523

 

    (7) COG 0160, 0161, 0331                (8) COG 0160, 0161, 0523

 

        (7) COG 0160, 0331, 0523                (8) COG 0161, 0331, 0523

Figure 2. Linear correlation of each data sets