Quality, Containment and Fragmentation Index

 

To measure clustering quality, a set of indices have been defined and used.

Containment index (CI) is

CI = (Number of correctly clustered sequences) / (Total number of sequences)

This indicates the number of sequences that are not mixed with other sequences into a cluster. It is the fraction of sequences that have been correctly clustered.

This varies between 0 and 1. Value 0 indicates all clusters contain sequences from distinct families. Value 1 indicates each cluster is made up of sequences from single families. In the extreme case, if all clusters are singletons, the CI is 1, but the clustering result is not informative and so the clustering quality is low. Another index FI is introduced. 

Fragmentation index (FI) is defined as the sum of fractions of sequences that do not belong to the cluster into which the largest number of sequences of a family belong.

FI = (Number of sequences in minor clusters) / (Total number of sequences)

FI also varies between 0 and 1. Specifically, the FI of above mentioned case is 1.

With clustering quality, CI is positively related and FI negatively related. So CI and FI are combined into a single scalar value and is denoted as quality index QI.

QI = CI -FI