Clustering Algorithms Used
Sequences to be clustered is often abstracted into graph nodes. The similarity between sequences is then modeled as graph edges. Sequence clustering problem then reduces to a graph problem. Cluster forming requires a degree of similarity among members. Different conditions can be used as criterion to identify sets of node components that possess high similarity. BAG includes more comprehensive survey on various sequence clustering algorithms.
Clustering package BAG has mainly been used. It uses bi-connectedness of components as the criterion of cluster forming. An edge between two sequences is weighted by their pairwise similarity. Graph is maintained a undirected one by taking the maximum similarity score out of the two possibly different edges scores.
Another edge construction method is used by a clustering package Proclust. The edges between nodes are normalized local alignment scores. The raw alignment score is computed using the standard technique. The self-similarity is defined as the raw similarity score of a gene with itself. Edge score from a gene to another is the mutual similarity score divided by the self-similarity score. An it is multiplied by 100 for normalization. This method of edge construction is different from that used by BAG. Graphs used in Proclust is directed while those used in BAG is undirected.