BAG: A Graph Theoretic Sequence Clustering Algorithm

BAG is a graph theoretic sequence clustering algorithm that utilizes two graph properties, biconnected components and articulation points. With the current prototype implemented in C++ with the LEDA package, we were able to compare many different sets of genomes. To use BAG, you need toget a license for LEDA . The license for academia is not expensive.

There is the LEDA free edition. We will try to modify BAG to utilize the LEDA free edition
LEDA book by Stefan Näher and Kurt Mehlhorn and Tutorial by Joachim Ziegler.

Try BAG on the web

BAG is available on the web and you can try it without LEDA. The web interface is configured to handle multiple bacterial genome comparisons on the web. As of now, we are testing this service and expect some changes and intermittent discontinued service for a while.
Try me Clustering with FASTA or Clustering with BLAST

Supplementary Data

Supplementary data for Cluster Utility

For clustering result
Note that clusters with "SPLIT INTO" tags are NOT final ones. These are shown in the output for the cluster generation hierarchy.
  • COG analysis result here
  • SCOP analysis result
    with a bitscore cutoff 50
    with a bitscore cutoff 100
    with a bitscore cutoff 150
  • Result from 10 COG pairs
    please find sequence file (.seq) and clustering result (.cluster result) for each pair.
  • E. coli and Hinf analysis result
    sequence files; clustering results
    Note: files with "NCBI" suffix can be linked to NCBI site (final clusters are shown).
  • Arab analysis result
    sequence files; clustering results
    Note: files with "NCBI" suffix can be linked to NCBI site (final clusters are shown).
    a talk at the meeting with Dow scientists
    a talk at IU CS

    Below is the results form an old version. Updated information will be available soon. The BAG implementation is available upon request. Please send me an email at ( Please send me a request for a new version after September 2003).

  • A sample analysis of E. coli and H. influenzae

  • A sample analysis of Arabidopsis
    with -o number option; the higher number the more specific.
    The number means that all members in a cluster should share number percent of the longest overlaps among the members.
    with an optin -o 50

    with an optin -o 20

    Application of BAG

  • motif discovery
  • genome alignment
  • Protein secondary prediction (preliminary)
  • LTR mobile element detection (with Haixu Tang and Mike Lynch group)
  • a core component tool in builing the PLATCOM system, and its siblings, CGAS and ComPath.


  • BAG: A Graph Theoretic Sequence Clustering Algorithm
    International Journal of Data Mining and Bioinformatics
    Sun Kim and Jason Lee
    Vol 1 No2, pp 178-200, 2006
    please cite this paper for BAG. thanks!

  • Cluster Utility: A New Metric to Guide Sequence Clustering
    Supplementary data for Cluster Utility
    Sun Kim and Jang Lee

    An old version by Sun Kim and Arvind Gopu
    Technical Report 116, 2004, School of Informatics, Indiana University
  • GBAG: A Genome-scale Sequence Clustering Algorithm
    Sun Kim
  • Graph theoretic sequence clustering algorithms and their applications to genome comparison,
    Sun Kim
    Chapter 4 in Computational Biology and Genome Informatics
    edited by Cathy H. Wu, Paul Wang, and Jason T. L. Wang, World Scientific, 2003
    This chapter was written in 2001, so recent papers (including ones I missed at that time, e.g., ProtoMap, ProClust, proximal seq. space ) will be updated in a separate web page.
  • BAG: A Graph Theoretic Sequence Clustering Algorithm
    an old manuscript( pdf , ps )

    link to clustering papers

    I add them as I find relevant papers, so it is kind of a random collection.
    1. ProtoMap,
    2. ProClust,
    3. proximal seq. space
    4. On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
    5. InParanoid
    6. MultiParanoid
    7. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
      Bioinformatics 2008 24(13):i41-i49; doi:10.1093/bioinformatics/btn174
    8. COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations Raja Jothi, Elena Zotenko, Asba Tasneem, and Teresa M. Przytycka Bioinformatics, 1 April 2006; 22: 779 - 788.
    9. Efficient functional clustering of protein sequences using the Dirichlet process Duncan P. Brown Bioinformatics, 15 August 2008; 24: 1765 - 1771.

    Back to Home

    Under Contruction.......