BAG: A Graph Theoretic Sequence Clustering Algorithm
BAG is a graph theoretic sequence clustering algorithm
that utilizes two graph properties, biconnected components
and articulation points. With the current prototype
implemented in C++ with the LEDA package,
we were able to compare many different sets of genomes.
To use BAG, you need toget a license for
LEDA .
The license for academia is not expensive.
There is
the LEDA free edition.
We will try to modify BAG to utilize the LEDA free edition
LEDA book by Stefan Näher and Kurt Mehlhorn
and
Tutorial by Joachim Ziegler.
Try BAG on the web
BAG is available on the web and you can try it
without LEDA. The web interface is configured to handle multiple
bacterial genome comparisons on the web.
As of now, we are testing this service and expect some changes
and intermittent discontinued service
for a while.
Try me
Clustering with FASTA
or
Clustering with BLAST
Supplementary Data
Supplementary data for Cluster Utility
For clustering result
Note that clusters with "SPLIT INTO" tags are NOT
final ones. These are shown in the output for the cluster generation hierarchy.
COG analysis result
here
SCOP analysis result
with a bitscore cutoff 50
with a bitscore cutoff 100
with a bitscore cutoff 150
Result from 10 COG pairs
please find sequence file (.seq) and clustering result (.cluster result) for each pair.
E. coli and Hinf analysis result
sequence files; clustering results
Note: files with "NCBI" suffix can be linked to NCBI site (final clusters are shown).
Arab analysis result
sequence files; clustering results
Note: files with "NCBI" suffix can be linked to NCBI site (final clusters are shown).
a talk at the meeting with Dow scientists
a talk at IU CS
Below is the results form an old version. Updated information
will be available soon.
The BAG implementation is available upon request.
Please send me an email at
sunkim@bio.informatics.indiana.edu ( Please send me a request for a new version after September 2003).
A sample analysis of E. coli and H. influenzae
A sample analysis of Arabidopsis
with -o number option; the higher number the more specific.
The number means that all members in a cluster should share
number percent of the longest overlaps among the members.
with an optin -o 50
with an optin -o 20
Application of BAG
motif discovery
genome alignment
Protein secondary prediction (preliminary)
LTR mobile element detection (with
Haixu Tang and
Mike Lynch group)
a core component tool in builing the
PLATCOM system,
and its siblings, CGAS and ComPath.
References
BAG: A Graph Theoretic Sequence Clustering Algorithm
International Journal of Data Mining and Bioinformatics
Sun Kim and Jason Lee
Vol 1 No2, pp 178-200, 2006
please cite this paper for BAG. thanks!
Cluster Utility: A New Metric to Guide Sequence Clustering
Supplementary data for Cluster Utility
Sun Kim and Jang Lee
Submitted
An old version by
Sun Kim and Arvind Gopu
Technical Report 116, 2004, School of Informatics, Indiana University
GBAG: A Genome-scale Sequence Clustering Algorithm
Sun Kim
manuscript
Graph theoretic sequence clustering algorithms and
their applications to genome comparison,
Sun Kim
Chapter 4 in Computational Biology and Genome Informatics
edited by Cathy H. Wu, Paul Wang, and Jason T. L. Wang,
World Scientific, 2003
This chapter was written in 2001, so recent papers (including ones I missed at that time,
e.g.,
ProtoMap,
ProClust,
proximal seq. space )
will be updated in a separate web page.
BAG: A Graph Theoretic Sequence Clustering Algorithm
an old manuscript(
pdf
,
ps
)
link to clustering papers
I add them as I find relevant papers, so it is kind of a random collection.
-
ProtoMap,
-
ProClust,
-
proximal seq. space
-
On the maximal cliques in c-max-tolerance graphs and their application in clustering molecular sequences
-
InParanoid
-
MultiParanoid
-
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space
Bioinformatics 2008 24(13):i41-i49; doi:10.1093/bioinformatics/btn174
-
COCO-CL: hierarchical clustering of homology relations based on evolutionary correlations
Raja Jothi, Elena Zotenko, Asba Tasneem, and Teresa M. Przytycka
Bioinformatics, 1 April 2006; 22: 779 - 788.
-
Efficient functional clustering of protein sequences using the Dirichlet process
Duncan P. Brown
Bioinformatics, 15 August 2008; 24: 1765 - 1771.
Back to Home