Motivation: Biologists frequently align multiple sequences to determine consensus sequences or look for predominant residues or conserved regions. In particular, determining conserved regions in an alignment is one of the most important activities, but it is determined manually by experts in most cases. Since protein sequences are often several-hundred residues or longer, it is difficult to distinguish biologically important conserved regions (motifs or domains) from others. The widely used tools, Logos and information content, often fail to highlight such regions. Thus a computational tool that can highlight biologically important regions with a higher accuracy will be significant.
Results: This paper presents a new score scheme ARCS (Aggregated Related Column Score) for aligned biological sequences and designs an algorithm based on ARCS to highlight the reserved regions among aligned sequences. In an extensive experimental evaluation using 533 PROSITE patterns, ARCS was able to highlight the motif regions with up to 82% accuracy.
Download: ARCS v1.1
Command line: arcs.pl -matlab -i PS00702.aln -o PS00702 -p 'C-P-[LP]-T-x-E-[ST]-x-C' -n PS00702 -s 3 -t 4 -v
either -matlab or -octave : program used
-i PS00702.aln : multiple alignment file of Clustal-W format
-o PS00702 : generate the MatLab code named PS00702.m, the ARCS output named PS00702.arcs, and the figure named PS00702.eps
-p 'C-P-[LP]-T-x-E-[ST]-x-C' : (optional) highlight the positions occurring the pattern in the figure
-n PS00702 : (optional) show the title of the figure
-s 3 : (optional) the smoothing window size [default: 3]
-t 4 : (optional) the neighborhood size [default: 4]
-v : (optional) print detail information
Experiments with 533 PROSITE patterns
ARCS results for 47 PROSITE patterns which are not aligned correctly by Clustal-W