EMBOSS: matcher

matcher

Function

Description

matcher identifies local similarities in two input sequences using a rigorous algorithm based on Bill Pearson's lalign application, version 2.0u4 (Feb. 1996). The substitution matrix, gap insertion and extension penalty are specified. The specified number of top-scoring pair-wise local sequence alignments are written to file.

Algorithm

matcher is based on Bill Pearson's lalign application, version 2.0u4 (Feb. 1996). lalign uses code developed by X. Huang and W. Miller (Adv. Appl. Math. (1991) 12:337-357) for the "sim" program, which is a linear-space version of an algorithm described by M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).

Usage

Command line arguments

Input file format

matcher reads in any 2 sequence USAs of the same type (DNA or protein).

Output file format

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

. (your current directory)
.embossdata (under your current directory)
~/ (your home directory)
~/.embossdata

Notes

matcher is rigorous but also very slow. The advantage of matcher over water (which is also rigorous) is that it uses far less memory, so you are much less likely to run out of memory when aligning large sequences.

matcher reports a specified number of alignments between the two sequences. water in contrast will only report only the single (optimal) match. The default number of alignments output is 1, but can be increased to (for example) the 10 best alignments by using the -alternatives 10 command-line qualifier. In some cases, for example multidomain proteins or cDNA and genomic DNA comparisons, there may be many interesting and significant alignments.

matcher will not produce an alignment that is guaranteed to be optimal in the same way that water does, which implements the Needleman & Wunsch algorith. water will generate the single, optimal local alignment and uses memory in the order of the product of the lengths of the sequences to be aligned. If you require an optimal alignment you should use water. If you run out of memory or want several possible good alignments, use matcher.

References

X. Huang and W. Miller (1991) Adv. Appl. Math. 12:373-381
M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728).

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 upon successful completion.

Known bugs

None.

water will give a single best rigorous local alignment. It will use memory of the order of the product of the lengths of the sequences to be aligned. If you wish the 'best' local alignment you should use water. If you run out of memory or want several possible good alignments, use matcher.

Author(s)

This program was originally written by Bill Pearson as part of the FASTA package under the name 'lalign'.

This application was modified for inclusion in EMBOSS by

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Output file format

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

Author(s)

History

Target users

Comments