supermatcher calculates an approximate alignment between all the sequences in a first set of sequences and all those from a second stream, typically a database. The alignments are written to a standard alignment file. A combination of a word-match and Smith-Waterman local alignment (dynamic programming) algorithms are used. The alignments will be less acurate than an optimal alignment generated by full dynamic programming, but the program will run faster and use less memory, which means it is suitable for use with larger sequences.
|
The output alignment is in simple format by default.
The file 'supermatcher.error' will contain any errors that occured during the program. This may be that wordmatch could not find any matches hence no suitable start point is found for the smith-waterman calculation.
supermatcher generates approximate local alignments for large sequences. The alignments are approximate because as a first step, all the sequence word matches between two sequences are found. By identifying the highest scoring, non-overlapping matches a set of approximate local alignments are calculated for two sequences. These give the centre points for more acurate Smith-Waterman type alignments in a region of width specified by the user. The use of Smith-Waterman in narrow regions means the alignment overall will be rough, but due to the memory saving much larger sequences can be aligned.
For the Swmith-Waterman alignment, the gap open and extension penalties and substition matrix may be specified, the later by default is EBLOSUM62 for protein sequences and EDNAMAT for nucleotide sequence.
For the word-match alignment, the word length may be specified. The time required for alignment depends very much on word size. A small word size (e.g. 4) may take a very long time even for short sequences. Much larger word sizes (e.g. 30) will give a very quick result. The default of 16 should give reasonably fast alignments.
supermatcher performs a Smith & Waterman alignment (albeit in a narrow best-matching regions identified by simple word-match) and therefore can use huge amounts of memory if the sequences are large. The longer the sequences and the wider the specified alignment width, the more memory will be used. If the program terminates due to lack of memory you can try running the UNIX command limit to see if your stack or memory usage have been limited and if so, run unlimit, (e.g.: % unlimit stacksize).
Because the alignment is made within a narrow area each side of the 'best' diagonal identified by word-matching, if there are sufficient indels between the two sequences, then the path of the Smith & Waterman alignment can wander outside of this area. Making the width larger can avoid this problem, but you then use more memory.