EMBOSS: needle

needle

Function

Description

needle reads two input sequences and writes their optimal global sequence alignment to file. It uses the Needleman-Wunsch alignment algorithm to find the optimum alignment (including gaps) of two sequences along their entire length. The algorithm uses a dynamic programming method to ensure the alignment is optimum, by exploring all possible alignments and choosing the best. A scoring matrix is read that contains values for every possible residue or nucleotide match. Needle finds the alignment with the maximum possible score where the score of an alignment is equal to the sum of the matches taken from the scoring matrix, minus a penalties arising from opening and extending gaps in the aligned sequences. The substitution matrix and gap opening and extension penalties are user-specified.

Algorithm

The Needleman-Wunsch algorithm is a member of the class of algorithms that can calculate the best score and alignment of two sequences in the order of mn steps, where n and m are the sequence lengths. These dynamic programming algorithms were first developed for protein sequence comparison by Needleman and Wunsch, though similar methods were independently devised during the late 1960's and early 1970's for use in the fields of speech processing and computer science.

An important problem is the treatment of gaps, i.e., spaces inserted to optimise the alignment score. A penalty is subtracted from the score for each gap opened (the 'gap open' penalty) and a penalty is subtracted from the score for the total number of gap spaces multiplied by a cost (the 'gap extension' penalty). Typically, the cost of extending a gap is set to be 5-10 times lower than the cost for opening a gap.

There are two ways to compute a penalty for a gap of n positions :

gap opening penalty + (n - 1) * gap extension penalty
gap penalty + n * gap length penalty

The two methods are basically equivalent. The first way is used by EMBOSS and WU-BLAST. The second way is used by NCBI-BLAST, GCG, Staden and CLUSTAL. Fasta for a long time used the first way, but recently shifted to the second.

In a Needleman-Wunsch global alignment, the entire length of each sequence is aligned. The sequences might be partially overlapping or one sequence might be aligned entirely internally to the other. There is no penalty for the hanging ends of the overlap. In bioinformatics, it is usually reasonable to assume that the sequences are incomplete and there should be no penalty for failing to align the missing bases.

Usage

Command line arguments

Input file format

needle reads any 2 sequence USAs of the same type (DNA or protein).

Output file format

The Identity: is the percentage of identical matches between the two sequences over the reported aligned region (including any gaps in the length).

The Similarity: is the percentage of matches between the two sequences over the reported aligned region (including any gaps in the length).

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used. Others can be specified.

Notes

needle is a true implementation of the Needleman-Wunsch algorithm and so produces a full path matrix. It therefore cannot be used with genome sized sequences unless you've a lot of memory and a lot of time.

References

Needleman, S. B. and Wunsch, C. D. (1970) J. Mol. Biol. 48, 443-453.
Kruskal, J. B. (1983) An overview of squence comparison In D. Sankoff and J. B. Kruskal, (ed.), Time warps, string edits and macromolecules: the theory and practice of sequence comparison, pp. 1-44 Addison Wesley.

Warnings

needle is for aligning two sequences over their entire length. This works best with closely related sequences. If you use needle to align very distantly-related sequences, it will produce a result but much of the alignment may have little or no biological significance.

A true Needleman Wunsch implementation like needle needs memory proportional to the product of the sequence lengths. For two sequences of length 10,000,000 and 1,000 it therefore needs memory proportional to 10,000,000,000 characters. Two arrays of this size are produced, one of ints and one of floats so multiply that figure by 8 to get the memory usage in bytes. That doesn't include other overheads. Therefore only use water and needle for accurate alignment of reasonably short sequences.

If you run out of memory, try using stretcher instead.

Diagnostic Error Messages

Uncaught exception
 Assertion failed
 raised at ajmem.c:xxx

Probably means you have run out of memory. Try using stretcher if this happens.

Exit status

0 upon successful completion.

Known bugs

None.

When you want an alignment that covers the whole length of both sequences, use needle.

When you are trying to find the best region of similarity between two sequences, use water.

stretcher is a more suitable program to use to find global alignments of very long sequences.