merger

Function

Description

merger reads two overlapping input sequences of the same type (typically nucleotide) and uses a global alignment algorithm (Needleman & Wunsch) to optimally align the sequences. A merged sequence is generated from the alignment and writen to the output file. The sequence alignment is also written.

Algorithm

It uses a global alignment algorithm (Needleman & Wunsch) to optimally align the sequences and then creates a merged sequence from the alignment. When there is a mismatch in the alignment between the two sequences, the base included in the merged sequence is the base from the sequence which has the best local sequence quality score.

The following heuristic is used to find the sequence quality score. If one of the bases is a 'N', then the other sequence's base is used. Otherwise, a window size around the disputed base is used to find the local quality score. This window size is increased from 5, to 10 to 20 bases or until there is a clear decision on the best choice. If there is no best choice after using a window of 20, then the base in the first sequence is used.

To calculate the quality of a window of a sequence around a base:

    * quality = sequence value/length under window either side of the base
    * sequence value = sum of points in that window
    * unambiguous bases (ACGTU) score 2 points
    * ambiguous bases (MRWSYKVHDB) score 1 point
    * Ns score 0 points
    * off end of the sequence scores 0 points 

This heavily discriminates against the iffy bits at the end of sequence reads.

Usage

Typically, one of the sequences will need to be reverse-complemented to put it into the correct orientation to make it join. For example:

% merger file1.seq file2.seq -sreverse2 -outseq merged.seq

Command line arguments


Input file format

merger reads any two sequence USAs of the same type (protein or nucleic acid.)

Output file format

The output sequence file contains the joined sequence, by default in FASTA format. Where there is a mismatch in the alignment, the chosen base is written to the output sequence in uppercase.

The output report file contains descriptions of the positions where there is a mismatch in the alignment and shows the alignment. Where there is a mismatch in the alignment, the chosen base is written in uppercase.

Data files

It reads the scoring matrix for the alignment from the standard EMBOSS 'data' directory. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences).

Notes

This program was originally written to aid in the reconstruction of mRNA sequences which had been sequenced from both ends as a 5' and 3' EST (cDNA). eg. joining two reads produced by primer walking sequencing.

The gap open and gap extension penalties have been set at a higher level than is usual (50 and 5). This was experimentally determined to give the best results with a set of poor quality EST test sequences.

References

None.

Warnings

Care should be taken to reverse one of the sequences (e.g. using the qualifier -sreverse2) if this is required to get them both in the correct orientation..

merger uses the memory-hungry Needleman & Wunsch alignment. The required memory may be greater than the available memory when attempting to merge large (cosmid-sized or greater) sequences.

Diagnostic Error Messages

None.

Exit status

It exits with a status of 0

Known bugs

None.

Author(s)

History

Target users

Comments