tranalign

Function

Description

tranalign is a re-implementation in EMBOSS of the program mrtrans by Bill Pearson. It reads a set of (unaligned) nucleotide sequences and a corresponding set of aligned protein sequences which are the translations, and writes the coding regions to file as a nucleotide sequence alignment. The sequences must be in the same order in the input sets. Each nucleotide sequence is translated in all three forward frames using the specified genetic code and the translations compared to the corresponding protein sequence from input the alignment. The contiguous nucleotide sequence that coded the protein is written to file (it will not splice together different exons to produce a coding sequence).

Algorithm

The protein sequences will typically include gap (-) characters. These are ignored during sequence comparison but replaced by --- in the nucleotide sequence alignment output.

Usage

Command line arguments


Input file format

The input is a set of unaligned nucleic sequences and the set of aligned protein sequences to be used as a guide in the alignment of the output nucleic sequences.

The ID names of the nucleic acid and protein sequences are NOT checked to see if they correspond to each other. They can have any names.

There must be at least as many protein sequences as nucleic acid sequence - extra protein sequences are ignored.

Each of the nucleic acid sequences must have a corresponding protein sequence which is derived from the coding region of that nucleic acid sequence. The two sets of sequences must be in the same order.

Output file format

The output is the regions of the nucleic acid sequences which code for the corresponding protein sequence, with gap characters ('-') introduced so that they have the same alignment as the corresponding protein sequences.

Data files

None.

Notes

In general, it is better to use protein sequences for multiple alignment, but to use DNA sequences for phylogeny, for example, when using the programs dnadist, dnapars, dnaml, etc in the PHYLIP package. Where one has a protein sequence alignment, it would be time consuming to remove gap characters before back-translating the proteins. tranalign helps by generating aligned cDNA sequences from a protein sequence alignment.

tranalign finds the coding regions for contiguous sequences only. It will not splice together different exons to produce a coding sequence. You should therefore use either mRNA sequences, or nucleic sequences which you have constructed to hold a contiguous coding region (maybe using extractseq or yank and union?).

References

None.

Warnings

The sequences must be in the same order in both input sets of sequences. Some alignment program (including clustalw/emma) will re-order their input sequences so as to group similar sequences together.

Diagnostic Error Messages

"No guide protein sequence available for nucleic sequence xxx" - the corresponding protein sequence for this nucleic sequence has not been input. You have input more nucleic acid sequences than protein sequences.

"Guide protein sequence xxx not found in nucleic sequence xxx" - the region of the nucleic sequence which codes for the protein was not found. The coding region in the nucleic acid sequence must be a single contiguous sequence. The protein sequence might not be the corresponding one for this nucleic acid sequence if they are out of order.

Exit status

It always exits with status 0.

Known bugs

None.

Author(s)

The original program mrtrans was written by Bill Pearson (wrp@virginia.edu)

tranalign was written in EMBOSS code using the description of mrtrans as a guide by

History

Target users

Comments