tranalign |
tranalign is a re-implementation in EMBOSS of the program mrtrans by Bill Pearson. It reads a set of (unaligned) nucleotide sequences and a corresponding set of aligned protein sequences which are the translations, and writes the coding regions to file as a nucleotide sequence alignment. The sequences must be in the same order in the input sets. Each nucleotide sequence is translated in all three forward frames using the specified genetic code and the translations compared to the corresponding protein sequence from input the alignment. The contiguous nucleotide sequence that coded the protein is written to file (it will not splice together different exons to produce a coding sequence).
The protein sequences will typically include gap (-) characters. These are ignored during sequence comparison but replaced by --- in the nucleotide sequence alignment output.
% tranalign ../data/tranalign.pep tranalign2.seq Generate an aligment of nucleic coding regions from aligned proteins |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-asequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) [-bsequence] seqset (Aligned) protein sequence set filename and optional format, or reference (input USA) [-outseq] seqoutset [ |
Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[-asequence] (Parameter 1) |
Nucleotide sequence(s) filename and optional format, or reference (input USA) | Readable sequence(s) | Required | ||||||||||||||||||||||||||||||||||||
[-bsequence] (Parameter 2) |
(Aligned) protein sequence set filename and optional format, or reference (input USA) | Readable set of sequences | Required | ||||||||||||||||||||||||||||||||||||
[-outseq] (Parameter 3) |
(Aligned) nucleotide sequence set filename and optional format (output USA) | Writeable sequences | <*>.format | ||||||||||||||||||||||||||||||||||||
Additional (Optional) qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
-table | Code to use |
|
0 | ||||||||||||||||||||||||||||||||||||
Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||||||||||||||||||||||||||||
(none) |
The ID names of the nucleic acid and protein sequences are NOT checked to see if they correspond to each other. They can have any names.
There must be at least as many protein sequences as nucleic acid sequence - extra protein sequences are ignored.
Each of the nucleic acid sequences must have a corresponding protein sequence which is derived from the coding region of that nucleic acid sequence. The two sets of sequences must be in the same order.
>HSFAU1 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggccccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU2 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU3 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU4 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa >HSFAU5 ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc tgactaccctggaagtaggccgcatgctttttggaggtaaagttcatggttccctggccc gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc tctaataaaaaagccacttagttcagtcaaaaaaaaaa |
>HSFAU1_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAG-PLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU2_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDALWASAGWRP >HSFAU3_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLG-GKVHGSLARAGKVRGQTPKGAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU4_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHEIASLEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVAGRMLARGKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS >HSFAU5_3 PLSRLHLRGSWDRRSVANMQLFVRAQELHTFEVTGQETVAQIKAHVAS-LEGIAPEDQVV LLAGAPLEDEATLGQCGVEALTTLEVGRMLFG-GKVHGSLARAGKVRGQTPKVAKQEKKK KKTGRAKRRMQYNRRFVNVVPTFGKKKGPNANS |
>HSFAU1 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggc---cccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU2 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgcactctgggccagtgcggggtggaggccc--- ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ --------------------------------------- >HSFAU3 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaagggggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU4 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgaaatagcctcactggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtagcaggccgcatgcttgcccgaggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct >HSFAU5 cctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgcag ctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcgcc cagatcaaggctcatgtagcctca---ctggagggcattgccccggaagatcaagtcgtg ctcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggcc ctgactaccctggaagtaggccgcatgctttttgga---ggtaaagttcatggttccctg gcccgtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaag aagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtg cccacctttggcaagaagaagggccccaatgccaactct |
The output is the regions of the nucleic acid sequences which code for the corresponding protein sequence, with gap characters ('-') introduced so that they have the same alignment as the corresponding protein sequences.
In general, it is better to use protein sequences for multiple alignment, but to use DNA sequences for phylogeny, for example, when using the programs dnadist, dnapars, dnaml, etc in the PHYLIP package. Where one has a protein sequence alignment, it would be time consuming to remove gap characters before back-translating the proteins. tranalign helps by generating aligned cDNA sequences from a protein sequence alignment.
tranalign finds the coding regions for contiguous sequences only. It will not splice together different exons to produce a coding sequence. You should therefore use either mRNA sequences, or nucleic sequences which you have constructed to hold a contiguous coding region (maybe using extractseq or yank and union?).
The sequences must be in the same order in both input sets of sequences. Some alignment program (including clustalw/emma) will re-order their input sequences so as to group similar sequences together.
"Guide protein sequence xxx not found in nucleic sequence xxx" - the region of the nucleic sequence which codes for the protein was not found. The coding region in the nucleic acid sequence must be a single contiguous sequence. The protein sequence might not be the corresponding one for this nucleic acid sequence if they are out of order.
Program name | Description |
---|---|
edialign | Local multiple alignment of sequences |
emma | Multiple sequence alignment (ClustalW wrapper) |
infoalign | Display basic information about a multiple sequence alignment |
plotcon | Plot conservation of a sequence alignment |
prettyplot | Draw a sequence alignment with pretty formatting |
showalign | Display a multiple sequence alignment in pretty format |
tranalign was written in EMBOSS code using the
description of mrtrans as a guide by
Gary Williams (gwilliam © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK
tranalign written (March 2002) - Gary Williams