EMBOSS: infoalign

infoalign

Function

Description

infoalign displays on screen basic information about sequences in an input multiple sequence alignment. This includes the sequences' USA, name, two measures of length, counts of gaps, and numbers of identical, similar and different residues or bases in this sequence when compared to a reference sequence, together with a simple statistic of the % change between the reference sequence and this sequence. Any combination of these records is easily selected or unselected for display. The same information may be written to an output file which (optionally) may be formatted in an HTML table.

The reference sequence is the one against which all the other sequences are compared using a specified substitution matrix. It is either the calculated consensus sequence of the alignment (the default) or it can be one of the set of aligned sequences, specified by either the ordinal number of that sequence in the input file, or by name. There are various options to control how the consensus is calculated.

Algorithm

The set of aligned sequences is read in.

If the reference sequence is the consensus sequence (this is the default) then this is calculated. If the reference sequence is specified as an ordinal number, then the sequences are counted (from 1) until the reference sequence is identified. If the reference sequence is specified by its name then the names of the sequences are compared to the specified name until the reference sequence is identified.

Foreach sequence:

Find the position of the first residue or base which is not a gap character.
Find the position of the last residue or base which is not a gap character.

Foreach position from the first non-gap character to the last non-gap character:

if the position is a gap character, then
increment the 'GapLen' count
if this character is the start of a new gap, increment the 'Gaps' count
else
the character at this position of the sequence and in the reference sequence are now compared.
if the sequence character and the reference character are identical (apart from case) then
increment the 'Ident' count
else if the similarity matrix score for the two characters is > 0 (i.e. if they are similar) then
increment the 'Similar' count
else
increment the 'Different' count

The 'SeqLen' length of the sequence is the number of non-gap characters in the sequence (i.e. 'Ident' + 'Similar' + 'Different')

The 'AlignLen' length of the sequence is the length from the first non-gap character to the last non-gap character. (i.e. the number of bases or residues of the sequence plus the number of gap characters internal to the sequence.)

The '%Change' value for the sequence is calculated as:
('AlignLen' - 'Ident') * 100 / 'AlignLen'

Usage

Command line arguments

Input file format

infoalign reads a normal multiple sequence alignment file, as produced by a alignment program.

Output file format

The first non-blank line is the heading. This is followed by one line per sequence containing the following columns of data separated by one of more space or TAB characters:

The USA (Uniform Sequence Address) that EMBOSS can use to read in the sequence.
Name - name of the sequence.
SeqLen - length of the sequence when all gap characters are removed.
AlignLen - length of the sequence including internal gap characters i.e. gaps at the start or the end are not included.
Gaps - number of gaps e.g. 'AAA---AAA' is 1 gap (and 3 gap characters long (see GapLen)).
GapLen - total number of internal gap characters, see the 3 gap characters above. This is the sum total of all of the internal gap characters in this sequence.
Ident - number of characters that are identical to the specified reference sequence (uppercase 'A' is identical to lowercase 'a').
Similar - number of characters which are non-identical - which score > 0 in the comparison matrix when compared to the reference sequence, but which are not identical.
Different - number of characters which score <= 0 in the comparison matrix when compared to the reference sequence.
%Change - a simple measure of the percentage change as compared to the reference sequence: (AlignLen - Ident) * 100 / AlignLen
Description - the description annotation of the sequence (if any).

If qualifiers to inhibit various columns of information are used, then the remaining columns of information are output in the same order as shown above, so if '-noseqlength' is used, the order of output is: usa, name, alignlength, gaps, gapcount, idcount, simcount, diffcount, change, description.

When the -html qualifier is specified, then the output will be wrapped in HTML tags, ready for inclusion in a Web page. Note that tags such as and are not output by this program as the table of databases is expected to form only part of the contents of a web page - the rest of the web page must be supplier by the user.

The lines of output information are guaranteed not to have trailing white-space at the end.

Data files

infoalign reads in scoring matrices to determine the consensus sequence and to determine which matches are similar or not.

Notes

There are many qualifiers to control exactly what information on the sequence is output and how it is formatted. If you only want a few fields in the output file, the command line may be shortended by preceding the appropriate qualifier with -only. For example, instead of specifying -nohead -nousa -noname -noalign -nogaps -nogapcount -nosimcount -noidcount -nodiffcount -noweight to get only the sequence length output, you can specify -only -seqlength.

By default, the output file starts each line with the USA of the sequence being described, so the output file is a list file that can be manually edited and read in by any other EMBOSS program that can read in one or more sequence to be analysed.

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Output file format

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

Author(s)

History

Target users

Comments