EMBOSS: compseq

compseq

Function

Description

compseq calculates the composition of words of a specified length (dimer, trimer etc) in the input sequence(s). The word length is user-specified. The unique sequences (words), their observed count, observed frequency, expected frequency and (observed / expected) frequency are written to the output file. The (observed / expected) frequency highlights any words with unusually high (or low) occurence in the input sequences.

Algorithm

By default, compseq makes the (false) assumption that each word is equally likely. The expected frequency therefore of any dimer is 1/16 - this is simply the inverse of the number of possible dimers (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT). Similarly, the expected frequency of any trimer is 1/64, etc. Clearly this is not the case in real sequences where there will be bias in favour of some monomers and words. There are ways around this (see "Notes").

The normal behaviour of compseq is to count the frequencies of all words that occur by moving a window of length 'word' up by one each time. The '-frame' option allows you to move the window up by the length of the word each time, skipping over the intervening words. You can count only those words that occur in a single frame of the word by setting this value to a number other than zero. If you set it to 1 it will only count the words in frame 1, 2 will only count the words in frame 2 and so on.

Usage

Command line arguments

Input file format

Normal sequence(s) USA.

Output file format

The output format consists of:

Header information and comments are preceeded by a '#' character at the start of the line.

The Word size and the Total count are then given on separate lines,

The headers of the columns of results are preceeded by a '#'

The results columns are: the sub-sequence word, the observed frequency, the expected frequency (which will be read from the input file if one is given, else it is a simple inverse of the number of words of the size specified that can be constructed), the ratio of the observed to expected frequency.

After a blank line at the end, the results of 'Other' words is given - this is the number of words with a sequence which has IUPAC ambiguity codes or other unusual characters in.

Data files

The input data file is not required.

The input data file format is exactly the same as the output file format.

It expects to read in a previous output file of this program. An error is produced if the word size of the current compseq job and that of the output file being read in are different.

Notes

There is no way for compseq to guess what the true expected frequency should be for each word. It can however read in the result of a previous compseq analysis and use this to set the expected frequencies of the subsequences. In this case, the input sequences under investigation should be representative of those used for the previous compseq analysis. It is down to your biological expertise to ensure the sequences are genuinely "representative", for instance, you might select a group of sequences belonging to the same taxonomic rank such as genus or species.

The file of expected frequencies is specified by name with the -infile qualifier. The word size in the current run must be the same as the one in this results file. Obviously, you should use a file produced from protein sequences if you are counting protein sequence word frequencies, and you must use one made from nucleotide frequencies if you are analysing a nucleotide sequence.

As an alternative to using -infile, the expected frequencies of words may be calculated from the observed frequency of single bases or residues in the sequences. To do this, specify the -calcfreq qualifier. If you are reporting a word size of 1 (single bases or residues) then there is no point in using this option because the calculated expected frequency will be equal to the observed frequency.

Calculating the expected frequencies like this will give an approximation of the expected frequencies that you might get by using an input file of frequencies produced by a previous run of this program. If an input file of expected word frequencies has been specified then the values from that file will be used instead of this calculation of expected frequency from the sequence, even if 'calcfreq' is set to be true.

References

None.

Warnings

The results are held in an array in memory before being written to a file. For large values of wordsize (over about 7 for nucleic, 5 for protein), you may run out of memory or generate a very large output file.

Diagnostic Error Messages

"The word size is too large for the data structure available.": You chose a word size that cannot be stored by the program.
"Insufficient memory - aborting.": You do not have enough memory - use a machine with more memory.
"The word size you are counting (n) is different to the word size in the file of expected frequencies (n).": You chose different word sizes in the run of compseq that produced your results file used to display the expected word frequencies to the word size used in this run of compseq.
"The 'Word size' line was not found, instead found:": You appear to be trying to read a corrupted compseq results file

Exit status

It always exits with status 0 unless one of the above error conditions is found

Known bugs

This program can use a large amount of memory is you specify a large word size (7 or above). This may impact the behaviour of other programs on your machine.

If you run out of memory, you may see the program crash with a generic error message that will be specific to your machine's operating system, but will probably be a warning about writing to memory that the program does not own (eg "Segmentation fault" on a Solaris machine)

This is not a bug, it is a feature of the way this program grabs large amounts of memory.