oddcomp reads one or more protein sequences, identifies proteins containing regions with a specific sequence word composition, then writes a list of sequence identifiers of those proteins to an output file. The word composition is read from an input file which gives the minimum word occurence for any number of sequence words. For an input sequence to be listed in the output file, each word must be found at least the stated number of times in any window over the input sequence. The window size may be set to any value or the length of the current protein (-fullwindow option).
The input file of sequence word composition data is in the same format as the output from compseq. Only one word size (of any length) can be used and is specified at the top of the file. The search for words is a boolean AND meaning all words given in the file must be found in a sequence for it to be reported.
Each word must occur at least the stated number of times in a window over an input sequence for the sequence to be reported. The word size given in the input data file must be less than the specified window size; you will not get any hits otherwise.
|
The columns "Obs Frequency", "Exp Frequency" and "Obs/Exp Frequency" are not required - they were simply included in this example to show the similarity between this input file format and the output of the program compseq. A compseq output file can be used as the input to oddcomp - the extra columns are ignored by oddcomp.
A minimal composition input data file would look like this:
Word size 2 Total count 0 RS 2 SR 1
Blank lines and lines starting with '#' are ignored.
The first non-comment line should start with 'Word size' and will specify the word size to use.
A line starting with the word 'Total' is required.
Anything after the line starting with the word 'Total' will be read as word count data.
Word count data consists of a word to search for and the count of that word to search for within the sliding window. The columns are separated by one or more spaces or TAB characters. Anything after these two columns will be ignored.
oddcomp was originally written to identify proteins with SR/RS dimers, for example, windows of forty amino acids containing at least 3 SR and 4 RS words. More generally, it will help answer questions of the type 'which proteins contain at least x occurrences of word X and y occurences of word Y in regions of n residues'. For example, one could search for serine rich or polyglutamine rich, collagen helix, or similar proteins using this program.
oddcomp does not report the location of the word matches in the sequence, merely the sequence ID. To search for a specific set of words in a sequence, you should edit the input file of sequence word composition data.