prophecy

Function

Description

prophecy generates for an input sequence alignment a simple frequency matrix (for use by profit) or a position specific weighted profile using either the Gribskov (1) or Henikoff (2) method (for use by prophet). For constructing a simple frequency matrix, a residue substitution matrix, gap opening and gap extension penalty must be specified.

Algorithm

The Gribskov scoring scheme is based on a notion of distance between a sequence and an ancestral or generalized sequence. For Henikoff it is based on weights of the diversity observed at each position in the alignment, rather than on a sequence distance measure.

Usage

Command line arguments


Input file format

prophecy reads a protein or a nucleic sequence alignment USA.

Output file format

The output is a profile file.

Simple frequency matrix

The columns represent amino acid counts for the amino acid residues from A to Z. The rows represent the alignment positions from 1->n. The file is a "Simple" frequency matrix of "Length" 20 amino acids. The maximum score this matrix can give is 496. The "Threshold" value is an instruction to the profit application to only report matches above the given score.

Gribskov profile

The columns represent the amino acids from A to Z, the rows denote the alignment position from 1 to n. The last column is the indel penalty. The "Name" is the name used by prophet to refer to the profile, the "Matrix" is the name of the scoring matrix used (containing residue substitution values). "Length" is the length of the alignment. "Max_score" is the maximum score this profile can produce. The threshold is an instruction to prophet to only report hits equal to or above the given value. The gap opening and extension values are used by prophet in the dynamic alignment (Smith Waterman equivalent).

Data files

The two profile methods require a residue substitution scoring matrix.

The Gribskov marices use the data file 'Epprofile' by default. This is derived from a PAM250 scoring matrix.

The Henikoff matrices use the data file 'EBLOSUM62' by default.

Notes

Profile analysis is a method for detecting distantly related proteins by sequence comparison. The basis for comparison is not only the customary Dayhoff mutational-distance matrix but also the results of structural studies and information implicit in the alignments of the sequences of families of similar proteins. This information is expressed in a position-specific scoring table (profile), which is created from a group of sequences previously aligned by structural or sequence similarity. The similarity of any other target sequence to the group of aligned probe sequences can be tested by comparing the target to the profile using dynamic programming algorithms. The profile method differs in two major respects from methods of sequence comparison in common use: (i) Any number of known sequences can be used to construct the profile, allowing more information to be used in the testing of the target than is possible with pairwise alignment methods. (ii) The profile includes the penalties for insertion or deletion at each position, which allow one to include the probe secondary structure in the testing scheme.

References

  1. Gribskov M, McLachlan AD, Eisenberg D. Proc Natl Acad Sci U S A. 1987 Jul; 84(13): 4355-8.
  2. Henikoff S, Henikoff JG. J Mol Biol. 1994 Nov 4; 243(4): 574-8.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

Author(s)

History

Target users

Comments