EMBOSS: profit

profit

Function

Description

profit scans one or more sequences with a simple frequency matrix and writes an output file with any high-scoring matches. All possible ungapped alignments of each sequence to the matrix are scored and any matches with a score higher than the specified threshold are written to the output file. The output file includes the name of any matching sequence found, the start position in the sequence of the match and the percentage of the maximum possible score.

Algorithm

All possible ungapped alignments of each sequence to the frequency matrix are scored. The first alignment has the first positions of the sequence and matrix in the same register. If the sequence is larger than the matrix, there will be more than one alignment. Otherwise, there will be just one.

The score for a match is simply the sum of scores at each position of the matrix for the corresponding residue from the sequence. The percentage of the maximum possible score reported in the output file is the sum of the highest value at each position in the frequency matrix. Where the match score is above the threshold percentage of the maximum possible score for that matrix, then a hit is reported.

Usage

Before running the example, we need to make a simple frequency matrix using prophecy

This is the ungapped aligned set of sequences used to make the matrix:

% more m.seq
>one
DEVGGEALGRLLVVYPWTQR
>two
DEVGREALGRLLVVYPWTQR
>three
DEVGGEALGRILVVYPWTQR
>four
DEVGGEAAGRVLVVYPWTQR



% prophecy
Creates matrices/profiles from multiple alignments
Input sequence set: m.seq
Profile type
         F : Frequency
         G : Gribskov
         H : Henikoff
Select type [F]: 
Enter a name for the profile [mymatrix]: 
Enter threshold reporting percentage [75]: 
Output file [outfile.prophecy]:

Command line arguments

Input file format

profit reads a simple frequency matrix produced by prophecy and uses it to search searches one or more protein or nucleic acid sequence USAs.

Output file format

The ouput is a list of three columns.

The first column is the name of the matching sequence found.
The second is the start position in the sequence of the match.
The third column (after the word 'Percentage:') is the percentage of the maximum possible score (sum of the highest value at each position in the frequency matrix).

Data files

None.

Notes

A 'simple frequency matrix' is simply a table containing a count of the number of times any particular amino acid occurs at each position in the sequence alignment from which it was derived. Simple frequency matrices are created using the program prophecy with the option -type F to create the correct type of matrix. The input alignment should not have gaps in it.

References

None.

Warnings

The aligned set of sequences used to make the simple frquency matrix should not have gaps in it. profit will let you use a matrix made from a gapped alignment, but the results will probably not be sensible.

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Output file format

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

Author(s)

History

Target users

Comments