EMBOSS: patmatmotifs

patmatmotifs

Function

Description

patmatmotifs reads a protein sequence and searches it against the PROSITE database of motifs. It writes a standard EMBOSS report file with details of the location and score of any matching motifs. Optionally, full documentation for matching patterns is given in the report. Optionally, simple post-translational modification sites are not reported. PROSITE must be processed by running prosextract before running patmatmotifs.

The home web page of PROSITE is: http://www.expasy.ch/prosite/. Quoting from the PROSITE user's documentation:

"PROSITE is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs.

In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. To paraphrase Orwell, in Animal Farm, we can say that "some regions of a protein sequence are more equal than others" !

The use of protein sequence patterns (or motifs) to determine the function(s) of proteins is becoming very rapidly one of the essential tools of sequence analysis. This reality has been recognized by many authors, as it can be illustrated from the following citations from two of the most well known experts of protein sequence analysis, R.F. Doolittle and A.M. Lesk:

"There are many short sequences that are often (but not always) diagnostics of certain binding properties or active sites. These can be set into a small subcollection and searched against your sequence (1)".

"In some cases, the structure and function of an unknown protein which is too distantly related to any protein of known structure to detect its affinity by overall sequence alignment may be identified by its possession of a particular cluster of residues types classified as a motifs. The motifs, or templates, or fingerprints, arise because of particular requirements of binding sites that impose very tight constraint on the evolution of portions of a protein sequence (2)."

It is common to find that a search of the PROSITE database against a protein sequence will report many matches to the short motifs that are indicative of the post-translational modification sites, such as glycolsylation, myristylation and phosphorylation sites. These reports are often unwanted and are not normally reported. You can turn reporting of these short motifs on by giving the -noprune option on the command-line.

Usage

Command line arguments

Input file format

patmatmotifs reads a protein sequence USA.

Output file format

By default patmatmotifs writes a 'dbmotif' report file.

Data files

Data and documentation from PROSITE files is automatically read. This must be generated and formatted by running prosextract before running patmatmotifs.

Notes

Program is only useful when prosextract is used beforehand.

References

If you want to refer to PROSITE in a publication you can do so by citing:

Bairoch A., Bucher P., Hofmann K. The PROSITE datatase, its status in 1997. Nucleic Acids Res. 24:217-221(1997).

Other references:

Bairoch, A., Bucher P. (1994) PROSITE: recent developments. Nucleic Acids Research, Vol 22, No.17 3583-3589.
Bairoch, A., (1992) PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research, Vol 20, Supplement, 2013-2018.
Peek, J., O'Reilly, T., Loukides, M., (1997) Unix Power Tools, 2nd Edition.
Doolittle R.F. (In) Of URFs and ORFs: a primer on how to analyze derived amino acid sequences., University Science Books, Mill Valley, California, (1986).
Lesk A.M. (In) Computational Molecular Biology, Lesk A.M., Ed., pp17-26, Oxford University Press, Oxford (1988).

Warnings

Your EMBOSS administrator must have set up the local EMBOSS PROSITE database using the utility prosextract before this program will run.

Diagnostic Error Messages

The error message:

"Either EMBOSS_DATA undefined or PROSEXTRACT needs running"

indicates that your local EMBOSS administrator has not yet correctly set up the local EMBOSS PROSITE database using the utility 'prosextract'.

Function

Description

Usage

Command line arguments

Input file format

Output file format

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

Author(s)

History

Target users

Comments