preg
Function
Description
preg searches for matches of the supplied regular expression to the input protein sequence(s). A regular expression is a way of specifying an ambiguous pattern to search for. The output is a standard EMBOSS report file with details of any matches.
Usage
Command line arguments
Input file format
preg reads any protein sequence USA.
Output file format
Data files
None.
Notes
The following is a short guide to regular expressions in EMBOSS:
- ^
-
use this at the start of a pattern to insist that the pattern can only
match at the start of a sequence. (eg. '^M' matches a methionine at
the start of the sequence)
- $
-
use this at the end of a pattern to insist that the pattern can only
match at the end of a sequence (eg. 'R$' matches an arginine at
the end of the sequence)
- ()
-
groups a pattern. This is commonly used with '|' (eg. '(ACD)|(VWY)'
matches either the first 'ACD' or the second 'VWY' pattern )
- |
-
This is the OR operator to enable a match to be made to either one
pattern OR another. There is no AND operator in this version of regular
expressions.
The following quantifier characters specify the number of time that
the character before (in this case 'x') matches:
- x?
-
matches 0 or 1 times (ie, '' or 'x')
- x*
-
matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc)
- x+
-
matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc)
- {min,max}
-
Braces can enclose the specification of the minimum and maximum number
of matches. A match of 'x' of between 3 and 6 times is: 'x{3,6}'
Quantifiers can follow any of the following types of character specification:
- x
-
any character (ie 'A')
- \x
-
the character after the backslash is used instead of its normal
regular expression meaning. This is commonly used to turn off the
special meaning of the characters '^$()|?*+[]-.'. It may be especially
useful when searching for gap characters in a sequence (eg '\.' matches
only a dot character '.')
- [xy]
-
match one of the characters 'x' or 'y'. You may have one or more
characters in this set.
- [x-z]
-
match any one of the set of characters starting with 'x' and
ending in 'y' in ASCII order (eg '[A-G]' matches any one of: 'A', 'B',
'C', 'D', 'E', 'F', 'G')
- [^x-z]
-
matches anything except any one of the group of characters in
ASCII order (eg '[^A-G]' matches anything EXCEPT any one of: 'A', 'B',
'C', 'D', 'E', 'F', 'G')
- .
-
the dot character matches any other character (eg: 'A.G' matches
'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.)
Combining some of these features gives these examples from the PROSITE
patterns database:
'[STAGCN][RKH][LIVMAFY]$'
which is the 'Microbodies C-terminal targeting
signal'.
'LP.TG[STGAVDE]'
which is the 'Gram-positive cocci surface proteins
anchoring hexapeptide'.
Regular expressions are case-sensitive.
The pattern 'AAAA' will not match the sequence 'aaaa'.
For this reason, both your pattern and the input sequences are
converted to upper-case.
The syntax in detail
References
None.
Warnings
Regular expressions are case-sensitive.
The pattern 'AAAA' will not match the sequence 'aaaa'.
For this reason, both your pattern and the input sequences are
converted to upper-case.
Diagnostic Error Messages
None.
Exit status
It always exits with a status of 0.
Always returns 0.
Known bugs
None.
Other EMBOSS programs allow you to search for simple patterns and may be
easier for the user who has never used regular expressions before:
- fuzznuc - Nucleic acid pattern search
- fuzzpro - Protein pattern search
- fuzztran - Protein pattern search after translation
Author(s)
History
Target users
Comments