preg

Function

Description

preg searches for matches of the supplied regular expression to the input protein sequence(s). A regular expression is a way of specifying an ambiguous pattern to search for. The output is a standard EMBOSS report file with details of any matches.

Usage

Command line arguments


Input file format

preg reads any protein sequence USA.

Output file format

Data files

None.

Notes

The following is a short guide to regular expressions in EMBOSS:

^
use this at the start of a pattern to insist that the pattern can only match at the start of a sequence. (eg. '^M' matches a methionine at the start of the sequence)
$
use this at the end of a pattern to insist that the pattern can only match at the end of a sequence (eg. 'R$' matches an arginine at the end of the sequence)
()
groups a pattern. This is commonly used with '|' (eg. '(ACD)|(VWY)' matches either the first 'ACD' or the second 'VWY' pattern )
|
This is the OR operator to enable a match to be made to either one pattern OR another. There is no AND operator in this version of regular expressions.

The following quantifier characters specify the number of time that the character before (in this case 'x') matches:

x?
matches 0 or 1 times (ie, '' or 'x')
x*
matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc)
x+
matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc)

{min,max}
Braces can enclose the specification of the minimum and maximum number of matches. A match of 'x' of between 3 and 6 times is: 'x{3,6}'

Quantifiers can follow any of the following types of character specification:

x
any character (ie 'A')
\x
the character after the backslash is used instead of its normal regular expression meaning. This is commonly used to turn off the special meaning of the characters '^$()|?*+[]-.'. It may be especially useful when searching for gap characters in a sequence (eg '\.' matches only a dot character '.')
[xy]
match one of the characters 'x' or 'y'. You may have one or more characters in this set.
[x-z]
match any one of the set of characters starting with 'x' and ending in 'y' in ASCII order (eg '[A-G]' matches any one of: 'A', 'B', 'C', 'D', 'E', 'F', 'G')
[^x-z]
matches anything except any one of the group of characters in ASCII order (eg '[^A-G]' matches anything EXCEPT any one of: 'A', 'B', 'C', 'D', 'E', 'F', 'G')
.
the dot character matches any other character (eg: 'A.G' matches 'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.)

Combining some of these features gives these examples from the PROSITE patterns database:

'[STAGCN][RKH][LIVMAFY]$'

which is the 'Microbodies C-terminal targeting signal'.

'LP.TG[STGAVDE]'

which is the 'Gram-positive cocci surface proteins anchoring hexapeptide'.

Regular expressions are case-sensitive. The pattern 'AAAA' will not match the sequence 'aaaa'. For this reason, both your pattern and the input sequences are converted to upper-case.

The syntax in detail

References

None.

Warnings

Regular expressions are case-sensitive. The pattern 'AAAA' will not match the sequence 'aaaa'. For this reason, both your pattern and the input sequences are converted to upper-case.

Diagnostic Error Messages

None.

Exit status

It always exits with a status of 0. Always returns 0.

Known bugs

None.

Other EMBOSS programs allow you to search for simple patterns and may be easier for the user who has never used regular expressions before:

Author(s)

History

Target users

Comments