The EMBOSS package consists of a large number of separate programs that have a specific function. They usually take a (number of) input file(s) and some parameters that are important to the function and produce output in the form of files, plots, web pages or simple text output.
The programs can be invoked in a myriad of ways. Its name could be entered on the command line with all parameters, so the program will have all the information it needs all at once. A more interactive way is a query-answer session with the user, in which the user is asked to enter a piece of information one at a time. A third way could be a web-interface where a user chooses the options for the program using lists, checkboxes, radio buttons etc. In EMBOSS, the way a program interacts with the user, its interface, is independent of the actual program.
At the moment, EMBOSS programs are called by giving their name on the UNIX command line either with or without parameters. Many parameters can have qualifiers that will give more information about a parameter. For instance, the format of the information in a sequence file that is used as an input file could be specified on the command line, like:
% seqret filename.seq -sformat fasta |
In this example the EMBOSS program ' seqret is called with the filename 'filename.seq' as its first parameter. '-sformat fasta' indicates that the sequence file is in 'fasta' format. A complete description of the command line syntax will follow in section 2 Formal Description of the ACD language. The percentage sign '%' indicates that the command was entered on the UNIX command line. This will be used throughout the documentation.
Every EMBOSS program will be accompanied by a so-called ACD (Ajax Command Definitions) file, which describes the parameters that the program it refers to needs. It contains information about its input and output files and other parameters the program may need. It will indicate if any of the parameters are mandatory (like an input sequence file) or that certain parameters are within certain limits (a gap penalty for an alignment must be higher then 0 for instance). It can also indicate whether one parameter's value is dependent on the value or the presence of another. (An example: If the input sequence for an alignment program is DNA, it should not accept a protein comparison matrix).
The parameters are defined in a special purpose language called Ajax Command Definitions or ACD, specially designed for EMBOSS. It will specify everything that can appear on the command line or can be used in another interface like web pages. It is a very 'forgiving' language in that it does not restrict the available syntax any more than is strictly necessary.
ACD files are simple text files that contain the definitions. The files usually have the same prefix as the program, but this is not required. ACD files use the extension '.acd'. This is mandatory.
Formalised:
token: token [ definition ]
is equivalent to
token=token [ definition ]
The first token in the file must be "application" directly followed by a colon ':' or an equal sign '='. The second token is the application name with which this ACD file is associated. The application name is followed by (required) application attributes enclosed in square brackets.
Formalised:
application: appname [ attributes ]
Example:
application: wossname [ documentation: "Finds programs by keywords" groups: "Display" ]
The first token of a parameter definition is an Ajax datatype, directly followed by a colon ':' (preferred) or equal sign '='. The second token is the name by which this parameter is going to be known (this is also the name that is used by the EMBOSS program to get the value of the parameter). After the name, definitions are in mandatory square brackets, [], which can make a definition span multiple lines.
Formalised:
datatype: parametername [ definition ]
Example:
sequence: asequence [ standard: "Y" ]
Tokens representing data types can be abbreviated up to the point where they are not ambiguous. For example, default: can be abbreviated to default: or even d: although the latter is not recommended due to lack of clarity.
Values can be delimited (i.e. treated as one token) by double quotes
The first token of an ACD file must be the application: token, followed by the application name. The application name and the ACD filename (without the .acd extension) are usually identical, but this is not mandatory. When a program calls the embInit("program") function with "program" as its parameter, the function will only look for an ACD file called program.acd. It will not compare the parameter with the string given after the application: token.
The application: token has a documentation: attribute which is followed by a string describing the function of the program. This documentation string will be used to generate the description of the program when the program is run or the user specifies the -help qualifier. When the documentation: attribute is missing, a warning will be issued.
Formalised:
application: appname [ documentation: string ]
Example:
ACD file definition (partly):
application: seqret [ documentation: "Reads and writes (returns) a sequence" ]
Command line:
% seqret Reads and writes (returns) a sequence Input sequence :
The ACD file starts with the definition of the program seqret. The documentation: attribute is followed by a string briefly explaining the function of the program and this string is shown after the program is invoked and before it prompts the user for any input. The documentation: string is also searched by the wossname utility, which finds applications by keyword (in the doc string) and group.
The length of the documentation: string should be kept to 63 characters or shorter in order to allow the wossname utility to display each program name and its documentation on one 80-character line.
The documentation: string should not end with a '.' character
Any acronyms or capitalised abbreviations in the documentation: string should be written in upper case. (e.g.: SNPs, EST, DNA, ABI, SRS, ASCII, CDS, mRNA, B-DNA, RNA, CpG, ORFs, MAR/SAR, PCR, STS, REBASE, SCOP, PROSITE, PRINTS, EMBL, TRANSFAC, AAINDEX, BLAST, GCG, EMBOSS)
The documentation: string should start with an upper-case letter.
The groups: attribute allows the EMBOSS programs to be grouped together based on their functionality. The groups: attribute is followed by a string value, containing the name(s) of the group(s). When an application belongs to more then one group, the group names must be separated by either a comma (,) or semi-colon (;); i.e. a group name is not a token, but a list of tokens.
The groups: string is also searched by the wossname utility, which finds applications by keyword (in the doc string) and group.
Formalised:
application: appname [ groups: "group name1, group2, ... " ]
Example: ACD file definition (partly):
application: seqret [ groups: "Display" ]
Group names can have spaces in them.
The group names can be split into sub-levels by
the use of a ':' character:
First Level : Second Level
Several third-party interfaces are starting to rely upon there being a maximum
of 2 levels, so do not use more than one ':' in a group name.
The group name is now checked against a list of accepted values in the file groups.standard which is defined and installed in the same directory as the ACD files. This file contains one line for each known group, with subgroups defined with a ":" delimiter, and spaces replaced by underscores. Each group also has a short description.
The table in the following section lists all groups currently defined
The First and Second level group names are given below with some explanation of what might be expected to be placed in the group.
If a group is composed of two levels, such as
Alignment : Consensus
then the group specification must not use the group names singly, (i.e. you
must not use "Alignment" or "Consensus").
If the group consists of only one level, such as
Display
then please don't start adding sub-levels to
it. (i.e. you must not use "Display : Features")
You are strongly encouraged to use the following groups structure. This is the set of groups defined by the groups.standard file. We have found that most things will fit in one or more of these groups. When, however, a completely new category of program is written, please discuss the creation of the new group name with the developers' mailing list. Sometimes a new group is required (for example the group "Enzyme Kinetics" which had to be created to hold 'findkm').
ACD files describe the parameters that a program needs, in an object-oriented manner. The most important types or objects are file objects, sequence objects, number objects, Boolean objects and string objects. The current objects are listed in Table 1.
Array parameters are lists of numbers, either integer or floating point. The ACD attributes control validation, for example the number of values, or a list of numbers that adds to a given total. The data value is a list of numbers separated by spaces or commas.
Boolean parameters are simple switches. If they are entered on the command line the value will be Y (True), if they are absent from the command line the value will be the default value. The name can also be prefixed by 'no' to force the value to be N (False). This is needed if the default value is Y (True). The data value is Y for yes and N for no.
The integer data type can hold simple integer values. The value range can be controlled by minimum and maximum values (a minimum value of 0 or 1 is often useful).
Simple float values. The value range can be controlled by minimum and maximum ACD attributes (a minimum value of 0.0 is often useful).
Ranges of sequence positions. Originally defined as a simple list of paired numbers, ranges can now be specified in files with the range syntax "@filename", as pairs of numbers with text comments. For example:
# this is my set of ranges 12 23 4 5 this is like 12-23, but smaller 67 10348 interesting region
Any string value. The length can be controlled by ACD attributes, and a regular expression pattern to provide more general validation if necessary. Most string values are free text, although strings can be used by a program for any input that is not covered by a defined ACD type.
Toggle parameters are simple switches, and work in the same way as "boolean" parameters. Toggle parameters are intended for use in turning on/off other parameters. When ACD parameters are grouped in sections, a clean ACD file will have all the "required" parameters in the "required" secion and all the "additional" parameters in the "additional" section. Some of these will have calculated values for the "standard" and "additional" attributes, controlled by the value of another parameter. The "toggle" parameters are designed to be used in these calculated values, and can be in the "required" section even if not themselves defined as "standard".
Exactly like "boolean" parameters, if they are entered on the command line the value will be Y (True), if they are absent from the command line the value will be the default value. The name can also be prefixed by 'no' to force the value to be N (False). This is needed if the default value is Y (True). The data value is Y for yes and N for no.
Codon usage tables are simple files read from the EMBOSS data search path, and are distributed in the emboss/data directory.
Codon usage files can be read in several formats, including "gcg".
Cpdb (Cleaned PDB) files are simple input files in CPDB format. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CPDB files from PDB file input.
Datafile input refers to a formatted data file to be read from the standard EMBOSS data file locations (see the EMBOSS Administrator's guide for full details).
EMBOSS looks for data files in the local/share/EMBOSS/data directory, or in various user directories.
Most data files are already defined as their own ACD types - matrix, matrixf, codon. Otehrs are hard coded file names that do not need their own ACD definition, although users are free to define their own file with the appropriate name to override the default file provided.
Directory defines a directory that can be used for input or output definitions.
Directory is intended for future use to replace string definitions of directory names in some applications, and to provide additional validation of the user input specific to directory specifications.
Directory defines a set (list) of directories that can be used for input or output definitions.
Dirlist is intended for future use to replace string definitions of directory names in some applications, and to provide additional validation of the user input specific to directory specifications.
Discretestates is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "discrete character" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Distances is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "distance matrix" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Feature annotation in any known feature format. Features can also be read from a sequence and written with a sequence.
Filelist defines a set (list) of input files.
Filelist is intended for future use to replace string definitions of input file names in some applications, and to provide additional validation of the user input specific to multiple input files.
Frequencies is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Discretestates input is used by the phylip "gene frequency and continuous character" applications. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Non-sequence-related data file. This data type refers to files that are to be used in the program and do usually not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to Outfile standard types, or to report, align, featout, or seqout formats.
Comparison matrix files are used by many programs. They are data files read from the EMBOSS data search path, and are distributed in the emboss/data directory. For preference, we use the matrix files distributed with BLAST.
Integer matrices are usually faster and are preferred by most applications. Floating-point matrix files are also available if needed, and an integer matrix file can of course also be read as floating point.
The matrix data type has an attribute to force selection of a nucleic acid or protein comparison matrix. In ACD files, the type of the input sequence is often used here.
Remember that any application which uses gap penalties will need to set them separately for each matrix.
Floating point comparison matrices are required by some algorithms. An integer matrix file can of course be used equally well as a floating point matrix.
Pattern definitions files allow multiple search patterns to be described, each with a name.
Pattern files are used for PROSITE syntax sequence patterns. The same syntax is used for "regexp" input. Pattern files also allow mismatch values to be defined for each pattern, and a "-pmismatch" qualifier sets the mismatch default for all patterns in the file. Mismatches are not appropriate for regular expression matches.
Properties is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Properties input is used by the phylip applications to define weights, ancestral states and factors (multi-state characters). By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.
Any regular expression value, or (new in release 4.0.0) a file containing regular expressions and names.
The length can be vallidated and controlled by ACD attributes. The case can be set to upper or lower case only. The regular expression must be supported by the EMBOSS regular expression library.
EMBOSS uses the "Perl-Compatible Regular Expression Library" (PCRE), so any regular expression that is valid in Perl 5.0 should be valid here.
SCOP files are simple input files in SCOP format.
USA (database reference or file) indicating a single sequence. The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
set of single sequences that can be addressed one after another (for example a set of sequences that will be used in an multiple alignment). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
set of single sequences that can be used all at the same time (for example a database of some sort that is to be used for a pattern search). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
One or more sets of single sequences that can be used all at the same time (for example a database of some sort that is to be used for a pattern search). The type of sequence can be restricted by specific attribute "type" (for example, the program should only accept DNA files). Can also read features if the "features" ACD attribute is set.
Tree is a new ACD type implemented specifically for the "phylipnew" EMBASSY package. Properties input is used by the phylip applications to define one or more phylogenetic trees. By defining a specific ACD type EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options. The trees are currently parsed by phylip itself, but in the near future we will implement parsing methods in ACD processing.
Selection lists are a way to present the user with a limited list of options he/she can choose from. For the user, the difference between the list and selection data type is minimal and lies only in the way the choices are labelled. In a selection data type, the choices are numbered automatically from 1 up. In a list data type the choices can be labelled by any arbitrary text label. The user can choose one of the options by either typing the number (for a selection type) or the text of the label (for a list type) or a non-ambiguous part of the value of the choice. In practice, the list data type is much preferred for this reason.
A list of text descriptions with short labels. The user can enter one (or sometimes more) labels, or can specify partial text descriptions. The program is given a list of text labels as input.
A list of text descriptions (usually short, unlike list data), with generated numbers. The user can enter one (or sometimes more) numbers, or can specify partial text descriptions. The program is given a list of text descriptions as input. The listdata type is usually preferred.
An output file for sequence alignments. Defined in the same way as a plain text "Outfile" but with extra qualifiers to allow a choice of alignment formats, and attributes to specify whether the alignment will have 2 or more sequences (which limits the possible formats). The data is stored as sequences, the available formats include the most common sequence formats.
Feature annotation in any known feature format. Can also be stored with the sequence if the sequence output "features" attribute is set.
Output file containing codon usage data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing cleaned PDB protein structure data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing cleaned formatted data as tables or lists. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Multiple outdata definitions are by default appended to a single file. The individual ACD definitions allow the format of each file section to be defined.
Output directory for multiple output files to be written. Specifying an outdir allows other properties to be defined, including the default file extension with the "extension" attribute.
Output file containing phylogenetics discrete characteristics data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing phylogenetics distance matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Non-sequence-related data file, usually plain text. This data type refers to files that are to be produced by the program and usually do not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to an Infile standard type for use as input to another program.
Non-sequence-related data files, usually plain text. This data type refers to files that are to be produced by the program and usually do not contain sequence data. The type of data can be identified by a "knowntype" attribute and matched to an Infile standard type for use as input to another program.
Output file containing phylogenetics character frequency data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing integer comparison matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing floating point comparison matrix data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing phylogenetics property data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing SCOP protein domain data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
Output file containing phylogenetic tree data. The default data format can be specified by an "oformat" attribute which the -oformat associated qualifier can override.
An output file for sequence annotation. Defined in the same way as a plain "Outfile" but with extra qualifiers to allow a choice of report formats. Report data is stored internally as a feature table, so the available formats include the most common feature formats.
USA (database reference or file) indicating a single sequence. Can also write features if the "features" ACD attribute is set.
The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.
A set of single sequences to be written to a single file. Can also write features if the "features" ACD attribute is set.
The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.
A set of single sequences stored in memory together, usually a multiple sequence alignment. Can also write features if the "features" ACD attribute is set.
The default file extension is the sequence format, but can be specifically set with the "osextension" attribute, for example where appliations produce two or more sequence outputs.
For graphical output of any general kind, including dotplots. The data value is the graphics device, as specified by the "PLPLOT" graphics library used in EMBOSS at present. Example values include "ps" for Postscript, "png" for PNG files, and "X11" for X-Windows. A value of "?" in answer to the prompt will list the available graphics devices on your installation.
For graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis. . The data value is the graphics device, as specified by the "PLPLOT" graphics library used in EMBOSS at present. Example values include "ps" for Postscript, "png" for PNG files, and "X11" for X-Windows. A value of "?" in answer to the prompt will list the available graphics devices on your installation.
ACD objects have mandatory names.
Formalised:
datatype: parametername [ ]
Example:
sequence: asequence [ ]
This defines asequence to be the name of a sequence object.
In order to assign a value to a parameter, the name of the parameter can be specified on the command line (in a number of ways, see section 4) followed by a value that is appropriate for that data type.
Example:
ACD file definition (partly):
sequence: asequence [ ]
Command line :
% acddemo -asequence filename.seq
This defines filename.seq to be the value of the parameter named asequence for the EMBOSS program acddemo.
If a parameter is defined with a special parameter attribute ( parameter:"Y"), using the name of the parameter on the command line is not mandatory (see section 3.4). This is commonly used for input data and for output filenames.
The name of an object is also used, in the EMBOSS program, to refer to the value of the parameter. After the initiation call using the EMBOSS function embInit(), the values of the parameters have been read in and checked (see 1.4). The program must then assign the parameter to an actual EMBOSS object, like sequence (AjPSeq), string (AjPStr) etc. The actual function calls are beyond the scope of this document, and the reader is referred to the AJAX documentation (http://srs.ebi.ac.uk/srs7bin/cgi-bin/wgetz?-fun+pagelibinfo+-info+EDATA for the SRS searchable Object documentation), but some examples can be found in section 1.4 and 1.5.
The name can also be used in the definition of other ACD parameters. The value of the parameter (or variable) is retrieved, using the dollar sign '$' and a the name of the parameter encapsulated by a pair of parentheses.
Formalised:
$(parametername)
Example:
integer: gappenalty [ standard: Y default: 10 ] integer: gapextpenalty [ default: $(gappenalty) ]
This defines the default for parameter gapextpenalty as the value of parameter gappenalty.
Naming conventions
Although everybody is free to use any (valid) name for a parameter, we would like to propose a naming convention, to streamline the development of ACD files.
Name |
Datatype |
Usage |
sequence |
sequence |
primary input sequence, generally required |
outseq |
outseq |
primary output sequence, generally required, generally should default to the primary input sequence name, extension defaults to the name of the output sequence format. |
outfile |
outfile |
primary output non-sequence results file, generally required. The file extension should be allowed to default to the application name. |
data |
infile |
primary auxiliary input data file, generally optional |
minlen |
int |
minimal length of sequence feature to be found |
maxlen |
int |
maximum length of sequence feature to be found |
wordsize |
int |
word size for hash tables etc. generally minimum=2 for protein, 4 for DNA |
window |
int |
window length for calculating dotplots/features/etc. |
shift |
int |
amount by which window is shifted in each iteration |
consensus |
bool |
flag for whether consensus sequence should be output |
gap |
float |
gap penalty |
gapext |
float |
gap extension penalty |
from |
int |
position of start of input sequence to specify for an operation (e.g. deletion), defaults to start of sequence, minimum value = 1, maximum value = sequence length |
to |
int |
position of end of input sequence to specify for an operation (e.g.: deletion), defaults to the 'from' value, minimum value = 'from', value, maximum = sequence length. |
threshold |
float/int |
threshold for various operations |
left |
bool |
operation should be done at the start of the sequence |
right |
bool |
operation should be done at the end of the sequence |
pattern |
string |
pattern to search for in sequence |
patterns |
infile |
file of patterns to search for in sequence |
Table 3. Recommended naming conventions.
There are two types of attributes for parameters. 'Global' attributes cen be defined for any ACD data type. Each data type then has its own set of 'specific' attributes. These definitions can refer to 'calculated' attributes generated automatically by ACD processing. The 'global' and 'specific' ttributes are part of the parameter definition and are placed between the square brackets.
Formalised:
datatype: parametername [ attribute: "value" ]
Attributes to parameters can specify the default value, and the requirements for a correct value, for a parameter. It can specify whether the parameter is mandatory and what the limits are for a valid value. There are global attributes that apply to all data types and there are data type-specific attributes.
default:
Defines the default value for the parameter, which can be dependent on the values of parameters defined earlier.
Each data type has a default value, which can be valid (for example a boolean will default to "N") or invalid (many input types will default to an empty string).
information:
The string giving information about the parameter, for use on Web forms and in GUIs and also a default prompt to the user
For some data types (sequence is a good example) there are standard prompts so no value is expected, and the acdvalid utility will issue a warning if an information attribute is found.
parameter:
Defines a parameter on the command line which can appear without a qualifier name. Also implies that the value is required and will be prompted for if missing.
standard:
Indicates whether a parameter is mandatory and will be prompted for if missing.
additional:
Indicates if the parameter should be queried for when the -options qualifier is set on the command line.
help:
The string shown when the -help qualifier is used on the command line
Help is usually only defined if a specific string is needed. If help is not defined, the value of the "information" attribute, or the default prompt, will be used.
expect:
A string used in the "Default" column of the command line syntax table in the documentation. This table is automatically generated from the ACD file, and in most cases there is a reasonable value generated. Where there is no suitable value, this attribute should be used to provide one.
valid:
A string used in the "Allowed values" column of the command line syntax table in the documentation. This table is automatically generated from the ACD file, and in most cases there is a reasonable value generated. Where there is no suitable value, this attribute should be used to provide one.
knowntype:
The knowntype attribute defines one of a controlled vocabulary of known value types. Some ACD data types require a knowntype attribute.
These standard values are read from a file knowntypes.standard which is stored and installed in the ACD file directory. A few other values are accepted, for example "(programname) output" for an outfile data type. These are documented under each output type. The acdvalid utility will check all knowntype values in an ACD file, and report any missing values for data types that require a knowntype.
prompt:
The string used if the user has to be queried for a value, though information can be used instead and usually only one will be defined. information is preferred.
missing:
Indicates whether a qualifier can have no value, especially when it appears on the command line (for example to override a default value in the ACD file).
needed:
Indicates whether a parameter is expected to be included in a GUI form. Some parameters are available on the command line, but are not generally useful to users, or can cause confusion when presented in a GUI form with all other options.
outputmodifier:
Indicates that this qualifier modifies the output in ways that can break parsers, for example by changing text output into HTML. Authors of wrappers can use this to test for qualifiers that can be hardcoded to fix the output syntax and content. Please let the EMBOSS team know if any other qualifiers are candidates for marking as output modifiers.
code:
A code word (no spaces) which is searched for in the file codes.english to give a standard prompt, for example when asking for an alignment gap penalty. The standard default prompts are in the same file. The code word is not case-sensitive. information is preferred.
comment:
A comment, provided for use by the EBI's SoapLab project but not defined in the standard ACD files.
style:
Provided for use by the EBI's SoapLab project but not defined in the standard ACD files.
Any global or specific attribute must have a second token representing the value of the attribute. The attribute must be followed by a colon ':' and usually the value will be enclosed in double quotes.
The syntax of the global attributes is
Formalised:
help: "String" information: "String" default: "value" additional: "Y"/"N" parameter: "Y"/"N" information: "String" standard: "Y"/"N"
Example:
sequence: asequence [ standard: "Y" information: "Enter filename" ]
The parameter: attribute is a boolean attribute, defining the order of the parameters on the command line, if the parameter name is not explicitly entered on the command line. If set to Y, the parameter can be entered on the command line without using the parameter name.
Formalised:
datatype: parametername [ parameter: Y/N ]
Example:
ACD file definition (partly) :
application: acddemo [ documentation: "" groups: "" ] sequence: asequence [ ]
Command line :
% acddemo -asequence filename.seq
Is equivalent to:
ACD file definition (partly) :
sequence: asequence [ parameter: Y ]
Command line:
% acddemo filename.seq
In both examples filename.seq is the value of the parameter named asequence for the EMBOSS program acddemo.
The second example will also allow the command line from the first, as parameter names are accepted as qualifiers.
If more then one parameter: attribute is used, the order in which they appear in the ACD file is the same as the order in which they appear on the command line.
Example: ACD file definition (partly) :
application: acddemo [ documentation: "" groups: "" ] sequence: asequence [ parameter: Y ] outseq: outseq [ parameter: Y ]
Command line :
% acddemo infilename.seq outfilename.seq
will assign the name infilename.seq to parameter asequence, and outfilename.seq to parameter outseq.
Any program is expected to have one or more required inputs. An ACD data type that is defined as a "parameter:" (see section 2.4.1.1.1) is automatically counted as required. All other required inputs should have the "standard:" attribute set.
When the program runs, the user will be prompted for any "required" values that are not already on the command line.
The only difference between "parameter:" and "standard:" is that a "parameter" can appear on the command line as the simple value with no name, to provide simple command lines.
When the additional: attribute is set, the parameter will only be queried for, when the -options qualifier is set (on the command line or when the system default is set using an environment variable (See 3.7) or any other way). If the -options qualifier is not set, the user will not be queried for this parameter, if it is omitted in the program execution (i.e. not mentioned on the command line or any other way).
The information: attribute defines the text hint to the user entering a data value. The same text is intended for use in the prompt to the user at a terminal, and as the text in an HTML form or a GUI.
In rare cases where the information: string is misleading, a prompt: string can be defined for use as a terminal prompt. For general use, information: is now preferred.
To provide standard prompts for common ACD data, there are default information: strings for most data types. These can be found in the file codes.english with the names DEFXXXX where XXXX is the name of the ACD data type.
Common practice is to use the default prompt for input and output ACD data types.
The help: attribute is shown in the help information, when the user requests assistance using the -help qualifier on the command line, or when help in other format is requested (Web page).
Again, there is a default help string in the codes.english file with the name HELPXXXX where XXXX is the name of the ACD data type.
The codes.english file includes some additional standard prompts such as GAP for gap penalties. These prompts can be used with the code: attribute, for example code: "GAP", but GUI developers found these hard to use, so we have replaced them with normal information: attributes.
The default set of attributes is available for all ACD data type definitions.
Each ACD type has its own set of specific attributes, summarized in Table 1 and described in more detail below.
Formalised:
The value for an array is a set of floating point numbers with white space or commas. The size: attribute sets the number of elements in the array. As for the float data type, the minimum: and maximum: attributes define the lower and upper value limits and default to the boundaries as specified by the systems set-up. For validation purposes, the sum: attribute defines the total for all values in the array (tested unless the sumtest: attribute is false), and the tolerance: attribute specifies how closely the sum should match the total. Remember that most floating point fractions cannot be represented accurately in binary form.
Although there are (currently) no specific attributes for a boolean ACD type, care should be taken over the definition of the information: and help: attributes. These are used to prompt the user (interactively or via a GUI), and to provide help text. The text provided in each case should reflect the expected default value of the boolean option, which may be the opposite of what the name implies. For example, if set to "Y" by default, then the command line option would typically be "-noxxx" where "xxx" is the qualifier. If set to "N" by default, then the default action may be the opposite of what the information or help text implies. If the value is calculated, the user may need some extra guidance.
The outputmodifier: attribute is set where this parameter changes the content or syntax of the output. This is provided for the developers of other interfaces and parsers of EMBOSS output so that they can fix the value, or provide parsers for each alternative.
The minimum: and maximum: attributes define the lower and upper value limits and default to the boundaries as specified by the systems set-up.
The increment: attribute defines the steps that this parameter is allowed to take, in case there is a need to iterate this parameter. The increment: attribute can be any valid float value.
The precision: attribute defines the maximum number of significant decimal places that will be taken into account for this value.
The integer data type can hold simple integer values. The minimum: and maximum: attributes define the boundaries and default to the boundaries as specified by the systems setup. The increment: attribute defines the steps that this parameter is allowed to take, in case there is a need to iterate this parameter.
Sequence ranges have similar attribute to integers. The minimum: and maximum: attributes define the boundaries and default to the boundaries as specified by the systems setup. The minlength: attribute defines the minimum number of values required.
The size: attribute defines an exact number of values required. The minsize: attribute defines a minimum number of values required for ranges that can be any length. Only one of these values should be defined for any range.
The value provided by the user is a list of sequence position pairs to be interpreted by the application. The upper and lower bounds (sequence positions can be negative to count back from the end) will depend on the length of the sequence to which they are applied.
The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the string can be. The default minimum length is zero. There is no default maximum.
The pattern: attribute defines a regular expression used to check the string value. ACD uses the Perl-compatible regular expression library (PCRE) so any Perl-compatible regular expression should be usable.
The word: attribute requires the result to be a valid word with no whitespace. The default minimum length of zero allows an empty string but this is not accepted as a word. This may change in future.
Although there are (currently) no specific attributes for a toggle ACD type, care should be taken over the definition of the information: and help: attributes. These are used to prompt the user (interactively or via a GUI), and to provide help text. The text provided in each case should reflect the expected default value of the toggle option, which may be the opposite of what the name implies. For example, if set to "Y" by default, then the command line option would typically be "-noxxx" where "xxx" is the qualifier. If set to "N" by default, then the default action may be the opposite of what the information or help text implies. If the value is calculated, the user may need some extra guidance.
The outputmodifier: attribute is set where this parameter changes the content or syntax of the output. This is provided for the developers of other interfaces and parsers of EMBOSS output so that they can fix the value, or provide parsers for each alternative.
Formalised:
Codon usage tables are species-specific, and in some cases specific to a class of genes within a species. This makes it useful to specify a default value for a codon usage table name. Internally, a default is set in the ACD source code. Usually this is "Ehum.cut", the human codon usage table provided in the EMBOSS distribution.
Individual codon inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a data file. The application must be able to accept a null value for this qualifier.
Cleaned PDB file input has a default value (typically "1azu") set in the ACD source code although this is not really a good idea.
Individual cpdb inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.
The default datafile name is defined by two ACD attributes, name: and extension:. The directory: attribute defines the EMBOSS data subdirectory to be searched.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a data file. The application must be able to accept a null value for this qualifier.
The extension: attribute sets the extension for all files read from the directory. Files with other extensions will not be read
The fullpath: attribute can be used to require a full rather than a relative path specification for a directory.
If a null value (the current directory) is allowed,the nullok: attribute must be set true.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a directory. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no directory) as the default for programs where a directory is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The extension: attribute sets the extension for all files read from the directories. Files with other extensions will not be read
The fullpath: attribute can be used to require a full rather than a relative path specification for a directory.
If a null value (the current directory) is allowed,the nullok: attribute must be set true.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a directory. The application must be able to accept a null value for this qualifier.
The discretestates data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs.
The length: attribute defines the number of state values (the length of the discrete characters string) in each set
The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.
The characters: attribute defines which discrete state characters can be specified. This is defined as a string containing all possible characters.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a discretestates file.
The distances data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The distance matrices accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The length: attribute defines the number of rows in the distance matrix.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a distance file.
The type: attribute defines whether the feature input is "protein" or "nucleotide". There is a default based on the type of any input sequence, but a value should always be specified.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without features input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no feature input) as the default for programs where a directory is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
Filelist is equivalent to infile, but allows the user to specify one or more input files.
The nullok: attribute specifies that a missing input file is acceptable to the application, and that -noxxx can be used on the command line to avoid reading the default input file (if any)
The frequencies data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The frequencies files formats accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The length: attribute defines the number of loci (or values) in the frequencies file.
The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.
The continuous: attribute specifies a frequencies file with continuous character data values.
The genedata: attribute specifies a frequencies file with genetic locus data values.
The within: attribute specifies a frequencies file with continuous data for multiple individuals (additional values on each line).
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a frequencies file.
The nullok: attribute specifies that a missing input file is acceptable to the application, and that -noxxx can be used on the command line to avoid reading the default input file (if any)
The trydefault: attribute specifies that the default filename may not exist. If nullok: is also defined as true then no error is reported.
The protein: attribute will determine if the scoring matrix is used as a DNA or Protein matrix.
The protein: attribute will determine if the scoring matrix is used as a DNA or Protein matrix.
Patterns are processed by an internal set of library functions designed to handle PROSITE-style pattern definitions.
The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the regular expression string can be.
The upper: and lower:attributes convert an input regular expression to upper or lower case before compiling.
The type: attribute describes the pattern as applying to nucleotide or protein sequence. Nucleotide patterns are compared in both directions.
The properties data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The properties files accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The length: attribute defines the number of values in the properties file.
The size: attribute defines the number of sets of values, usually 1 but some programs will accept multiple sets.
The characters: attribute defines which property characters can be specified. This is defined as a string containing all possible characters.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a properties file.
Regular expressions are processed by the "Perl-Compatible Regular Expression Library" (PCRE). Any value must be accepted by this library's compilation function. Some additional attributes are provided for further validation by ACD.
The minlength: attribute defines the minimum length the string must be, the maxlength: attribute defines the maximum length the regular expression string can be.
The upper: and lower:attributes convert an input regular expression to upper or lower case before compiling.
The type: attribute describes the pattern as applying to nucleotide or protein sequence. Nucleotide patterns are compared in both directions.
Scop file input has a default value (typically "d3sdha") set in the ACD source code although this is not really a good idea.
Individual scop inputs can set their own default names with the name: attribute which in the current version has the same effect as setting the default: attribute.
The type: attribute will force the sequence to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The sask: attribute sets the defauklt for the -sask qualifier, and if set to "Y" specifies that the program will prompt the user for a sequence begin and end position, and prompt for the reversing of a nucleotide sequence. The EMBOSS "yank" program works with fragments of sequences, and uses the sask: attribute to prompt the user.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where seqeunce input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The type: attribute will force the sequence(s) to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where seqeunce input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The minseqs: attribute specifies a minimum number of sequences to be read. By default, a single sequence is acceptable.
The type: attribute will force the sequence set to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
The aligned: attribute, if true, specifies that all sequences in the input are expected to be aligned. If false, then the sequences are assumed to be unaligned, and are simply read into memory together for processing. We recommend always defining the aligned: attribute so that the nature of the sequence set if clearly defined.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The minseqs: attribute specifies a minimum number of sequences to be read. By default, a single sequence is acceptable.
The type: attribute will force the sequence set(s) to be of the given type. By default, any sequence type is accepted.
We recommend always defining the type: attribute so that the accepted input sequence type is always clear.
The aligned: attribute, if true, specifies that all sequences in the input are expected to be aligned. If false, then the sequences are assumed to be unaligned, and are simply read into memory together for processing. We recommend always defining the aligned: attribute so that the nature of the sequence set if clearly defined.
If the features: attribute is set, the sequence input will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
If the entry: attribute is set, the sequence input will include the full original text of the input sequence or database entry.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a sequence input. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence input) as the default for programs where sequence input is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The minseqs: attribute specifies a minimum number of sequences to be read for each set. By default, a single sequence is acceptable.
The tree data type can be replaced by a simple input file in GUIs, with the user required to provide the correct data format.
The attributes define characteristics required for Phylip programs. The tree files accepted by ACD include all the formats read by Phylip, with automatic interconversion.
The size: attribute defines the number of trees in the input file, usually 0 but some programs will accept multiple sets. Some can only accept a single tree (so the value should be set to "1" for these.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a properties file.
Formalised:
For both selection list types, the values that the user can choose from are defined in the values: attribute as a string, delimited by the character that is given by the delimiter: attribute (which defaults to the semi-colon ';'). For the list data type there is a second delimiter ( codedelimiter:) character that defines the delimiter that separates the label from the value (defaults to the colon ":"). The minimum: and maximum: attributes define the number of choices that this parameter can handle. The header: attribute will hold the text that is displayed above the option list. The casesensitive: attribute will indicate if the options are case sensitive or not, but the value of the parameter will be exactly what the list value is. The button: attribute, which can either be Y(es) or N(o), is used in for web front ends, to indicate if radiobuttons/checkbox/selection lists are to be used or if the list is simply displayed with a text entry box beneath it, to enter the option.
The values: attribute contains the list of valid code names and values. The delimiter: and codedelimiter: attributes specify how to parse this string into individual list items.
The minimum: attribute specifies the minimum number of selections required. By default, 1 selection is required.
The maximum: attribute specifies the maximum number of selections required. By default, exactly 1 selection is required. A higher value allows multiple selections.
The header: attribute defines text to appear before the list is presented to the user. The information: attribute defines text to be used as a prompt after the list.
The delimiter: attribute specifies the character used in the values: string to separate list items.
The codedelimiter: attribute specifies the character used in the values: string to separate codes (names) and descriptions of list items.
The button: attribute suggests whether a list is best represented as checkboxes or radio buttons in an interface (value "Y") or as a pull-down list.
The casesensitive: attribute defines whether the input must match the exact case of the selection list item.
Example:
list: matrix [ default: "blosum" # default value minimum: 1 maximum: 1 # must select exactly 1 header: "Comparison matrices" # printed before list values: "B:blosum, P:pam, I:id" 3 valid values delim: "," # delimiter default ";" codedelim: ":" # label delimiter default ":" prompt: "Select one" # prompt after list button: Y # use radio buttons rather than # checkboxes in HTML, # ignored by ACD ]
What you get is:
Comparison matrices B : blosum P : pam I : id Select one [blosum] : PAM
The values: attribute contains the list of valid values. The delimiter: attribute specifies how to parse this string into individual selection list items.
The minimum: attribute specifies the minimum number of selections required. By default, 1 selection is required.
The maximum: attribute specifies the maximum number of selections required. By default, exactly 1 selection is required. A higher value allows multiple selections.
The header: attribute defines text to appear before the selection list is presented to the user. The information: attribute defines text to be used as a prompt after the list.
The delimiter: attribute specifies the character used in the values: string to separate list items.
The button: attribute suggests whether a selection list is best represented as checkboxes or radio buttons in an interface (value "Y") or as a pull-down list.
The casesensitive: attribute defines whether the input must match the exact case of the selection list item.
Example:
select: matrix [ default: "blosum" # default value minimum: "1" maximum: "1" # must select exactly 1 header: "Comparison matrices" # printed before list values: "blosum, pam, id" # valid values delimiter: "," # delimiter default ";" information: "Select one" # prompt after list button: "Y" # use radio buttons rather than # checkboxes in HTML, # ignored by ACD ]
What you get is:
Comparison matrices 1 : blosum 2 : pam 3 : id Select one [blosum] : PAM
Formalised:
The minseqs: and maxseqs: attributes define whether the alignment will contain exactly 2 sequences, 1 or more, 3 or more, or whatever the program will produce. These values can be used to validate the choice of formats on the command line with the -aformat qualifier.
The aformat: attribute is required. It defines the default value for the -aformat qualifier. The aglobal: attribute defines the default value for the -aglobal qualifier, and should be set true for programs that produce a global alignment. The multiple: attribute should be set true if the output can contain more than one alignment from the same input.If a null value (the current directory) is allowed,the nullok: attribute must be set true.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without an alignment file. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no alignment file) as the default for programs where an alignment file is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead.
The output filename is constructed from the name: and extension: attributes in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in ($(asequence.name) if the sequence parameter is named asequence).
The extension: attribute will default to the output feature format.
The type: attribute defines whether the feature output is "protein" or "nucleotide". There is a default based on the type of any input sequence, but a value should always be specified.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without this feature output.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no feature output) as the default for programs where feature output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The offormat: attribute defines the default value for the -offormat qualifier, used as the feature format and the default feature file extension.
The ofname: attribute defines the default value for the -ofname qualifier, used as the default base file name
The name: attribute will default to "outfile".
The extension: attribute will default to the format, with "cut" defined as the default format to match the usual codon usage filenaming convention. This format is also called "emboss".
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute sets the default extension for all files written to the directory.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The fullpath: attribute requires the path to be specified in full when passed to the program, although the user may provide a path from the current working directory.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The output filename is constructed from the name: and extension: attributes in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in ($(asequence.name) if the sequence parameter is named asequence).
The extension: attribute will default to the program name, and is usually left as the default value.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no output file) as the default for programs where an output file is only occasionally required. Examples include programs where the original output format is available, usually for users that still require it for parsing in automated scripts. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without an output file.
The knowntype: attribute should always be defined. If the output is not of any special type, a knowntype of "(program) output" is the recommended value.
The append: attribute specifies that output is appended to the end of an existing output file. By default, the output file will be overwritten.
The output filename is constructed from the name: and extension: attributes in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in ($(asequence.name) if the sequence parameter is named asequence).
The extension: attribute will default to the program name, and is usually left as the default value.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no output file) as the default for programs where an output file is only occasionally required. Examples include programs where the original output format is available, usually for users that still require it for parsing in automated scripts. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without an output file.
The knowntype: attribute should always be defined. If the output is not of any special type, a knowntype of "(program) output" is the recommended value.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The extension: attribute will default to the output file format.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no data output) as the default for programs where data output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The oformat: attribute defines the default value for the -oformat qualifier, used as the file format and the default file extension.
The minseqs: and maxseqs: attributes define whether the alignment will contain exactly 2 sequences, 1 or more, 3 or more, or whatever the program will produce. These values can be used to validate the choice of formats on the command line with the -aformat qualifier.
The rformat: attribute is required. It defines the default value for the -rformat qualifier. The taglist: attribute defines the additional tags to be reported from the internal feature table. The tag names and types must match the source code of the application. Each tag is in the format "type:tagname[=columnname]" for example "int:length" or "string:gc=GC%" The precision: attribute sets the floating point precision of the score value. For integer scores this can be set to "0". The multiple: attribute should be set true if the output can contain more than one report from the same input.The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without a report output file. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no report output) as the default for programs where report output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
filename is constructed from the name: and extension: attribute in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in ($(asequence.name) if the sequence parameter is a named asequence).
If the features: attribute is set, the sequence output will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
The type: attribute defines the output sequnce type. Although this will default to the type of the first input sequence, it is ercommended that a value is always defined to make the output sequnce type clear.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without this sequence output. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence output) as the default for programs where sequence output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The osextension: attribute sets the default file extension. This is usually the sequence format, but can be specifically set with this attribute, for example where appliations produce two or more sequence outputs.
filename is constructed from the name: and extension: attribute in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in ($(asequence.name) if the sequence parameter is a named asequence).
If the features: attribute is set, the sequence output will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
The type: attribute defines the output sequnce type. Although this will default to the type of the first input sequence, it is ercommended that a value is always defined to make the output sequnce type clear.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without this sequence output. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence output) as the default for programs where sequence output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The ossingle: attribute sets the default value for the -ossingle qualifier and can be set to "Y" to direct output to multiple sequence files. For example, the EMBOSS program "seqretsplit" splits an input sequence input multiple files using this attribute.
The osextension: attribute sets the default file extension. This is usually the sequence format, but can be specifically set with this attribute, for example where appliations produce two or more sequence outputs.
filename is constructed from the name: and extension: attribute in a $( name).$(extension) format. If the name: attribute is not defined in the ACD file, it will default to the calculated attribute name: of the FIRST sequence that is read in ($(asequence.name) if the sequence parameter is a named asequence).
If the features: attribute is set, the sequence output will include feature information either in the same file (if the sequence format supports it) or in a separate file (by default in GFF format).
The type: attribute defines the output sequnce type. Although this will default to the type of the first input sequence, it is ercommended that a value is always defined to make the output sequnce type clear.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without this sequence output. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no sequence output) as the default for programs where sequence output is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead. In combination with the nullok: and missing: attributes, this allows qualifiers to be null by default, and turned on from the command line.
The osextension: attribute sets the default file extension. This is usually the sequence format, but can be specifically set with this attribute, for example where appliations produce two or more sequence outputs.
Formalised:
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without this graph output. The application must be able to accept a null value for this qualifier.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no graph) as the default for programs where a graph is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead.
The goutfile: attribute specifies the base file name for output. It can be used to direct output to a named file rather than default to the first sequence name in the input.
The command line qualifiers can be defined as ACD attributes. The most used are gdesc:, gtitle:, gxtitle: and gytitle:.
The nullok: attribute allows a default value to be replaced by an empty string or by -noxxx on the command line if the application can run without this graph output.
The nulldefault: attribute overrides the default name generation, and uses an empty string (no graph) as the default for programs where a graph is only occasionally required. If an empty string is specified on the command line, the standard default value will be generated instead.
The command line qualifiers can be defined as ACD attributes. The most used are gdesc:, gtitle:, gxtitle: and gytitle:.
The multiple: attribute specifies the number of multiple XY graphs in a single output. The default value is 1, but is sometimes defined in ACD files.
The goutfile: attribute specifies the base file name for output. It can be used, for example by the EMBOSS program "tmap" to direct output to a named file rather than default to the first sequence name in the input.
Calculated attributes are attributes that are assigned values (calculated in some points) AFTER the parameter has been validated (for instance, for a sequence data type, the sequence file has been checked to exist and read in). The values are extracted from the actual object the parameter is referring to. At the moment the calculated attributes are only referring to sequence type objects and can hold things like the name of the sequence, the length, the type of sequence (Protein, DNA, RNA etc).
Formalised:
The type: attribute will describe the type of the sequence in a single token. The EMBOSS initialisation routines will try to establish the type, by reading the (first) sequence and examining the contents. Possible values for the type: attribute are listed in table 8.
The values of attributes (default, specific and calculated) can be referred to after they have been defined by appending the attribute name to the parameter name, spaced by a dot '.' and enclosing it in parentheses, prefixed by a dollar sign '$'.
Formalised:
$(parametername.attribute)
Example:
sequence: asequence [ standard: Y prompt: "Enter filename" ] integer: windowsize [ default: $(asequence.length) ]
In this example the parameter windowsize will default to the length of the input sequence.
For many of the parameters/objects, qualifiers can be used to specify the properties of that object on the command line. The format of a sequence file (data type 'sequence') can be specified by a qualifier as being, for instance, 'fasta'. These types of qualifiers are specific for a particular data type (or object) and are therefore called data type specific qualifiers.
A second type of qualifier is independent of the data types. These are the global qualifiers and apply to the complete program. They are usually used to change the behaviour of the program. Qualifiers can be set to turn the debugging on, for instance (by using the -debug qualifier), or it can instruct the program to behave like a filter, reading from the standard input and writing to the standard output ( -filter qualifier).
Qualifiers can be entered on the command line in a myriad of ways and a full description of the command line syntax will be given in 1.4. For the moment qualifiers will be used in the UNIX style, which means that a qualifier name is prefixed with an hyphen and the value (if necessary) will be spaced from the qualifier by a space.
Example:
% seqret sequence.seq -sformat fasta
"-sformat fasta" is a "qualifier/value pair". Where seqret is the program being called, sequence.seq the first (and only) parameter and -sformat fasta the qualifier/value pair for this parameter.
The global qualifiers are boolean qualifiers and can be set by naming them on the command line and specifically unset by prefixing the qualifier with 'no', but since the global qualifiers all default to false anyway, there is no specific need to use this syntax at the moment.
Example:
% seqret sequence.seq -debug % seqret sequence.seq -nodebug
In the first example seqret is the program being called, sequence.seq the first (and only) parameter and -debug instructs the program to turn debugging on. In the second example seqret is run with the same parameter, but the -debug qualifier is now prefixed with ' no', instructing the program to turn debugging off (this could be useful if debugging was turned on by default in the resource files or in an environment variable).
Qualifiers can have any name, but a recommended naming scheme is used at the moment. The first one or two letters of the qualifier indicate the data type they are related to. 'OS' is used for the output sequence data types (outseq, outseqset and outseqall) and 'S' for the input sequence data types (sequence, seqset and seqall). The rest of the qualifiers' name is free but should be something sensible related to the data type.
Global qualifiers can change the behaviour of the program. They are boolean qualifiers and can be set by naming them on the command line and specifically unset by prefixing the qualifier with 'no' [2]. The qualifiers can be used on the EMBOSS program and as a qualifier for acdc. The current global qualifiers are listed in the table below.
Formalised:
This qualifier will turn on the log function of the ACD file processing. It will produce a logfile of the ACD file parsing process. The logfile will have the name of the application, with the extension .acdlog.
Example:
% seqret sequence.seq -acdlog -auto % more seqret.acdlog seqret [ sequence: sequence [ parameter: "1" ] seqout: outseq [ parameter: "2" ] -- All Done -- Definitions in ACD file ACD 0 Name: 'seqret' ..(Removed lines)..
This example shows the application seqret being run with the -acdlog qualifier (and the -auto qualifier, (which will be discussed later). After completion of the program a file called seqret.acdlog is created in the current directory, with the logging information, of which the first sixteen lines are shown.
The logfile will first list the ACD file description, with all abbreviated names extended to their full length. Next, it will list all the parameters and qualifiers it knows of and prints out all the information it has on the data types and qualifiers.
When the -acdpretty qualifier is used, an ACD file will be produced which is a formatted version of the original ACD file. It will produce the full-length names of all data type names, attributes and qualifiers. It will show all attributes on a separate line and all values enclosed in quotes. The file will be saved as programname.acdpretty in the current directory ( programname is the name of the original program).
The -auto qualifier will turn off any prompting of the user. It will try to run the program with all the default settings that are defined in the ACD file. If a parameter does not have a default value and it is flagged as required, the program will stop and produce an error message.
Example:
% seqret sequence.seq Output sequence [pdnirsecf.fasta]: % seqret sequence.seq -auto %
The first example shows the application seqret being run without the -auto qualifier. The program will prompt the user for an output filename, because the output sequence is a mandatory parameter. It presents the user with a prompt and a default output filename ( pdnirsecf.fasta, constructed from the input sequence name and the output format).
In the second example, the application seqret is run with the -auto qualifier. The user is not queried for the output filename and it will use the default filename( pdnirsecf.fasta) for the output file.
This qualifier will turn on the debug tracing. A file will be produced with the name of the program followed by the extension .dbg. The debug file will contain a complete trace of the actions of the program reported by calls to the AJAX function ajDebug().
Example:
% seqret sequence.seq -debug -auto % more seqret.dbg acdArgsScan acdDebug Yes ajNamGetValueC 'acdroot' 'emboss_acdroot' definition for 'acdroot' not found ajNamResolve of '/packages/emboss/emboss/acd/seqret.acd' closing file '/packages/emboss/emboss/acd/seqret.acd' acdFindQualAssoc 'auto' pnum: 0 ifound: 0 acdSetQualAppl acdDebug YesajNamGetValueC 'filter' 'emboss_filter' definition for 'filter' not found ajNamGetValueC 'options' 'emboss_options' definition for 'options' not found ajNamGetValueC 'acdlog' 'emboss_acdlog' definition for 'acdlog' not found ajNamGetValueC 'help' 'emboss_help' definition for 'help' not found ajSeqInClear called Initializing seqInFormat, 24 formats ajNamGetValueC 'format' 'emboss_format' definition for 'format' not found ajSeqRead: no file yet - test USA '/people/tdeboer/seq/nir.gb' seqUsaProcess USA to test: '/people/tdeboer/seq/nir.gb' format regexp: No no format specified in USA ...input format not set dbname dbexp: No no dbname specified entry-id regexp: Yes found filename /people/tdeboer/seq/nir.gb seqAccessFile /people/tdeboer/seq/nir.gb ajNamResolve of '/people/tdeboer/seq/nir.gb' ajSeqRead: calling seqRead '/people/tdeboer/seq/nir.gb' seqRead seqin format 0 '' try format 1 (gcg) seqGcgDots .. found source 1..5574 try format 3 (embl) first line 'LOCUS PDNIRSECF 5574 bp DNA BCT 30-MAR-1996 try format 5 (swiss) first line 'LOCUS PDNIRSECF 5574 bp DNA BCT 30-MAR-1996 try format 7 (fasta)
This examples shows the application seqret being run with the -debug qualifier (and the -auto qualifier, which will be discussed later). After completion of the program, a file called seqret.dbg is created in the current directory which contains the debug information, of which some of the lines are shown.
The filter qualifier makes the program behave like a filter, reading its (first) input 'file' from the standard input, and writing its (first) output 'file' to the standard output. The -filter qualifier will also invoke the -auto qualifier, so the user is never prompted for any missing values.
Example:
% cat sequence.seq | seqret -filter | lpr
The example shows the application seqret being run with the -filter qualifier. The input file is 'piped' into the program using the Unix command cat and the output is 'piped' directly to the Unix program lpr, which will print it on the printer.
Help on a program's use can be obtained by using the -help qualifier. The help that is displayed will be automatically produced from the information in the ACD file. It will list all the parameters and their associated qualifiers. It will show the names of the parameters and qualifiers, their type and a brief help text, that is extracted from the help: attribute.
A second qualifier -verbose gives a list of all available qualifiers, including any associated qualifiers (sequence formatting etc) and the general qualifiers such as -help.
A program wirll prompt the user for any missing "required" or "parameter" values. Some programs have more options that are normally not prompted for (although they can be used on the command line). When the -options qualifier is used, the program will query the user for the required parameters (data types with the parameter: attribute and/or standard: attribute) and also for the parameters that are labelled with the additional: attribute.
Example:
ACD file definition :
application: seqdemo sequence: asequence [ parameter: Y ] outseq: outseq [ standard: Y ] integer: outputLength [ additional: Y information: "Output length" ]
Command line :
% seqdemo Input sequence: sequence.seq Output sequence [pdnirsecf.fasta]: <ENTER> % seqdemo -options Input sequence: sequence.seq Output sequence [pdnirsecf.fasta]: <ENTER> Output length: 10
In the first example the application seqdemo is run without any parameters or qualifiers and since the asequence parameter is a parameter it queries the user for the input filename. It also queries the user for the output sequence, since that parameter is labelled as being required by the attribute of that name. It will not query the user for the integer variable outputLength, since it is not labelled as a parameter and is not labelled as required.
In the second example the user IS queried for the integer, since the -options qualifier forces the program to query for those parameters that are labelled with the additional: attribute.
Any parameter that is not defined as a parameter (with the parameter: attribute), as required (by the standard: attribute) or as optional (by the additional: attribute) can still be used on the command line, but the user will NEVER be queried for them. These parameters are considered an 'advanced feature' and can only be used on the command line. They will only be shown by the -help qualifier.
When the -stdout qualifier is used, the user will still be prompted for all the info that is required, but will write to standard output. The user will also still be prompted for an output filename, in case the user wants to save the output to a file.
Example:
Command line :
% seqret -stdout Input sequence: sequence.seq Output sequence [stdout]: <ENTER>
In this example the -stdout qualifier changes the default output to be to standard output (the terminal) instead of to a file. The program can still prompt the user, so there is a chance to enter a filename instead. With -auto on the command line, the program would instead write to the terminal without asking.
Most global qualifiers default to FALSE unless they are set on the command line or the environment variable is set to TRUE (the exceptions are the message level qualifiers -warning -error and -fatal). The actions of all the global qualifiers can be changed by using environment variables. Environment variables will override the default action of the program. The variables are constructed of the word EMBOSS (all capitals) and the name of the qualifier (also in capitals) divided by the underscore character '_'. If set, they can be set with YES, TRUE or 1. Both lowercase and uppercase is accepted, as is using only a part of the word YES or TRUE (i.e. Y and T)
Formalised:
(csh) setenv EMBOSS_QUAL TRUE or setenv EMBOSS_QUAL true setenv EMBOSS_QUAL YES or setenv EMBOSS_QUAL yes setenv EMBOSS_QUAL 1 (sh or bash) export EMBOSS_QUAL=YES
where QUAL represents the name of the global qualifier.
The table below lists all environment variables for global qualifiers.
Environment variables can be specified in the global emboss.defaults file, in the user's .embosssrc file or set on the command line with the setenv command.
When the environment variable is set, its effect can be cancelled by using the negating action of prefixing 'no' to the boolean qualifier name with the program name.
Example:
ACD file definition :
application: seqdemo sequence: asequence [ parameter: Y ] outseq: outseq [ parameter: Y ] integer: outputLength [ additional: Y ]
Command line :
Example 1
% seqdemo Input sequence: sequence.seq Output sequence [pdnirsecf.fasta]: <ENTER> % seqdemo -options Input sequence: sequence.seq Output sequence [pdnirsecf.fasta]: <ENTER> Output length: 10
Example 2:
% setenv EMBOSS_OPTIONS YES % seqdemo Input sequence: sequence.seq Output sequence [pdnirsecf.fasta]: <ENTER> Output length: 10 % seqdemo -nooptions Input sequence: sequence.seq Output sequence [pdnirsecf.fasta]: <ENTER>
The first example shows the behaviour without the EMBOSS_OPTIONS environment variable being set. The program seqdemo behaves in the standard way, and only asks for the outputLength parameter when the -options qualifier is used. In the second example the environment variable EMBOSS_OPTIONS is set at the command line and the effect of it is that it now asks for the outputLength parameter without the -options qualifier being used. The effect of the environment variable is cancelled by using the negating effect of the prefix ' no' to the qualifier -options (giving -nooptions).
Formalised:
Formalised:
Formalised:
Qualifiers refer to the parameter that preceded the qualifier, until a parameter from the same data type appears on the command line. But, qualifiers that are specific for different data types can be intermixed. If there are no two parameters of equal type, the order of parameters and their qualifiers is irrelevant.
Example 1
% seqret in.seq out.seq -sformat fasta -osformat gcg
In this example, the program seqret takes two parameters, an input sequence (file in.seq, data type 'sequence') and an output sequence (file out.seq, data type 'outseq') and the order of the qualifiers is irrelevant, since the two qualifiers refer to different data types.
Example 2
% align aap.seq -sformat fasta noot.seq -sformat gcg
In this example, the program align takes two parameters, both input sequences (files aap.seq and noot.seq, data type sequence) and here the order of the qualifiers is important. Since aap.seq is in 'fasta' format and noot.seq is in 'gcg' format.
Instead of having to adhere to a rigorous order for the qualifiers when two or more parameters of the same data type are defined, it is also possible to use numbers in the qualifiers name, to indicate to which parameter the qualifier is referring.
Formalised:
-qualifiername# qualifiervalue
where # represents an integer number, indicating which parameter the qualifier is referring to.
Example:
% align aap.seq noot.seq -sformat2 gcg -sformat1 fasta
Is similar to example 2 above, but uses the qualifier numbering, to indicate that the format of the first parameter is 'fasta' and the second 'gcg'.
The number that is used is not the number of the parameter in the ACD definition, but indicates the number of SIMILLAR qualifiers.
Example:
#ACD definition
application: seqtest sequence: asequence1 [ parameter: Y ] outfile: outfile [ parameter: Y ] sequence: asequence2 [ parameter: Y ]
Command line :
% seqtest filename1.seq seqtest.out filename2.seq \ -sformat1 gcg -sformat2 fasta
Defines that the first sequence file (filename1.seq) is in 'gcg' format and the second sequence file (filename2.seq) is in 'fasta' format. Note that the second -sformat qualifier has been numbered 2, although it is the third parameter (but the second sequence parameter, hence number 2).
Operations can be used to be more flexible in the ACD syntax. At the moment there are arithmetic and conditional operations. An operation is enclosed in a pair of parenthesis '()' and preceded by the at symbol '@'.
Formalised:
@(operation)
If the operation contains white spaces, the whole token should be enclosed by double quotes (" ")
Formalised:
"@(operation with white space )"
Operations can be nested.
Formalised:
@(@(operation))
The current arithmetic operations are addition, subtraction, multiplication and division. The standard characters for the arithmetic operations are used: + - * and /.
Formalised:
@(a+b) (Addition) @(a-b) (Subtraction) @(a*b) (Multiplication) @(a/b) (Division)
The operands a and b must parse to an integer or a float value. Only a single arithmetic operation is allowed per operation. If more then one arithmetic operation is required, one should make use of internal ACD variables to hold the intermediate results or nest separate @() operations.
Example1:
variable: protlen "@( $(sequence.length) / 3 )" integer: window [ maximum: "@($(protlen)-50)" default: 50 ]
This is an example of using an internal ACD variable, to store the intermediate result. The internal ACD variable $(protlen) is calculated from the length of the input sequence ( sequence data type) and used in the definition of maximum size of the window parameter.
Example2:
integer: window [ maximum: "@( @( $(sequence.length) / 3) - 50)" default: 50 ]
This is an example using nesting of operations, achieving the same result as example1. The window parameter is calculated directly from the sequence.length variable (calculated attribute) by first dividing the sequence length by 3, using the divide arithmetic operation, nested with a separate subtraction operation.
If any of the operands are not numerical, the result is undefined.
There are three conditional operations: The Boolean operation, the simple conditional (if/then/else type) and the case type.
The Boolean operation will resolve to a Boolean variable using any of the four conditional operators, for equality (==), non-equality (!=), less-than (<) and greater-than (>).
Formalised:
@(token1==token2) (Equality) @(token1!=token2) (Non-equality) @(token1<token2) (Less-than) @(token1>token2) (Greater-than) @(!token1) (Not) @(token1|token2) (Or) @(token1&token2) (And)
The test values can be integers, floats and strings.
Example:
sequence: seq [ standard: Y ] infile: data [ standard: @(seq.type==DNA) ]
In this example, the data file is only required if the type of sequence is 'DNA'.
The simple conditional is a tri-operand operator. The test value is followed by a question mark '?', which in turn is followed by the two values the operation can resolve to, separated by a colon ':'. Formalised:
@(boolval ? iftrue : iffalse)
The test value, boolval, which must be either a Boolean variable, a Boolean operation or an integer, is examined and if it resolves to true (or non-zero) the total operation resolves to the iftrue value. If the test value resolves to false (or zero) the operation resolves to the second value ( iffalse.
Example:
string: matrix [ default: "@($(asequence.protein) ? BLOSUM62 : DNAMAT)" ]
The $(sequence.protein) variable is a Boolean value that resolves to true if the sequence data type with the name asequence is a protein sequence. The operation would resolve to BLOSUM62 if the sequence is a protein sequence and resolve to DNAMAT if it was not a protein sequence (i.e. a DNA or RNA sequence).
From EMBOSS 2.8.0 the preferred method is to use the automatic ACD variable $(acdprotein) which is set to the type of the first input sequence. This makes the conversion of ACD files for GUI interfaces and other wrappers simpler. The examlpe then becomes:
string: matrix [ default: "@($(acdprotein) ? BLOSUM62 : DNAMAT)" ]
The results will be the same because internally EMBOSS will use the value of "$(asequence.protein)"
In the case-type operation, the test value is compared with a list of possible values. If a match is found, the operation resolves to the result associated with that possible value. The test value, which is parsed as a string, is followed by an equal sign '=', which in turn is followed by one or more pairs of possible and associated values, separated by a colon ':'. If none of the possible values match, the operation will resolve to the default result, that is associated with the keyword else.
The else : default value pair is not mandatory and if none of the possible values match in a operation without the default value, the operation will resolve to a null string. Formalised:
@(testval = poss_valA : ass_valA poss_valB : ass_valB else : default_val)
Example:
string: matrix [ default: "@($(sequence.type) = protein : BLOSUM62 dna : dnamat rna : rnamat else : unknown)" ]
The $(sequence.type) variable is a string value that holds the type of sequence present in the sequencedata type, with the name sequence. If the type is 'protein', the operation resolves to BLOSUM62, if the type is 'dna' it resolves to DNAMAT, etc. If the type is not in this list, the operation resolves to unknown.
If the test value cannot unambiguously be assigned to a single associated value, the operation will resolve to the LAST associated value that matches its possible value.
The use of conditional operations in ACD files is often to test the values of list or selection data types.
These tests are often used for several other qualifiers. To help understand the ACD file, and to help the developers of ACD parsers, an ACD file can use a variable definition to define the result once only, and then to refer to the variable by name in all later ACD data type definitions.
Example1:
variable: usermatrix "@($(pwmatrix) == o)" infile: pairwisedata [ additional: "$(usermatrix)" default: "" nullok: "@(!$(usermatrix))" information: "Filename of user pairwise matrix" knowntype: "comparison matrix" ]
Note that as a variable only has a single value and no attributes the square brackets are not used.
Variables are used to simplify the ACD file, but they do indicate that there is some complexity in the ACD definitions. When a variable is used, or when a conditional operation refers to another ACD value, the application can be regarded as two or more separate applications with each possible condition resolved.
The parameters and qualifiers defined by an ACD file are processed in the order in which they appear. This is sufficient for ACD processing by EMBOSS applications, but does not give enough detail for user interfaces to build clean groupings of options.
To help user interfaces, all ACD parameters and qualifiers are now grouped into 5 major sections and in some cases into subsections. The 5 major sections always appear in the following order in the ACD file (the order is tested by the acdvalid tool):
Section name | Description |
Input | Input values, including any infile, sequence, seqset, seqall, matrix, fmatrix, codon, or any other ACD type that will read input. At present datafile is included, although this may change. Other qualifiers related to input can also be placed in this section. |
Required | Parameters and required qualifiers, including any whose "additional" attribute can be true but depends on a conditional operation. Also any toggles that their definitions use. Note that input and output parameters and qualifiers must be in their respective sections. Other qualifiers related to input and output can also be placed in those sections. |
Additional | Additional qualifiers, including any whose "additional" attribute can be true but depends on a conditional operation. Also any toggles that their definitions use. Note that input and output parameters and qualifiers must be in their respective sections. Other qualifiers related to input and output can also be placed in those sections. |
Advanced | Any qualifiers (except input and output qualifiers) which have no "standard" or "additional" attribute defined. Other qualifiers related to input and output can also be placed in those sections. |
Output | Output values, including any outfile, outdata, seqout, seqoutall, seqoutset, outtree or any other data type that will write output. This is the last section to be defined, so all output definitions must be at the end. Other qualifiers related to output can also be placed in this section. |
All sections and subsections are defined in the file sections.standard which is stored and installed in the same directory as the ACD files.
The behaviour of EMBOSS programs can not only be influenced by command line options, but also by environment variables. Some of the environment variables were already mentioned in section 3.4.2.2 for the global qualifiers and they are listed here also, for completeness. There are a few others that are not directly related to specific data types, but are more general to the workings of an EMBOSS program.
All environment variables should be described in the file variables.standard which is stored and installed in the same directory as the ACD files and is used to generate the table below. Unlike the other .standard files, there is no explicit test for a variable to be defined in this file.
A set of 6 utility programs will run, test and document an ACD file without the need to write the program which will use the ACD file.
The recommended approach for developing new applications is to first write and test the ACD file and then to write the application to use the values defined by the ACD file.
The acdc utility processes an ACD file in exactly the same way as an application, even if the application itself has not yet been written.
acdc can use general qualifiers such as -debug. Note that as the input files are read any debug calls made by the input functions will be reported.
The acdtrace utility runs like acdc but also reports the resolution of any ACD varaibles and operations as the file is processed. The output on screen can look a little confusing but is by far the best way to see how variables and operations work in your ACD file.
The acdvalid utility validates an ACD file, testing many features which will not prevent an application from running, but will create problems for the user interface (commandline or some wrapper).
Among the features tested by acdvalid are:
If the message is a "Warning" then the ACD file will work, although it is worth trying to fix the problem. Recommended solutions are described in the web page http://www.ebi.ac.uk/~pmr/emboss/acdvalid-fix.html used by the developers
Further validation tests will be added in future releases so it is worth running acdvalid on all local ACD files with each new version of EMBOSS
The acdpretty utility simply reads an ACD file and rewrites it with clean indentation to file (programname).acdpretty which can be used to overwrite the original ACD file.
The acdtable utility is used to create the table of qualifiers, allowed values and defaults that appears in the application documentation. The allowed values uses the valid attribute, and the default value uses the expect attribute for cases where the ACD definition alone is not enough to define the value to be reported.
The acdc utility
runs like acdc but also produces a file (programname).acdlog which documents the internals of ACD processing.Most of the EMBOSS programs will be started from the UNIX command line, either with or without extra parameters and qualifiers. Which parameters and qualifiers can appear on the command line, is defined in the Ajax Command Definition (ACD) file that is associated with the EMBOSS program (See 1.2).
The Command line syntax is very versatile and it does not restrict the available syntax more than is strictly necessary. To save confusion, there will be a recommended EMBOSS command style, which probably will be the UNIX style using '=' for parameter and qualifier values.
For parameters it is not always mandatory to use the name of the parameter on the command line. If the parameter: attribute was used for a parameter it is not mandatory to use the name of the parameter as a prefix to the parameter value (See 3.4.1.1.1). For qualifiers it is always mandatory to provide the name of the qualifier (if a value for the qualifier is to be given on the command line).
In the rest of the definition of the command line syntax, wherever the word qualifier is used, it means both parameters and qualifiers. If 'parameter' is used it will only apply to parameters.
Example:
ACD definition
application: seqdemo sequence: asequence [ parameter: Y ] boolean: output [ default: Y ]
Command line :
% seqdemo filename.seq -output % seqdemo filename.seq -nooutput
In the first command line example the bool parameter output is set to True (although it could have been omitted since the default value is True).
In the second command line example the output parameter is set to False, by the prefix 'no'.
Sequence specifications conform to the EMBOSS Uniform Sequence Address, but parts of the specification can also be given on the command line.
Examples
The following command lines all tell seqdemo to read sequence paamir.tfa in fasta format, starting at base 25.
% seqdemo -sbeg 25 paamir.tfa -sf fasta % seqdemo fasta::paamir.tfa -sbegin=25 % seqdemo -sbegin=25 fasta::paamir.tfa % seqdemo -sbegin=25 paamir.tfa -sformat fasta % seqdemo -sbeg 25 paamir.tfa -sf=fasta % seqdemo -sbeg 25 -sequence=paamir.tfa -sf=fasta % seqdemo sbeg=25 -sequence=paamir.tfa sf=fasta % seqdemo -sbeg 25 -sequence paamir.tfa -sf fasta % seqdemo /SBEG=25 /SEQUENCE=paamir.tfa /SF=fasta
This may seem rather confusing, but only because there is no enforcement of a standard recommended way for users to specify the command lines.
For general use, we strongly recommend the first example above.
This part is intended as a simple guide to getting started as a developer using EMBOSS. EMBOSS is a new package, and can seem difficult at first. If you follow these steps you will find it can be easy.
We start by assuming you want to write a new application. You will need to write the application source code, which will use the EMBOSS libraries, and you also need to add the application to EMBOSS.
Strangely, you add the application to EMBOSS before you write any source code. This is because EMBOSS can use the application command definition (ACD) file to test all the input for you, without any new source code.
ACD files live in the emboss/acd/ directory with a filename or appname.acd ("appname" is the name of your application). You will find some files there already which you are welcome to use as templates for your own ACD file.
application: appname
Defines the application name. All ACD files start this way.
sequence: sequence [ parameter: Y ]
This asks for a sequence as the first parameter on the command line.
outfile: outfile [ parameter: Y ]
This asks for an output file name as the second parameter on the command line.
integer: weight [ ]
Allows an integer value to be specified by "-weight" on the command line. It will not be prompted for unless you add "standard:Y", in which case you should add "prompt: 'prompt for the user'" as well.
There are other things you can specify too. All values can have defaults provided in the ACD file and tests to make sure the values are reasonable. See the ACD documentation for more information.
Now for the cunning part. EMBOSS has an application called acdc which can pretend to be any other application. You put
% acdc appname
on the command line. It will read appname.acd and will read in any required data just as if the application itself was running. It will also test anything else you add on the command line and report syntax errors in exactly the same way as the real application.
When the ACD file is ready, which should not take long, you can start on the application code. This lives in emboss/appname.c and to start with you will simply call the startup routines and pick up the values you defined in the ACD file.
#include "emboss.h" AjPSeq seq; AjPFile outfile; int iweight; int main (int argc, char * argv[]) { embInit ("appname", argc, argv); seq = ajAcdGetSeq ("sequence"); outfile = ajAcdGetOutfile ("outfile"); iweight = ajAcdGetInt ("weight"); ajExit(); }
The "embInit" call is exactly what acdc was doing. It will read appname.acd and read everything you need from the command line or by prompting the user.
The next 3 lines pick up the sequence, output file and integer value in the suggested ACD file (you will have your own set of calls here).
Now you are ready to write your code. The sequence is in seq in an internal representation. You can use EMBOSS functions to work with this sequence, or convert it to a string and use the string functions, or convert it to a null-terminated C character string and use the C library functions and C pointers. It does not really matter which you choose.
All output will be written to outfile. You should use the EMBOSS output functions to do this, typically ajFmtPrintF which works just like C's "printf" except that it uses an AJAX file object (AjPFile) and has some extra format options like %S for AJAX strings (AjPStr).
You can add a new application without too much difficulty if you are using the full developers version from the CVS server. Go to the emboss directory (/emboss). Edit file Makefile.am and make two changes. Add appname to the list of applications in bin_PROGRAMS and add a new line:
appname_SOURCES = appname.c
Then go back up to the top directory ( emboss/, and run ./configure which will magically update the makefiles for you. You can then use make to make EMBOSS with your new application.
What next?
Time to write some real code for your application. Good luck and happy coding!
The following ACD file for a hypothetical application called ajtest tests the data types for both required and optional values. The application will prompt for one value of each data type, in the order in which they are defined, and will accept definitions of the optional data on the command line.
# AJTEST application # AJAX COMMAND DEFINITION (ACD) FILE # use "" for missing values - these are required. # values in "" are trimmed to single spaces. # everything is treated as single tokens delimited by white space # (space, tab, newline) # pmr 8-jul-98 application: ajtest [ documentation: "Testing ACD files" groups: "Test" ] boolean: reqbool [ default: Y standard: Y information: "Required bool" ] boolean: bool [ default: N information: "Another bool" ] integer: reqint [ minimum: -50 maximum: +50 standard: Y information: "Number -50 to 50" ] integer: int [ minimum: -50 maximum: +50 information: "Enter a number -50 to 50" additional: y ] float: reqfloat [ minimum: -0.07 maximum: 2.5 standard: Y information: "Float to 2.5" ] float: float [ minimum: -7e-2 maximum: 2.5 information: "Float -0.07 to 2.5" ] sequence: psequence [ parameter: Y ] outfile: outfile [ default: stdout extension: "test" name: "ajtest" type: "text" standard: Y ] sequence: qsequence [ parameter: Y ] string: reqstring [ default: "abcd" standard: Y information: "rqstring" minlen: 4 ] string: string [ default: "b" information: "string" minlen: 1 maxlen: 50 ]
The AJAX Command Definition, and the command line entered by the user, are processed automatically by a single start-up call to embInit. The same call also handles all prompting of the user for missing information.
To help in evaluating the ACD files, there is a special EMBOSS application acdc (the ACD compiler) which, when given the name of an ACD file as the first argument on the command line, will process that file and use the remainder of the command line. This causes acdc to behave exactly like any (possibly not yet written) application and makes it very easy to test how a particular application could be defined.
For example
% acdc ajtest
would use the example ACD file above and would prompt for each of the required data types:
% acdc ajtest Testing ACD files Required bool [Y] : Number -50 to 50 [0] : Float to 2.5 [0.0] : First sequence : gcg::egmsmg.gcg Output file [stdout] : Second sequence : paamir.tfa rqstring [abcd] : fred % acdc ajtest -sask Testing ACD files Required bool [Y] : Number -50 to 50 [0] : Float to 2.5 [0.0] : First sequence : gcg::egmsmg.gcg Begin at base [1] : End at base [1217] : Reverse strand [N] : Output file [stdout] : Second sequence : paamir.tfa Begin at base [1] : End at base [1000] : Reverse strand [N] : rqstring [abcd] : fred % acdc ajtest -noreqb -reqf=1.5 egmsmg.gcg -sformat gcg Testing ACD files Number -50 to 50 [0] : Output file [stdout] : Second sequence : paamir.tfa rqstring [abcd] : fred % acdc ajtest -reqst=xyz Testing ACD files Required bool [Y] : Number -50 to 50 [0] : Float to 2.5 [0.0] : First sequence : gcg:egmsmg.gcg Output file [stdout] : Second sequence : paamir.tfa Too short - minimum length is 4 characters
This documentation was originally written by Thon de Boer at the HGMP RC in Hinxton, UK with input from Peter Rice at the Sanger Centre and Gary Williams at the HGMP RC. it is now maintained by Peter Rice at the European Bioinformatics Institute
A web version can be found at http://emboss.sourceforge.net/developers/acd/
[1] Input (and output) will usually take place using files, but 'files' is used here in the broadest sense, since there are many different ways that the input can actually take place (via the USA method).
[2] The global qualifiers all default to 'false' at this moment, so there is no compelling reason to use this syntax.
Need to add qualifier_assocqualifier syntax
Need to add a lot more on features input and output and notes on features in reports
Check details for ranges - can we add some attributes e.g. forwardonly so range appears in the tables
Check the -nooutput option in the example(s)
acd error checks - list and example for each - and how to log and debug the details
cpdb, scop should all use name and extension attributes - makes more sense than just name. Only codon should have a default.
datafile - does extension get added to anything the user specifies? always, only if it has no extension?