extractfeat is a simple utility for extracting regions of a sequence that are annotated as being a specified type of feature. It reads one or more sequences, and writes out the sequences and features of interest to an output sequence file. 'joined' features can either be extracted as individual sequences, or as a single concatenated sequence if the -join qualifier is used. If the feature is annotated as being in the reverse sense of a nucleic acid sequence, then that feature's sub-sequence is reverse-complemented before it is written.
There are many options control exactly what parts of the feature table are given in the output file. In addition, it is often useful to have contextual information about a feature. There are options to specify a number of positions before and/or after the specified feature which will be reported in the output file.
|
Feature tables in Swissprot, EMBL, GFF, etc. format can be added using '-ufo featurefile' on the command line.
The sequences of the specified features are written out.
The ID name of the sequence is formed from the original sequence name with the start and end positions of the feature appended to it. So if the feature came from a sequence with an ID name of 'XYZ' from positions 10 to 22, then the resulting ID name of the feature sequence will be 'XYZ_10_22'
The name of the type of feature is added to the start of the description of the sequence in brackets, e.g.: '[exon]'.
The sequence is written out as a normal sequence.
If the feature is in the reverse sense of a nucleic acid sequence, then it is reverse-complemented before being written.
There are many options to control exactly what parts of the feature table are written to file.
By default every feature in the feature table is extracted. -type will set the specific feature type to extract. See http://www3.ebi.ac.uk/Services/WebFeat/ for a list of the EMBL feature types and see Appendix A of the Swissprot user manual in http://www.expasy.ch/txt/userman.txt for a list of the Swissprot feature types.
By default any feature tag in the feature table is extracted. -tag specifies the feature tag to show. Tags are the types of extra values that a feature may have. For example in the EMBL feature table, a 'CDS' type of feature may have the tags /codon, /codon_start, /db_xref, /EC_number, /evidence, /exception, /function, /gene, /label, /map, /note, /number, /partial, /product, /protein_id,/pseudo, /standard_name, /translation, /transl_except, /transl_table, or /usedin. Some of these tags also have values, for example /gene can have the value of the gene name.
By default any feature tag value in the feature table is shown. You can set this to match any feature tag value you wish to show. Tag values are the values associated with a feature tag, for example /gene can have the value of the gene name. Bear in mind only some of these tags can have values.
By default any feature source in the feature table is shown. -source is used to set this to match a specific feature source.
By default features in either sense are extracted. The -sense option specifies a particular sense.
The minimum and maximum score of features to be reported may be specified with -minimum and -maximum.
To aid you in identifying the type of feature that has been output, the type of feature is added to the start of the description of the output sequence. Sometimes the description of a sequence is lost in subsequent processing of the sequences file, so it is useful for the type to be a part of the output sequence ID name. The -featinname option specifies this behaviour.
To aid you in identifying the properties of a feature that has been output, -describe specifies one or more tag names that will be added to the output sequence "Description" text, together with their values (if any). For example, if this is set to be gene, then if any output feature has the tag (for example) /gene=BRCA1 associated with it, then the text (gene=BRCA1) will be added to the Description line. By default no feature tag is given in the "Description" text. You can set -describe to specify any feature tag you wish to show.
Some features, such as CDS (coding sequence) and mRNA are composed of introns concatenated together. There may be other forms of 'joined' sequence, depending on the feature table. By default, 'joined' features are extracted as individual sequences. If the -join option is specified, then any group of these features will be output as a single sequence. If the -before and -after qualifiers have been set, then only the sequence before the first feature and after the last feature are added.
Bear in mind that database annotation cannot always be trusted to be reliable. If you rely upon annotation written by other people or another program and do not independently verify such annotation, then there is a chance that some of the reported features will be erreneous.