EMBOSS: textsearch

textsearch

Function

Search the textual description of sequence(s)

Description

textsearch searches for words (specified as a regular expression) in the description text of one or more input sequences. It writes an output file with optional contents such as the name, description and accession number of any sequence whose description line from the annotation matches the search term. Optionally, the search is case-sensitive and the results output as an HTML table. textsearch is convenient for small input files but will be slow for larger files and databases; you should use use SRS or Entrez instead.

Algorithm

textsearch searches only the description line, not the full sequence annotation.

Usage

Here is a sample session with textsearch

Search for 'lactose':

% textsearch tsw:* 'lactose' Search the textual description of sequence(s) Output file [edd1_rat.textsearch]:

Go to the input files for this example
Go to the output files for this example

Example 2

Search for 'lactose' or 'permease' in E.coli proteins:

% textsearch tsw:*_ecoli 'lactose | permease' Search the textual description of sequence(s) Output file [laci_ecoli.textsearch]:

Go to the output files for this example

Example 3

Output a search for 'lacz' formatted with HTML to a file:

% textsearch tembl:* 'lacz' -html -outfile embl.lacz.html Search the textual description of sequence(s)

Go to the input files for this example
Go to the output files for this example

Command line arguments

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     (Gapped) sequence(s) filename and optional
                                  format, or reference (input USA)
  [-pattern]           string     The search pattern is a regular expression.
                                  Use a | to indicate OR.
                                  For example:
                                  human|mouse
                                  will find text with either 'human' OR
                                  'mouse' in the text (Any string is accepted)
  [-outfile]           outfile    [*.textsearch] Output file name

   Additional (Optional) qualifiers:
   -casesensitive      boolean    [N] Do a case-sensitive search
   -html               boolean    [N] Format output as an HTML table

   Advanced (Unprompted) qualifiers:
   -only               boolean    [N] This is a way of shortening the command
                                  line if you only want a few things to be
                                  displayed. Instead of specifying:
                                  '-nohead -noname -nousa -noacc -nodesc'
                                  to get only the name output, you can specify
                                  '-only -name'
   -heading            boolean    [@(!$(only))] Display column headings
   -usa                boolean    [@(!$(only))] Display the USA of the
                                  sequence
   -accession          boolean    [@(!$(only))] Display 'accession' column
   -name               boolean    [@(!$(only))] Display 'name' column
   -description        boolean    [@(!$(only))] Display 'description' column

   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages

Standard (Mandatory) qualifiers Allowed values Default

[-sequence]
(Parameter 1) (Gapped) sequence(s) filename and optional format, or reference (input USA) Readable sequence(s) Required

[-pattern]
(Parameter 2) The search pattern is a regular expression. Use a | to indicate OR. For example: human|mouse will find text with either 'human' OR 'mouse' in the text Any string is accepted An empty string is accepted

[-outfile]
(Parameter 3) Output file name Output file <*>.textsearch

Additional (Optional) qualifiers Allowed values Default

-casesensitive Do a case-sensitive search Boolean value Yes/No No

-html Format output as an HTML table Boolean value Yes/No No

Advanced (Unprompted) qualifiers Allowed values Default

-only This is a way of shortening the command line if you only want a few things to be displayed. Instead of specifying: '-nohead -noname -nousa -noacc -nodesc' to get only the name output, you can specify '-only -name' Boolean value Yes/No No

-heading Display column headings Boolean value Yes/No @(!$(only))

-usa Display the USA of the sequence Boolean value Yes/No @(!$(only))

-accession Display 'accession' column Boolean value Yes/No @(!$(only))

-name Display 'name' column Boolean value Yes/No @(!$(only))

-description Display 'description' column Boolean value Yes/No @(!$(only))

Standard (Mandatory) qualifiers	Allowed values	Default
[-sequence] (Parameter 1)	(Gapped) sequence(s) filename and optional format, or reference (input USA)	Readable sequence(s)	Required
[-pattern] (Parameter 2)	The search pattern is a regular expression. Use a \| to indicate OR. For example: human\|mouse will find text with either 'human' OR 'mouse' in the text	Any string is accepted	An empty string is accepted
[-outfile] (Parameter 3)	Output file name	Output file	<>*.textsearch
Additional (Optional) qualifiers	Allowed values	Default
-casesensitive	Do a case-sensitive search	Boolean value Yes/No	No
-html	Format output as an HTML table	Boolean value Yes/No	No
Advanced (Unprompted) qualifiers	Allowed values	Default
-only	This is a way of shortening the command line if you only want a few things to be displayed. Instead of specifying: '-nohead -noname -nousa -noacc -nodesc' to get only the name output, you can specify '-only -name'	Boolean value Yes/No	No
-heading	Display column headings	Boolean value Yes/No	@(!$(only))
-usa	Display the USA of the sequence	Boolean value Yes/No	@(!$(only))
-accession	Display 'accession' column	Boolean value Yes/No	@(!$(only))
-name	Display 'name' column	Boolean value Yes/No	@(!$(only))
-description	Display 'description' column	Boolean value Yes/No	@(!$(only))

Input file format

textsearch reads one or more normal sequence USAs.

Input files for usage example

'tsw:*' is a sequence entry in the example protein database 'tsw'

Input files for usage example 3

'tembl:*' is a sequence entry in the example nucleic acid database 'tembl'

Output file format

Output files for usage example

File: edd1_rat.textsearch

# Search for: lactose
tsw-id:LACI_ECOLI LACI_ECOLI    P03023	Lactose operon repressor.
tsw-id:LACY_ECOLI LACY_ECOLI    P02920	Lactose permease (Lactose-proton symport).

Output files for usage example 2

File: laci_ecoli.textsearch

# Search for: lactose | permease
tsw-id:LACI_ECOLI LACI_ECOLI    P03023	Lactose operon repressor.
tsw-id:LACY_ECOLI LACY_ECOLI    P02920	Lactose permease (Lactose-proton symport).

Output files for usage example 3

File: embl.lacz.html

Search for: lacz

tembl-id:J01636	J01636	J01636	E.coli lactose operon with lacI, lacZ, lacY and lacA genes.
tembl-id:V00296	V00296	V00296	E. coli gene lacZ coding for beta-galactosidase (EC 3.2.1.23).

The first column in the name or ID of each sequence. The remaining text is the description line of the sequence.

When the -html qualifier is specified, then the output will be wrapped in HTML tags, ready for inclusion in a Web page. Note that tags such as <HTML>, <BODY>, </BODY> and </HTML> are not output by this program as the table of databases is expected to form only part of the contents of a web page - the rest of the web page must be supplier by the user.

The lines of out information are guaranteed not to have trailing white-space at the end. So if '-nodesc' is used, there will not be any whitespace after the ID name.

Data files

None.

Notes

This is a rather slow way to search for text in databases. If you are searching for text in public databases, you should consider using either Entrez (http://www.ncbi.nlm.nih.gov/Entrez/) or SRS (http://srs.rfcgr.mrc.ac.uk/ or http://www.sanger.ac.uk/srs6/ etc.)

References

None.

Warnings

Diagnostic Error Messages

None.

Exit status

It always exits with status 0

Known bugs

None.

Program name	Description
abiview	Display the trace in an ABI sequencer file
cirdna	Draws circular maps of DNA constructs
infoalign	Display basic information about a multiple sequence alignment
infobase	Return information on a given nucleotide base
inforesidue	Return information on a given amino acid residue
infoseq	Display basic information about sequences
lindna	Draws linear maps of DNA constructs
pepnet	Draw a helical net for a protein sequence
pepwheel	Draw a helical wheel diagram for a protein sequence
prettyplot	Draw a sequence alignment with pretty formatting
prettyseq	Write a nucleotide sequence and its translation to file
remap	Display restriction enzyme binding sites in a nucleotide sequence
seealso	Finds programs with similar function to a specified program
showalign	Display a multiple sequence alignment in pretty format
showdb	Displays information on configured databases
showfeat	Display features of a sequence in pretty format
showseq	Displays sequences with features in pretty format
sixpack	Display a DNA sequence with 6-frame translation and ORFs
tfm	Displays full documentation for an application
whichdb	Search all sequence databases for an entry and retrieve it
wossname	Finds programs by keywords in their short description

Author(s)

Gary Williams (gwilliam © rfcgr.mrc.ac.uk)
MRC Rosalind Franklin Centre for Genomics Research Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SB, UK

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Input files for usage example

Input files for usage example 3

Output file format

Output files for usage example

File: edd1_rat.textsearch

Output files for usage example 2

File: laci_ecoli.textsearch

Output files for usage example 3

File: embl.lacz.html

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

See also

Author(s)

History

Target users

Comments