dbxfasta
Function
Description
dbxfasta indexes a flat file database of one or more files, and builds
EMBOSS B+tree format index files. These indexes allow access of flat
files larger than 2Gb.
Usage
Command line arguments
Input file format
dbxfasta reads any normal sequence USAs.
Output file format
dbxfasta creates one summary file for the database and two files
for each field indexed.
- dbalias.ent is the master file containing the names of the files
that have been indexed. It is an ASCII file. This file also contains
the database release and date information.
- dbalias.xid is the B+tree index file for the ID names. It is a binary
file.
- dbalias.pxid is an ASCII file containing information regarding the
structure of the ID name index.
Data files
None.
Notes
The indexing system has been designed to allow on-the-fly updating
of indexes. This feature is, however, not implemented in the current
version. It will be made available in future releases.
Having created the EMBOSS indexes for this file, a database can then be
defined in the file emboss.defaults as something like:
DB emrod [
type: N
dbalias: emrod (see below)
format: fasta
method: emboss
directory: /data/embl/fasta
file: emrod.fasta
indexdirectory: /data/embl/fasta/indexes
]
The index file 'basename' given to dbxfasta must match the DB name
in the definition. If not, then a 'dbalias' line must be given which
specifies the basename of the indexes.
Fields Indexed
By default, dbxfasta will index the ID name and the accession
number (if present).
If they are present in your database, you may also specify that
dbxfasta should index the Sequence Version and GI, the words
in the description, the keywords and the organism words using the
'-fields' qualifier with the appropriate values.
Global Parameters
dbxfasta requires that two global parameters be defined in
the file emboss.defaults. These are:
SET PAGESIZE 2048
SET CACHESIZE 200
The above values are recommended for most systems. The PAGESIZE is
a multiple of the size of disc pages the operating system buffers.
The CACHESIZE is the number of disc pages dbxfasta is
allowed to cache.
Resources
dbxfasta will ask you for the name of a resource definition
in the file emboss.defaults. This will be something like:
RES embl [
type: Index
idlen: 15
acclen: 15
svlen: 20
keylen: 25
deslen: 25
orglen: 25
]
The length definitions are the maximum lengths of 'words' in the
field being indexed. Longer words will be truncated to the
value set.
References
None.
Warnings
None.
Diagnostic Error Messages
None.
Exit status
It always exits with status 0.
Known bugs
None.
Author(s)
History
Target users
Comments