est2genome
Function
Description
est2genome aids the prediction of genes by sequence homology. It aligns a set of spliced nucleotide sequences (ESTs cDNAs or mRNAs) to an unspliced genomic DNA sequence, inserting introns of arbitrary length when needed. Where feasible introns start and stop at the splice consensus dinucleotides GT and AG.
By default, est2genome makes three alignments: First it compares both strands of the spliced sequence against the forward strand of the genomic sequence, assuming the splice consensus GT/AG (ie in the forward gene direction). The maximum-scoring orientation is then realigned assuming the splice consensus CT/AC (ie in the reversed gene direction). By default, only the overall maximum-scoring alignment is reported, and then if it scores higher than a specific minimum threshold score. Optionally, all comparisons may be reported.
The program outputs a list of the exons and introns it has found. The format is like that of MSPcrunch, ie a list of matching segments. This format is easy to parse into other software. The program also indicates, based on the splice site information, the gene's predicted direction of transcription. Optionally the full sequence alignment is printed as well.
Algorithm
The program uses a linear-space divide-and-conquer strategy (Myers and Miller, 1988; Huang, 1994) to limit memory use:
1. A first pass Smith-Waterman local alignment scan is done to find the start, end and score of the maximally scoring segments (including introns of course). No other alignment information is retained.
2. Subsequences corresponding to these segments are extracted
3a. If the product of the subsequences' lengths is less than a user-defined threshold (-space parameter), i.e. they will fit in memory, the segments are realigned using the Needleman-Wunsch global alignment algorithm, which will give the same result as the Smith-Waterman since the subsequences are guaranteed to align end-to-end.
3b. If the product of the lengths exceeds the threshold (a full alignment will not fit in memory) the alignment is made recursively by splitting the spliced (EST) sequence in half and finding the genome sequence position which aligns with the EST mid-point. The process is repeated until the product of the lengths is less than the threshold. The problem reduces to aligning the left-hand and right-hand portions of the sequences separately and merging the result.
4. The genome sequence is searched against the forward and reverse strands of the spliced (EST) sequence, assuming a forward gene splicing direction (i.e. GT/AG consensus).
5. Then the best-scoring orientation is realigned assuming reverse splicing (CT/AC consensus). The overall best alignment is reported.
The worst-case run-time for the algorithm is about 3 times as long as would be taken to align using a quadratic-space program. In practice the maximal-scoring segment is often much shorter than the full genome length so the program runs only about 1.5 times slower.
The algorithm has the following steps:
- A first-pass Smith-Waterman scan is done to locate the score, start
and end of the maximal scoring segment (including introns of
course). No other alignment information is retained.
- Subsequences corresponding to the maximal-scoring segments are
extracted. If the product of these subsequences' lengths is less than
the area parameter then the segments are re-aligned using the
Needleman-Wunsch algorithm, which in this instance will give the same
result as the Smith-Waterman since they are guaranteed to align
end-to-end.
- If the product of lengths exceeds the area threshold then the
alignment is recursively broken down by splitting the EST in half and
finding the genome position which aligns with the EST mid-point. The
problem then reduces to aligning the left-hand and right-hand portions
of the sequences separately and merging the result.
The worst-case run-time for the algorithm is about 3 times as long as
would be taken to align using a quadratic-space program. In practice
the maximal-scoring segment is often much shorter than the full genome
length so the program runs only about 1.5 times slower.
Usage
Command line arguments
Input file format
est2genome reads two nucleotide sequences. The first is an EST
sequence (a single read or a finished cDNA). The second is a genomic
finished sequence.
Output file format
MSP type segments
There are four types of segment,
- each gapped Exon
- each Intron (marked with a ? if it does not start GT and end AG)
- the complete alignment Span
- individual ungapped matching Segments.
The score for Exon segments is the alignment score excluding
flanking intron penalties. The Span score is the total
including the intron costs.
The coordinates of the genomic sequence always refer to the positive
strand, but are swapped if the est has been reversed. The splice
direction of Introns are indicated as +Intron (forward, splice
sites GT/AG) or -Intron (reverse, splice sites CT/AC), or
?Intron (unknown direction). Segment entries give the
alignment as a series of ungapped matching segments.
Full alignment
You get the alignment if the -align switch is set. The alignment
includes the first and last 5 bases of each intron, together with the
intron width. The direction of splicing is indicated by
>>>> (forward) or <<<< (reverse) or ????
(unknown)
Data files
None
Notes
est2genome uses a linear-space dynamic-programming
algorithm. It has the following parameters:
parameter default description
match 1 score for matching two bases
mismatch 1 cost for mismatching two bases
gap_penalty 2 cost for deleting a single base in
either sequence,
excluding introns
intron_penalty 40 cost for an intron, independent of
length.
splice_penalty 20 cost for an intron, independent of
length and starting/ending on
donor-acceptor sites.
space 10 Space threshold (in megabytes)
for linear-space recursion. If the
product of the two sequence
lengths divided by 4 exceeds this then
a divide-and-conquer strategy is used
to control the memory requirements.
In this way very long sequences can
be aligned.
If you have a machine with plenty of
memory you can raise this parameter
(but do not exceed the machine's
physical RAM)
However, normally you should not need
to change this parameter.
There is no gap initiation cost for short gaps, just a penalty
proportional to the length of the gap. Thus the cost of inserting a
gap of length L in the EST is L*gap_penalty
and the cost
in the genome is
min { L*gap_penalty, intron_penalty } or
min { L*gap_penalty, splice_penalty } if the gap starts with GT and ends with AG
(or CT/AC if splice direction reversed)
Introns are not allowed in the EST. The difference between the
intron_penalty and splice_penalty allows for some slack in marking the
intron end-points. It is often the case that the best intron
boundaries, from the point of view of minimising mismatches, will not
coincide exactly with the splice consensus, so provided the difference
between the intron/splice penalties outweighs the extra mismatch/indel
costs the alignment will respect the proper boundaries. If the
alignment still prefers boundaries which don't start and end with the
splice consensus then this may indicate errors in the sequences.
The default parameters work well, except for very short exons
(length less than the splice_penalty, approx) which may be
skipped. The intron penalties should not be set to less that the
maximum expected random match between the sequences (typically 10-15
bp) in order to avoid spurious matches.
References
- Mott R. (1997) EST_GENOME: a program to align spliced DNA sequences to
unspliced genomic DNA. Comput. Applic. 13:477-478
- Huang X (1994) On global sequence alignment. Comput. Applic. Biosci.
10:227-235.
- Myers, EW and Miller, W (1988) Optimal alignments in linear space.
Comput. Applic. Biosci. 4:11-17
- Smith, TE and Waterman, MS (1981) Identification of common molecular
subsequences. J. Mol. Biol. 147:195-197
Warnings
None.
Diagnostic Error Messages
None.
Exit status
It returns 0 unless an error occurs.
Known bugs
None.
Author(s)
This application was modified for inclusion in EMBOSS by
The original program was est_genome, written by Richard Mott at the Sanger
Centre. The original version is available from
ftp://ftp.sanger.ac.uk/pub/pmr/est_genome.4.tar.Z
History
Target users
Comments