EMBOSS: est2genome

est2genome

Function

Description

est2genome aids the prediction of genes by sequence homology. It aligns a set of spliced nucleotide sequences (ESTs cDNAs or mRNAs) to an unspliced genomic DNA sequence, inserting introns of arbitrary length when needed. Where feasible introns start and stop at the splice consensus dinucleotides GT and AG.

By default, est2genome makes three alignments: First it compares both strands of the spliced sequence against the forward strand of the genomic sequence, assuming the splice consensus GT/AG (ie in the forward gene direction). The maximum-scoring orientation is then realigned assuming the splice consensus CT/AC (ie in the reversed gene direction). By default, only the overall maximum-scoring alignment is reported, and then if it scores higher than a specific minimum threshold score. Optionally, all comparisons may be reported.

The program outputs a list of the exons and introns it has found. The format is like that of MSPcrunch, ie a list of matching segments. This format is easy to parse into other software. The program also indicates, based on the splice site information, the gene's predicted direction of transcription. Optionally the full sequence alignment is printed as well.

Algorithm

The program uses a linear-space divide-and-conquer strategy (Myers and Miller, 1988; Huang, 1994) to limit memory use: 1. A first pass Smith-Waterman local alignment scan is done to find the start, end and score of the maximally scoring segments (including introns of course). No other alignment information is retained. 2. Subsequences corresponding to these segments are extracted 3a. If the product of the subsequences' lengths is less than a user-defined threshold (-space parameter), i.e. they will fit in memory, the segments are realigned using the Needleman-Wunsch global alignment algorithm, which will give the same result as the Smith-Waterman since the subsequences are guaranteed to align end-to-end. 3b. If the product of the lengths exceeds the threshold (a full alignment will not fit in memory) the alignment is made recursively by splitting the spliced (EST) sequence in half and finding the genome sequence position which aligns with the EST mid-point. The process is repeated until the product of the lengths is less than the threshold. The problem reduces to aligning the left-hand and right-hand portions of the sequences separately and merging the result. 4. The genome sequence is searched against the forward and reverse strands of the spliced (EST) sequence, assuming a forward gene splicing direction (i.e. GT/AG consensus). 5. Then the best-scoring orientation is realigned assuming reverse splicing (CT/AC consensus). The overall best alignment is reported. The worst-case run-time for the algorithm is about 3 times as long as would be taken to align using a quadratic-space program. In practice the maximal-scoring segment is often much shorter than the full genome length so the program runs only about 1.5 times slower.

The algorithm has the following steps:

A first-pass Smith-Waterman scan is done to locate the score, start and end of the maximal scoring segment (including introns of course). No other alignment information is retained.
Subsequences corresponding to the maximal-scoring segments are extracted. If the product of these subsequences' lengths is less than the area parameter then the segments are re-aligned using the Needleman-Wunsch algorithm, which in this instance will give the same result as the Smith-Waterman since they are guaranteed to align end-to-end.
If the product of lengths exceeds the area threshold then the alignment is recursively broken down by splitting the EST in half and finding the genome position which aligns with the EST mid-point. The problem then reduces to aligning the left-hand and right-hand portions of the sequences separately and merging the result.

The worst-case run-time for the algorithm is about 3 times as long as would be taken to align using a quadratic-space program. In practice the maximal-scoring segment is often much shorter than the full genome length so the program runs only about 1.5 times slower.

Usage

Command line arguments

Input file format

est2genome reads two nucleotide sequences. The first is an EST sequence (a single read or a finished cDNA). The second is a genomic finished sequence.

Output file format

MSP type segments

There are four types of segment,

each gapped Exon
each Intron (marked with a ? if it does not start GT and end AG)
the complete alignment Span
individual ungapped matching Segments.

The score for Exon segments is the alignment score excluding flanking intron penalties. The Span score is the total including the intron costs.

The coordinates of the genomic sequence always refer to the positive strand, but are swapped if the est has been reversed. The splice direction of Introns are indicated as +Intron (forward, splice sites GT/AG) or -Intron (reverse, splice sites CT/AC), or ?Intron (unknown direction). Segment entries give the alignment as a series of ungapped matching segments.

Full alignment

You get the alignment if the -align switch is set. The alignment includes the first and last 5 bases of each intron, together with the intron width. The direction of splicing is indicated by >>>> (forward) or <<<< (reverse) or ???? (unknown)

Data files

None

Notes

est2genome uses a linear-space dynamic-programming algorithm. It has the following parameters:

parameter               default         description

match                   1               score for matching two bases
mismatch                1               cost for mismatching two bases
gap_penalty             2               cost for deleting a single base in
                                        either sequence, 
                                        excluding introns
intron_penalty          40              cost for an intron, independent of
                                        length.
splice_penalty          20              cost for an intron, independent of
                                        length and starting/ending on
                                        donor-acceptor sites.

space                   10              Space threshold (in  megabytes) 
                                        for linear-space recursion. If the
                                        product of the two sequence 
                                        lengths divided by 4 exceeds this then
                                        a divide-and-conquer strategy is used 
                                        to control the memory requirements. 
                                        In this way very long sequences can
                                        be aligned. 
                                        If you have a machine with plenty of
                                        memory you can raise this parameter
                                        (but do not exceed the machine's
                                        physical RAM)
                                        However, normally you should not need
                                        to change this parameter.

There is no gap initiation cost for short gaps, just a penalty proportional to the length of the gap. Thus the cost of inserting a gap of length L in the EST is

 L*gap_penalty

and the cost in the genome is

 
min { L*gap_penalty, intron_penalty } or
min { L*gap_penalty, splice_penalty } if the gap starts with GT and ends with AG
                                     (or CT/AC if splice direction reversed)

Introns are not allowed in the EST. The difference between the intron_penalty and splice_penalty allows for some slack in marking the intron end-points. It is often the case that the best intron boundaries, from the point of view of minimising mismatches, will not coincide exactly with the splice consensus, so provided the difference between the intron/splice penalties outweighs the extra mismatch/indel costs the alignment will respect the proper boundaries. If the alignment still prefers boundaries which don't start and end with the splice consensus then this may indicate errors in the sequences.

The default parameters work well, except for very short exons (length less than the splice_penalty, approx) which may be skipped. The intron penalties should not be set to less that the maximum expected random match between the sequences (typically 10-15 bp) in order to avoid spurious matches.

References

Mott R. (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput. Applic. 13:477-478
Huang X (1994) On global sequence alignment. Comput. Applic. Biosci. 10:227-235.
Myers, EW and Miller, W (1988) Optimal alignments in linear space. Comput. Applic. Biosci. 4:11-17
Smith, TE and Waterman, MS (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It returns 0 unless an error occurs.

Known bugs

None.

Author(s)

This application was modified for inclusion in EMBOSS by

The original program was est_genome, written by Richard Mott at the Sanger Centre. The original version is available from ftp://ftp.sanger.ac.uk/pub/pmr/est_genome.4.tar.Z

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Output file format

MSP type segments

Full alignment

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

Author(s)

History

Target users

Comments