EMBOSS

diffseq

Function

Description

diffseq reads two sequences which typically are very similar or almost identical. It finds regions of identity (exact matches) in the two sequences and reports on similarities and differences between the features of the two sequences within these regions. The output is a standard EMBOSS report file. The start and end positions of the regions of identity are reported. Any features that are shared and any differerences in features are reported. The original feature table of each sequence may also (optionally) be written to file.

Algorithm

diffseq searches for identical matches between all sequence words from both sequences. Identical sequence regions are found by creating a hash table of subsequences of user-defined size (-wordsize option), which is 10 by default. It then reduces the matches to a minimum set of overlapping matches by sorting them in order of size (largest size first). For each such match it removes any smaller matches that overlap. The result is a set of the longest regions of identity between the two sequences that do not overlap with each other. The mismatched regions between these matches are reported.

Usage

Command line arguments

Input file format

This program reads in two nucleic acid sequence USAs or two protein sequence USAs.

Output file format

By default diffseq writes a 'diffseq' report file.

The first line is the title giving the names of the sequences used.

The next two non-blank lines state the positions in each sequence where the detected overlap between them starts.

There then follows a set of reports of the mismatches between the sequences.
Each report consists of 4 or more lines.

The first line has the name of the first sequence followed by the start and end positions of the mismatched region in that sequence, followed by the length of the mismatched region. If the mismatched region is of zero length in this sequence, then only the position of the last matching base before the mismatch is given.
If a feature of the first sequence overlaps with this mismatch region, then one or more lines starting with 'Feature:' comes next with the type, position and tag field of the feature.
Next is a line starting "Sequence:" giving the sequence of the mismatch in the first sequence.

This is followed by the equivalent information for the second sequence, but in the reverse order, namely 'Sequence:' line, 'Feature:' lines and line giving the position of the mismatch in the second sequence.

At the end of the report are two non-blank lines giving the positions in each sequence where the detected overlap between them ends.

The last three lines of the report gives the counts of SNPs (defined as a change of one nucleotide to one other nucleotide, no deletions or insertions are counted, no multi-base changes are counted).

If the input sequences are nucleic acid, The counts of transitions (Pyrimide to Pyrimidine or Purine to Purine) and transversions (Pyrimidine to Purine) are also given.

It should be noted that not all features are reported.

The 'source' feature found in all EMBL/Genbank feature table entries is not reported as this covers all of the sequence and so overlaps with any difference found in that sequence and so is uninformative and irritating. It has therefore been removed from the output report.

The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.

Data files

None

Notes

diffseq is useful when looking for SNPs, differences between strains of an organism and anything else that requires the differences between two eseentially identical sequences to be highlighted.

Identical sequence regions are found by creating a hash table of subsequences of user-defined size (-wordsize option, which is 10 by default). Making this value larger (e.g. 20) may speed-up the program slightly, but will mean that any two differences within wordsize bases bases or residues of each other will be grouped as a single region of difference. This value may be made smaller to improve the resolution of nearby differences, but the program will go much slower.

The sequences can be very long; it should be possible to find differences between sequences that are Mega-bases long. If, however, you run out of memory, use a larger word size. This increases the length between mismatches that will be reported as one event. Thus a word size of 50 will report two single-base differences that are with 50 bases of each other as one mismatch.

References

None.

Warnings

Not all features are compared and reported. The 'source' feature found in all EMBL/Genbank feature table entries is not reported. This feature is uninformative; it covers the entire sequence and therefore would always be reported. The translation information of CDS features is often extremely long and does not add useful information to the report. It has therefore been removed from the output report.

By default diffseq finds regions of identity that are at least as long as the specified word-size. This is what's typically required when working with long overlapping nucleic acid sequences, where the non-overlapping sequence ends are less interesting. If however, you have protein sequences or short RNA sequences however, you may well be interested in differences at the very ends. The -globaldifferences option when set means the differences at the ends will also be reported.

Function

Description

Algorithm

Usage

Command line arguments

Input file format

Output file format

Data files

Notes

References

Warnings

Diagnostic Error Messages

Exit status

Known bugs

Author(s)

History

Target users

Comments