degapseq reads one or more sequences and writes them out again but stripped of any non-alphabetic characters. It's main purpose is to remove gap characters from aligned sequences, but it will also remove such things as the symbol for translation STOP ('*') in a protein sequence.
|
The input sequence can be nucleic or protein.
The input sequence can be gapped or ungapped.
There are many different formats for storing molecular sequences in files. Some formats are specifically for aligned sequences, where gaps are inserted into the sequences for purposes of alignment. Gaps are indicated with different characters depending on the format in question, but commonly include '.', '-' and '~'. Some formats use more than one type of character to indicate different types of gaps, for example gaps at the sequence ends, internal gaps, gaps inserted by a program and gaps inserted manually by a person editing the alignment may all be denoted with different characters.
EMBOSS uses the dash character ('-') only to indicate gaps. When an EMBOSS program reads a sequence with gaps, all gap characters are changed internally to a dash ('-'). Thus any distinguishing characters for different gap types are convered to a '-' on output.