cpgreport

Function

Description

cpgreport identifies in a nucleotide sequence regions with higher than expected frequency of the dinucleotide CG.

Each position in the sequence is scored using a running sum calculated from all positions in the sequence. This is a different method to that typically used for identifying CpG islands, for example by newcpgreport and cpgplot. This method overpredicts islands but finds the smaller ones around primary exons. An output file is written with information on the CpG-rich regions that are found. A feature table of sequence features in these regions is also written.

Algorithm

cpgreport scores each position in the sequence using a running sum calculated from all positions in the sequence, starting with the first and ending in the last. If there is not a CG dinucleotide at a position, the score is decremented, if there is one, the score is incremented by a constant (user-defined) value. If the score for a region in the sequence is higher than a threshold (17 at the moment) then a putative island is declared. Sequence regions scoring above the threshold are searched for recursively.

Usage

Command line arguments


Input file format

Any DNA sequence USA.

Output file format

The first non-blank line of the output file 'rnu68037.cpgreport' is the title line giving the program name, the name of sequence being analysed and the start and end positions of the sequence.

The second non-blank line contains the headings of the columns.

Subsequent lines contain columns with the following information:

If the count of GpC in the region is zero, then the ratio of CG/GC is reported as '-'.

Data files

None.

Notes

"CpG" refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases. Regions of genomic sequences rich in the CpG pattern or "CpG islands" are resistant to methylation and tend to be associated with genes which are frequently switched on. It's been estimated that about half of all mammalian genes, and, possibly all mammalian house-keeping genes, have a CpG-rich region around their 5' end. Non-mammalian vertebrates have some CpG islands that are associated with genes, but the association gets equivocal in the farther taxonomic groups. The detection of CpG island upstream of predicted exons or genes is evidence in support of a highly expressed gene.

As there is no official definition of what is a CpG island is or how to identify where they begin and end, we work with two definitions and thus two methods. These are:

1. cpgplot and newcpgreport use a sliding window within which the Observed/Expected ratio of CpG is calculated. For a sequence region to reported as a CpG island, it must satisfy the following contraints:

   Observed/Expected ratio > 0.6
   % C + % G > 50%
   Sequence Length > 200

2. newcpgseek and cpgreport use a running sum calculated from all positions in a sequence rather than a window to produce a score. If there is not a CG dinucleotide at a position, the score is decremented, if there is one, the score is incremented by a constant (user-defined) value. If the score for a region in the sequence is higher than a threshold (17 at the moment) then a putative island is declared. Sequence regions scoring above the threshold are searched for recursively.

This method overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable. For most purposes you should probably use newcpgreport rather than cpgreport. It is used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server.

newcpgseek and cpgreport both now display the actual CpG count, the (%C + %G) and the Observed/Expected ratio in the region where the score is above the threshold.

The geecee program measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be useful for detecting sequences that MIGHT contain an island.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

0 if successful.

Known bugs

None. As there is no official definition of what is a cpg island is, and worst where they begin and end, we have to live with 2 definitions and thus two methods. These are:

1. newcpgseek and cpgreport - both declare a putative island if the score is higher than a threshold (17 at the moment). They now also displaying the actual CpG count, the % CG and the observed/expected ration in the region where the score is above the threshold. This scoring method based on sum/frequencies overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable.

2. newcpgreport and cpgplot use a sliding window within which the Obs/Exp ratio of CpG is calculated. The important thing to note in this method is that an island, in order to be reported, is defined as a region that satisfies the following contraints:

   Obs/Exp ratio > 0.6
   % C + % G > 50%
   Length > 200.

For all practical purposes you should probably use newcpgreport. It is actually used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server.

geecee measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be usefull for detecting sequences that MIGHT contain an island.

Author(s)

This program was originally written by

The algorithm was modified for inclusion in EGCG under the name 'CPGSPANS' by

This application was modified for inclusion in EMBOSS by

History

Target users

Comments