Abstract:
The Exon/Intron database (EID) is redundant when it is constructed based on GenBank records. In order to overcome this shortcoming, the non-redundant EID is derived from RefSeq (Reference Sequence) database. The original Homo sapiens (human) genome databases (Build 36.3, March 26, 2008) download from NCBI homepage (http://www.ncbi.nlm.nih.gov/). The script, programmed in Perl, automatically analyses huge amounts of information including in these original databases and extracts all CDS (Coding Sequence) field in every entries. After complex parsing procedure, some data related to eukaryotic genes (definition line, gene_id, gene sequence, protein_id, protein sequence, number of exon(s) and intron(s), size of exon and intron, sum of exons and introns, intron in UTR, phase of intron, and pattern of splice site) are collected into EID. All of human chromosomes (total 2,870,827,355 bps) are parsed and we obtain 32,157 gene blocks. In there genes, there are 7,398 pseudo genes, 4,014 alternative splicing genes, 15,533 genes containing-intron in CDS, 765 genes containing-intron in UTR, 2,585 genes containing none of intron, and other imperfect genes.