人类基因组非冗余Exon/Intron数据库的构建

The Construction of non-redundant Exon/Intron Database of H. sapiens

  • 摘要: 以RefSeq作为原始数据库来构建EID (Exon/Intron Database)可以克服GenBank所带来的冗余问题。Homo sapiens(智人)的RefSeq基因组数据库(Build 36.3, March 26, 2008最后更新)由NCBI主页(http://www.ncbi.nlm.nih.gov/)下载。利用Perl语言编写的脚本程序自动分析基因组的海量信息并提取每个记录条目中的CDS(Coding Sequence,编码序列)区域。与每个编码基因相关的数据(基因的定义、基因标识符、基因序列、蛋白质标识符、蛋白质序列、外显子和内含子的数量、大小、总数、非翻译区(UTR)内含子、内含子相位、内含子剪切位点模式)。结果表明,人类24条染色体(22条常染色体和2条性染色体,共计2,870,827,355bps)中含有32,157个基因标识符 (gene blocks),其中7,398个基因为假基因,4,014基因个发生了可变剪切 (Alternative Splicing, AS),15,533个基因含有CDS内含子,765个基因含有UTR内含子,2,585个基因不含有内含子,其他的为异常基因。

     

    Abstract: The Exon/Intron database (EID) is redundant when it is constructed based on GenBank records. In order to overcome this shortcoming, the non-redundant EID is derived from RefSeq (Reference Sequence) database. The original Homo sapiens (human) genome databases (Build 36.3, March 26, 2008) download from NCBI homepage (http://www.ncbi.nlm.nih.gov/). The script, programmed in Perl, automatically analyses huge amounts of information including in these original databases and extracts all CDS (Coding Sequence) field in every entries. After complex parsing procedure, some data related to eukaryotic genes (definition line, gene_id, gene sequence, protein_id, protein sequence, number of exon(s) and intron(s), size of exon and intron, sum of exons and introns, intron in UTR, phase of intron, and pattern of splice site) are collected into EID. All of human chromosomes (total 2,870,827,355 bps) are parsed and we obtain 32,157 gene blocks. In there genes, there are 7,398 pseudo genes, 4,014 alternative splicing genes, 15,533 genes containing-intron in CDS, 765 genes containing-intron in UTR, 2,585 genes containing none of intron, and other imperfect genes.

     

/

返回文章
返回