构建非冗余EID的若干技巧

SOME IMPORTANT PROBLEMS IN CONSTRUCTION OF NON-REDUNDANT EID

  • 摘要: 基于GenBank构建的外显子内含子数据库(EID)含有大量的冗余数据.为了解决冗余问题,构建了基于RefSeq的非冗余EID(non-redundant EID).RefSeq是由NCBI staff负责维护和更新的参考序列库,为基因组注释、基因识别、基因突变、多态性分析、表达研究和比对分析提供了一个稳定的参考.该EID可用于大规模分析Exon/Intron结构和内含子剪切(Splicing)的研究,并拥有一些内部机制来控制数据质量和可能出现的错误.同时,它的新的改进是增加了基因序列中非翻译区(UTR)的数据内容.该文对构建基于RefSeq的非冗余EID的一些技巧作出说明.

     

    Abstract: There are a lot of redundant data in Exon/Intron Database (EID) based on GenBank. In order to resolve this puzzle, a non-redundant EID is constructed based on RefSeq. RefSeq is a sequence database maintained and renewed by NCBI staff for medical, functional, and diversity studies, providing a consistent reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. This EID is a good choice for large-scale computational investigation of exon/intron structure and splicing. It has many internal filters that could control for sequence quality, consistency of gene descriptions, accordance with standards, and possible errors. New modification also includes data of untranslated regions (UTR) of gene sequences as well. Here some issues on the construction of non-redundant EID are addressed.

     

/

返回文章
返回