GenBase: A Nucleotide Sequence Database

Gene sequence and annotation information (including DNA, RNA and protein sequence information) is one of the core basic data supporting gene function research. With the rapid development of biology, in the past few decades, scientists in the field of life science in China have produced a large number of gene sequence data. In order to meet the practical needs of Chinese researchers in the process of gene sequence data collection, management and sharing, there is an urgent need for a nucleotide data repository that adheres to international rules/standards, actively exchanges data with major global data centers, and provides enhanced data services for local and global researchers.

Recently, researchers from the China National Center for Bioinformation have developed the gene sequence database GenBase. The work under the title "GenBase: A Nucleotide Sequence Database" was published online in the journal Genomics, Proteomics & Bioinformatics.

GenBase is an open-access data repository designed for nucleotide sequence archiving, searching, and sharing. It adopts GenBank’s data model and supports the submission of diverse data types like messenger RNAs (mRNAs), genomic DNAs, non-coding RNAs (ncRNAs), organelles, viruses, plasmids, and phages through an online bilingual submission portal that provides real-time validation for both generic and SARS-CoV-2 sequences. Additionally, GenBase integrates all sequences from GenBank with daily updates to provide free and publicly accessible data to support the distribution and sharing of the international datasets, as well as facilitate data access for Chinese researchers. 

As of August 1, 2024, GenBase has received 81,929 nucleotide sequences and 832,740 annotated protein sequences from 197 organizations, 309 submitters, 2,650 batches. Of the submitted data, 76,340 nucleotide sequences (93%) and 723,863 annotated protein sequences (87%) have been made available to the public, supporting the publication of 51 papers. Notably, out of 63,006 submitted SARS-CoV-2 genome sequences with standardized annotations, 59,913 have been released. At the same time, to ensure the localized management of global gene sequence data, GenBase also integrates about 580 million nucleic acid and protein sequences published by INSDC, improving the efficiency of domestic researchers to query and obtain data.

The whole architecture of GenBase