Databases for Biological Data

Now that we understand the building blocks of biological data, let us take a look at the various online databases to store them:

GenBank:

The GenBank sequence database is an open-access, annotated collection of all publicly available nucleotide sequences and their protein translations. It is produced and maintained by the National Center for Biotechnology Information (NCBI; a part of the National Institutes of Health in the United States) as part of the International Nucleotide Sequence Database Collaboration (INSDC). GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. The database was started in 1982 by Walter Goad and Los Alamos National Laboratory. GenBank has become an important database for research in biological fields and has grown in recent years at an exponential rate by doubling roughly every 18 months. Release 242.0, produced in February 2021, contained over 12 trillion nucleotide bases in more than 2 billion sequences. GenBank is built by direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centers.

EMBL:

The European Molecular Biology Laboratory (EMBL) is a molecular biology research institution supported by 27 member states, two prospect states, and one associate member state. EMBL was created in 1974 and is an intergovernmental organization funded by public research money from its member states. Research at EMBL is conducted by approximately 85 independent groups covering the spectrum of molecular biology. The list of independent groups at EMBL can be found at www.embl.org. The Laboratory operates from six sites: the main laboratory in Heidelberg, and sites in Hinxton (the European Bioinformatics Institute (EBI), in England), Grenoble (France), Hamburg (Germany), Rome (Italy) and Barcelona (Spain). EMBL groups and laboratories perform basic research in molecular biology and molecular medicine as well as train scientists, students, and visitors. The organization aids in the development of services, new instruments and methods, and technology in its member states. Israel is the only full member state located outside Europe.

PDB:

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organizations (PDBe, PDBj, RCSB, and BMRB). The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB. The PDB is key in areas of structural biology, such as structural genomics. Most major scientific journals and some funding agencies now require scientists to submit their structure data to the PDB. Many other databases use protein structures deposited in the PDB. For example, SCOP and CATH classify protein structures, while PDBsum provides a graphic overview of PDB entries using information from other sources, such as Gene ontology.

FlyBase:

FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data for the insect family Drosophilidae. For the most extensively studied species and model organism, Drosophila melanogaster, a wide range of data is presented in different formats. Information in FlyBase originates from a variety of sources ranging from large-scale genome projects to primary research literature. These data types include mutant phenotypes; molecular characterization of mutant alleles; other deviations, cytological maps, wild-type expression patterns, anatomical images, transgenic constructs and insertions, sequence-level gene models, and molecular classification of gene product functions. Query tools allow navigation of FlyBase through DNA or protein sequence, by gene or mutant name, or through terms from the several ontologies used to capture functional, phenotypic, and anatomical data. The database offers several different query tools in order to provide efficient access to the data available and facilitate the discovery of significant relationships within the database. Links between FlyBase and external databases, such as BDGP or modENCODE, provide opportunities for further exploration into other model organism databases and other resources of biological and molecular information. The FlyBase project is carried out by a consortium of Drosophila researchers and computer scientists at Harvard University and Indiana University in the United States, and the University of Cambridge in the United Kingdom.

UniProt:

UniProt is a freely accessible database of protein sequences and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organizations and a foundation from Washington, DC, United States. The UniProt consortium comprises the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR). EBI, located at the Wellcome Trust Genome Campus in Hinxton, UK, hosts a large resource of bioinformatics databases and services. SIB, located in Geneva, Switzerland, maintains the ExPASy (Expert Protein Analysis System) servers that are a central resource for proteomics tools and databases. PIR, hosted by the National Biomedical Research Foundation (NBRF) at the Georgetown University Medical Center in Washington, DC, US, is the heir to the oldest protein sequence database, Margaret Dayhoff's Atlas of Protein Sequence and Structure, first published in 1965. In 2002, EBI, SIB, and PIR joined forces as the UniProt consortium.

SCOP:

The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor. Similar to CATH and Pfam databases, SCOP provides a classification of individual structural domains of proteins, rather than a classification of the entire proteins which may include a significant number of different domains. The SCOP database is freely accessible on the internet. SCOP was created in 1994 in the Centre for Protein Engineering and the Laboratory of Molecular Biology. It was maintained by Alexey G. Murzin and his colleagues in the Centre for Protein Engineering until its closure in 2010 and subsequently at the Laboratory of Molecular Biology in Cambridge, England.

CATH:

The CATH Protein Structure Classification database is a free, publicly available online resource that provides information on the evolutionary relationships of protein domains. It was created in the mid-1990s by Professor Christine Orengo and colleagues including Janet Thornton and David Jones, and continues to be developed by the Orengo group at University College London. CATH shares many broad features with the SCOP resource, however there are also many areas in which the detailed classification differs greatly.

PROSITE:

PROSITE is an annotated collection of motif descriptors dedicated to the identification of protein families and domains. The motif descriptors used in PROSITE are either patterns or profiles, which are derived from multiple alignments of homologous sequences. PROSITE is a protein database. It consists of entries describing the protein families, domains, and functional sites as well as amino acid patterns and profiles in them. These are manually curated by a team from the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation. PROSITE was created in 1988 by Amos Bairoch, who directed the group for more than 20 years. Since July 2018, the director of PROSITE and Swiss-Prot has been Alan Bridge. PROSITE's uses include identifying possible functions of newly discovered proteins and analysis of known proteins for previously undetermined activity. Properties from well-studied genes can be propagated to biologically related organisms, and for different or poorly known genes biochemical functions can be predicted from similarities. PROSITE offers tools for protein sequence analysis and motif detection (see sequence motif, PROSITE patterns). It is part of the ExPASy proteomics analysis servers.

PRINTS:

PRINTS is a database of protein family 'fingerprints' offering a diagnostic resource for newly-determined sequences. By contrast with PROSITE, which uses single consensus expressions to characterize particular families, PRINTS exploits groups of motifs to build characteristic signatures.

Pfam:

The Pfam database is a widely used resource for classifying protein sequences into families and domains. Since Pfam was last described in this journal, over 350 new families have been added in Pfam 33.1 and numerous improvements have been made to existing entries. To facilitate research on COVID-19, we have revised the Pfam entries that cover the SARS-CoV-2 proteome and built new entries for regions that were not covered by Pfam. We have reintroduced Pfam-B which provides an automatically generated supplement to Pfam and contains 136 730 novel clusters of sequences that are not yet matched by a Pfam family. The new Pfam-B is based on clustering by the MMseqs2 software. We have compared all of the regions in the RepeatsDB to those in Pfam and have started to use the results to build and refine Pfam repeat families. Pfam is freely available for browsing and download at http://pfam.xfam.org/.

InterPro:

InterPro is a database of protein families, domains, and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterize them. The contents of InterPro consist of diagnostic signatures and the proteins that they significantly match. The signatures consist of models (simple types, such as regular expressions, or more complex ones, such as Hidden Markov models) which describe protein families, domains, or sites. Models are built from the amino acid sequences of known families or domains and they are subsequently used to search unknown sequences (such as those arising from novel genome sequencing) in order to classify them. Each of the member databases of InterPro contributes towards a different niche, from very high-level, structure-based classifications (SUPERFAMILY and CATH-Gene3D) to quite specific sub-family classifications (PRINTS and PANTHER). InterPro's intention is to provide a one-stop shop for protein classification, where all the signatures produced by the different member databases are placed into entries within the InterPro database. Signatures that represent equivalent domains, sites, or families are put into the same entry and entries can also be related to one another. Additional information such as a description, consistent names, and Gene Ontology (GO) terms are associated with each entry, where possible.

MINT:

MINT focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators. Protein interaction databases represent unique tools to store, in a computer-readable form, the protein interaction information disseminated in the scientific literature. Well-organized and easily accessible databases permit the easy retrieval and analysis of large interaction data sets. Here we present MINT, a database designed to store data on functional interactions between proteins. Beyond cataloging binary complexes, MINT was conceived to store other types of functional interactions, including enzymatic modifications of one of the partners.

Allermatch:

Novel proteins entering the food chain, for example by genetic modification of plants have to be tested for allergenicity. Allermatch http://allermatch.org is a web tool for the efficient and standardized prediction of the potential allergenicity of proteins and peptides according to the current recommendations of the FAO/WHO Expert Consultation, as outlined in the Codex Alimentarius.

GEO:

GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.

OMIM:

OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 16,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is updated daily, and the entries contain copious links to other genetic resources. This database was initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in Man (MIM). Twelve book editions of MIM were published between 1966 and 1998. The online version, OMIM, was created in 1985 by a collaboration between the National Library of Medicine and the William H. Welch Medical Library at Johns Hopkins. It was made generally available on the Internet starting in 1987. In 1995, OMIM was developed for the World Wide Web by NCBI, the National Center for Biotechnology Information. OMIM is authored and edited at the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, under the direction of Dr. Ada Hamosh.

CEBS:

The CEBS database houses data of interest to environmental health scientists. CEBS is a public resource and has received depositions of data from academic, industrial, and governmental laboratories. Data in CEBS are housed in a relational database designed to display data in the context of biology and study design and permit data integration for cross-study analysis, knowledge generation, and novel meta-analysis.

DECODE The Script of LIFE

Search This Blog