ORGANISATION AND STANDARDISATION OF INFORMATION IN SWISS-PROT AND TREMBL

SWISS-PROT is a curated, non-redundant protein sequence database which provides a high level of annotation and is integrated with a large number of other biological databases. It is supplemented by TrEMBL, a computer-annotated database which contains translations of all coding sequences in the EMBL Nucleotide Sequence Database which are not yet in SWISS-PROT. Each fully curated SWISS-PROT entry contains as much up-to-date information as possible from a variety of sources and the high quality of the annotation in SWISS-PROT provides the basis for the procedure which is used to automatically annotate the TrEMBL database. The large amounts of different data types found in both databases are stored in a highly structured and uniform manner and this structured organisation means that SWISS-PROT and TrEMBL together provide a comprehensive resource with data that are readily accessible for users and easily retrievable by computer programs.


INTRODUCTION
SWISS-PROT (Bairoch & Apweiler, 2000) is a curated protein sequence database which is maintained collaboratively by the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI), an outstation of the European Molecular Biology Laboratory (EMBL).The database distinguishes itself from other protein sequence databases by three distinct criteria: (i) It provides a high level of annotation.The entries are annotated by a team of biologists who use a variety of sources such as scientific literature, other databases, prediction programs and the help of external experts to add as much accurate and up-to-date information as possible to each entry.(ii) It is non-redundant which means that all reports for a given protein are merged into a single entry, thus summarising many pages of scientific literature into a concise but comprehensive report.(iii) It provides a high level of integration with other databases.Cross-references are provided to other sequence databases as well as to specialised data collections.Currently, there are cross-references to more than 40 different databases and this allows users to access a large amount of additional information related to a particular protein.
SWISS-PROT is supplemented by TrEMBL (Bairoch & Apweiler, 2000), a computer-annotated database which contains translations of all coding sequences in the EMBL Nucleotide Sequence Database (Stoesser, Baker, van den Broek, Camon, Garcia-Pastor, Kanz et al., 2002) which are not yet in SWISS-PROT.TrEMBL was created in 1996 due to the dramatic increase in data from genome sequencing projects and allows these sequences to be made publicly available as quickly as possible without diluting the high quality annotation found in SWISS-PROT.
The information in SWISS-PROT and TrEMBL is highly organised and structured and this is achieved through standardisation in a number of different areas including data storage and database management, data format, syntax and semantics of data items, and data analysis and automation of annotation and each of these areas will be discussed in detail below.

DATA STORAGE AND DATABASE MANAGEMENT
Although SWISS-PROT and TrEMBL are currently distributed as flatfile databases, both databases are now stored in ORACLE and, in the near future, production of the databases will switch to this system.This relational version of the databases is based on the relational schema used by the EMBL Nucleotide Sequence Database and shares as many parts of the EMBL schema as possible.Around the database, there is a C++ enwrapping which allows for basic operations on the data such as loading and unloading of entries, creation of releases, and updates, and this is a modified version of the code used in the EMBL database.So, using the EMBL schema and code which caters for nucleic acid entries, a modified schema and code have been designed to accommodate SWISS-PROT and TrEMBL protein entries which keeps as much compatibility of code with EMBL as possible and allows for easier maintenance.

DATA FORMAT
The SWISS-PROT and TrEMBL databases share a common data format which means that all line types used in SWISS-PROT are also used in TrEMBL and, wherever possible, a particular line type has the same format in both databases.There are some necessary exceptions to this shared format.For example, the data class used in the ID (identification) line of a SWISS-PROT entry is "STANDARD" which shows that the entry has been fully curated whereas in TrEMBL, it is "PRELIMINARY" which shows that the entry has not yet been manually curated.However, apart from a small number of such differences, the format in both databases is identical.
The format of the SWISS-PROT and TrEMBL databases follows that of the EMBL Nucleotide Sequence Database as closely as possible so that the general structure of an entry is identical in all three databases.This means that many line types found in the EMBL database are also present in SWISS-PROT and TrEMBL and have the same format as that used in EMBL.There are some differences such as line types defined in one database but not in the others or slight differences between the databases within a given line type but, where possible, all three databases share the same format.

Data types
The SWISS-PROT and TrEMBL databases consist of sequence entries (Figure 1) which are composed of different line types, each one having its own specified format.A full list of the line types used can be found in the SWISS-PROT user manual (SWISS-PROT, 2001).In SWISS-PROT, two classes of data can be distinguished, core data and annotation.

Core data
The core data is generally provided by the submitter of the sequence and consists of sequence data which come either from the translation of the corresponding nucleotide sequence in the EMBL Nucleotide Sequence Database or from submissions to SWISS-PROT in the case of peptide sequences; citation information which shows where the data has been published or, if unpublished, to which database it has been submitted; and taxonomic data which shows the biological source of the protein.

Annotation
The SWISS-PROT database strives to provide a high level of annotation and this is achieved through extraction of relevant information from scientific literature and rigorous sequence analysis by a team of biologists.Use is also made of external experts who have been recruited to send us their comments and updates concerning specific groups of proteins.This process allows the addition of as much correct and up-to-date information as possible about each protein including descriptions of properties such as function(s) of the protein, post-translational modifications, domains and sites, secondary and quaternary structure, similarities to other proteins, diseases associated with deficiencies in a protein, developmental stages in which the protein is expressed, in which tissues the protein is found, pathways in which the protein is involved, and sequence conflicts and variants.The annotation is stored mainly in the comment or CC lines, the feature table or FT lines, and the keyword or KW lines.There are currently (Release 40) more than 300,000 CC lines, 470,000 FT lines and 300,000 keywords in SWISS-PROT.

CC lines
The comment or CC lines are free text comments which are used to convey any useful information about a protein.The information in the CC lines is contained in a number of defined topics which allows the easy retrieval of specific categories of data from the database.A full list of the currently used comment topics and their definitions is shown in Table 1.

FT lines
The feature table or FT lines provide a way of annotating position-specific data relating to the sequence.The lines have a fixed format and a defined set of feature keys which may be used.These feature keys describe domains and sites of interest within a sequence such as post-translationally modified residues, binding sites, enzyme active sites, secondary structure, and any other regions of interest.The full list of currently defined feature keys can be found in the SWISS-PROT user manual (SWISS-PROT, 2001).

Keywords
The keywords are found in the keyword or KW lines of an entry.They serve as a subject reference for each sequence and assist in the retrieval of specific categories of data from the database.A controlled list of approximately 800 keywords, each with a definition to clarify its biological meaning and intended usage, is maintained.The full list of currently defined keywords is available at http://www.expasy.org/cgi-bin/keywlist.pl.

Annotation bottleneck
To produce a fully curated SWISS-PROT entry containing all of the above types of data is a highly labour-intensive process.This is the rate-limiting step in the production of SWISS-PROT as entries come into TrEMBL more quickly than they can be manually annotated and integrated into SWISS-PROT, thus creating a bottleneck of entries awaiting annotation.While it is necessary to maintain the high standard of annotation in SWISS-PROT, it is also vital to enhance the annotation of the proteins in TrEMBL, many of which are uncharacterised and about which very little functional information is known.This problem can be partly overcome by automatic annotation of TrEMBL entries (Apweiler, 2001).

Overcoming the annotation bottleneck by automatic annotation
For automatic annotation, a novel system of standardised transfer of annotation from well-characterised proteins in SWISS-PROT to unannotated TrEMBL entries has been developed (Fleischmann, Moeller, Gateau & Apweiler, 1999).Using this system, a TrEMBL entry is reliably recognised by a given method as being a member of a certain group of proteins.The annotation shared by the functionally characterised SWISS-PROT proteins of the group is then extracted and is assigned to the unannotated TrEMBL entry.
For such a system to work successfully, a number of requirements must be met.Firstly, a wellannotated reference database is needed from which annotation can be extracted for transfer to unannotated entries.For the automatic annotation of TrEMBL, the SWISS-PROT database is used as the source of high quality annotation because of its well-annotated and standardised content.
Secondly, there needs to be a system to store and manage the annotation rules used in the system and for this, RuleBase (Apweiler, 2001), a database which contains the rules as well their sources and usage, has been developed.
The final requirement for the success of the above system is that all of the above should be stored in a proper database management system and this is met by the fact that SWISS-PROT, TrEMBL, RuleBase and InterPro are all stored in ORACLE.
This process of automatic annotation brings the standard of annotation in TrEMBL closer to that found in SWISS-PROT through the addition of accurate, high-quality information to TrEMBL entries, thus improving the quality of data available to the user.

CONCLUSIONS
The SWISS-PROT and TrEMBL databases together provide a complete collection of protein sequences with minimal redundancy and offer a high level of integration with a large number of other biological databases.Each SWISS-PROT entry is manually annotated by a biologist, thus ensuring that the quality of information in the database is as accurate and up-to-date as possible.Using this high quality annotation as a basis, a procedure has been developed to automatically annotate the TrEMBL database.This system adds information to entries which are awaiting manual curation and improves the quality of data available to users of the TrEMBL database.All information items in the SWISS-PROT and TrEMBL databases are stored in a highly structured and uniform manner which means that they are easily retrievable by users and by computer programs in a consistent manner.

Table 1 .
Comment topics used in the SWISS-PROT database