Impact of the Protein Data Bank Across Scientific Disciplines

Zukang Feng1,2, Natalie Verdiguel3, Luigi Di Costanzo1,4, David S. Goodsell1,5, John D. Westbrook1,2, Stephen K. Burley1,2,6,7,8 and Christine Zardecki1,2 1 Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ, US 2 Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ, US 3 University of Central Florida, Orlando, Florida, US 4 Department of Agricultural Sciences, University of Naples Federico II, Portici, IT 5 Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, US 6 Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA, US 7 Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ, US 8 Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, US Corresponding author: Christine Zardecki (christine.zardecki@rcsb.org)


Introduction
Since 1971, the Protein Data Bank (PDB) has served the scientific community as the single, global repository for structural data of biomolecules (Protein Data Bank, 1971). Data archived at the PDB include atomic coordinates and related experimental data from macromolecular crystallography, nuclear magnetic resonance spectroscopy and 3D electron microscopy studies. Understanding these 3D structures of proteins, nucleic acids, and large molecular machines informs our understanding of fundamental biology, medicine and drug discovery, and energy.
The PDB was conceived as a resource for the crystallographic community, to archive their primary results. However, as the number of structures grew, it became apparent that this body of information would have much wider application. Communities of researchers emerged that focused on data mining, using the available structures to hypothesize and test overarching principles of biomolecular structure, folding, and function. Soon after, the archive showed growing application in the field of structure-guided drug design, and has since been instrumental in the discovery and development of dozens of blockbuster medical treatments (Westbrook and Burley, 2019). In addition, structures from the archive are used widely to provide structural understanding of biomolecular structure and function, promoting research in many fields of biology, but also in chemistry, physics, mathematics, computer science and beyond. The growing utility of the archive naturally lead to widespread policies of structure deposition, and today, most major journals require release

Previous work on the impact of data reuse of the PDB archive
Evaluation of the impact of data archives is important for data depositors, data users, and for resource management, planning, and funding. For scientific databases, citations are often used as a tangible expression of reuse of the data. Since these publications are peer-reviewed and from reputable journals, it lends confidence that the derivative work is contributing to the growing body of scientific knowledge and validates the role of the data archive as a central resource for the community.
Previous citation analyses of the PDB archive have focused on the inaugural article describing the RCSB PDB resource, "The Protein Data Bank" (Berman et al., 2000) that appeared in Nucleic Acids Research. This inaugural article is regularly used to cite both the PDB data archive and RCSB PDB services. This reference is useful to study due to its high volume of citations. A 2014 analysis (Van Noorden et al., 2014) ranked the inaugural article 92 nd among the top 100 most-cited research publications of all time and a 2017 study (Basner, 2017) placed it 5 th among papers published since 2000. The 2017 analysis by Basner also found, using internal methods for normalizing across category, that articles citing the inaugural RCSB PDB publication had a citation-based impact exceeding the world-average in 16 scientific fields including Biology & Biochemistry, Computer Science, Plant & Animal Sciences, Physics, Environment/Ecology, Mathematics and Geosciences. Another study (Markosian et al., 2018) found the research areas for articles citing the inaugural RCSB PDB publication are changing over time, with more recent growth in disciplines such as Mathematical Computational Biology, Chemistry Medicinal, and Computer Science Interdisciplinary Applications.
Other studies have looked at how individual structures are referenced in the literature to demonstrate data reuse. For example, Huang et al. found an increase in the number of citations to PDB entries by URL rather than to publication (Huang et al., 2015), and Bousfield et al. cross-referenced open access literature with the PDB archive and found the average annual number of citations for a PDB structure is 6.7 (Bousfield et al., 2016). Scientific articles that are connected to open access data have been shown to be more highly cited than articles without data being made available (Colavizza et al., 2019). This is certainly reflected in the primary citations included in PDB entries. At the time the data for this study were collected (March 1, 2018), the PDB archive contained ~1 39,000 structures, and the Basner 2017 study found that the PDB archive from 2000-2016 had been cited by more than 1 million scientific publications in the Web of Science, giving an average number of ~4 0 citations per PDB structure publication (Burley et al., 2018). This number rises to ~8 0 citations per PDB structure publication for drug targets in all therapeutic areas (Westbrook and Burley, 2019).
This study explores citation patterns of individual PDB structures to identify PDB entries of most interest in specific fields and to examine trends in the application of structural biology. Each structure represents the results of an experiment determined by a laboratory and then deposited to the PDB data archive, biocurated by the wwPDB, and then made publicly available in the archive (Morris, 2018). The majority of PDB structures (~80%) have a corresponding primary citation that is the first paper to describe the molecule, its structure, and its function. Public release of most PDB data (87% in 2018) is coordinated with the time of publication of this primary citation. As identified by previous research, and due to the nature of the data, most papers citing PDB structures are in the field of biology. To identify strong examples of structures cited in other research categories, we identified the top cited structures within related disciplines.

Methods
For each entry in the PDB archive, we analyzed the set of articles that cite the primary citation of the PDB entry. First, the primary citations for each PDB entry were exported from the RCSB PDB database. A single primary citation may describe multiple PDB structures-in these cases, entries were treated and counted separately in the analysis. Then, publication data for articles that cite these PDB primary citations as of March 2018 were exported and organized by subject categories within the Web of Science (Clarivate Analytics, 2019). Related subject categories were aggregated, for example, the Chemistry category reported here includes Chemistry Physical, Chemistry Organic, Chemistry Analytical, and others. Note that each publication can be assigned to more than one category. The ranking of citation impact across disciplines and longitudinally is an ongoing area of active bibliometric research (Abaci, 2017, Bronmann and Williams, 2020, Diamandis, 2017, Koelblinger et al., 2019, Pendlebury, 2009, Jesper W. Schneider et al., 2019. For this study, the citation data are not normalized in any way to provide a direct comparison of impact between categories. Exported data were analyzed in July 2018.

Overall citation of PDB structures
The top-cited PDB structures (Table 1) are landmark structures that signal achievements in fundamental biology and their application to biomedicine and biotechnology. A detailed description of the impact of the structures of the nucleosome (PDB 1aoi) and major histocompatibility complex 1 (1hla) has been published (Burley et al., 2018). The structure of bacteriorhodopsin was the first EM structure released in the PDB, and represents a ground-breaking use of electron crystallography of two-dimensional sheets to determine the membrane-bound structure of a protein (1brd). The structure of the F1 portion of ATP synthase (1bmf) revealed the atomic details of the rotary molecular motor, providing a structural explanation for decades of biochemical studies. Similarly, the structure of the potassium channel resolved a long-standing question about the nature of specificity, revealing the central role of hydration and dehydration of ions in controlling ion passage across the cell membrane. The structures of MHC I (1hla), MDM2 (1rv1), and serum albumin (1uor) are a testament to the utility of atomic structures in the understanding of biomedicallyimportant biomolecules and in structure-based design of pharmaceuticals. Many additional milestone structures closely follow these top 10 entries, including photosystem II (1s5l) with 2286 citations and green fluorescent protein (1ema) with 1529 citations. In keeping with the importance of these molecules, all structures in the top-cited list (Figure 1) have been highlighted in the RCSB PDB's Molecule of the Month series at PDB101.rcsb.org (Goodsell et al., 2019). The importance of these 10 entries is also supported by prominence of the journals where the primary citation was published. Of the nine articles describing the top 10 structures, four were published in Nature, four in Science, and one in the Journal of Molecular Biology. The oldest structure was published in 1987, and the most recent in 2007. Surprisingly, the initial 12 structures archived in the PDB since the 1970s and that represent the seminal achievements of structural biology are not included in this list. This may be due in part to inconsistent citation practices of the time, both for the citations that were included in the early structure deposition to the archive, and for how PDB structures were cited in the literature. For example, the citation included in the entry for Kendrew's landmark structure of myoglobin (1mbn) is not the initial structure solution (Kendrew et al., 1960), which currently shows >1100 citations, but rather a later report of the molecule (Watson, 1969) that is not included in the Web of Science. Similarly, multiple publications were presented over decades during the structure solution of hemoglobin (2dhb), including the primary citation associated with the entry (Bolton and Perutz, 1970) with 253 citations and a key Nature paper with 932 citations (Perutz et al., 1960).

Top cited structures by category
We also analyzed subject categories for the journals where citing articles were published, to assess the range of disciplines where data from the PDB archive is having impact. Not surprisingly, PDB structures are most cited by publications with the subject category Biochemistry Molecular Biology, including 101,921 unique structures (72% of archive) at the time of this study. Six non-biological categories were chosen to show utility of the archive in related disciplines, including Materials Science, Physics, Computer Science, Chemistry, Engineering, and Mathematics. We also identified the most highly-cited article that cited a PDB structure in each category. For the physicsrelated categories these highly-cited papers included reviews related to biotechnology and nanotechnology: Materials Science (Nel et al., 2009), Physics (Zweib et al., 1989, and Engineering (Hersel et al., 2003). For Computer Science, Chemistry, and Mathematics, the papers were primary citations for the widely-used molecular visualization program VMD (Humphrey et al., 1996), small and macromolecular structure determination program SHELX (Sheldrick, 2008), and database of theoretical models SWISS-MODEL (Arnold et al., 2006), respectively. These citations highlight how available data in the repository support cross-disciplinary use across the physical sciences. Figure 2 reveals that individual PDB entries have impact on a wide range of disciplines. The three most highly-cited structures are included, along with the number of citations falling into the top 5 categories. Not surprisingly, all have Biochemistry & Molecular Biology as the top category. The following categories are quite different for these three entries, reflecting the different uses that are made of these structures: the nucleosome (1aoi) in basic biology and understanding of genetic mechanisms, the potassium channel (1bl8) as a central structure used for understanding and engineering specific ion channels, and rhodopsin (1f88) which was used for many years as a template for understanding and modeling the pharmacology of G-protein coupled receptors (GPCRs). The citations also fall into numerous other categories: for example, citations for the nucleosome structure fall into over 100 separate categories.
In the sections below, we identify the top-cited structures in each of these subject categories, and describe how these structures have impacted the fields.

Materials Science
Articles in this category cited 18,495 unique structures, or roughly 12% of the archive. Two of the top cited PDB entries here, serum albumin (1uor) and potassium channel (1bl8) also appear in the overall top cited list ( Table 2). Surprisingly, two additional structures of serum albumin (1ao6/1bm0) also appear in this top cited list. The reasons for citation of these structures in materials science journals are reflected in the most frequently used keywords in these citing articles: in vivo, mechanism, drug delivery, adsorption, in vitro, crystal structure, protein/s, nanoparticles, and binding.  These structures provide information that is useful in a variety of current bioengineering and nanotechnology goals. Serum albumin (1uor, 1ao6, 1bm0) plays essential roles in delivery of a wide variety of small molecules in the blood, thus it is often a key for assessing the ADME (Absorption, Distribution, Metabolism, Excretion) properties of engineered molecules. Many of these cited papers explore design of nanoparticles for delivery of molecules in the blood, building on knowledge of the structure. Similarly, alpha-hemolysin (7ahl) and potassium channels (1bl8) are worked examples of selective channels and have been used in bioengineering efforts. In particular, structural understanding of alpha-hemolysin has been instrumental in the engineering of nanopores for DNA sequencing. Designed DNA structures (3gbi) are some of the first successful examples of de novo design in bionanotechnology, and the strong streptavidin-biotin interaction (1stp) is often used to connect modular components in designed nanostructures. Materials Science publications reference the in situ structure of collagen microfibrils (3hqv and 3hr2), including reports exploring the properties of connective tissue and biomineralization. Photosystem II appears in Materials Science (3wu2), as well as in nearly all of the other categories below, since these structures revealed the water-splitting details of the oxygen-evolving center.

Physics
Articles in this category cited 50,819 unique structures. Two of the top cited structures, potassium channel (1bl8) and the nucleosome (1aoi) also appear in the overall top cited list ( Table 3). The most frequently used keywords in these citing articles include: model, mechanism, protein/s, binding, molecular dynamics/simulations, spectroscopy, and crystal structure.
This category includes an interesting mix of structures related to photosynthesis (3pcq, 1s5l, 3wu2, 1jb0, 1lgh) and structures related to development of experimental methods (also 3pcq, 1m8m, 1l2y). The structures of photosystems revealed the detailed arrangements of chromophores in the protein complexes, and thus provide concrete information on the types of geometries and distances that are relevant for excitation and electron transfer. These structures also include several seminal developments in structural science with strong ties to physics, including determination of photosystem I by femtosecond X-ray protein nanocrystallography (3pcq) and determination of the spectrin domain by solid-state magic-angle-spinning NMR spectroscopy (1m8m). In addition, two structures related to nanotechnology appear in the list: a very small de novo designed protein (1l2y) and alpha-hemolysin (7ahl), mentioned in the section above. Inclusion of the nucleosome (1aoi) in this list may seem like a bit of a puzzle, until we understand that much effort has been expended with trying to understand and model the physics of DNA bending as it relates to nucleosome positioning and higher-order chromatin structure.

Computer Science
Articles in this category cited 28,122 unique structures, and rhodopsin (1f88) also appears in the overall top cited list ( Table 4). The most frequently used keywords in these citing articles included: docking, identification, inhibitors, prediction, molecular dynamics, forcefield, design, binding, and crystal structure. Alpha-hemolysin 7ahl Science (Song et al., 1996) 127 * Also appears in top cited overall list. These structures represent important targets for drug development, and thus are often used to test new structure-based drug design methodology, including a shape-based 3-D scaffold hopping method (1y2f, 1y2g) and novel use of cyclic ureas to mimic substrate binding and displace a key water molecule in HIV protease (1hvr). Half of the list are GPCRs (2rh1, 3eml, 2vt4, 2r4s), along with the landmark structure of rhodopsin (1f88), which was used for many years to model GPCRs and ligand binding thereto.

Chemistry
Articles in this category cited 87,073 unique structures, and given the strong connections between biology and chemistry, half of the top cited structures (2v3h/2v3o, 1uor, 1f88, 1bl8) appear in the overall top cited list ( Table 5). The most frequently used keywords in these citing articles were: complexes, protein/s, E. coli, inhibitors, mechanism, design, derivatives, binding, and crystal structure.
The top two entries are thrombin with an inhibitor (2v3h) and with a fluorinated version of the inhibitor (2v3o), and the primary reference is a review that is cited by studies of the effects of fluorination in inhibitor design. Serum albumin (1uor) showed up in "Materials Science" in relation to bioengineering efforts, but many of the citations in "Chemistry" are directly related to binding of molecules to the protein and characterizing its functional properties in blood. Two hydrogenase enzymes (1feh, 1hfe) and photosystem II (1s5l, 3wu2) perform interesting chemistry catalyzed by unusual metal clusters. The structure of a human telomeric quadruplex (1k8p) is cited by all manner of studies looking at its chemical properties and interactions with ions, small molecules, and proteins.   (Palczewski et al., 2000) 1,003 Photosystem II 3wu2 Nature (Umena et al., 2011) 936 Fe-only hydrogenase 1feh Science (Peters et al., 1998)

Engineering
Articles in this category cited 16,190 unique structures, and serum albumin (1uor) and potassium channel (1bl8) again appear in the top cited list ( Table 6). The most frequently used keywords in these citing articles were : mechanism, in vivo, protein/s, purification, binding, in vitro, E. coli, expression. Several of these structures are related to bioengineering projects. The citations for serum albumin (1uor) include studies about the interaction with a wide variety of dyes, nanoparticles and other engineered molecules. Structures of collagen (3hqv, 3hr2, 1cag), fibronectin (1fnf), and osteocalcin (1q8h) played key roles in the understanding of cell adhesion, connective tissues and bone. The potassium channel structure (1bl8) is cited by studies looking at nanopores and biosensors. The lipase (5tgl) and laccase (1gyc) structures were cited in studies of engineered and immobilized versions of the enzymes.

Mathematics
Articles in this category cited 7,306 unique structures. Four of the top cited structures (1bl8, 1f88, 1aoi, 1brd) appear in the overall top cited list ( Table 7). The most frequently used keywords in these citing articles were: molecular dynamics, protein/s, binding, identification, recognition, sequence, prediction, database, crystal structure.  Many of these papers are involved in modeling and analysis of protein and nucleic acid structures, with "Mathematical Computational Biology" being the major mathematics-related category. For example, the potassium channel (1bl8) citations include many computation studies exploring the dynamics of channel gating and permeation, as well as methods for predicting structure and function of other channels based on this structure, and the rhodopsin structure (1f88) includes several citations for methods that model the structure of GPCRs. The nucleosome (1aoi) citations include studies about nucleosome positioning and modeling of DNA bending or higher chromatin structure. The designed protein structures (1qys, 1fsv, 1fsd) are cited by methods papers that explore prediction of protein folding and design, and the ribosome structure (1ffk) is cited by methods exploring RNA structure and interaction of RNA and protein.
The first atomic structure of a B-DNA helix (1bna) is cited in modeling studies of DNA conformation and interaction.

Conclusions
This analysis has shown a large impact of the PDB archive within the discipline of molecular biology and in many related disciplines. Of course, this analysis uses only one metric for assessing impact-the record of citations. Additional information could be obtained through analysis of instances of PDB structure IDs in publications, or linkage of specific PDB entries in digital resources. We are also interested in assessing the impact of the PDB archive in education and public understanding, which may potentially be approached through analysis of usage and citation of entries in textbooks and popular publications. That effort may be more difficult, however, given the citation practices in those publications are not as tightly codified as in professional scientific publications.
The PDB archive was originally established to serve the structural biology community. The extensive usage of PDB structures across a variety of disciplines demonstrates the importance of structural studies and how data archives support interdisciplinary research.