Data curation and citation practices in the Earth sciences have emphasized assignment and citation of identifiers in order to promote asset discoverability and interoperability (Klump, Huber & Diepenbroek 2016). Persistent identifiers undoubtedly contribute to the popularity of data citation through enabling improved tracking of data set use and reuse, providing credit for data producers, and aiding reproducibility efforts through associating research with the exact data set(s) used (Parsons and Fox 2013; Parsons 2014; Katz and Strasser 2015). By directly linking publications with the resources underlying the scientific findings reported therein, persistent identifiers ensure scientific integrity, promote data discovery and management, and facilitate scientific communication (Hanson 2016). While the recognition of the benefits of such linking is not new, the systematic implementation of linking resources is not yet widespread, Some recent examples of implementation include the Global Change Information System which linked findings from the Third US National Climate Assessment with their underlying data (e.g., Ma et al. 2014; Wolfe et al. 2015) and the tracing of indicators found within Integrated Ecosystem Assessments (Beaulieu et al. 2016). In geology, persistent identifiers are core elements of digital metadata records, and are especially useful when space or resource requirements complicate storage of the samples described therein (McNutt et al. 2016). Persistent identifiers are essential for maintaining provenance: tracing back lineage of significant scientific conclusions to the use of the entities (data sets, algorithms, instruments, satellites, etc.) that lead to conclusions (Ramapriyan et al. 2016). As stated by Tilmes, Yesha & Halem (2010):
‘Provenance in this context refers to the source of data and a record of the process that led to its current state. It encompasses the documentation of a variety of artifacts related to particular data. Provenance is important for understanding and using scientific data sets, and critical for independent confirmation of scientific results’.
Unless each of the entities constituting the provenance has a persistent identifier, the provenance trace generated at a given time may not remain valid at a future time. Many professional societies, journals, and federal entities within the US have recently adopted commitments, recommendations, mandates, and procedures for citing research data used in research products, all of which rely on identifiers (for particular examples within the field of Earth sciences see Bloom, Ganley & Winkler 2014; GSA 2014; Hanson and van der Hilst 2014; Evans et al. 2015; Hanson, Lehnert & Cutcher-Gershenfeld 2015; Mayernik, Ramamurthy & Rauber 2015). The US Federal Open Data Policy mandated the accompaniment of appropriate citations and persistent identifiers with data sets resulting from federally-funded research (OMB 2013). Consequently, US federal agencies have begun implementing policies requiring, enabling and facilitating data citations – e.g., the National Aeronautics and Space Administration (NASA 2015), the National Oceanic and Atmospheric Administration (NOAA 2015), the National Science Foundation (NSF 2014), and the United States Geological Survey (USGS 2016). Having over 180 participating member organizations including such federal agencies, universities and commercial entities, the Federation of Earth Science Information Partners (ESIP) has developed data citation guidelines (ESIP Data Stewardship Committee 2012) which have been adopted by various parties.
Outside the US, the UK Digital Curation Centre (Ball and Duke 2015) has provided a detailed guide on citing data sets and linking them to publications. Egloff et al. (2016) have emphasized data citations as important components of their data policy recommendations for the European Biodiversity Observation Network (EU BON). During 2013–2014, the Committee on Data for Science and Technology (CODATA) and the International Council for Scientific and Technical Information (ICSTI) held several international workshops to articulate data citation principles (CODATA-ICSTI 2015). These produced the Joint Declaration of Data Citation Principles (Data Citation Synthesis Group 2014), which explicitly calls out the need for unique and persistent identifiers for data sets. To date, these principles have been endorsed by 372 entities worldwide, 114 of which are science organizations including journal publishers and professional societies (FORCE11 2016). The use of persistent identifiers in citations for non-data set assets such as software (e.g., Gent, Jones & Matthews 2015; Smith et al. 2016) and projects (e.g., NCAR 2016) is an emerging area of focus, and builds on the success of data citation efforts.
The Digital Object Identifier (DOI) is by far the most widely used identifier system-with 130 million persistent identifiers assigned to date (International DOI Foundation 2016a). Administered by the International DOI Foundation (IDF), it is an integral part of scientific publishing and possesses a high level of maturity (Klump, Huber & Diepenbroek 2016). Recently, many journal-wide open access publishing endeavors, e.g., the Coalition for Publishing Data in the Earth and Space Sciences (COPDESS 2015) and the Transparency and Openness Promotion (McNutt 2016) have selected the DOI as their identifier of choice. DOIs serve as a key requirement for items in DataCite, the primary DOI registration agent for data DOIs. Given the recent emphasis on DOIs for citing data sets, this paper explores the adoption of DOIs for Earth science data sets, outlines successes, and identifies some remaining challenges.
DOIs comprise a portion of the Handle System, but exceed its capabilities by providing a resolution system for identifiers and for requiring semantic interoperability, among other reasons (International DOI Foundation 2015). Various articles have further explored the distinctions between DOIs and other identifier schemes, and have shown the advantages of DOIs over the others. DOIs possess advantages compared to other identifier schemes. For example, Duerr et al. (2011) have compared DOIs with eight other identifier schemes for their use as Unique Identifiers, Unique Locators, Citable Locators, and Scientifically Unique Identifiers. Although they found none of the schemes to be suitable as Scientifically Unique Identifiers (which inform users that two data instances–even if in different formats– contain identical information), they found that DOIs fare equally well or better than others for the first three uses, and especially well as citable locators for data sets. Lee and Stvilia (2012) have compared DOIs with five other identifier schemes using the following 11 quality criteria: Uniqueness/Precision, Resolution, Interoperability, Persistence, Granularity/Flexibility, Complexity, Verifiability, Opacity/Clarity, Authority, Scalability, and Security. They observed that the Handle system was the most widely used in 34 institutional data repositories of (the then) 61 members of the prestigious Association of American Universities. However, they have noted that DOIs ranked the highest in all the quality criteria except Opacity/Clarity. They also stated that a possible reason for low adoption of DOIs by the repositories could be ‘the cost of registering with and purchasing DOI prefixes from the International DOI Foundation’. The DOI syntax is described in ISO 26324:2012. It is important to emphasize that while the number of DOIs assigned is in the millions, the significantly lower numbers shown below for Earth science data sets is solely due to their relatively recent use for identifying and citing them.
For nearly 20 years, publications discussing the need for identifiers for data sets—many with particular emphasis on DOIs–have appeared in the literature (e.g., Helly et al. 1999; Helly, Staudigel, & Koppers 2003; Brase 2004; Paskin 2006; Klump et al. 2006; Williams et al. 2009; Piwowar 2011; Hills et al. 2015; James and Wanchoo 2015; Christensen et al. 2015; Mayernik, Phillips & Nienhouse 2016; Prakash et al. 2016). Although material discussing perceived benefits and drawbacks to various identifier schemes exists (see above), to our knowledge none to date has focused on evaluating DOI use. To what extent have DOIs accomplished their aims? Have game changers occurred in the field of data citation which hinder DOI use? This paper explores these questions with particular focus on Earth science data sets held by organizations in the US.
DOI Use and Proliferation
As mentioned above, DOIs have proliferated in many domains since their advantages were first heralded during the late 1990s (see Rosenblatt 1997; Davidson and Douglas 1998). This section covers the recent growth of DOI assignments for Earth science data.
US federal agencies and affiliates have within the past few years initiated earnest efforts to assign DOIs to their data sets (see Table 1). In the cases of NASA, NOAA, and USGS the small to moderate percentages of data sets assigned DOIs is a function of (1) the vast size of their catalogues, and that (2) in these and other various federal entities, DOIs are minted upon the submission of the data sets to the archive and their preparation for public distribution. Note that such programs have only recently begun assigning DOIs for data sets. Looking in figshare, an independent service that supports self-archiving of data and other assets, 132,574 data sets out of 466,545 total assets (28%) were categorized as related to “Earth and Environmental Sciences”, as of January 25, 2017, (Valen, pers. communication). The DataCite organization has been influential in promoting and enabling geoscience-focused organizations to create DOIs for data sets (Klump, Huber, and Diepenbroek 2016). All organizations mentioned in Table 1 and 2, as well as Figure 1, now register DOIs for data using DataCite services. NOAA and NASA assign DOIs deliberately at the coarse, collection level rather than at the finer, “granule” level. This is to avoid a large proliferation of DOIs and to simplify citation. For example, NASA EOSDIS curates hundreds of millions of granules, compared to approximately 10,000 collections. After careful review of data sets and metadata, NOAA assigns DOIs manually, one at a time. The DOIs were initially assigned manually within NASA EOSDIS, although the recent automation of the process has resulted in significantly faster assignments (Wanchoo, James & Ramapriyan 2017). In both cases the organizations ensure that well-developed landing pages providing information about data sets exist before DOIs are assigned.
|Organization||Number of DOIs Assigned (Data Sets)||Total Number of Data Sets in Archive||%||Initial Year of DOI Assignment||Source||Valid Date|
|US Department of Energy (DOE) Atmospheric Radiation Measurement (ARM) Archive||797||41602||2012||ARM Climate Facility Data Discovery Search Results (2017); Prakash, pers. communication||2017-02-03|
|NASA Earth Observing System Data and Information System Distributed Active Archive Centers (EOSDIS DAACs)||5364||~ 10000||~ 54||2009||NASA 2017||2017-02-08|
|NOAA||612*||n/a*||n/a*||2013||DataCite 2017; De La Beaujardière, pers. communication||2017-02-08|
|National Center for Atmospheric Research (NCAR) Research Data Archive**||79||671||12||2012||UCAR NCAR 2017||2017-02-09|
|Rolling Deck to Repository (R2R)***||15405||20956||74||2015||Arko, pers. communication||2017-02-01|
|USGS Science Data Catalog||1407||8677||16||2011||Bristol, pers. Communication||2017-02-13|
|Organization||DOI Prefix (Search string)||Number of Google Scholar results Uncovered Using their Respective Strings|
|NASA – EOSDIS [CDL.NASAGSFC]||“DOI:10.5067”; “doi.org/10.5067”||346; 324|
|National Snow & Ice Data Center (NSIDC) [CDL.NSIDC]||“DOI:10.7265”; “doi.org/10.7265”||166; 202|
|NCAR [CDL.NCAR]||“DOI:10.5065”; “doi.org/10.5065”||927; 415|
|NOAA [CDL.NOAA]||“DOI:10.7289”; “doi.org/10.7289”||578; 261|
|USGS [CDL.USGS]||“DOI:10.5066”; “doi.org/10.5066”||112; 178|
As DOIs for data sets from the above organizations have existed for a few years, it would be interesting to identify the extent to which they have been used in citations. Using information gleaned from Google Scholar (2017), Table 2 presents the numbers of citations to objects (mostly data sets) using some DOI prefixes of interest. It should be noted that the term “include patents” was unchecked while performing the Google Scholar searches. Google Scholar returns lists of publications containing a string specified in the search window. Thus, by specifying a DOI prefix reserved for data sets, it is theoretically possible to obtain lists (and numbers) of publications that cite all data sets using such a prefix. However, the search results are sensitive to the manner in which the DOI prefixes are specified. For example, the five entries 10.7265, DOI:10.7265, doi.org/10.7265, “DOI:10.7265” and “doi.org/10.7265” return different sets of results, with some overlap. The first three entries in the search window provide several irrelevant results in which the number 7265 may have occurred in some context other than a DOI or a citation. The last two, in which the quotes are included, do provide mostly relevant results. However, differences in results exist because authors do not use a consistent format in the citations – some use DOI:10.7265 and others use doi.org/10.7265. Without a detailed analysis of the results from such searches, it is difficult to deduce the extent of overlap between the two, and thus determine the accurate number of citations of data sets with a given DOI prefix. Table 2 shows results from two search strings for each of the DOI prefixes analyzed. Despite the issues shown above, use of DOIs for citing data sets has the potential for assessing the utility of data sets as the adoption of DOIs continues to take hold. The brackets indicate the data center code as represented in the DataCite system, e.g. CDL.USGS. “CDL” refers to the California Digital Library, which provides DOI registration services via DataCite. The DataCite codes are shown to allow comparison with Figure 1.
Although DataCite and related initiatives have emphasized DOI assignment primarily to data sets, they are actually being provided to a much wider set of resources. Using the DataCite (2016) search service (accessed on August 24, 2016), Figure 1 divides DOI assignments by DataCite-identified resource type for 11 organizations. Those organizations are chosen from the ESIP (2016) member directory. The organizational names are drawn from DataCite. “CDL” refers to the California Digital Library, which provides DOI registration services.
Resource types displayed in Figure 1 are those allowed by the DataCite metadata schema, specifically according to the controlled list of types allowed in the “ResourceTypeGeneral” attribute of the “ResourceType” element. As this field was optional in the DataCite schema until 2016, not all their metadata records provide them. Records with unspecified resource types are listed as “Not Supplied”. As with those in CDL.NASAGSFC (DOI prefix 10.5067), in some cases the category “not supplied” mostly contains data sets, although this is hard to determine from the metadata alone. Figure 1 contains one pie chart for each of the 11 organizations, showing percentages of objects belonging to various resource types to which DOIs have been assigned. Resource types with fewer than 1% of the total number of objects are not shown. The data underlying these charts are provided in the Appendix.
DOIs have proved useful for identifying, locating, and citing data sets, and the extent of their recent adoption by various organizations shows that the practice of citing data sets is catching on in scientific publications. Data set citations also can facilitate the extraction of citation metrics for those from given organizations, for a given group of data sets or even a single data set, which would otherwise involve considerable amount of manual effort.
Some Remaining Challenges with DOIs
Despite the successes with DOI implementations mentioned above, challenges remain. We organize this summary according to those identified in the literature and those based on our experience.
Challenges identified in the literature
As the DOI system was being developed in the late 1990s, there was common sentiment within academic circles that the DOI system was too targeted to the needs of the publishing community, in-lieu of addressing the needs of the scholarly community (Cleveland 1998; Davidson and Douglas 1998; Lynch 1998). For instance, the “Armati Report” (Armati 1995), which was foundational in the development of the DOI system, focused in large part on intellectual property issues, which were (and still are) critical for academic publishers. Even in the early days, however, issues that are problematic for scientific data now were being discussed. For example, in referencing “critical issues” facing an information identification system, that report specifically calls out “flexible, granular (sub-file level) data object identification” and “real time differentiation between master data object and individual expressions of whole or parts of it” (pg. 6). Similarly, in outlining the early developments of the DOI system, Paskin (1999) described debates about the resource types to which DOIs should be assigned, the value of the metadata associated with DOIs, and the granularity at which DOIs should be assigned to various resources. These questions re-emerged when the communal practice of assigning DOIs to data sets began in earnest (Paskin 2006), and continue to inspire debate (Altman and King 2007; Wynholds 2011; Starr et al. 2015). Lin and Strasser (2014) have noted that the DOI system (1) lacked mechanisms for users to identify modifications to the data if needed, and (2) was designed to support references to entire data sets, not their components such as individual values or granules. Parsons and Fox (2013) note how many of these challenges in applying DOIs to data relate to the DOI system’s strong association with the concept of “publication”. In the context of scholarly publishing, DOIs are assigned after a resource (e.g. journal article) is fixed in its final state. The DataCite DOI services were built with the same model in mind (Brase, Sens, & Lautenschlager 2015). Data sets often change or are updated however, even after being made publicly accessible. Therefore, the assumptions that people may hold for resources with DOIs – that they are static and well-bounded – are not necessarily true for data.
In a Public Library of Science blog, Fenner (2011) recognized the following issues with the DOI system’s implementation: 1. Many bibliographic databases store DOIs without permitting queries or providing links to services using them; 2. DOI string syntax is often very web-unfriendly, as the unregulated format of DOI strings could require ‘escaping’ various special characters; 3. DOIs assigned to individual components such as individual figures or tables in a paper, while very useful, can ‘confuse bibliographic databases and make it more difficult to track all the links to a given article’; 4. Assignment of new DOIs to updated versions of articles complicates the tracking of all references to the articles. He noted that journals are beginning to handle these issues, however.
During a retrospective discussion of 20 years of web-based identifier management systems, Bide (2015) recognized that some of the key barriers to universally adopting identity management techniques/standards involve governance issues associated with achieving broad adherence to standards, coupled with the fact that use of identification systems is not free of charge (even though the costs may be small). In short, many challenges related to DOI adoption revolve around practices, not technical capabilities.
Challenges identified by authors’ experience
We have focused on the use of identifiers in Earth science data. Although this scientific community is slowly recognizing the value of assigning identifiers to data products, data citations have not yet become the norm in scientific publications. It is expected that the gathering of accurate data set citation metrics will be facilitated by the emergence of this norm. However, the lag in adoption of data citations by data users and journals slows the ability to gather such metrics accurately. We have not yet found prevalent web-based mechanisms for consistently and accurately deriving citation metrics for data sets originating from given organizations. Such difficulties are:
- As shown in Figure 1, it is very difficult to identify the resource type to which a DOI refers without consulting its metadata. In many cases even from the metadata it is impossible to identify the resource type. Although very recently remedied to some extent by changing the resource type to a mandatory metadata field, it will take some time for the data archives to comply with this recent requirement.
- No universally accepted recommendations of good practices for DOI syntax (e.g., random strings, clearly identifiable strings) exist. Although stated in the DOI Handbook (International DOI Foundation 2016b), the “best practice” of assigning opaque identifiers is not universally followed. Opaque DOIs refer to those with suffixes having no discernable meaning (e.g., 10.5065/D62J68XR).
- Google Scholar counts the appearance of multiple citations with a single DOI prefix (e.g., 10.5067) within a document as one record. The determination of the true number of citations using such a prefix therefore requires either manual or automated analysis of the search results. This would hold true for outputs from any search engine.
- The number of Google Scholar outputs changes daily, and sometimes decreases, complicating analysis efforts.
- No method exists for distinguishing between “internal citations” that use DOI prefixes for describing identifier assignment processes and “external citations” that cite data sets with those prefixes.
- Among data archives, variations among the definitions of “data sets” and the level of granularity at which they exist complicate overall evaluation of the extent of DOI use. For instance, one archive may consider three similar data sets as one, in contrast to that of another archive.
It is possible that these difficulties will be overcome with specialized software for scripted searches.
Our analysis of the assignment of DOIs to resources (primarily data sets) by various organizations managing Earth science data reveals significant differences in approaches. These differences occur in syntax, number of prefixes used, and the variety of resource types to which DOIs are assigned, among others (see Figure 1 and Table 1). This complicates the use of these DOIs for consistently deriving or aggregating information such as usage metrics. While not problematic for using DOIs to enable persistent and unique references to individual resources, the variation in the manner in which DOIs are used in citations makes it difficult to use DOIs for gathering metrics of data usage and citation. Many data usage metrics, such as download counts and quantities of bytes delivered, are already highly repository specific, being contingent on the manner in which repositories collect, manage, and present data (Weber et al. 2013). DOI-based metrics are not likely to be any more comparable across organizations or repositories than other data usage metrics due to the variations in DOI implementation.
As demonstrated above, the utilization of DOIs within the Earth Science community has come a long way since the 1990s. It is worth noting though that much of this work has emerged only recently–as early as the beginning of this decade–owing in part to revolutions in the digital arena. This study only examined DOI metrics for US-based organizations; further investigation is needed to enable comparison with similar initiatives in other geographic areas. However, given the international interest in this topic and multi-national scope of key organizations such as DataCite and the Research Data Alliance, we would expect to see only minor, if any, differences across national borders. Increased DOI utilization will necessitate the development and adaptation of automated approaches to data citation. Along with automated approaches, the use of structured metadata having mandatory “resource type” fields will facilitate improved characterization of the importance of Earth science data sets and related information, with the appropriate linkages to related resources, other types of resources as well as the linking of them to related resources.
Within its data citation guidelines, the ESIP Data Stewardship Committee (2012) provided recommendations for citing so-called “dynamic data” (i.e., data sets containing evolving contents). More recently, the Research Data Alliance (RDA) Working Group on Data Citation has proposed a query-based approach for generating precise citations from archival databases (Rauber et al. 2015). Furthermore, Buneman, Davidson & Frew (2016) have developed an automated approach to query-based citation generation which returns the data along with relevant citations from databases. When used with DOIs, such approaches should cite data more precisely, assess provenance more accurately, and improve reproducibility of scientific products. In addition, the increased use of web-based persistent identifier schemes will enable scientific data systems to leverage linked data and Semantic Web technologies more broadly. These technologies, which are being used by a number of organizations to support systems for data documentation, discovery, and integration, rely on resources having web-resolvable identifiers (Ma et al. 2014; Wilson et al. 2015). The use of persistent identifiers adds robustness to Semantic Web applications, reducing problems related to link obsolescence.