The History and Future of Data Citation in Practice

In this review, we adopt the definition that ‘Data citation is a reference to data for the purpose of credit attribution and facilitation of access to the data’ (TGDCSP 2013: CIDCR6). Furthermore, access should be enabled for both humans and machines (DCSG 2014). We use this to discuss how data citation has evolved over the last couple of decades and to highlight issues that need more research and attention. Data citation is not a new concept, but it has changed and evolved considerably since the beginning of the digital age. Basic practice is now established and slowly but increasingly being implemented. Nonetheless, critical issues remain. These issues are primarily because we try to address multiple human and computational concerns with a system originally designed in a non-digital world for more limited use cases. The community is beginning to challenge past assumptions, separate the multiple concerns (credit, access, reference, provenance, impact, etc.), and apply different approaches for different use cases.


I. Introduction
Data citation helps make data sharing both more FAIR -findable, accessible, interoperable, and reusable (Wilkinson et al. 2016) -and fair. Citation helps make data more findable and accessible through current scholarly communication systems. It can aid interoperability through precise reference to data and associated services and aid reusability by providing some context of how data have been created and used. Citation also helps credit the intellectual effort necessary to create a good data set and provides accountability for that data. It recognizes important scientific contributions beyond the written publication and, therefore, makes things fairer for everyone involved in doing good science.
Data citation is not a new concept. It rests on a fundamental principle of the scientific method that demands recognized, verifiable, and credited evidence behind an assertion. Traditionally, this was done within the literature through citation of materials often held in special library collections, such as field or lab logs, printed books, monographs, and maps (Downs et al. 2015). If the data were small enough, they were simply included directly in the publication. Some fields, such as astronomy, had whole journals devoted to publishing data. The digital age and the corresponding growth in the volume and complexity of data changed all that.
In this review, we adopt the definition that 'Data citation is a reference to data for the purpose of credit attribution and facilitation of access to the data (TGDCSP 2013: CIDCR6). Furthermore, access should be enabled for both humans and machines (DCSG 2014). We use this to discuss how data citation has evolved over the last couple of decades and to highlight issues that need more research and attention.
Early work illustrates the desire to build from existing systems and culture (e.g., Altman & King 2007). Silvello (2018) provides an excellent, extensive review of the motivations, principles, and high-level practices of data citation. Borgman (2016) provides a similar review and illustrates the disconnect between bibliographic citation principles and data citation.
We build on this work by focusing on how the two concerns, credit and access, and the two audiences, humans and machines, can create tensions in how data citation is conceived and implemented. We review relevant literature and current activities while also drawing from our own experience working as data professionals managing data, systems, and communities for decades. We come from the perspective of observational science, where data record phenomena as they occur and cannot be repeated. This makes precise citation more necessary and more challenging. Our experience is primarily in Earth, environmental, and space science, but we believe our recommendations apply broadly.

II. History
The modern concept of data citation emerged in the late 1990s. For example, one of the authors was involved in an effort at that time, where NASA's Earth science archives (the DAACs) agreed to a common approach to data citation. The US Geological Survey also proposed guidelines (Berquist Jr. 1999). But neither of these approaches were broadly adopted even within NASA and USGS. Part of the issue was that journals were still developing standards on how to cite electronic resources in general. In the analog era, the intangible concepts of an article were manifest in a physical object, but as we moved into the digital age, we needed "to think not of one space (the physical, paper space) but of three 'spaces' in which 'the same' articles appear: • Information space = the work as intangible entity (ideas) • Cyberspace = digital manifestation (electronic, made of bits) • 'Paper space' = physical manifestation (cellulose and ink, made of atoms)" (Paskin 2000: 2).
This added more complexity to the concept of citation, and the problem was exacerbated by the impermanence of web references (e.g., Lawrence et al. 2001). The Digital Object Identifier (DOI) first emerged in the year 2000, 1 even though the underlying Handle system was almost as old as the web (Kahn & Wilensky 1995). Overall, in those early digital years, data were rarely cited, and if they were, the mechanisms were erratic and inconsistent.
From the mid-2000s, there was a growing consensus on the use of registered, resolvable, authority-based persistent identifiers (PIDs) -not only for papers but also other digital artifacts, notably data. DOIs emerged as the PID of choice for many publishers and data repositories, but there are other popular choices, such as Archival Resource Keys (ARKs) 2 or compact URIs (CURIEs) resolved through metaresolvers like Name to Thing 3 or identifiers.org. Indeed, local identifiers (accession numbers) have been used for centuries for internal management, and even external referencing, especially for biocollections. Moreover, digital entities (e.g., computer files), physical entities (e.g., rock samples), living things (e.g., wildlife), and descriptive entities (e.g., mitosis) have different requirements for identifiers (Guralnick et al. 2015). McMurry et al. (2017) provide a good contemporary review of identifiers and how to use them, and the RDA Data Fabric Interest Group has developed a set of assertions about the nature, creation and implementation of PIDs (Wittenburg et al. 2017).
At the same time, there was an emerging argument that data should be 'published' in a manner akin to scientific literature and cited accordingly (Callaghan et al. 2009;Costello 2009;Klump et al. 2006;Lawrence et al. 2011). Data started getting DOIs in 2004 through a pilot project in Germany, and DataCite was established in December 2009, with the explicit global mission of minting DOIs for data (Klump et al. 2015). Even as the 'publication' paradigm for data was questioned (Parsons & Fox 2013;Schopf 2012), citation was broadly supported by the library and data management community. This support culminated with the broadly endorsed Joint Declaration of Data Citation Principles in 2014, which defined the core purposes and general practices of data citation (DCSG 2014). But this was an agreement of the library, data, and information science communities. The research community remained largely unaware, and studies have revealed that data citation remains an infrequent and inconsistent process (Howison & Bullard 2016; Mooney & Newton 2012;Mayernik et al. 2016;Silvello 2018).
Part of the reason for infrequent and inconsistent data citation is that, from the researcher's point of view, making data FAIR implies sharing and reuse and therefore effort by the researcher. Based on our collective experience managing multiple data archives and networks, large scale community data, like satellite imagery, may be reused a lot, but much data from more-localized, research collections may never be shared or reused until the broad community recognizes that aggregating these data globally is necessary to make further progress (Parsons et al. 2008;Baker & Yarmey 2009). This takes time and effort. For example, starting in 1995 with the creation of the National Center for Ecological Analysis and Synthesis (NCEAS) (Hackett et al. 2008) and subsequent synthesis centers, sharing and reuse through synthesis became an established norm for disciplines like ecology, evolution, and socio-ecology. Nonetheless, only recently has data citation been common in synthesis papers. It appears a culture of data citation must be preceded by a culture of sharing and reuse; one that values reproducibility, transparency, and credit in practice (Stuart 2017).

III. Current Activity and Issues
In the last few years, we have seen much activity to promote data citation and to define specific guidelines for both data and software citation. The Digital Curation Centre and Earth Science Information Partners (ESIP) have updated their respective, long-standing data citation guidelines (Ball & Duke 2015;EDPSC 2019). The Research Data Alliance (RDA) produced a Recommendation on citing specific subsets of very dynamic data (Rauber et al. 2015). Publishers and repositories are coming together on common citation practices We are optimistic that data (and software) citation is emerging as a norm for observational science. We are especially encouraged by the recent project led by the American Geophysical Union and others on 'Enabling FAIR Data' which has led to many publishers now requiring data citation in their author guidelines (Stall et al. 2018). The new Transparency and Openness Promotion statement shows similar commitment beyond Earth, environmental, and space sciences (Aalbersberg et al. 2018). It appears we are approaching the critical mass for a broad behavioral shift. Nonetheless, multiple issues remain.
Many of the issues are rooted in the fact that we are taking a concept implemented for physically printed literature and human beings and trying to use it to address multiple concerns for both humans and machines. We awkwardly try to have the digital space match the 'paper' space, as Paskin (2000) put it.

A. Specific and verifiable citation
One issue of data citation practice is citing precise subsets of versioned data. This is necessary to meet the 'specific and verifiable' principle of the Joint Declaration and is fundamental to the reproducibility use case for citation. Of course, the simplest, logical approach is to assign a new PID if there is any change in the data set, but this can become unwieldy with large, dynamic data such as data streaming from a remote instrument undergoing multiple calibrations and corrections. Furthermore, repositories can have very different approaches to versioning their data and how they recommend citing different versions. They also package data and assign PIDs at very different levels of granularity.
We find the RDA Recommendation on Data Citation of Evolving Data (Rauber et al. 2015) the best approach to citing specific subsets of very dynamic data. The basic idea is to assign and maintain a PID for a specific, time-stamped query of a data set, as well as a PID for the data set as a whole. This means the repository must continue to resolve the query PID and maintain or migrate the technology necessary to resolve the actual query within the data set. The RDA Data Citation WG has conducted multiple implementation workshops and has reports from repositories adopting this approach every six months at RDA Plenaries. The approach is getting broader adoption, but it is not at all a norm across repositories. Many simply do not have the capacity to implement it yet; sustaining these citations means sustaining query systems not just data; and maintaining access to these cited queries through technology cycles can be quite challenging (Stockhause & Lautenschlager 2017). Note, this approach provides reference and access to a precise subset, but it does not necessarily address specific credit concerns for that subset, such as when different authors contribute to a larger collection.
There are other approaches for citing dynamic data recommended by DCC, ESIP, and DataVerse (Ball & Duke 2015;Crosas 2014;EDPSC 2019). These include capturing time slices or snapshots of an ongoing time series, having different PIDs for the data set concept and specific versions or instances, or simply establishing careful documentation practices. These approaches are incomplete in that they tend to be more appropriate for relatively static data and often require human interpretation. For example, it is not practical to continually mint new PIDs for a data set that may update every six seconds (typical for many automated meteorological stations). Similarly, some repositories do not find it appropriate to mint new PIDs for minor changes to a data set (e.g., a typo in the documentation) because it can unnecessarily complicate tracking the use of a data set. They rely on the user to apply their judgement on what is a meaningful change for their application. Computers cannot exercise such judgement.

B. What to cite
Another issue is deciding what constitutes a first-class object in scholarly discourse -the 'importance' principle. Data are only truly useful if they are accompanied with detailed documentation about how they were collected, their uncertainties, and relevant applications. Multiple data journals have emerged to provide 'peer-review' of data and documentation and to publish ' data papers' that provide recognizable credit for data authors or creators. They have different approaches, however, on what is to be cited-the document, the data, or both. Often the paper and the associated data set have different authors. The papers also differ in how they structure the information for humans and machines.
Some organizations are now developing well-structured, machine-readable 'publications' that provide the best services of both publishers and data repositories. A good example is the Whole Tale project which is a collaboration among repositories (e.g., DataOne, Globus, DataVerse), computing providers, and publishers to implement reproducible data papers (Brinckman et al. 2019;Chard et al. 2019). Whole Tale and similar systems like Binder 4 and CodeOcean 5 provide mechanisms to package the data and products of research along with the code and computing environment that produced them, machine-readable provenance about how they were produced, references to published input data, and the scientific narrative that frames the rationale and conclusions for the work, all in a citable and re-executable Research Object (Bechhofer et al. 2010). These complex, hybrid publications span the boundaries between data, software, and publications to enable fully transparent research publication.

C. Tracking use and impact
Part of the purpose of citing data is to provide credit and attribution for the creation of the data set and correspondingly to help determine how and how often a data set is used. Data can have many important uses outside of research publications, though, and people are exploring better ways to track the impact of data. The National Information Standards Organization (NISO) defined a 'Recommended Practice' for alternative assessment metrics, but they primarily emphasize the need for data citation. They observe that 'there currently seems to be a lack of interest in altmetrics for data in the community' (NISO 2016: 16). Peters et al. (2016) also find little use of altmetrics and no correlation between altmetrics and citation. Because of inconsistent citation practices, text mining may be a better way to identify literature-data relationships (Kafkas et al. 2013). Nevertheless, research from Kratz and Strasser (2015b) indicates that citation and data downloads are the measures most valued by researchers. To that end, work through RDA has led to the 'Make Data Much more research is needed on how to assess data use and impact beyond bibliometrics. Groups such as the VALUABLES consortium, are beginning to explore these issues through the lens of economics. For example, Bernknopf, et al. (2016) found that federal agencies could save $7.7 million/year in post-wildfire response if Landsat data were used; while Cooke & Golub (2019) report that a 30% reduction in weather uncertainty impacts on corn and soybean futures due to soil moisture measurements from NASA's Soil Moisture Active Passive satellite had a net worth of $1.44 Billion/year. Even more uncertain are methods of quantifying the impacts of data on public policy.
Providing credit for that impact is also tricky. Many people play critical roles in the creation of even the simplest data set, and their performance is evaluated in different ways. Not everyone is measured by their research paper publication record. One effort to recognize these other contributions is Project CRediT (Contributor Roles Taxonomy), which has defined a taxonomy of contributor roles for research objects and suggests using digital badges that detail what each author did for the work and link to their profiles elsewhere on the Web (Allen et al. 2014). We also recognize the concept of transitive credit, which could be used to recognize the developers of products other than papers (Katz 2014), and to use provenance to understand how upstream data and software enabled advances in research. Ultimately, we must recognize that credit is a human concern requiring context and interpretation that cannot be readily automated. We can work to build attribution into the scientific workflow, but human judgement is still required to assess the relative value of various contributions. Indeed, a study of software attribution found that automating credit mechanisms can lead to perverse metrics and incentives that can falsely represent the value of a contribution (Alliez et al. 2019).
Another concern related to both credit and understanding impact is to identify and trace the connections between all sorts of research objects (data, literature, software, people, organizations, algorithms, etc). Multiple efforts try to address this. One effort directly related to scholarly publishing is the Scholix (Scholarly Link Exchange) initiative which emerged out of RDA to more formally interconnect data and literature (Burton et al. 2017). The approach is functional and operational, as it builds from established citation hubs like DataCite, CrossRef, and OpenAire. This effort is expanding and collaborating with other related initiatives through a newly proposed 'Open Science Graphs for FAIR Data Interest Group' within RDA.
Other approaches use more decentralized, Web-based mechanisms which may allow more adaptability and extensibility (e.g., Ma et al. 2017;Parsons & Fox 2018). Work in ESIP and RDA explores how we can use schema.org web markup to identify and link data and repositories. ESIP is also exploring another W3C recommendation called Linked Data Notifications 6 -a sort of RSS-style protocol to request and receive notifications about activities, interactions, and new information. The notification itself is an individual entity with its own URI. Another emergent approach is the Digital Object Interface Protocol (Kahn et al. 2018) which assigns a persistent ID to any digital object and allows that object to express what type of object it is and what operations it allows such as various web services or basic management functions.
All these interconnecting technologies are somewhat peripheral to the core purpose of citation, but they highlight how applying machine-actionable PIDs to digital objects can expand the possibilities for knowledge sharing well beyond the bounds of traditional (paper-based) citation. This brings us to the issue and concern of identity itself.

D. Identifying things
People are beginning to rethink how PIDs work. To date, the basic issue of persistence of locators on the web has been addressed by what we might call authority-based identifiers, which separate the identity of an object from its location and are maintained in trusted registries. Klump et al. (2015Klump et al. ( , 2017 discuss how this approach has evolved and raise issues around the interconnection of identity, institutional commitment, and cost models. They note: 'The focus of the DOI for the data community on paper-like documents and human actors has left some conceptual gaps' (Klump et al. 2015: 133). They argue that we need to explore more advanced features such as identifier templates, more sophisticated content negotiation when resolving identifiers, more machine actionability in general, as well as the social process of maintaining the persistence of an object and its reference.
In other work, data managers are looking to content-based identifiers (i.e., cryptographic-hash-based IDs) to identify exact copies of data, ideally without relying on third parties and external administrative processes. These content-based identifiers can be deployed and resolved in peer-to-peer environments like the InterPlanetary File System (IPFS) 7 and Dat. 8 There is already an established system called Qri 9 (query) which allows users to reference, browse, download, create, fork, and publish data sets with a broad network of peers in IPFS. Furthermore, Dat includes public-key technology to provide assurance on the source of the data and any changes that may have occurred. 10 These approaches are still not well-suited to massive volumes of streaming data, and there is still the issue that different representations of data may be scientifically equivalent but not identical. The hash really needs to include the provenance chain as well as the data set. Nonetheless, these approaches show great promise. In related work, Bolikowski et al. (2015: 281) use the concepts of blockchain and version control systems like Git to propose a distributed system for maintaining long-term resolvability of persistent identifiers. They argue that the 'system should be agnostic with respect to referent type (data sets, source codes, documents, people) and content delivery technology (HTTP, BitTorrent, Tor/Onion)'.
It is important to note that content-based identifiers have quite different properties and applications from authority-based identifiers. Content-based identifiers and blockchain technologies can be useful for tracking provenance, precise data queries, and internal repository management concerns. Authority-based identifiers are being used for ensuring the social requirements necessary to maintain a persistent and managed access location (Di Cosmo et al. 2018), while those governance structures are only now being developed for distributed, content-based identifier systems.
Finally, it is worth noting that various authority-based identifiers have or are being developed for other research objects and entities. These include the Open Researcher and Contributor ID (ORCID) for individual researchers (Haak et al. 2012); the Research Organization Registry (ROR) Community, which is working to develop identifiers for research organizations; 11 and the work of the Persistent Identification of Instruments Working Group of RDA, 12 which is implementing processes to use DataCite DOIs for scientific instruments. These are only a few examples of the growth in the development and application of PIDs. Similar to Linked Open Data approaches (Bechhofer et al. 2011;Bizer et al. 2009), these other-types of PIDs can make more elements of a citation precise and machine-actionable, but again they move us further away from the traditional human-oriented citation. Consider, fancifully, if this article was cited making full use of identifiers (Figure 1). It is more precise but also more opaque for the human reader.
Also, PIDs cannot always point precisely to the thing but rather only a representation of the thing (ORCIDs don't reference people, they reference descriptions of people). Access may be precisely defined, but credit and reference are inherently ambiguous to some degree (Hayes & Halpin 2008). The human remains in the loop.

IV. Looking to the Future
Despite recognized definitions (Borgman 2015; DCSG 2014; TGDCSP 2013), data citation remains a complex and evolving issue. On the one hand, the basic principles and process are well established. We know how to cite most data in research publications. We must only accelerate the implementation, and there does appear to be movement in that direction. On the other hand, long-established academic practices and assumptions about what is 'important' in scientific work lead us to bundle many different concerns into the concept of citation. At the same time, the opportunities promised by PIDs lead us to bundle even more concerns into reference schemes. Reconsideration and disaggregation of data citation concerns are overdue.
Data citation is often viewed as a computational or information science problem (Buneman et al. 2016;Silvello 2018), and sometimes more accurately as a social or cultural adaptation problem (Borgman 2015; Klump et al. 2015). But it is a complex socio-technical problem with many nuanced concerns. In short, it is an issue of praxis.
We are inspired by the Force11 effort to identify some of the myriad use cases for software citation (Smith et al. 2016). We feel we need to do the same with data citation: define multiple use cases that 1) de-emphasize the importance of the scientific paper in lieu of more precise assertions and supporting evidence and 2) emphasize the valuable use of data outside traditional scholarly environments. Some of this work has begun in RDA and ESIP. This should help us sort out the different concerns.

Author ORCIDs
Imaginary ROR ID for DSJ or CODATA Imaginary DOI for article Recognizing access as a machine concern can help us focus on providing data as a service rather than simply as object downloaded by a human. This, in turn, can help us make intelligent choices about what type of identifiers to use for what application. The ID for the human interested in a general description of the data may be different and will behave differently than the ID for the machine.
Going forward, we should accelerate the substantial progress we have made on implementing data citation for the basic scholarly use case. At the same time, we should not overextend the concept nor expand our expectations for what citation can accomplish. It is time to rethink some of our assumptions if we are to make data both FAIR and fair.