I. Introduction

Data citation helps make data sharing both more FAIR — findable, accessible, interoperable, and reusable () — and fair. Citation helps make data more findable and accessible through current scholarly communication systems. It can aid interoperability through precise reference to data and associated services and aid reusability by providing some context of how data have been created and used. Citation also helps credit the intellectual effort necessary to create a good data set and provides accountability for that data. It recognizes important scientific contributions beyond the written publication and, therefore, makes things fairer for everyone involved in doing good science.

Data citation is not a new concept. It rests on a fundamental principle of the scientific method that demands recognized, verifiable, and credited evidence behind an assertion. Traditionally, this was done within the literature through citation of materials often held in special library collections, such as field or lab logs, printed books, monographs, and maps (). If the data were small enough, they were simply included directly in the publication. Some fields, such as astronomy, had whole journals devoted to publishing data. The digital age and the corresponding growth in the volume and complexity of data changed all that.

In this review, we adopt the definition that ‘Data citation is a reference to data for the purpose of credit attribution and facilitation of access to the data (). Furthermore, access should be enabled for both humans and machines (). We use this to discuss how data citation has evolved over the last couple of decades and to highlight issues that need more research and attention.

Early work illustrates the desire to build from existing systems and culture (e.g., ). Silvello () provides an excellent, extensive review of the motivations, principles, and high-level practices of data citation. Borgman () provides a similar review and illustrates the disconnect between bibliographic citation principles and data citation.

We build on this work by focusing on how the two concerns, credit and access, and the two audiences, humans and machines, can create tensions in how data citation is conceived and implemented. We review relevant literature and current activities while also drawing from our own experience working as data professionals managing data, systems, and communities for decades. We come from the perspective of observational science, where data record phenomena as they occur and cannot be repeated. This makes precise citation more necessary and more challenging. Our experience is primarily in Earth, environmental, and space science, but we believe our recommendations apply broadly.

II. History

The modern concept of data citation emerged in the late 1990s. For example, one of the authors was involved in an effort at that time, where NASA’s Earth science archives (the DAACs) agreed to a common approach to data citation. The US Geological Survey also proposed guidelines (). But neither of these approaches were broadly adopted even within NASA and USGS. Part of the issue was that journals were still developing standards on how to cite electronic resources in general. In the analog era, the intangible concepts of an article were manifest in a physical object, but as we moved into the digital age, we needed “to think not of one space (the physical, paper space) but of three ‘spaces’ in which ‘the same’ articles appear:

  • Information space = the work as intangible entity (ideas)
  • Cyberspace = digital manifestation (electronic, made of bits)
  • ‘Paper space’ = physical manifestation (cellulose and ink, made of atoms)” ().

This added more complexity to the concept of citation, and the problem was exacerbated by the impermanence of web references (e.g., ). The Digital Object Identifier (DOI) first emerged in the year 2000, even though the underlying Handle system was almost as old as the web (). Overall, in those early digital years, data were rarely cited, and if they were, the mechanisms were erratic and inconsistent.

From the mid-2000s, there was a growing consensus on the use of registered, resolvable, authority-based persistent identifiers (PIDs) — not only for papers but also other digital artifacts, notably data. DOIs emerged as the PID of choice for many publishers and data repositories, but there are other popular choices, such as Archival Resource Keys (ARKs) or compact URIs (CURIEs) resolved through metaresolvers like Name to Thing or identifiers.org. Indeed, local identifiers (accession numbers) have been used for centuries for internal management, and even external referencing, especially for biocollections. Moreover, digital entities (e.g., computer files), physical entities (e.g., rock samples), living things (e.g., wildlife), and descriptive entities (e.g., mitosis) have different requirements for identifiers (). McMurry et al. () provide a good contemporary review of identifiers and how to use them, and the RDA Data Fabric Interest Group has developed a set of assertions about the nature, creation and implementation of PIDs ().

At the same time, there was an emerging argument that data should be ‘published’ in a manner akin to scientific literature and cited accordingly (; ; ; ). Data started getting DOIs in 2004 through a pilot project in Germany, and DataCite was established in December 2009, with the explicit global mission of minting DOIs for data (). Even as the ‘publication’ paradigm for data was questioned (; ), citation was broadly supported by the library and data management community. This support culminated with the broadly endorsed Joint Declaration of Data Citation Principles in 2014, which defined the core purposes and general practices of data citation (). But this was an agreement of the library, data, and information science communities. The research community remained largely unaware, and studies have revealed that data citation remains an infrequent and inconsistent process (; ; ; ).

Part of the reason for infrequent and inconsistent data citation is that, from the researcher’s point of view, making data FAIR implies sharing and reuse and therefore effort by the researcher. Based on our collective experience managing multiple data archives and networks, large scale community data, like satellite imagery, may be reused a lot, but much data from more-localized, research collections may never be shared or reused until the broad community recognizes that aggregating these data globally is necessary to make further progress (; ). This takes time and effort. For example, starting in 1995 with the creation of the National Center for Ecological Analysis and Synthesis (NCEAS) () and subsequent synthesis centers, sharing and reuse through synthesis became an established norm for disciplines like ecology, evolution, and socio-ecology. Nonetheless, only recently has data citation been common in synthesis papers. It appears a culture of data citation must be preceded by a culture of sharing and reuse; one that values reproducibility, transparency, and credit in practice ().

III. Current Activity and Issues

In the last few years, we have seen much activity to promote data citation and to define specific guidelines for both data and software citation. The Digital Curation Centre and Earth Science Information Partners (ESIP) have updated their respective, long-standing data citation guidelines (; ). The Research Data Alliance (RDA) produced a Recommendation on citing specific subsets of very dynamic data (). Publishers and repositories are coming together on common citation practices (; ). Force11 and ESIP have collaborated on software citation principles and guidelines (; ; ).

We are optimistic that data (and software) citation is emerging as a norm for observational science. We are especially encouraged by the recent project led by the American Geophysical Union and others on ‘Enabling FAIR Data’ which has led to many publishers now requiring data citation in their author guidelines (). The new Transparency and Openness Promotion statement shows similar commitment beyond Earth, environmental, and space sciences (). It appears we are approaching the critical mass for a broad behavioral shift. Nonetheless, multiple issues remain.

Many of the issues are rooted in the fact that we are taking a concept implemented for physically printed literature and human beings and trying to use it to address multiple concerns for both humans and machines. We awkwardly try to have the digital space match the ‘paper’ space, as Paskin () put it.

A. Specific and verifiable citation

One issue of data citation practice is citing precise subsets of versioned data. This is necessary to meet the ‘specific and verifiable’ principle of the Joint Declaration and is fundamental to the reproducibility use case for citation. Of course, the simplest, logical approach is to assign a new PID if there is any change in the data set, but this can become unwieldy with large, dynamic data such as data streaming from a remote instrument undergoing multiple calibrations and corrections. Furthermore, repositories can have very different approaches to versioning their data and how they recommend citing different versions. They also package data and assign PIDs at very different levels of granularity.

We find the RDA Recommendation on Data Citation of Evolving Data () the best approach to citing specific subsets of very dynamic data. The basic idea is to assign and maintain a PID for a specific, time-stamped query of a data set, as well as a PID for the data set as a whole. This means the repository must continue to resolve the query PID and maintain or migrate the technology necessary to resolve the actual query within the data set. The RDA Data Citation WG has conducted multiple implementation workshops and has reports from repositories adopting this approach every six months at RDA Plenaries. The approach is getting broader adoption, but it is not at all a norm across repositories. Many simply do not have the capacity to implement it yet; sustaining these citations means sustaining query systems not just data; and maintaining access to these cited queries through technology cycles can be quite challenging (). Note, this approach provides reference and access to a precise subset, but it does not necessarily address specific credit concerns for that subset, such as when different authors contribute to a larger collection.

There are other approaches for citing dynamic data recommended by DCC, ESIP, and DataVerse (; ; ). These include capturing time slices or snapshots of an ongoing time series, having different PIDs for the data set concept and specific versions or instances, or simply establishing careful documentation practices. These approaches are incomplete in that they tend to be more appropriate for relatively static data and often require human interpretation. For example, it is not practical to continually mint new PIDs for a data set that may update every six seconds (typical for many automated meteorological stations). Similarly, some repositories do not find it appropriate to mint new PIDs for minor changes to a data set (e.g., a typo in the documentation) because it can unnecessarily complicate tracking the use of a data set. They rely on the user to apply their judgement on what is a meaningful change for their application. Computers cannot exercise such judgement.

B. What to cite

Another issue is deciding what constitutes a first-class object in scholarly discourse — the ‘importance’ principle. Data are only truly useful if they are accompanied with detailed documentation about how they were collected, their uncertainties, and relevant applications. Multiple data journals have emerged to provide ‘peer-review’ of data and documentation and to publish ‘data papers’ that provide recognizable credit for data authors or creators. They have different approaches, however, on what is to be cited—the document, the data, or both. Often the paper and the associated data set have different authors. The papers also differ in how they structure the information for humans and machines.

Some organizations are now developing well-structured, machine-readable ‘publications’ that provide the best services of both publishers and data repositories. A good example is the Whole Tale project which is a collaboration among repositories (e.g., DataOne, Globus, DataVerse), computing providers, and publishers to implement reproducible data papers (; ). Whole Tale and similar systems like Binder and CodeOcean provide mechanisms to package the data and products of research along with the code and computing environment that produced them, machine-readable provenance about how they were produced, references to published input data, and the scientific narrative that frames the rationale and conclusions for the work, all in a citable and re-executable Research Object (). These complex, hybrid publications span the boundaries between data, software, and publications to enable fully transparent research publication.

C. Tracking use and impact

Part of the purpose of citing data is to provide credit and attribution for the creation of the data set and correspondingly to help determine how and how often a data set is used. Data can have many important uses outside of research publications, though, and people are exploring better ways to track the impact of data. The National Information Standards Organization (NISO) defined a ‘Recommended Practice’ for alternative assessment metrics, but they primarily emphasize the need for data citation. They observe that ‘there currently seems to be a lack of interest in altmetrics for data in the community’ (). Peters et al. () also find little use of altmetrics and no correlation between altmetrics and citation. Because of inconsistent citation practices, text mining may be a better way to identify literature-data relationships (). Nevertheless, research from Kratz and Strasser () indicates that citation and data downloads are the measures most valued by researchers. To that end, work through RDA has led to the ‘Make Data Count’ project (; ), which has defined a consistent way to count data downloads through the COUNTER Code of Practice for Research Data (). These projects are still primarily oriented to the research community.

Much more research is needed on how to assess data use and impact beyond bibliometrics. Groups such as the VALUABLES consortium, are beginning to explore these issues through the lens of economics. For example, Bernknopf, et al. () found that federal agencies could save $7.7 million/year in post-wildfire response if Landsat data were used; while Cooke & Golub () report that a 30% reduction in weather uncertainty impacts on corn and soybean futures due to soil moisture measurements from NASA’s Soil Moisture Active Passive satellite had a net worth of $1.44 Billion/year. Even more uncertain are methods of quantifying the impacts of data on public policy.

Providing credit for that impact is also tricky. Many people play critical roles in the creation of even the simplest data set, and their performance is evaluated in different ways. Not everyone is measured by their research paper publication record. One effort to recognize these other contributions is Project CRediT (Contributor Roles Taxonomy), which has defined a taxonomy of contributor roles for research objects and suggests using digital badges that detail what each author did for the work and link to their profiles elsewhere on the Web (). We also recognize the concept of transitive credit, which could be used to recognize the developers of products other than papers (), and to use provenance to understand how upstream data and software enabled advances in research. Ultimately, we must recognize that credit is a human concern requiring context and interpretation that cannot be readily automated. We can work to build attribution into the scientific workflow, but human judgement is still required to assess the relative value of various contributions. Indeed, a study of software attribution found that automating credit mechanisms can lead to perverse metrics and incentives that can falsely represent the value of a contribution ().

Another concern related to both credit and understanding impact is to identify and trace the connections between all sorts of research objects (data, literature, software, people, organizations, algorithms, etc). Multiple efforts try to address this. One effort directly related to scholarly publishing is the Scholix (Scholarly Link Exchange) initiative which emerged out of RDA to more formally interconnect data and literature (). The approach is functional and operational, as it builds from established citation hubs like DataCite, CrossRef, and OpenAire. This effort is expanding and collaborating with other related initiatives through a newly proposed ‘Open Science Graphs for FAIR Data Interest Group’ within RDA.

Other approaches use more decentralized, Web-based mechanisms which may allow more adaptability and extensibility (e.g., ; ). Work in ESIP and RDA explores how we can use schema.org web markup to identify and link data and repositories. ESIP is also exploring another W3C recommendation called Linked Data Notifications — a sort of RSS-style protocol to request and receive notifications about activities, interactions, and new information. The notification itself is an individual entity with its own URI. Another emergent approach is the Digital Object Interface Protocol () which assigns a persistent ID to any digital object and allows that object to express what type of object it is and what operations it allows such as various web services or basic management functions.

All these interconnecting technologies are somewhat peripheral to the core purpose of citation, but they highlight how applying machine-actionable PIDs to digital objects can expand the possibilities for knowledge sharing well beyond the bounds of traditional (paper-based) citation. This brings us to the issue and concern of identity itself.

D. Identifying things

People are beginning to rethink how PIDs work. To date, the basic issue of persistence of locators on the web has been addressed by what we might call authority-based identifiers, which separate the identity of an object from its location and are maintained in trusted registries. Klump et al. (, ) discuss how this approach has evolved and raise issues around the interconnection of identity, institutional commitment, and cost models. They note: ‘The focus of the DOI for the data community on paper-like documents and human actors has left some conceptual gaps’ (). They argue that we need to explore more advanced features such as identifier templates, more sophisticated content negotiation when resolving identifiers, more machine actionability in general, as well as the social process of maintaining the persistence of an object and its reference.

In other work, data managers are looking to content-based identifiers (i.e., cryptographic-hash-based IDs) to identify exact copies of data, ideally without relying on third parties and external administrative processes. These content-based identifiers can be deployed and resolved in peer-to-peer environments like the InterPlanetary File System (IPFS) and Dat. There is already an established system called Qri (query) which allows users to reference, browse, download, create, fork, and publish data sets with a broad network of peers in IPFS. Furthermore, Dat includes public-key technology to provide assurance on the source of the data and any changes that may have occurred. These approaches are still not well-suited to massive volumes of streaming data, and there is still the issue that different representations of data may be scientifically equivalent but not identical. The hash really needs to include the provenance chain as well as the data set. Nonetheless, these approaches show great promise. In related work, Bolikowski et al. () use the concepts of blockchain and version control systems like Git to propose a distributed system for maintaining long-term resolvability of persistent identifiers. They argue that the ‘system should be agnostic with respect to referent type (data sets, source codes, documents, people) and content delivery technology (HTTP, BitTorrent, Tor/Onion)’.

It is important to note that content-based identifiers have quite different properties and applications from authority-based identifiers. Content-based identifiers and blockchain technologies can be useful for tracking provenance, precise data queries, and internal repository management concerns. Authority-based identifiers are being used for ensuring the social requirements necessary to maintain a persistent and managed access location (), while those governance structures are only now being developed for distributed, content-based identifier systems.

Finally, it is worth noting that various authority-based identifiers have or are being developed for other research objects and entities. These include the Open Researcher and Contributor ID (ORCID) for individual researchers (); the Research Organization Registry (ROR) Community, which is working to develop identifiers for research organizations; and the work of the Persistent Identification of Instruments Working Group of RDA, which is implementing processes to use DataCite DOIs for scientific instruments. These are only a few examples of the growth in the development and application of PIDs. Similar to Linked Open Data approaches (; ), these other-types of PIDs can make more elements of a citation precise and machine-actionable, but again they move us further away from the traditional human-oriented citation. Consider, fancifully, if this article was cited making full use of identifiers (Figure 1). It is more precise but also more opaque for the human reader.

Figure 1 

Imaginary citation for this article making full use of PIDs.

Also, PIDs cannot always point precisely to the thing but rather only a representation of the thing (ORCIDs don’t reference people, they reference descriptions of people). Access may be precisely defined, but credit and reference are inherently ambiguous to some degree (). The human remains in the loop.

IV. Looking to the Future

Despite recognized definitions (; ; ), data citation remains a complex and evolving issue. On the one hand, the basic principles and process are well established. We know how to cite most data in research publications. We must only accelerate the implementation, and there does appear to be movement in that direction. On the other hand, long-established academic practices and assumptions about what is ‘important’ in scientific work lead us to bundle many different concerns into the concept of citation. At the same time, the opportunities promised by PIDs lead us to bundle even more concerns into reference schemes. Reconsideration and disaggregation of data citation concerns are overdue.

Data citation is often viewed as a computational or information science problem (; ), and sometimes more accurately as a social or cultural adaptation problem (; ). But it is a complex socio-technical problem with many nuanced concerns. In short, it is an issue of praxis.

We are inspired by the Force11 effort to identify some of the myriad use cases for software citation (). We feel we need to do the same with data citation: define multiple use cases that 1) de-emphasize the importance of the scientific paper in lieu of more precise assertions and supporting evidence and 2) emphasize the valuable use of data outside traditional scholarly environments. Some of this work has begun in RDA and ESIP. This should help us sort out the different concerns.

We already know that credit is primarily a human concern and access is a machine concern (reference is both), but what does that mean in practice? In health and social sciences, researchers have developed a ‘Payback Framework’ with a logical model of the complete research process and categories of (social health) payback from research (). Can we extend this and apply it to data by recognizing the reuse and value generated at many different stages? Can machine-actionable badges capture credit better than centralized citation indices?

Recognizing access as a machine concern can help us focus on providing data as a service rather than simply as object downloaded by a human. This, in turn, can help us make intelligent choices about what type of identifiers to use for what application. The ID for the human interested in a general description of the data may be different and will behave differently than the ID for the machine.

Going forward, we should accelerate the substantial progress we have made on implementing data citation for the basic scholarly use case. At the same time, we should not overextend the concept nor expand our expectations for what citation can accomplish. It is time to rethink some of our assumptions if we are to make data both FAIR and fair.