Introduction

The demand for better reproducibility of research results is growing. A dataset, small or large, is often revised to correct errors, apply new algorithms, add new surveys, etc. A single data product can be released in different formats and by multiple providers: each of these can then be the source of additional derived data products, often by different authors than its precursor. For the reproducibility of research it is therefore important to know:

  1. The source of each version that was used in any subsequent analysis and the sequential history of any evolved data product (provenance); and
  2. Which organisation or individual produced and is sustaining the release of any version (attribution).

As more and more datasets are becoming available online, the versioning problem is becoming acute. In some cases, datasets have become so large that downloading the data as a single file is no longer feasible. Using web services, datasets can be accessed and subsetted at a remote source when needed, but the user often has no knowledge whether, and when, the dataset they accessed online has been changed or updated.

Errors in research data have the potential to cause significant damage when used to inform decision making. A high-profile case was the retraction of a multi-national study on a potential COVID-19 treatment that had to be withdrawn because it could not be reproduced (). Similarly, another published study, which had used data from the same source to investigate a potential COVID-19 treatment, also had to be retracted (). A lack of best practices for data versioning has been recognised as a contributing factor in what has been called a ‘reproducibility crisis’ (). For research to be reproducible, it is essential for a researcher to be able to cite the exact extract of the dataset that was used to underpin their research publication ().

Versioning procedures and best practices are well established for scientific software (e.g., ; ), and the codebase of large software projects bears some semblance to large dynamic datasets. The related Wikipedia article gives an overview of software versioning practices (). We investigated whether these concepts could be applied to data versioning to facilitate the goals of reproducibility of scientific results.

Whilst the means for identifying datasets by using persistent identifiers have been in place for more than a decade, community-agreed and systematic data versioning practices are currently not available. Confusion exists about the meaning of the term ‘version’ which is often used in a very general sense. Our collection of use cases showed the term ‘version’ referring to all kinds of alternative artefacts and the relationships between them (), extending beyond the more common understanding of ‘version’ to refer primarily to revisions and replacements ().

The work presented in this paper was undertaken within the Research Data Alliance (RDA) Data Versioning Working Group (, ) that worked with other RDA Groups, such as the Data Citation and Provenance Patterns Working Groups, the Data Foundations and Terminology, Research Data Provenance and Software Source Code Interest Groups, the Use Cases Coordination Group, as well as the Dataset Exchange Working Group of the W3C to develop a common understanding of data versioning and recommended practices.

The work of the RDA Data Versioning Working Group presented in this paper aimed to collect and document use cases and practices to make recommendations for the versioning of research data, and to investigate to which extent these practices can be used to enhance the reproducibility of scientific results (e.g., ). The outcomes from this work add a central element to the systematic management of research data at any scale by providing a conceptual framework and six principles that can be used to guide the versioning of research data.

Related Work

Versioning has been an important concept for tracking changes and identifying the state of a resource, especially for digital objects that are constantly going through revisions and changes. For example, it is established practice in the software community to use versioning systems such as Concurrent Versioning Systems (CVS) (see e.g., ) to keep track of changes in source code, and to name software versions and releases according to the semantic naming and numbering protocol ().

Following the versioning practices in software development, the data community has recognised that, for reproducibility, it is important to understand that a dataset has been changed in the course of its life cycle. DataCite recommends that a data publication includes the version in its citation and that two versions of a data product cross-reference each other through the relation types ‘HasVersion’ and ‘IsVersionOf’ ().

Several international standards bodies include data versioning in their recommended practices. The W3C Dataset Exchange Working Group () gives definitions of data versioning concepts. Among the use cases documented by the W3C working group are four use cases that focus on versioning, including version definition, version identifier, version release date, and version delta, identifying current shortcomings and motivating the extension of the Data Catalog Vocabulary (DCAT) (). This work includes the provenance of versioning, as described in PROV-O (), and the provenance, authoring, and versioning (PAV) ontology ().

Many data centres now include data versioning as an important aspect of their data management practices (e.g., ), as can be seen in the many use cases collected by the RDA Data Versioning Working Group ().

The RDA Data Citation Working Group addressed the question of how to identify and cite a subset of a large and dynamic data collection. The recommendations given by the RDA Data Citation Working Group () include data versioning as a key concept: ‘Apply versioning to ensure earlier states of datasets can be retrieved’ (R1 – Data Versioning). Fundamental to this recommendation is the requirement for unambiguous references to specific versions of data used to underpin research results. In this concept, any change to the data creates a new version of the dataset. R6 – Result Set Verification offers a simple way to determine whether two datasets differ by calculating and comparing checksums.

However, through our work in the RDA Data Versioning Working Group we recognised that determining changes in the bitstream of a dataset is only one aspect of what is commonly called ‘versioning’. Just knowing that the bitstreams of two datasets differ does not give us other essential information that we might need to know to determine the nature or significance of the change, or how different data files relate to each other.

The analysis of use cases and prior work showed that, although there are current practices, standards and tracking methods for data versioning, a high-level framework for guiding the consistent practice of data versioning is still lacking. The conceptual framework and data versioning principles proposed in this paper are intended to fill this gap.

Use Cases

The RDA Data Versioning Working Group collected 39 data versioning practice use cases from 33 organisations from around the world that cover different research domains, such as social and economic science, earth science, and molecular bioscience, and different data types (). The use cases describe current practices reported by data providers. These use case descriptions are useful in identifying differences in data versioning practices between data providers and highlighting encountered issues. The hashed names that appear in the list below point to the use cases collected for our analysis and cited in ().

Through analysis of these use cases, we compiled the following list of issues or inconsistencies of practices across data producers:

  • Issue 1: What constitutes a new release of a dataset and how should it be identified? Consider the following situations:
    1. There is no change to data but rather the data structure, format or scheme;
    2. Revisions are completed to correct identified corrigenda and errata;
    3. New analytical and or processing methods are applied to a select number of attributes/components of the existing dataset;
    4. Data is processed with a different calibration or parameterisation of the processing algorithm;
    5. Models and derived products are revised with new or updated data; and
    6. The data itself is revised as processing methods are improved, e.g. by a new algorithm.
      In each above use case, we observe inconsistencies in the practices as to whether a new release or a new dataset should be recommended (#DIACHRON, #USGS #BCO-DMO, #CSIRO #Molecular, #AAO, #DEA).
  • Issue 2: What is the significance of the change from one version to the next?
    Although the definition of minor revision, substantial revision and major revision is context dependent, there should be a guideline on each, including how to identify corrigenda and errata. For example, a researcher who used an old version should be able to determine from the documentation of the changes whether a newer version with minor changes could change the outcome of their research (all use cases, e.g., #C-O-M, #C-F-H, #ASTER, #MT).
  • Issue 3: Do changes in the metadata change the version of the associated dataset?
    There are inconsistent practices for treating metadata revision and the implications for the data described by the metadata. For some use cases, a change in the metadata initiates a new version of the data and creation of a new DOI. Other use cases argue that if only the metadata is updated, neither a new version of the data nor a new DOI is created (compare #BCO-DMO, #CSIRO, #USGS).
  • Issue 4: What needs to be included in a versioning history?
    Inconsistency in documenting version history: some comprehensive, some very light or not all. When data has a new version, it should be easy for users to judge what kinds of changes have been made, so that users can 1) select the appropriate version, and 2) assess if the changes would affect a research conclusion based on data from previous versions (all use cases, e.g., #DIACHRON, #VersOn, #USGS #BCO-DMO).
  • Issue 5: How should a version be named or numbered?
    Inconsistency in naming or numbering each version: Data producers use various terms for version: e.g., Version 1,2; Collection 1,2; Release 1,2; Edition 1,2, vYYYYMMDD. What are the differences between each of these terms and what do these differences mean if they exist? (#NASA: EOSDIS and SEDAC, #Molecular, #GA-EMC, #CMIP6, #RDA-DDC-R)
  • Issue 6: What level of granularity is appropriate for Persistent Identifiers (PIDs)?
    The granularity of PID: Should every revision receive a PID or each release/certain level of revision receive a PID? (#USGS, #ESIP)
  • Issue 7: Which version does the landing page of a dataset point to?
    For a collection with multiple versions, a landing page may point to the latest version, all published versions, all published and archived versions (#BCO-DMO, #NASA, #AAO, #GA-EMC, # Molecular).
  • Issue 8: What versioning information should be included in a data citation?
    Version related information should, at a minimum, include a version number, data-access-URL, date-of-access, or other identifying information.

Revision vs. Release

In our analysis of the use cases we noticed that the term version, revision, and release were used almost interchangeably even though the three terms can mean different things across the use cases. Following common practices in software development (), we distinguish between tracking revisions of and changes made to a dataset, and the editorial process of identifying a particular version of a dataset as a release.

Here we define a revision as identifying that a change is made to the bitstream of a dataset, while the release of a dataset is an editorial process that designates a particular revision being published for reuse. The description of a dataset release must explain the significance of the changes to its designated user community. Not all revisions lead to a new release.

The FRBR model and its application to digital data versioning

The identification of versions is not unique to data and software but is relevant for other information resources too. The International Federation of Library Associations and Institutions (IFLA) Study Group on the Functional Requirements for Bibliographic Records developed a general conceptual framework to describe how information resources relate to each other in their Functional Requirements for Bibliographic Records (FRBR) (). We identified FRBR as a potential conceptual framework to inform our data versioning principles which will be discussed in the sections below. The FRBR model is a conceptual entity–relationship model to provide ‘a clearly defined, structured framework for relating the data that are recorded in bibliographic records to the needs of the users of those records’ (). Figure 1 shows the definition of four top level entities and their relation: Work, Expression, Manifestation and Item.

Figure 1 

Relationship of Work, Expression, Manifestation and Item in the FRBR model ().

In the digital era, the FRBR model has the potential not only to distinguish multiple derivatives of an original dataset, but to also help establish transparent provenance chains describing how a particular dataset evolved from the initial collection of the original data through to its publication, and more importantly, to then be able to provide attribution and credit to those researchers, institutions, funders, etc who were involved in the creation of each individual version.

Hourclé () was the first to apply the FRBR model to scientific data. Note that Hourclé’s work focuses on the specific use case of remotely sensed satellite data and the mapping of processing levels () may not be universally applicable to all data types because of the specific requirements of the use case. As an example, the determination of an element isotopic ratio on a mass-spectrometer has to go through several processing steps from the raw sensor output to a table to measurements and on to a higher aggregate product, but the types of corrections and conversions are completely different to those applied to satellite images. The concept of processing levels is still useful, but it has to be applied in a way that is specific to the use case.

As a generalisation of Hourclé’s work, we suggest an alternative mapping of the FRBR entities to data that also takes into account more recent concepts developed for the Observations and Measurements model (ISO 19156) ().

  1. A Work is the observation (e.g. an experiment) that results in the estimation of the value of a feature property, and involves application of a specified procedure, such as a sensor, instrument, algorithm or process chain. In the FRBR model the work is an abstract entity;
  2. An Expression of a work is the realisation of a Work in the form of a logical data product. Any change in the data model or content constitutes a change in expression;
  3. A Manifestation is the embodiment of an Expression of a Work, e.g., as a file in a specific structure and encoding. Any changes in its form (e.g., file structure, encoding) is considered a new manifestation; and
  4. An Item is a concrete entity, representing a single exemplar of a manifestation, e.g., a specific data file in an individual, named data repository. An Item can be one or more than one object (e.g., a collection of files bundled in a container object).

Taking the #ASTER use case as an example, Wyborn (in ) applied the FRBR model combined with the well-established NASA processing levels () to document the Full Path of Data () used for the sequence of data products and data distributions derived from the original Japanese Space System (JSS) Advanced Spaceborne Thermal Emission and Reflectance Radiometer mission (ASTER – http://asterweb.jpl.nasa.gov, Figure 2). In this use case, we use the term Full Path of Data to track how the dataset evolves starting with the capture of the original source data through the production of multiple derivative products and ultimately its distribution from multiple sources: each one of these ‘products’ will have its own data life cycle ().

Figure 2 

Schematic overview of the FRBR model applied to Full path of data use from the ASTER Mission. Each processing lavel (grey boxes) has its own data life cycle (e.g., ()).

In late 2009, a national initiative supported by multiple Australian organisations produced a set of 17 ASTER National data products of the Earth’s surface mineralogy of Australia that could be used from a continental scale (1:250,000) down to the scale of a mineral prospect (1:50,000). Each of these 17 mineral maps was available in 1) Band Sequential (BSQ) image format; 2) more GIS compatible products in GeoTIFF format; and 3) netCDF files to optimise use for analysis in High-Performance Computing ().

Considering the complexity of the ASTER use case, in particular the different formats of each data product combined with the multiple sites each is released from, the FRBR model proved to be useful. When combined with the use of unique persistent identifiers, the model can be used to help ensure both reproducibility by knowing the identity of each object, provenance by knowing the source of each version that was used in any subsequent analysis as well as the sequential history of any evolved data product, and attribution by knowing which organisation/individual had produced and was hosting the release of any version.

In detail, the various entities along the ASTER Full path of data are as follows:

  1. The Work: In this example, the Work is the observations taken by the ASTER sensor on board the Terra (EOS AM-1) satellite. This work is expressed in a number of data products.
  2. The Expression: The reduction of the ASTER Level 0 (raw instrument data) to Level 1B or Level 2 products by JSS produced the 4 initial versions of the ASTER ‘work’, as shown in the grey boxes in Figure 2. The Australian initiative then produced a set of mineral map data products from the Level 1B or Level 2 products (). This involved applying a series of product masks/thresholds to generate a suite of geoscience mineral maps that included 14 ASTER VNIR/SWIR Geoscience products and three ASTER Thermal Infrared (TIR) products.
    In FRBR hierarchy terminology, each product at each processing level, such as L0 to L3, as well as each of 17 maps derived in L4 are considered to be an Expression of the Work. These represent 22 individual Expressions in total.
  3. The Manifestations: Each of these 17 L4 mineral maps were made available in three different formats that relate to different user requirements/infrastructures/capabilities:
    1. Band sequential image (BSQ) files that can be restretched/processed;
    2. GeoTIFF files that were generated by contrast stretching and colour rendering to national standards to generate more user-friendly GIS-compatible products; and
    3. Self-describing netCDF files for analysis at full resolution and at continental scale: these could also be subsetted down to very small bounding boxes for local analysis at the prospect or local scale.
    In the terminology of the FRBR model, each set in each of the three formats is considered to be a Manifestation of each of the 17 L4 Expressions of the Work, resulting in 51 manifestations in total.
  4. The Items: The file sizes of some of the national coverages were very large – the netCDF files are ~60 GB each. When these files were first created in 2012 they were too large to make available online as file downloads and a series of items, subsetting these files as standard 1:1,000,000 map tiles were generated from each manifestation and delivered from various organisational websites, e.g., CSIRO, Geoscience Australia, and State-Territory Geoscience Maps at the State level. The NCI instance still provided web service access to the large single files and the user could generate their own subset and either use it in situ or download it for processing it locally.
    Following the definition in the FRBR model, each file released from each location is considered to be an Item of a Manifestation of each of the 17 L4 Expressions of the original Work, represented by more than 1200 items in total.

To show the general nature of the applicability of FRBR to research data, we applied the pattern to a dataset from PANGAEA (). This dataset is a collection of 426 individual data sets of tabulated data documenting a meteorological radiosonde ascent launched from the Georg-Forster Antarctic Research Station operated by the German Democratic Republic. The data are stored in a relational database at the PANGAEA data repository. The tables for data delivery are generated on demand in a wide range of character encodings ().

In the example of data in PANGAEA, the FRBR model can be applied as follows:

  1. The Work: In this example, the Work is the observations taken during the radiosonde ascends. This work is expressed in a data product, i.e. ().
  2. The Expression: The series of radiosonde ascends is expressed as a set of tabulated data.
  3. The Manifestations: The data are offered by the PANGAEA database as a manifestation of the data in a specific character encoding.
  4. The Items: The data Items are the bitstream of the data delivered by PANGAEA for download in the specified encoding.

The Six Data Versioning Principles

Using the FRBR model as our reference model, we analysed the issues as extracted from the use cases (), the work from W3C () and the RDA Data Citation Working Group (). While prior work focused on differences in the bitstream between versions, we found a number of additional questions that versioning practices try to address:

  • What constitutes a change in a dataset? (Revision: Issues 1, 2, 3)
  • What are the magnitude and significance of the change? (Release: Issues 1, 2)
  • Are the differences in the bitstream due to different representation forms? (Manifestation: Issue 2)
  • If the data are part of a collection and which elements of the collection have changed? (Granularity: Issues 2, 6)
  • How do two versions relate to each other? (Provenance: Issues 4, 5, 7)
  • How can we express information on versioning when citing data? (Citation: Issue 5, 8)

Note that we are specifically discussing changes in the dataset, not the metadata records in a catalogue that describe an individual dataset. Updating the metadata record does not create a new version in our model, it only changes the catalogue entry. Sometimes the metadata record of a dataset can be changed due to the correction of the metadata, metadata elements added, changing the location of the service endpoints or any other reason. If these changes do not change the bitstream of a dataset manifestation, a change in the metadata record does not constitute a new version.

The analysis of the use cases documented in () demonstrated the need to distinguish between versioning based on changes in a dataset (data revisions) versus communicating the significance of these changes (data release) as part of an editorial process in the data lifecycle. In addition, we recognised that concepts from the FRBR model can be used to describe relationships between different versions of a dataset.

As an outcome from our analysis, we recommend the following 6 principles to address the identified issues in data versioning.

Principle 1: Version Control and Revisions (Revision)

A new instance of a dataset that is produced in the course of data production or data management that is different from its precursor is called a ‘revision’ and it should be separately (uniquely) identified. As noted in the discussion of prior work, the recommendations given by the RDA Data Citation Working Group already states that any change to a dataset creates a new version of the dataset that needs to be identified (). This may also require the minting of a persistent identifier for this new version.

This practice of fine-granular identification of revisions is derived from version control commonly applied to the management of software code where every change to the code is identified as a separate version, often called a ‘revision’ or ‘build’ (). In the case of software versioning, the revision or build number can change far more frequently than the version number of a ‘released’ version.

Principle 2: Identifying releases of a data product (Release)

In some cases, the production of a dataset can be quite complex. The dataset may go through a number of revisions before it is considered to be ‘final’. The publication of such a ‘final’ version of a dataset is called a ‘release’.

The release of a new version of a dataset must be accompanied by a description of the nature and the significance of the change, along with a description of possible implications for use that could result from the change. The significance of this change will depend on the intended use of the data by its designated user community. For instance, the release of a new version could signify changes in the data format and its compatibility with existing data processing pipelines, or significant changes to the content of the dataset. Concepts such as Semantic Versioning () describe a commonly used practice to communicate the significance of a version change in a dataset release and have been widely adopted in software development.

Principle 3: Identification of Data Collections (Granularity)

A collection of data may be the result of successively generated datasets. The full set of aggregated data (data collection) can be seen as ‘works of works’, and may be organised in a number of sub-collections to be served by a data repository or archive (). The collection of works must be identified and versioned, and so shall be its constituent datasets or individual works ().

This practice of identifying elements of a collection, and identifying the collection as a whole, is similar to the established bibliographic practice of identifying individual articles in a journal and identifying the journal series as a whole (; ). The granularity is to be determined by the use case to provide a way (or ways) of identifying parts and versions whenever the practical need arises (). Entire time series should be identified as collections (), as should be time-stamped revisions, if the series is updated frequently ().

Principle 4: Identification of manifestations of datasets (Manifestation)

The same dataset may be expressed in different file formats or character encodings, sometimes referred to as distributions (), without differences in content. While these datasets will have different checksums, the work expressed in these datasets does not differ, they are manifestations of the same work. From the perspective of content it might be sufficient to identify only the expressions of a work, and not its manifestations, but there might be technical considerations such as machine actionability that merit a machine actionable identification of different manifestations of a work and their instances as items through persistent identifiers ().

Principle 5: Requirements for provenance of datasets (Provenance)

For scientific reproducibility, it is essential to know if a dataset was derived from a precursor and if yes, how these two objects relate to each other. Knowledge of the history of a piece of information is known as ‘provenance’. Using provenance, it should be possible to understand how a piece of information has changed and whether it is fit for the intended purpose or whether the information should be trusted (). Information accompanying a dataset release should therefore contain information on the provenance of a dataset.

Principle 6: Requirements for data citation (Citation)

Data publications must include information about the Release in the citation and metadata. The DataCite metadata kernel () has an optional element (Element 15) to record the version of a dataset. DataCite recommends using Semantic Versioning and furthermore recommends issuing a new identifier with major releases. DataCite leaves it to the data stewards to define major and minor releases. DataCite further recommends using the alternate identifier (optional Element 11) and related identifier (optional Element 12) elements to identify releases and how they relate to other datasets, e.g. whether it was derived from a precursor. Note that this is the minimum required for data citation by DataCite; data centres and other repositories may opt to offer a richer description of release history and provenance of a dataset through other channels.

Summary and Future Directions

In this paper we present six principles on data versioning based on the analysis of 39 use cases () and the FRBR framework (). The principles allow us to describe and discuss issues related to data versioning with greater precision and clarity, even when particular communities have different policies and procedures on data versioning. The six principles can be treated as a conceptual framework, yet each principle on its own is implementable as part of a data versioning policy or procedure.

Data publishers and providers must publish their data versioning policies and procedures to enable their users to identify the exact version of the data and any data extract that was used in a research project and in subsequent publications. Rigorous versioning procedures and policies will also enable proper attribution and credit to those parties involved in the creation, publication and curation of any data product and its precursors. Where data is accessible as online web services and/or are dynamically being constantly updated, as in a time series, it is essential that the data publisher/provider also makes available machine readable records of where they have made any changes to the dataset. This includes not only known changes to the data itself, but also changes in hardware and software, including versions of web service standards, that could also affect the data.

Versioning is also relevant to the application of the FAIR principles, particularly for the aspect of reuse (R1.2, ). To interoperate or aggregate data sets from multiple sources, the exact version being merged needs to be known to ensure compatibility, reproducibility, provenance, and attribution. Precise and unambiguous versioning also facilitates the reuse of data, particularly if each version is clearly licensed and provides revision history and source information, as these in turn, help enable the user to define the quality of the version and whether the data are fit for their specific purpose. Similarly, adoption of such data versioning practices also can contribute to achieving transparency, responsibility, user focus, sustainability, and technology (TRUST), as described in the TRUST Principles for Digital Repositories ().

We also expect to work with those implementing the data versioning principles in the future in a follow-up group to the RDA Data Versioning WG, and to verify and revise these principles, where necessary. In the next step, the data versioning principles should be developed into actionable recommendations by working with the community to develop domain specific policies on how to identify and communicate data versioning and by sharing examples of technical and policy implementations of the principles to develop the best practices for the versioning of datasets.

The analysis of the use cases and discussions within the community raised questions about the ethics of data duplication and re-publication, incentives for proper publication of all relevant data, particularly for the rawer precursor forms of derived data products, and above all, the reproducibility and replicability of research. This discussion highlights the need for documentation of best practices for the identification of data aggregations, data re-publication and mirroring of data to multiple sites. In future work we will explore issues related to defining the authoritative or canonical version of a dataset and correct attribution and citation of data sources.