Digital Objects – FAIR Digital Objects: Which Services Are Required?

Some of the early Research Data Alliance working groups reused the notion of digital objects as digital entities described by metadata and referenced by a persistent identifier. In recent times the FAIR principles became a prominent role as framework for the sustainability of scientific data. Both approaches had always machine actionability, the capability of computational systems to use services on data without human intervention, in their focus. The more technical approach of digital objects turned out to provide a complementary view on several aspects of the policy framework of FAIR from a technical perspective. After a deeper analysis and integration of these concepts by a group of European data experts the discussion intensified on so called FAIR digital objects. But they need to be accompanied by services as building blocks for automated processes. We will describe the components of this framework and its potentials here, and also which services inside this framework are required.

of files in local file systems or streams of streaming providers for instance, and they are embedded in a structure of other important data concepts as one can see in Figure 1 below.
How to represent the logical structure of digital objects with the right level of abstraction however, is in its details still a matter of discussion. As we have seen before, it certainly depends on how much of the logical structure is hidden by encapsulation behind a certain layer. And it also will partly depend on the data itself, specific workflows and use cases of data management and reuse.
But in any case the pointer as the most abstract logical representation has a prominent role here, and since data is and must be available across domains and sites, the pointer has to be a reference that is globally unique.
A global reference as URL could be seen as the easiest option, but URLs are unpredictable unstable references, because they change if the location of the data changes. See also (Klein et el. 2014) for a deeper analysis of this problem and its consequences for scientific reproducibility. This problem is known by librarians since many years and it is somehow documented in the name shelf mark that came originally from the mark for the location of a book in the shelf. After a short time it turned out that it does not make sense to always place the book at the same location. A level of redirection was introduced and shelf marks became symbolic entries in a catalog.
This additional level of redirection essentially is the rationale behind persistent identifiers. They are just globally unique strings without any semantics, but each such string has a record in a database that leads to the object, for instance via URL. If it changes, the database record can and has to be changed. These PIDs are seen as the right way to reference objects. The service to get the path to the object from the identifier string as reference, is called resolution. And since we are dealing with global references, they need to be globally resolvable, meaning that there must be a simple globally organized way that leads to the referred object. Otherwise these references would not be pointers in the sense of a logical representation in Computer Science.
Luckily such persistent identifiers are already widely used as global references in several domains of data management and publication and different highly reliable, global infrastructures are available since many years. Most of these PID infrastructures do not provide by themselves global resolution, but one of these proven systems, the Handle system (Handle 2019), has an inherent, highly scalable global resolution mechanism. Therefore the PIDs of the Handle system are able to actually fulfill the role of pointers as logical representation of digital objects.

The FAIR principles
The FAIR approach has been defined much later, about three years ago, as "Data and services that are findable, accessible, interoperable, and re-usable both for machines and for people" articulated by fifteen highlevel principles (Wilkinson et al. 2016). These FAIR principles have already become part of the EOSC roadmap as one can see in (European Commission 2018). Most of these principles repeat on a high level view again the strong relationship between metadata, the data or digital object itself and the persistent identifier as already described in Figure 1. But they go even further by stating for instance that metadata has to specify the data identifier (see F4 of the FAIR principles) and (meta)data are retrievable by their identifier using a standardized communications protocol (see A1 of the FAIR principles).
This shows on one hand the strong coupling between digital objects and the FAIR principles, but the approaches are conceptually on completely different levels on the other hand: the FAIR principles are policies, whereas the digital objects are technical abstractions. This together suggests that a deep interconnection of both approaches can be extremely fruitful, because concrete implementations of digital objectss will lead to data structures that implicitly comply to at least parts of the policies. The idea, to investigate this coupling more deeply and to describe FAIR Digital Objects (FDO) as digital objects that fulfill all FAIR principles, was beside others conducted by the GEDE Digital Object Topic Group of European Data Experts (GEDE 2019) and is also described in (Schultes 2018).

Persistent Identifier, Handles and DOIs
As mentioned before the persistent identifier as pointer plays a prominent role in the abstraction process as well as in the FAIR principles and therefore in the FDO framework. Additionally there are clear advantages to use the Handle system as PID technology to describe FDOs. The so called digital object identifiers (DOI) by the way, mainly used for the publication of articles or data, are also Handles with certain additional policies. But Handles can have a much broader scope and the policies, which are necessary for publications, are not always flexible enough to fulfill the needs of data management or data sharing between researchers. For data management or data sharing usually digital content related or community specific information, often in a finer granularity and often in a tight connection to the reference, is much more important than bibliographic information. Therefore there is a need for other governance structures for Handles to ensure reliable PID services with a much higher flexibility in PID usage and policies.

Data Types
In addition to the virtualization by reference it is crucial to provide a description of the object that is understandable also by machines in order to overcome the highly inefficient current way of data handling and to choose and prepare flexible services for digital objects in scientific workflows. And it would be helpful, if these descriptions would be available already at the reference level.
One already knows this principle from the simple characterization of digital objects via MIME types, where the ending of a reference URL gives the necessary information. But for the reusability of data a lot of other and more refined parameters are necessary. Such metadata enhancements of the digital objects are called data types.
As mentioned before the FAIR principles as well as the notion of the digital object emphasize a close coupling between metadata, data and the persistent identifier as pointer. With the abstract structures of the FDO this becomes more explicit. Already in the early RDA working groups "PID Information Types" and "Data Type Registries" the coupling was made even tighter by allowing certain kinds of metadata to become part of the identifier record in the resolution database. Such metadata is called PID information type and build, as shown in Figure 2, a substantial encapsulation of complexity into a generic structure. But one has to choose these additional metadata elements in the PID record carefully, because an extensive use of additional fields might slow down the resolution infrastructure. So an additional RDA working group on "Kernel Information Types" developed rules and a profile (Weigel 2018) for a set of simple and most frequently needed metadata elements that should be stored together with the PID. The profile can be extended for the needs of scientific communities for instance and the rules are guidelines for these extensions. Currently this powerful technology is not supported by the DOI providers for paper and data publication. For scientific data management it is available with the more general Handle system, as provided for instance by ePIC.

Data Type Registries
In any case these types need some kind of standardization to fulfill a minimal level of interoperability, another major goal of the FAIR principles. The classical way along the procedures of international standardization bodies is either too specific or not flexible and fast enough to cover the needs of diverse research and economic areas in this fast growing area of data management.
A more promising approach is to provide community driven, reliable registries that contain reviewed type definitions in machine readable and interpretable form, uniquely referred and disambiguated again by PIDs. The PIDs of the type definitions can be used as keys for the metadata relevant to the Digital Objects as value, either in the PID record or a special metadata record.
Such registries with type definitions are called Data Type Registries (DTRs) and have been a topic for the Research Data Alliance (RDA) also since its first days (Lannom 2015). Two working groups made recommendations that led to a prototypical deployment of a working DTR implementation based on Cordra. Cordra is an open source software for managing digital objects, now available in version 2.0. ePIC is running two instances of Cordra, configured as DTRs on behalf of ePIC, one for production data types and one for the preparation of data types and testing. The type definitions are openly available. To create or change types an account is needed. A distinctive feature of the ePIC DTRs is the ability to define types in a hierarchical manner, such that also complex data types can be easily defined and for instance schemata for the value domain can be derived from the definition (Schwardmann 2016). As a starting point one can find a short overview with links to these DTRs at the ePIC web pages (ePIC 2019).
Because DTRs enable the disambiguation and correct assignment of types for humans and machines, they build an integral part of the FDO framework. With the correct choice of PID information types, depending on the needs in a scientific community, such FDOs enable fast decisions at the reference level about the relevance of data for certain scientific questions, allow the identification of the location and prepare the automated staging of remote data for the processing in a scientific workflow, for instance with high performance computing, or even the automated decision that a remote computation would need less effort.

Which Services are required?
The introduction of PIDs as reliable pointers or references to digital objects is a precondition for long term findability and provides already additional simplification and flexibility in the data domain. As mentioned in the beginning, the major goal is the enabling of automation, and especially for findability essential requirements for automating data findability were given in (Weigel 2020). There are several elementary services for PIDs like creating, managing and resolving them. Also basic services can be used on PID records, if they contain additional metadata as PID information types.
Examples here are the detection of duplicates based on checksums, of earlier versions based on 'was derived from' relations or of the candidates for format conversion based on mime types and version numbers. A metadata service based on the metadata location given in a PID information type would be another example. Also decisions in workflows can be based on such PID information types as for instance the decision to move the application to the data or the data to the application based on the data size.
Collection representations can be based completely on PID information types, and a wide range of additional services and applications are proposed as part of the collection API and also beyond. Furthermore for repository interoperability it would be beneficial to provide a collection enhancement based on a common agreement as it was given by the RDA working group on Research Data Collections (Weigel 2017) to enable more flexibility for structures imposed on digital objects.
All these examples show that the elementary service of resolution for retrieving PID information types from the PID record is required, but also services to describe the types in data type registries and to retrieve this information are needed. This additionally asks for interoperability between DTRs and services that monitor this interoperability. And in a next step services are required, that provide a set of information types that can be expected from (a class of) PIDs, so called PID profiles.

Repositories
But finally the data services for FDOs itself need to be based on repositories providing reliable access to elementary digital objects. Currently often these repositories are giving some data representation enhanced with data base systems that provide a local layer of data and metadata indexing. A PID registration for the provided data is not even given too often. A FAIR and global data perspective proposes a clear statement to overcome this situation. The FDO has to replace all other kind of representation of data inside repositories and a more generic approach to metadata indexing is also necessary. In some cases it will be possible to provide adapters around legacy repository architectures, but overall this transformation is a big effort and may take a while. Nevertheless this effort is worthwhile in order to not end up with a fragmented data space with all its interoperability gaps, as we have it today.