Application Profile for Machine-Actionable Data Management Plans

Tomasz Miksa; Paul Walk; Peter Neish; Simon Oblasser; Hollydawn Murray; Tom Renner; Marie-Christine Jacquemot-Perbal; João Cardoso; Trond Kvamme; Maria Praetzellis; Marek Suchánek; Rob Hooft; Benjamin Faure; Hanne Moa; Adil Hasan; Sarah Jones

1 Introduction

Data Management Plans (DMPs) are documents that accompany research proposals and project outputs. “They describe the data that is used and produced during the course of research activities, where the data will be archived, which licenses and constraints apply, and to whom credit should be given” (Miksa, Simms, Mietchen and Jones []). The existing practice of writing DMPs is primarily driven by research funders who consider DMPs not only to be planning, but also a steering and evaluation tool. However, DMPs are often perceived by researchers as an annoying administrative exercise that does not support data management activities (Smale et al. []). This is because answering specific questions requires special knowledge or the information requested overlaps with data previously submitted elsewhere. This in turn has impact on the quality of DMPs, because questions remain unanswered, or the free text answers are copied between DMPs for different projects, and lack specific details.

DMPs should continue to be presented in a human-readable way. However, this information must be complimented by a machine-actionable representation consisting of atomised, structural data. Breaking the information down into specific fields creates added value to all stakeholders in the research data lifecycle such as researchers and funders, but also data stewards, repository operators, etc. The added value is created when the data is automatically collected as well as re-used by systems acting on behalf of stakeholders. An example would be a repository setting information on backup strategy and preservation policy in response to a data steward choosing that particular repository for data deposit. Similarly, information from DMPs can be used to trigger actions, for example, the license and embargo selected by a researcher can be used to automatically fill out information on data deposited into a repository.

To realise the vision of machine-actionable DMPs (maDMPs) as a way to exchange and act on information about data used and produced by researchers, we need all stakeholders to collaborate and synchronise their efforts. The basic framework requires: (i) an application profile for representing information in a common way; (ii) services that can provide and use this information in an automated way. The application profile is the focus of this paper and can be defined as a metadata design specification that uses a selection of terms from multiple metadata vocabularies, with added constraints, to meet application-specific requirements.

The Research Data Alliance (RDA) recognised the importance of making DMPs machine-actionable and established the DMP Common Standards working group to develop an application profile allowing for “automatic exchange, integration, and validation of information provided in DMPs and facilitating the exchange of information between systems acting on behalf of stakeholders involved in the research life cycle” (Miksa, Walk and Neish []).

In this paper, we describe the application profile for machine-actionable DMPs developed by the DMP Common Standards working group. We present the research conducted and methodology used to define it. We describe the motivation and rationale behind design decisions made. We also present a range of services adopting the application profile as examples of how it can be used to create value for stakeholders. This is the first paper describing in a holistic way all the work done to release the official version of the application profile for machine actionable DMPs (Miksa, Walk and Neish []), as well as its adoptions.

The following sections of this paper are structured as follows. In Section 2 we describe the related work. Section 3 describes how the application profile was developed. Section 4 describes the application profile using an example of a minimal maDMP. Section 5 provides discussion on the application profile, design decisions made, etc. Section 6 presents adoptions of the application profile. Section 7 presents conclusions and future work.

“Data management plans are required by funding bodies and institutions all over the world, e.g. the National Science Foundation (NSF) in the USA, the European Commission in Europe, or the National Research Foundation (NRF) in South Africa” (Miksa, Neish, Walk and Rauber []). Researchers are often advised to follow the principles defined in Michener [] to write a good and comprehensive DMP. There is also a wide range of tools supporting researchers in the creation of a DMP, e.g. DMPonline, DMP Tool, Data Stewardship Wizzard or RDM Organizer (Engelhardt et al. []). These tools provide questionnaires that must be answered to create a DMP complaint with a selected funder template. Jones et al. [] present a table describing most recent DMP tools.

Science Europe, which brings together research funders from Europe, issued common guidelines for development of DMP templates (Doorn []). Hence, many templates used across Europe are similar. Furthermore, they are also often derived from the checklist created by the DCC (DCC []).

Machine-actionability (defined as “information that is structured in a consistent way so that machines, or computers, can be programmed against the structure”) and DMPs are of interest to the Research Data Alliance (RDA). The RDA facilitates the formation of community-driven groups to deal with topics related to data management and sharing. One of them is the the RDA DMP Common Standards working group that developed the application profile described in this paper. Community needs that led to establishing this working group are described in the paper by Simms et al. []. An initial model for maDMPs and its mapping to tools and standards was presented in Miksa et al. [] and was used to kick-off developments by the group.

We use the term application profile as defined by Dublin Core: “An application profile is a metadata design specification that uses a selection of terms from multiple metadata vocabularies, with added constraints, to meet application-specific requirements”. This definition emphasises that our goal was to reuse existing standards, models, terms, vocabularies, etc. Other examples of application profiles include the European Commission who introduced an application profile for data portals in Europe that complements the W3C DCAT standard (Archer []) and describes which fields in W3C DCAT are mandatory and must be used by all repositories adopting the application profile. In a similar fashion, RIOXX application profile was designed to fulfil requirements of Research Councils UK (RCUK) for institutional repositories to share metadata about the scholarly resources.

3 Methodology

In this section we present how the application profile was developed within the DMP Common Standards working group. We describe the open stakeholder consultation process, as well as tools and accompanying concepts developed.

We followed the Design Science research methodology (Hevner et al. []), which is common in computer science, especially in information and knowledge-based systems. Hence, given the problem description, we iteratively produced artifacts (methods, models, prototypes) and evaluated them accordingly.

All these activities helped in defining what machine-actionable DMPs are and resulted in the formulation of the application profile for maDMPs.

3.1 Open stakeholder consultations

We performed two open stakeholder consultations. Each of them included both physical workshops and virtual meetings to collect feedback. The first consultation was aimed at defining the scope of the application profile. We applied requirements engineering methods (Dalpiaz and Brinkkemper []) known in software engineering and used GitHub for collecting requirements in form of user stories. Everyone, regardless of their geographical location and role in the research data lifecycle, were able to submit their own requirements that should be taken into account by maDMPs. In total, we collected 108 user stories. The user stories involved viewpoints of funders, institutions, repository operators, research support, researchers, and service providers. The number of collected requirements and their diversity showed that we managed to engage with a wide range of stakeholders. This in turn helped in establishing a common definition of machine-actionable DMP that reflects expectations of the community. The first consultation is described in detail in Miksa, Neish, Walk and Rauber [].

In the second consultation we mapped the captured requirements to specific fields in existing standards and models to identify which of them can become a basis for the application profile. The second consultation was conducted together with domain experts and is described in detail in Miksa, Cardoso and Borbinha []. Table 1 presents a list of standards that we found relevant.

Table 1

List of relevant standards considered when developing the application profile.


RELEVANT STANDARD	URI

Access License and Indicators	http://www.niso.org/schemas/ali/1.0/

Dublin Core Element Set	http://purl.org/dc/elements/1.1/

DCMI Metadata Terms	http://purl.org/dc/terms/

Friend of a Friend (FOAF)	http://xmlns.com/foaf/0.1/

DCAT	https://www.w3.org/TR/vocab-dcat/

DataCite	https://schema.datacite.org

cerif	https://www.eurocris.org/ontologies/cerif/1.3/

COAR	http://vocabularies.coar-repositories.org/

ISO 6391-1	https://www.iso.org/iso-639-language-codes.html

ISO 4217	https://www.iso.org/iso-4217-currency-codes.html

We also collaborated with other RDA groups working on related topics, such as Exposing DMPs and Active DMPs, to collect feedback and to incorporate feedback from their consultations (Simms et al. []).

The open stakeholder consultations allowed us to iteratively define the scope of the application profile and to identify existing standards that are relevant. Thus, we were able to derive an initial list of fields that should be contained in the application profile to reflect the identified needs of stakeholders.

3.2 Proof of concept tools

Together with students of Data Stewardship at the TU Wien we developed proof of concept tools that demonstrate how existing data management practices can be improved and what new opportunities are created when maDMPs are in place. This helped us two-fold: (1) to better explain the novelty and benefits of maDMPs to stakeholders and thus get further feedback from them, (2) to further refine the application profile and to better reflect the needs of automated data processing, e.g. to group information into specific classes and to design the most suitable structure of the application profile. The proof of concept tools can be broken down into different categories:

Tools showing how information from existing systems can be used for estimations and recommendations in an early phase of a project, e.g. for cost estimation, repository recommendation, or license selection. (Miksa, Cardoso and Borbinha []).
Tools exporting information from existing DMP tools into maDMPs, e.g. work by: Pichler [], Breitenfellner [], Inschlag and Drechsel []. To do that, students developed simple mappings between the application profile and funder templates (EC Horizon 2020 and Science Europe). Thus, we identified that not all of the information currently requested by funders is machine-actionable. For example, the description of quality assurance processes will always be in free-text and the application profile must accommodate such non-machine-actionable information as well.
Tools making information contained in maDMPs human-readable, e.g. work by Aigner [], Alkhatib and Rivera [], Leidinger []. In other words, the tools use information from maDMPs to pre-fill existing DMP templates. Thus, the researchers do not start with an empty page when writing a DMP and funders still can receive a PDF version of a DMP.
Tools integrating systems using maDMPs to automate upload of files into repositories by extracting information from maDMPs. The tools also support the opposite scenario: a dataset already exists in a repository and information about it needs to be added to the maDMP. The prototypes were developed for Dataverse (Hido1994 and Alhirthani []), Invenio (Tsepelakis []) and content management system Alfresco (Bakos et al. []). We also performed analysis on how maDMPs can help in automating repository recommendation (Oblasser et al. []).

Finally, we developed mock-ups (Oblasser []) for the next generation tool for DMPs that puts machine-actionability at its core. The mock-ups are a result of broad stakeholder consultations performed face to face with researchers in different domains and also of feedback collected online. The mock-ups were converted into a prototype tool (Oblasser []) allowing for export of maDMPs using the application profile for maDMPs.

3.3 Processes and Guidelines

We used the experience collected by developing and testing prototypes to define typical processes in which maDMPs can be used (Oblasser and Miksa []). The set of processes is independent of any implementation and can be used by any institution or organisation as guidance on how to build the ecosystem of services that exchange information using maDMPs. The processes are not implementation blueprints and require customisation. However, they help in understanding which organisational and technical components already exist and which of them must be introduced. For example, institutions willing to provide storage to researchers must also consider cost models used for calculating costs of storage. Machine-actionable DMPs can carry information on costs, but the services of organisation must provide the actual values. The maDMP itself does not contain any business logic.

While the processes help individual organisations to streamline the discussion on machine-actionable DMPs, we also developed 10 principles for machine-actionable DMPs that call for coordinated effort within the broad research data management community (Miksa, Simms, Mietchen and Jones []). The principles “contain specific actions that various stakeholders are already undertaking or should undertake in order to work together across research communities to achieve the larger aims of the principles themselves” (Miksa, Simms, Mietchen and Jones []). The principles also “describe existing initiatives to highlight how much progress has already been made toward achieving the goals of maDMPs as well as a call to action for those who wish to get involved” (Miksa, Simms, Mietchen and Jones []).

This helped us to better separate the concerns and narrow down the focus of the application profile, which is the information carrier and the actual automation is the task of systems using it.

4 Application profile

In this section we outline the structure of the application profile and illustrate this with an example of a minimal maDMP. The full description of concepts used, as well as clarification on its purpose and usage, can be found in the official recommendation (Miksa, Walk and Neish []) and in its official repository.

Figure 1 presents concepts used within the application profile. Each concept is further broken down into specific fields (not depicted). The central concept is the DMP which provides generic information on the DMP. It is used to link information on projects, costs, contributors and contact point for the DMP. Most of the relations are optional (0.*) and depend on the specific setting in which the maDMPs are used. Each DMP must have at least one Dataset. Datasets are used to group information on the data described by a DMP. This includes information such as: size, format, location of the data, existence of embargo, metadata standards used, etc.

Figure 1

Overview of concepts and relations between them.

A minimal maDMP consists of only those fields that are defined as obligatory in the application profile. An example of a minimal maDMP compliant with the application profile is depicted in Listing 1. The listing shows that each maDMP must have a title. Each maDMP must also provide information on the contact person who can provide further information on data described by the maDMP. Each maDMP must indicate when it was created for the first time and when last modified. Thus, systems processing maDMPs can track their versions. Each maDMP must indicate the language in which it is written, so that its contents can be displayed properly (e.g. translated). Each maDMP must indicate whether any sensitive, personal or ethical issues exist. These fields follow a controlled vocabulary with three values to choose from: yes, no, unknown, so that no assumptions have to be made. Each maDMP must contain at least one dataset. A single dataset can represent “all data in a project” – like traditional DMPs often do. However, it is recommended (but optional) to use more datasets to provide more precise information.

Listing 1

Example of a minimal maDMP in JSON.

5 Discussion

This section provides discussion on the application profile. While the recommendation provides details on specific elements of the application profile, here we clarify selected aspects of using it and discuss design decisions taken.

5.1 Cardinality of fields

Only those fields for which the cardinality is set to “exactly one” or “one to many” must always be filled with information. Further fields defined in the application profile may be set if required (by business constraints), or when the information becomes available.

The application profile aims to be flexible and for this reason many fields are optional. In specific deployments requirements may be stricter, for example: DMP must contain information on a project number (funder requirement), while in the application profile specification this is optional.

All tools compliant to the application profile must expect to receive both obligatory and optional fields.

5.2 Granularity of Datasets

There is no single common vocabulary that describes types of datasets. This shows that there is no consensus within the community on how to describe datasets. Sometimes, especially at the early stages of the research data lifecycle, it is necessary to refer to datasets on an aggregated or even abstract level, e.g. collection of satellite images. However, in other settings, one may assume that a dataset is equivalent to a file.

The application profile reuses the Dataset class as defined by the DCAT (Archer []) which in turn is a sub-class of Dataset defined in DC Terms as: “Data encoded in a defined structure”. This definition is also very general. For this reason, we did not decide to constrain the definition of Dataset and its granularity depends on the specific context.

If a DMP contains only one Dataset (the most generic setting), it can denote that all data, for which the DMP is created, is considered jointly. For example, if a DMP is a short document created before a project begins and contains only an outline of planned actions.

If a DMP contains more than one Dataset, then each dataset can represent a logical group of data, e.g. raw data, software, etc. Thus, the application profile allows to express that different data is handled in a different way. For example, software is deposited in a source code repository under embargo, while data is instantly available in an institutional repository.

Distribution points to a specific instance of a dataset. Hence, distribution contains information like format and size of files.

A dataset can have several distributions. For example, an image can be both available as PNG and TIFF. Furthermore, a dataset can have many distributions to indicate where the data is kept temporarily, for example during a project, and where the data is going to be published/archived at the end of a project.

5.3 Versioning

Each DMP has a creation and a modification timestamp. The modification timestamp indicates the last modification of the DMP. Having two DMPs with different modification timestamps, one can identify which is newer by comparing timestamps. The same creation timestamp and DMP identifier indicate that we consider different versions of the same DMP.

The application profile itself does not have any mechanisms to model different versions of data – if information is overwritten, then previous information is not kept in the model. Systems processing DMPs must have suitable versioning mechanisms, if needed. For example, each update to a DMP can be committed to a database. Thus, the database engine allows to retrieve different versions of a DMP over time, while the DMP itself contains the modification timestamp allowing to identify/distinguish/refer to a specific DMP versions. The modification timestamp must be set by a tool that modified the DMP.

5.4 Embargo

An embargo on data sharing means that data will be made available using a license, but not immediately after deposition of data in a repository. As long as no license applies, the data is considered closed.

For each distribution, one can assign a license. If the license is assigned, then it means that a distribution at some point will become available. Start date set for the license indicates from when on it becomes active – in other words, when the distribution becomes available under this license.

We used the same mechanisms as defined by the National Information Standards Organisation.

5.5 State of the DMP

We use dates to indicate planned actions. DMP has a modification timestamp, and Dataset contains issue date. Together these indicate whether the actions are planned or already performed.

if the issue date is set in the future (compared to DMP modification date), then the actions are planned,
if the issue date is set in the past (compared to DMP modification date), the actions were performed.

This approach is similar to the way embargo is modelled. We also wanted to avoid labels that can have different interpretations depending on the context. If we used a tag such as approved for a DMP, then someone could assume that the DMP was approved by the funder, whereas it was actually approved internally by the research support, before it was sent to the funder. Hence, we avoided modelling different states a DMP or a Dataset can be in, but focused on collecting facts, such as dates.

5.6 Serialisations

All the examples provided are in JSON, because of its readability and popularity, i.e. many tools implementing the application profile use the JSON serialisation. The application profile can be serialised to any other representation, e.g. XML, OWL, JSON-LD, etc. if needed. There is an ongoing work on the ontological representation of the application profile.

6 Adoptions

This section describes existing adoptions of the application profile. Its goal is to provide concrete examples and further pointers to information on how the application profile can be used. The list of adoptions below is not final and further applications of the application profile are pending.

Furthermore, Table 2 (see Appendix A) presents an overview of systems adopting the recommendation. It provides a short description of each system, lists stakeholders interacting with it, outlines key benefits and explains in what setting it is used.

6.1 Haplo

Haplo is based in the UK, and supplies research information management software for Higher Education. A suite of products cover the full research lifecycle, including Current Research Information System (CRIS) and repository functionality. These products coexist within a single Haplo application, providing an integrated datastore for research activity and management at an institution. Haplo Repository and the maDMP implementation are open source.

Implementing a machine-actionable format enables Haplo to embed data management within the existing institutional research administration processes, sharing data between areas of the application. For example, an ethical approval process will update the project’s DMP if the submission indicates sensitive data will be produced. The DMP is then queried when depositing a dataset to the repository, and will notify repository staff if the DMP indicates the dataset needs to be restricted (Renner []) (illustrated in Figure 2). This implementation of maDMPs is in use at London South Bank University, as is in the process of being deployed to further institutions at the time of writing.

Figure 2

Notification to library staff identifying a potential security issue when depositing a dataset to the repository.

The application profile is flexible, expressive, and general enough to be the native data structure for maDMPs within Haplo applications. This implementation of the application profile is in use at within a fully featured CRIS has demonstrated that the maDMP, structured correctly, can enable data management within the full rage of administrative processes that take place over the lifecycle of a research project. It enables the system to automatically apply policy, detect potential errors, and provide useful and meaningful reports on the data management activities at an institution.

6.2 Open Research Publishing Platforms

F1000 Research is an open access publisher providing open research platforms to several funding partners including Wellcome, the Gates Foundation, and the Health Research Board of Ireland. These platforms use a unique open post-publication peer review model and are supported by an open and FAIR data sharing policy. As part of F1000 Research’s commitment to open research, there is a pilot project in place to extend this service offering to include the publication of DMPs. Metadata is a critical component of this project and especially in this context, it is key to openly sharing information across technologies, disciplines, and stakeholders. The end goal is to establish DMPs as project ‘hubs’ linking out to related research articles, datasets, etc. In this way, published DMPs will provide linkages which make connecting and tracking the research landscape easier – particularly for funders and institutions.

By adopting this application profile, F1000 Research plans to provide basic interoperability between systems producing or consuming machine-actionable DMPs while reducing redundancy of effort. A key advantage from F1000 Research’s standpoint is that this application profile covers a broad range of use cases. Furthermore, it does not enforce any funder or institutional specific requirements. It is also extremely important that the application profile represents information over the whole DMP lifecycle, given that each published DMP will be versionable.

6.3 DMPTool

The cornerstone of support for DMPs in the U.S. is the DMPTool, developed in 2011 by the California Digital Library (CDL) and founding collaborators. The Digital Curation Center, based in the UK, and CDL have a formal partnership to co-develop and maintain a single, open-source platform, for providing DMP guidance. This shared software, DMPRoadmap, underpins both the DMPTool and the DCC service, DMPonline.

In 2017 CDL was awarded a National Science Foundation EAGER grant to support the creation of machine-actionable DMPs. As part of this grant, the DMPTool, together with the DMPRoadmap team, implemented the application profile within the shared code base and developed an application profile compliant API. Additionally, in partnership with DataCite, DMPTool developers have built a workflow for generating DOIs for DMPs and utilising the Event Data service from DataCite to record when assertions have been made on the DOI.

Ongoing pilot projects with domain-specific and institutional stakeholders are testing integration between different services and systems utilising maDMPs. The largest of these pilot projects is the FAIR Island project which is a collaboration between the University of California Gump Field Station, located on Moorea in French Polynesia, and Tetiaroa Society, which operates a newly established field station located on the atoll of Tetiaroa. By implementing mandatory registration requirements including extensive use of controlled vocabularies, PIDs, and other identifiers, maDMPs in this environment will be utilised as key documents for tracking provenance, attribution, compliance, deposit, and publication of all research data collected on the island. The FAIR Island Project offers a real-world example to prove the capabilities of machine-actionable DMPs and to analyse the downstream effects of these policies in the resulting release of data.

6.4 DMPonline

The Digital Curation Centre launched DMPonline in 2010 to support the UK Higher Education sector with the increasingly complex landscape of divergent funder policies for DMPs. Over the years the user base has expanded significantly, and the service now has a growing number of paying subscribers at institutions and funders in the UK, Netherlands, Sweden, Finland and Australia.

Together with partners at the CDL, the DCC is enhancing the open-source DMPRoadmap codebase to support machine-actionable DMPs. One of the first enhancements in this series of activities pre-dates the DMP Common Standards Working Group. In 2016, the DCC received an RDA Europe collaboration award to integrate the Metadata Standards Directory into DMPonline. By using external directories as a way to answer certain questions, we could ensure structured data was provided rather than free text. Work to integrate other registries of repositories and standards such as Re3data and FAIRsharing continues, and integrations with the Research Organisations Registry and FundRef are planned.

The DMPRoadmap data model instantly complied with the application profile for maDMPs at a minimal level (Rust []) but as noted above, much work has been done to extend this, for example by adding additional metadata such as project start/end dates, utilising a range of persistent identifiers and developing an application profile compliant API. The existing DMP themes which are used as tags on questions and guidance in the system also provide an approximation to other key elements of the application profile such as ethical issues, metadata, costs and preservation statement. Mapping between these would permit existing data from the existing 80,000 DMPs to be converted to a more comprehensive maDMP for reuse in other systems. In a recent RDA Europe hackathon, developers from DMPTool and DMPonline trialled the new API and supported teams from Haplo, the Data Stewardship Wizard, OpenDMP and others to both export and import maDMPs from DMPRoadmap.

6.5 DMP OPIDoR

DMP OPIDoR is a DMP tool, based on DMPRoadmap code. It is made available to the French scientific community and has been adapted to meet its needs.

In France, commitment to open science is widely embraced by all stakeholders who are involved at different stages of the data lifecycle. They are willing to provide their expertise and interconnect their infrastructures to DMPs so as to enable a seamless data management.

In order to ease information exchange between systems, a new flexible object data model has been developed based upon the DMP templates currently published in DMP OPIDoR and use cases submitted by the different actors. Its implementation into DMP OPIDoR is underway. As a result, more structured and standardised information will be collected, and the use of PIDs and controlled vocabularies will also be facilitated. Furthermore, the flexibility and extensibility of this model will allow it to be adapted to the specifics of various service providers and scientific communities.

These developments will be first tested as part of a project, in partnership with the French Bioinformatics Institute and funded by the French Research National Agency, whose main objective is to enable automated service provisioning by a data processing facility.

The implemented data model is compliant with the application profile described in this paper. DMP OPIDoR will provide an import/export feature to allow interoperability with other DMP tools or publishing platforms.

6.6 Data Stewardship Wizard

Data Stewardship Wizard (DSW, Pergl et al. []) as a data management planning tool has a significantly different approach compared to others: it is not meant primarily to satisfy demands by funders and institutions, but to help researchers plan data management that works best for their project. Rather than using question templates directly, DSW makes use of customizable knowledge models that describe the hierarchical structure and content of questionnaires. DSW comes with a root knowledge model that avoids free-text answers, and contains guidance towards making data FAIR. A questionnaire can be turned into a document using Jinja2 templates that can either simply print-out the questions and answers (e.g. the generic default template), or synthesise text from the answers (e.g. following the Science Europe template), or transform answers to any textual format following a required schema (e.g. RDF, JSON, or OPML).

Machine-actionable DMPs are the key to interoperability between DMP tools and potentially will enable use of structured information from DMPs by different stakeholders such as project management offices, institutional IT support, and funders. To support this, we implemented export and import functionality in DSW for maDMPs compliant with the application profile. We achieved export functionality using Jinja2 templates for creating documents. The main part of this task was then to find a mapping between the application profile and the root knowledge model of DSW. Many questions were already present and directly mappable in the root knowledge model, e.g. several questions regarding ethical issues, some other questions had to be modified or moved in the structure, e.g. so that a maDMP can describe several projects instead of just one according to the application profile. Also, some new questions were added, e.g. to support describing the funding of each project where we also added an integration with the funders registry of CrossRef. Those changes together were released as a new version of the root knowledge model, and anyone with an existing questionnaire can migrate their answers. The newly added maDMP export template queries the relevant answers in a questionnaire using their UUIDs and transforms them into a JSON object according to the DCS JSON schema and dumps it as output. In addition to a JSON export, we also created a template for RDF using the ontology representation of the application profile. It uses the mentioned JSON object to compose an RDF Turtle file that can be then internally converted to other RDF formats such as RDF/XML, Trig, JSON-LD, or N3.

Implementing the import of maDMPs was more complex as there was no concept of importing questionnaires in our system beforehand. Nevertheless, we were able to create a prototype import functionality on the frontend part of DSW. After loading a valid JSON file, it decodes the maDMP into the internal structure and then, using the same mapping used for the export, transforms the data into answers for a new questionnaire. The user can preview the parsed data and then create such a pre-filled questionnaire. Within this prototype the structure of maDMPs as well as the mapping was hard-coded; we will use this for further analysis and generalisation into a future flexible answer-import feature. By adopting maDMPs (as shown in Figure 3), we managed to interchange maDMPs with other DMP tools, such as DMPtool, Argos, and easyDMP. Moreover, using the document submission feature of DSW, it is possible for a user to send a DMP from DSW to DMPTool via its API by clicking a button in DSW.

Figure 3

Diagram of how DSW is using maDMPs.

6.7 NSD DMP

NSD – Norwegian Centre for Research Data – is a national archive and research data centre. Its mission is to ensure free and open access to research data, and improve the basis for empirical research through a broad range of data and support services. By means of automation and system integration, the NSD DMP can assist researchers and institutions in their efforts to share their data.

NSD DMP adopts parts of the application profile. The NSD specifically looked at new ways of distinguishing between project information and dataset information when creating a DMP. We also considered ways of integrating security and privacy issues with information on relevant data hosting options. We found much valuable input and inspiration in the set of semi-automated workflows that were designed in the RDA DMP Common Standards Working Group. These processes show how data management planning can be supported by means of automation and system integration in an institutional context.

An example of machine-actionability in the NSD DMP is the classification module, which provides institutional-specific policy recommendations for collecting, storing and archiving data based on the classification of data into either open (green), restricted/internal (yellow), confidential (red), or strictly confidential (black) categories. Another example is the archive guide, which helps users identify national archives and repositories that are relevant for their data. The archive guide uses APIs from re3data.org, and provides a list of suggested archives and repositories for data packages based on various criteria pre-selected by the user.

In addition the NSD DMP tool is integrated with a Data Policy Manager, which allows institutions to design interactive and machine-actionable policies that can be linked to internal and external systems as needed. The tool enables institutions to define their own policies for a wide variety of general and institution-specific storage services, data transfer applications and data collection tools.

The proof of concept tools and BPMN processes that were developed together with the application profile helped us to better imagine the full scope of a system and provided us with valuable suggestions as on how to develop our own DMP tool with machine-actionable elements.

6.8 Argos – OpenDMP

Argos is a service offered by OpenAIRE utilising the OpenDMP platform co-designed by OpenAIRE and EUDAT initiatives, in order to support the management and distribution of machine actionable DMPs.

The platform builds on a few core concepts, which are as follows:

Dataset description: structured and unstructured (i.e. textual) information about the dataset, shaped by a dataset Profile
Data Management Plan: an aggregation (1.n) of dataset descriptions into a set managed as a whole in a given context
Dataset Profile: a definition, containing data elements and behavioural rules, that dictates how a dataset may be described by a user

Around those baseline concepts, numerous other “native” (i.e. a-priori defined in code) and “soft” (i.e. defined by configuration) elements are utilised to complete the data model of the system. Examples of those elements are projects, funders, researchers, organisations, repositories etc.

OpenDMP supports maDMP JSON import and export under a few assumptions. Its model maps natively to a substantial segment of the maDMP entities, attributes and relations. There is the DMP container which aggregates a number of datasets, contributors, contacts and language information, while DMP timeline attributes are captured via a combination of DMP versions and timestamps. The project, grant and funder stack of the application profile is supported under a slightly different arrangement which fits well when exporting maDMPs, yet imports may lead to a user prompt for further choices. Two deviations are that: (i) OpenDMP supports the actual funding, the cost and ethical issues at the level of a dataset instead of the entire DMP, and as such the maDMP elements may be exported only in the case of DMPs with a single dataset. (ii) Only one distribution is supported per dataset, which works well during export but imports may lead to user prompts for further choices. Beyond those differences all attributes of maDMP dataset model are covered by OpenDMP and may both exported. Through its customisable templates, OpenDMP may support concepts not directly mapped to maDMPs. In order to enable maDMP JSON format as its fully re-importable export form, it utilises a reserved area of the file for maintaining the extra information. The other way round, OpenDMP imports maDMP files under certain assumptions (as presented above) and tries to reserve additional data and structure that may not fit its model in reserved areas to avoid discarding the extra information. This extra info remains accessible through its REST API. Currently work is underway to fine-tune the mapping of particular attributes to maDMP concepts and extending the user actions supported during imports.

6.9 Research Data Infrastructure at TU Wien

TU Wien is the largest technical university in Austria. Within the FAIR Data Austria project TU Wien is developing an integrated infrastructure for research data management that includes data and code repositories, as well as tools for machine-actionable DMPs. The goal is to integrate existing and new systems to ensure exchange of information and improve management of scientific data (see Figure 4).

Figure 4

maDMPs used to exchange machine-actionable DMPs among university services.

Machine-actionable DMPs are one of the main enablers of the project. They act as an inventory integrating information from various systems. For example, maDMPs provide a link between research groups, projects and data produced by them. The maDMPs help in realising ‘ask once only’ principle that aims at reducing the workload imposed on researchers and support services. For example, information on existing ethical issues is requested when a new project is registered. This information is automatically transferred to a DMP. TU Wien is currently developing a tool for DMPs that will be shared with project partners: TU Graz and University of Vienna. The application profile and prototypes developed in course of the research streamlined the architecture of the developed data infrastructure and are being implemented in live systems.

6.10 easyDMP

The easyDMP tool for creating DMPs was created in Norway in 2016 by UNINETT Sigma2 to provide a simple Python tool for creating DMPs that integrates with the services for provisioning storage on the Norwegian Infrastructre for Research Data (Nird) operated by Sigma2. The tool supports a variety of templates (Horizon 2020, Science Europe and local templates) that are defined through an administration interface. The level of detail is defined by the template designer and can have branches and make use of controlled vocabularies. Plans are stored in an SQL database and the information can be accessed through a read-only API.

The goal of easyDMP has been make the plan machine actionable enabling storage services to reserve the storage space requested in the plan. The development of the application profile by the RDA working group has been fortuitous as it has provided a schema based on feedback from a large number of communities interested in providing maDMPs.

By aligning easyDMP with the application profile we can ensure our tool interoperates with other tools making the plans independent of the tool implementation. We are also working to extend the schema such that our storage services can consume the plans and reserve the required storage space. Providing a different DMP tool implements our extension our storage services would be able to consume plans created in that tool.

7 Conclusion

We can reduce the effort imposed on researchers and on other stakeholders involved in research data management, when information contained in data management plans can be accessed, interpreted, exchanged and acted upon by machines. Automation and machine-actionability can lead to improved quality of information provided and higher reuse of information resulting in better return on invest made into data management services.

In this paper we reported on a community effort to define the application profile for machine-actionable data management plans that allows expressing information from traditional data management plans in a machine-actionable way. We described the methodology and research conducted to define the application profile. We also discussed design decisions made during its development and presented systems adopting it that include major DMP tool providers, as well as repositories, publishers and universities. The application profile is an official output of the RDA DMP Common Standards Working Group that has gathered more than 200 members from around the globe.

The next steps will focus on further adoptions of the application profile and development of novel services utilising the full expressiveness of the application profile, e.g. repository recommendation services and other services acting on behalf of stakeholders. We also plan to release new serialisations of the application profile, such as the OWL ontology to fully utilise the benefits of machine-actionability and the Semantic Web.

The RDA DMP Common Standards Working Group will continue to maintain the application profile and will incorporate feedback received from the adopters. Any updates to the specification of the application profile will remain domain and tool independent, so that the application profile remains generally applicable.

Additional File

The additional file for this article can be found as follows:

Appendix A

Overview of adoptions. DOI: https://doi.org/10.5334/dsj-2021-032.s1

Data Science Journal

Research Papers

Application Profile for Machine-Actionable Data Management Plans

Abstract

1 Introduction