1 Introduction

This paper presents the latest developments around data management plans (DMPs) in the Photon and Neutron (PaN) Research Infrastructures (RIs) on the European Strategy Forum on Research Infrastructures (ESFRI) roadmap in Europe. The ESRFI PaN RIs participated in the Photon and Neutron Open Science Cloud project () financed by the European Commission Horizon 2020 programme to contribute to building the European Open Science Cloud (EOSC). A sister project to PaNOSC, the European Open Science Cloud Photon and Neutron Data Service () was financed for the national PaN facilities in Europe. PaNOSC and ExPaNDS worked closely together to provide solutions and recommendations for the PaN community.

One of the tasks of PaNOSC for FAIR data was to develop a template for the DMPs and to implement these for the RIs involved. PaNOSC and ExPaNDS developed the DMP template together (see accompanying paper by Görzig et al. () in this issue). This paper documents the work which was done together at each of the six facilities (CERIC-ERIC, ELI-ERIC, ESRF, ESS, EuXFEL, and ILL; see Table 1) to implement a common template.

Table 1

PaNOSC Research Infrastructures implementing DMPs.


FACILITY NAMELOCATIONDESCRIPTIONWEBSITE

CERIC-ERICTrieste, ItalyThe Central European Research Infrastructure Consortium provides access to national research infrastructures in eight European countries for over 1000 users per year.https://www.ceric-eric.eu/

ELI-ERICPrague, CzechiaThe Extreme Light Infrastructure is the world’s largest and most advanced high-power laser infrastructure opened for users in 2022.https://www.eli-laser.eu

ESRFGrenoble, FranceThe European Synchrotron Radiation Facility is a fourth generation synchrotron facility; approx. 9000 scientists per year come to use the ESRF.https://www.esrf.eu

ESSSweden, LundThe European Spallation Source future world’s most powerful pulsed neutron source is currently under construction in Lund, Sweden.https://europeanspallationsource.se/

EuXFELSchenefeld, GermanyThe European X-Ray Free-Electron Laser Facility offers ultrashort X-ray pulses for over 1500 users per year.https://www.xfel.eu

ILLGrenoble, FranceThe Institut Laue-Langevin is has been the world’s brightest neutron source for 50 years and has over 2000 users per year.https://www.ill.eu/

2 Photon and Neutron Sources

The PaN sources and their instruments are essential tools for scientists and also for industry in many different fields, such as biology, medicine, materials, physics, culture heritage, and geology. Thousands of users of these large RIs perform scientific experiments each year. These experiments produce petabytes of scientific data, which are curated and archived for 5 to 10 years, depending on the data policy of each facility. The data include raw data (data collected from experiments performed on a facility’s instruments) with the associated metadata (from the sample and the acquisition environment) and processed data (obtained by processing raw data in an automated manner usually done at the facility).

3 Experiment Workflows at Photon and Neutron Sources

3.1 DMP in the PaN facilities experiment workflows

In a simplified view, the main stages of the RI lifecycle are the following:

  • Proposal submission: The user submits a proposal to use specific instruments on a facility and apply for experiment time.
  • Approval: After a peer-review process, the proposal could be accepted.
  • Scheduling: Time on the instrument is allocated to the proposal.
  • Experiment: The scientists perform the experiment on their samples at the facility and acquire data.
  • Data analysis: The scientists carry out further analysis of the raw data.
  • Publication: The possible subsequent publication is registered with the facility.

To be of real use, DMPs for PaN facilities should be aligned with the facility workflow for research (Figure 1). It is important that the plan precedes execution; thus, for users, the planning stage is made before the experiment.

Figure 1 

DMP in the PaN facilities workflow.

The start of the DMP is during the submission of a proposal and/or as part of activities preceding the experiment, such as sample declaration, visit planning, and safety training. During a user’s experiment, the DMP will be enhanced with information by the facility systems. The updating of the DMPs throughout the life cycle of the experiment is based on the concept of active DMPs; that is, DMPs are continuously updated. The next section presents the diverse sources of information available in the PaN facilities.

3.2 Information sources

Different IT systems have been identified as possible sources of information for the DMP:

  • The proposal and scheduling systems are the entry points to initiate the DMP and provide user data and general project information.
  • The metadata and data catalogue provide information on raw datasets.
  • The data processing and analysis pipelines provide software and datasets information.
  • From the facility storage system, format and volume information can be extracted.
  • The data publication system ensures the findability of the data.

Moreover, two databases, which did not yet exist in the facilities, the facility data information database and the instruments information database, add knowledge about instruments and general facility information in the DMP. The schema for these two databases has been proposed and implemented in a common software, which is described in section 6.

In some cases, the required information for the questionnaire may be available directly from the RI system, but users may also need to provide additional information themselves. The relevant information may be available at different stages of the project life cycle and in relevant DMP phases (see Table 2).

Table 2

DMP phases.


DMP Phases0 Before proposal submissionTypically knowledge of instrument scientist or RDM team (static parameter)

1 Proposal submissionTypically knowledge of the user, with support by the facility administration and RDM team.

2 Accepted experiment planningTypically knowledge of the user, with support from the facility administration and instrument scientist.

3 Data Collection/Data processing/analysisTypically knowledge of the user, with support from the instrument scientist.

4 A Common Knowledge Model

Instead of implementing a specific DMP for PaN facilities, a common structure, used to organise and represent information, has been built to meet an expanding set of requirements. This framework, called a knowledge model (KM), is translated into facility-specific questionnaires by selecting the relevant set of questions. The KM therefore contains the superset of questions from the facility questionnaires. The questions in the KM can be mapped to different templates (e.g., funder-specific templates, PaN facility templates).

When surveying the DMP landscape, the Research Data Management Organiser () stands out due to its broad range of questions applicable to the PaN facilities. While all the sections of the RDMO KM were kept for the facilities in the PaNOSC project, only the questions based on applicability for the facilities have been filtered. Where necessary, questions have been rephrased to fit the PaN facility users ().

Requirements on DMPs were described by each facility. Based on these requirements and on the RDMO questionnaire, a list of roughly 120 questions was mapped to the PaN facilities use cases. For each question, the following was identified and/or decided: the data source, by whom the question should be answered, and the phase in which the question could be answered.

4.1 Research data management knowledge model framework

There are two overriding principles for the RDMO KM framework:

  1. The questions should be lightweight for the user and beamline scientist(s), and, where possible, content should be automatically generated from the experiment proposal.
  2. Completion of the template should be staged, with the staging aligned with the access mechanism of the research infrastructure and the data life cycle.

The RDMO KM consists of seven sections grouping related questions to ease completion. The following is a summary of each section:

  1. General/Topic
    This section introduces the project, with questions surrounding the science to be conducted. For the PaN facilities in PaNOSC, the users and the source of funding are also included in this section.
  2. Content Classification/Datasets
    This section asks questions about the individual datasets covered by the DMP.
  3. Technical Classification/Data Collection
    This section records the dates when data collection will take place as well as when the data will be processed and analysed. The total volume of data and volume of data per year are recorded. The user is asked to provide information about the software required to work with the data. Finally, questions are asked to understand if and how the user will carry out versioning as they work with the data.
  4. Data Usage/Usage Scenarios
    This section establishes what PaN facility resources will be required in the lifetime of the data. It defines who handles the data and data backups as well as who can access the data and what provision is made for data security. The extent that data can be shared or form part of a collaboration is recorded. The user is asked to estimate the personnel and non-personnel costs associated with the data.
  5. Metadata and Referencing/Metadata
    Here, the user is asked to indicate what metadata are required to understand the data and to indicate whether this is collected automatically, semiautomatically, or manually. Additionally, the use of persistent identifiers (PIDs) is requested. The user is asked to estimate the personnel and non-personnel costs associated with metadata and PIDs.
  6. Legal and Ethics/General Legal Issues
    This section establishes whether the data is under the jurisdiction of more than one country and whether it includes personal or sensitive data. The user is also asked to detail any recommendations the funding body has about data management.
  7. Storage and Long-Term Preservation/Selection
    Here, the user is asked to explain their criteria for archiving data as well as the duration and accessibility of such archived data.

The questionnaire (Figure 2) can be extended by each facility or, even more specifically, for a dedicated instrument. Therefore, the common KM ensures that the DMPs are still compatible and interoperable between the PaN facilities.

Figure 2 

The PaNOSC knowledge model.

5 Choice of a DMP Tool

Several tools are available to help in developing a DMP, and some of them are listed in Table 3 (extracted from ). Most of these are based on a questionnaire and include templates for predefined frameworks.

Table 3

DMP tools.


TOOL NAMEOPERATOR

DMPonlineDigital Curation Centre (DCC)

DMPToolCalifornia Digital Library (CDL)

EasyDMPEUDAT & UNINETT Sigma2

Data Stewardship Wizard (DSW)Dutch Techcentre for Life Sciences (DTL, ELIXIR NL) and Czech Technical University in Prague (CTU, ELIXIR CZ)

Research Data Management Organiser (RDMO)Operated by many institutions—self-deploy model. Creators are Leibniz-Institut für Astrophysik Potsdam (AIP), Potsdam University of Applied Sciences (FHP), and Karlsruhe Institute of Technology (KIT).

Research Data Manager (UQRDM)University of Queensland

DataWizLeibniz Institute for Psychology Information and Documentation (ZPID)

ezDMPInterdisciplinary Earth Data Alliance (IEDA)

OpenDMPOpenAIRE & EUDAT

ARGOSOpenAIRE Service in EOSC

After a thorough analysis of the different DMP tools available, it was decided to use the Data Stewardship Wizard (DSW) tool () to implement the RDMO KM. The DSW tool contains a KM editor, which allows the data steward user to build his/her own KM or to extend an existing one. The questions included in the KM can be defined as closed questions with a set of possible answers. Depending on the answer, follow-up questions can be defined. With this flexible tool, the common KM for all PaN facilities could be created, and each facility could extend it to fit its specificities. The KM implemented in DSW is provided in the official GitHub repository ().

A useful feature of the DSW is the creation of custom templates (). This makes it possible to create DMPs adapted to each facility and based on their specific workflows and use cases. An active DMP is a living document and is updated at each major step of the experiment. DSW enables the data steward to define phases and map questions with them, and it allows the researcher/user to select the current phase that matches the phase his/her project is currently in. DSW enables users to work together on the same project, which is an important feature in a research project. The DSW tool provides a Rest API which can be used for integration purposes, as described in the next section.

6 Common Software for Populating DSW

The majority of questions in the DMP are answered by the facility. To enable this, a software that integrates with the DSW API was developed to create and populate DMPs on the user’s behalf. By splitting the facility-specific information into separate modules, it is possible for different facilities to share the same code base for receiving events and communication with DSW. This enables facilities to quickly integrate DSW into their IT infrastructure. As seen in Figure 3, the system leverages message brokers (a messaging software that applications and services use to communicate with each other) to receive events from systems, such as the proposal system. This ensures robustness in the system, as messages are only consumed upon successful communication with the DSW platform. Facility-specific information can either be populated directly in the platform, by populating the JSON configuration files, or by connecting to third-party systems for facility and instrument information.

Figure 3 

Common software architecture.

7 Deploying DMP Service in the PaN Facilities

The proposed system architecture and the common KM developed in the PaNOSC project show how a DMP tool can be integrated to the proposal workflow of a PaN facility, as summarised in Table 4 below.

Table 4

Overview of approaches for each facility.


FACILITYDSW USEDTEMPLATESDMP MANDATORYDMP QUESTIONSDMP PREFILLEDDMP CREATED

CERIC-ERICYesCERICYes21100%1

ELI-ERICYesELIYes10–20, depending on the call/facility50%0

ESRFYesH2020ANRHENo5160%1150

ESSYesH2020, RDMONo2580%50

EuXFELUnder evaluationWIPPlannedWIPWIPWIP

ILLTestingWIPWIPWIPWIPWIP

The adaptations required to deploy DSW at the PaN facilities are described below.

7.1 CERIC-ERIC

DMPs are generated in a fully automatic way, avoiding manual operations and getting almost all the information needed from the user office platform.

Workflow

The creation of a DMP is triggered when the associated proposal schedule is accepted. Thirty days after the end date of the experiment, the user is prompted to fill an ‘Achievements page’. On this page, the user must accept the DMP and can download the document in PDF format directly from the user office portal.

Implementation

Figure 4 shows an overview of the entities involved in the DMP service.

Figure 4 

CERIC DMP service overview.

The DSW has been adopted as the DMP tool and deployed using a bespoke docker-compose implementation starting from the example provided by the DSW team in the official GitHub repository (). The DSW instance can be reached at https://dsw.ceric-eric.eu, and the Keycloak () identity provider has been configured to make the tool available for all the CERIC users.

Questionnaire

The CERIC questionnaire contains only 21 questions which we identified as the most interesting for the scientists and can be answered automatically using the information stored in the user office platform. All the PaNOSC KM chapters have been maintained, and in addition to the aforementioned changes to the questions, we decided to make the questions a statement that will be a caption in the document created.

Templates

A template which allows creating a four-page document that can be exported in HTML and PDF formats has been created.

7.2 ELI-ERIC

Workflow

ELI-ERIC’s Scientific Data Policy (), adopted and already in production for the very first ELI-ERIC call for proposals, introduces the DMPs to ELI-ERIC users and staff.

Implementation

The following basic functional diagram (Figure 5) describes the current approach and the next development steps.

Figure 5 

DMP implementation roadmap at ELI-ERIC.

Questionnaire

DSW has been deployed () and tested and is now integrated in the ELI-ERIC User Office workflow. When selecting a Scientific Instrument in the proposal submission workflow, the user automatically gets a DMP template (from DSW) with some of the fields/values/instrument parameters prefilled (editable, but default values preselected), based on the expertise and experience of the User Office and Technical Feasibility experts of the ELI facility hosting the experiment.

7.3 ESRF

ESRF has introduced data management plans (DMPs) for all proposals which have been accepted, excluding industrial proposals (i.e., privately funded proposals). The ESRF-specific data management policies and tools, such as the ESRF data policy (), IT infrastructure, the User Portal (SMIS), and the metadata catalogue (ICAT), are used to fill in 60% (36 questions) of the DMPs automatically.

Workflow

Six weeks before the first experiment of a proposal, a DMP is automatically created. When an experiment of the proposal is finished, the DMP is updated with information coming from the metadata catalogue. Five questions have been identified for now.

Implementation

Figure 6 gives an overview of how the DSW DMP service has been integrated in the ESRF infrastructure.

Figure 6 

Architecture and DMP data flow at the ESRF.

The DSW is available at https://dmp.esrf.fr. The data portal () provides a link to the corresponding DMP for all proposals created since 2022 (Figure 7).

Figure 7 

Link to the DMP from the ESRF data portal.

Questionnaire

The PaNOSC common knowledge has been extended to better fit the ESRF’s needs. The differences are mainly semantic.

Templates

Two templates have been developed for the French ANR (French national funding) and the Horizon Europe funding schemes. The DSW service can also be used by scientists to build a DMP for a specific project, which can involve several ESRF proposals. Two ANR-funded projects have successfully used this service to generate DMPs for the funders.

7.4 ESS

At ESS, the DMP is considered an integral part of the experiment planning and feasibility assessment. Currently, the feasibility of a proposal is performed in collaboration with the appropriate member of the instrument staff. The additional information obtained through the collection of the DMP will be reviewed by the data reduction and analysis team.

Workflow

The DMP is created in connection with the proposal submission. At this stage, the DMP can be populated with facility-specific information. The next update is initiated upon the scheduling of a proposal. By using information from the scheduler regarding the instrument, sample environment, and allocated days of the experiment, the DMP can be further enhanced with estimated storage requirements and instrument-specific information. The last stage is post-experiment wrap-up. At this stage, the DMP will be updated with information captured by the ESS metadata catalogue. This includes actual data volume captured and executed data analysis.

Implementation

It was decided to offer the DMP service as an option for the user. Facility information regarding ESS has been collected and is used upon initial creation to answer 25 questions (i.e., 80%) of each DMP. This offers users an immediate benefit, as the DMP acts as an information channel when it comes to data management at ESS.

Each instrument scientist fills out the relevant questions known for their instrument in the instruments information database. This information is then populated in the DMP automatically.

Questionnaire

The users are offered the choice to either answer a more extensive DMP or a simplified version, depending on their preferences. A user can also select the questionnaire based on the template that will be generated.

Templates

At the moment, two ways of exporting data are supported: PDF and markup language. It is envisioned to enhance this by adding support for popular formats such as Horizon 2020 and the RDMO standard.

7.5 EuXFEL

European XFEL is in the process of updating its Scientific Data Policy, which will introduce DMPs for all experiments to be carried out at the facility. The DMPs should improve and formalise communication between the facility’s user groups, instrument scientists, and data experts. Due to the extremely high data volumes generated at some of the European XFEL instruments (petabytes), the DMP must identify the necessary steps to obtain datasets compliant with the objectives of the European XFEL long-term data preservation strategy.

The European XFEL will use the PaNOSC project deliverable () as a base for DMPs and enhance them with facility-specific questions.

The implementation of the DMP workflows at the European XFEL is ongoing. Similar to the other members of PaNOSC, the European XFEL is considering the usage of DSW as an underlying tool to implement DMP, supplemented by a message broker.

7.6 ILL

ILL is exploring how to apply DMPs to all proposals, since the process should be a benefit in all ways without generating an overload for the user. Since the ILL proposal portal is about to get redesigned starting next year, the DMP’s workflow will be implemented at that point. The DSW deployment is in the test phase.

8 Conclusions

A DMP framework has been built for all PaN facilities users during the PaNOSC project. The framework is based on a common KM of 120 questions. The use of a common KM ensures interoperability between the facilities. This questionnaire has been implemented at the six PaNOSC partner facilities using the DS-Wizard tool. DSW was selected by the PaNOSC members because of its flexibility and ability to be integrated within the IT environment of each facility. DSW fulfilled the requirements of the PaNOSC facilities thanks to its powerful KM approach, ease of installation, and excellent support by the developers.

DMPs will provide a better overview on data management to the facility users during their project. Users will know what to expect from the facility and what data they will produce, and they will have information on tools they can use to treat data. They can also export a DMP which can be submitted to funders. Facilities can gather information on what users need for data management and processing and plan accordingly. Some facilities decided to make the DMP a mandatory step in the proposal workflow, but with a DMP fully prefilled. The majority of facilities still see DMP as optional, but which could become mandatory in certain cases in the future. In those cases, the DMP will need to be completed with additional inputs from the user.

A community-specific template is planned to be implemented in order to automatically generate a DMP for every scientific project. DMPs should be machine readable to enable optimal exploitation of its content by all DMP consumers. For now, the DMP service is integrated as a data consumer service, but it can also be a data producer service which provides information to other services (e.g., about storage and compute resources required). In the future, DMPs will serve as a source of structured information for building knowledge graphs.