1 Introduction

Since their original publication, the FAIR principles (Findable, Accessible, Interoperable, Reusable; ) have initiated an advancement of research data management practices and requirements at an unprecendented pace. What the FAIR principles entail is essentially a formalization of what one would generally understand as the data management aspects of good scientific practice (), that is, that digital objects forming the foundation of research results should be available to the global community in order to facilitate the validation of scientific results and enable broad reuse of scientific data.

Specifically, the FAIR principles have entered the day-to-day workflow of researchers, because funders and publishers more often than not require project data underlying scientific publications be managed, archived and made available to the scientific community in-line with the FAIR principles. Consequently, research data repositories and archives can offer the researchers a corresponding service if data curation practice in-line with the FAIR principles can be trustfully demonstrated and communicated. Indeed, current efforts to align the CoreTrustSeal certification () with the FAIR principles are leveling the path in that regard (; ) in order to allow repositories to be considered as ‘FAIR-enabling’.

To date however, there exists no standardised and globally accepted procedure to trustfully evaluate the FAIRness of a research data repositories’ (meta)data holdings and its data curation approach. While the technical aspects required for providing FAIR data services can be clearly defined (e.g., ; ), this does not hold for the domain-specific requirements at the dataset level. Although recommendations regarding the metrics to be considered in FAIR evaluations have been recently published (; ), the lack of global agreement on and adoption of discipline-specific FAIRness criteria requires concerted community effort and remains a challenge (; ). This state of affairs results in unsatisfyingly persistent communication barriers between the scientific community and those driving to see the FAIR principles accepted and adopted – often to the disadvantage of the FAIR concept.

To eventually overcome the deadlock surrounding FAIRness evaluation, a plethora of tools – manual and automated as well as comprehensive and less comprehensive ones – has been and is continuously developed and is openly available for evaluating archived (meta)data (). From the perspective of a repository operator aiming for FAIRness evaluation, it is however not evident which tool to choose from, because thorough evaluation of the fitness-for-purpose of the tools is not available.

In this study, we aim to close this knowledge gap by applying an ensemble of five different FAIRness evaluation tools to selected (meta)data archived in the World Data Center for Climate (WDCC), which is hosted at the German Climate Computing Center (DKRZ) in Hamburg, Germany. The WDCC is a CoreTrustSeal certified domain-specific archive for climate science, with a focus on ensuring the long-term reusability of climate simulation data and climate related data products. In earlier work, a self-assessment of the WDCC along the FAIR principles () indicated a high level of FAIRness (0.9 of 1). That evaluation was purely based on self-developed metrics along the individual FAIR principles, did not evaluate individual datasets and provides a holistic view of the WDCC (meta)data curation approach.

Our study is further motivated by the fact that while it is clear that automation of FAIRness evaluation is needed for to ensure scaleability, we are unsure if automated tools are entirely fit-for-purpose, especially when it comes to the evaluation of contextual reusability of archived (meta)data (; ; ; ; ) – probably one of the most important aspects of ‘R’. Or in other words: what use are good findability, accessibility and interoperability if the data lack contextual metadata like documentation of methods, uncertainty assessment, associated references or provenance information. We presume that automated assessment of such information is close to impossible with current technology – a question we address in detail in this study.

The aspect of contextual reusability is especially important to adequately consider when assessing FAIRness of archived climate simulation data, because the climate modeling community has at least for the last decade provided access to standardised collections of well-documented data for reuse by the global community (; ; ; ; ; ; ; ). As such efforts are only feasible by adhering to agreed-upon and adopted discipline-specific (meta)data standards (e.g., ; ), this can already be seen as a certain degree of FAIRness. Further, data curation approaches of repositories catering for the archival of climate data already include quality control mechanisms to ensure long-term reusability (e.g., ; ; ). FAIRness evaluation tools should therefore be capable of reflecting these efforts. In applying an ensemble of FAIRness evaluation tools in this study, we aim at answering the following research questions:

  1. How does the previous self-assessment of WDCC FAIRness () compare to currently available tools and proposed methods, how is this reflected in WDCC’s (meta)data curation approach and how can WDCC FAIRness be improved?
  2. How do the different FAIRness evaluation tools compare to each other and what can we take home from such an analysis?
  3. How fit-for-purpose are the different FAIRness evaluation tools for an evaluation of the domain-specific aspects of FAIRness, especially in terms of contextual (meta)data reusability?

Building on our analysis, we discuss the lessons-learned during the process of evaluation and conclude with a set of recommendations for the design and application of future FAIR evaluation approaches. The paper is organised as follows: we introduce our analysis method and data used in Section 2. This includes a detailed description of the FAIRness evaluation tools, the choice of evaluated WDCC-archived datasets and the approach taken to achieve comparability between the different FAIRness evaluation tools. Results are presented in Section 3 and discussed in Section 4. The paper concludes with a summary in Section 5.

2 Methods and Data

In this section, we detail our approach to selecting FAIRness evaluation tools for our ensemble from the pool of globally available tools. We also cover aspects of tool applicability and discuss our approach to making the results from different tools comparable to each other. We also highlight the importance of constructive feedback-loops between tool developers and FAIRness evaluators. We further discuss and motivate our methodology behind the selection of WDCC-archived entries to be tested.

2.1 Selection of evaluation approaches

We based our selection of tools on the collection of FAIRness evaluation tools prepared by the Research Data Alliance (RDA) FAIR Data Maturity Working Group (WG) (). That collection presents twelve FAIR assessment tools having their origins at various institutions around the globe. We find that only two out of the twelve presented tools are actually fit-for-purpose in the context of our study. These are the Checklist for Evaluation of Dataset Fitness for Use () produced by the Assessment of Data Fitness for Use WG (WDS/RDA) (cf. Sec. 2.1.1) and the FAIR Maturity evaluation service documented in Wilkinson et al. () (cf. Sec. 2.1.2). The latter is not explicitly listed in Bahim, Dekkers & Wyns (), but represents the evolution of a listed tool (). We did not use the other tools listed in Bahim, Dekkers & Wyns () for a number of reasons (see Table 1).

Table 1

Summary of the FAIRness evaluation tools which we assessed but decided not to use in the context of this study. The evaluation approaches were assessed in April 2021; a reassessment took place for some tools in February 2022 (see references).


TOOLNOT USED BECAUSEREFERENCE

ANDS-Nectar-RDS FAIR data self-assessment toolnot accessibleANDS ()

DANS-Fairdatpilot version meant for internal testing at DANSThomas ()

SATIFYDnot maintained anymore (L. Cepinskas (DANS), pers. comm. 24 March 21)Fankhauser et al. ()

The CSIRO 5-star Data Rating toolnot accessible as online toolYu & Cox ()

The Scientific Data Stewardship Maturity Assessment Modelnon-automated capture of evaluation results; proprietary document formatPeng et al. ()

Data Stewardship Wizardassistance for FAIR data management planning, not for evaluation of archived dataPergl et al. ()

RDA-SHARC Evaluationno fillable form readily providedDavid et al. ()

WMO Stewardship Maturity Matrix for Climate Data (SMM-CD)non-automated capture of evaluation results; proprietary document formatPeng et al. ()

Data Use and Services Maturity Matrixunclear application conceptThe MM-Serv Working Group ()

ARDC FAIR Self-Assessment Tooltest results not saveable; no quantitative FAIR measureSchweitzer et al. ()

We further sourced the internet by searching for ‘FAIR data evaluation’. Thereby, we discovered the tool FAIRshake () and decided to use it in our ensemble approach (cf. Sec. 2.1.3). We also discovered the ARDC’s FAIR self-assessment tool (), but decided not to use it as it neither provides a download option for test results annotated with sufficient metadata of the evaluated resource nor does it provide a quantitative measure of FAIRness as final output (see Table 1).

Building upon earlier collaboration with the developers of the F-UJI tool () (see examples in ), we also used that tool in its software version v1.1.1 for our assessment ensemble (cf. Sec. 2.1.4). Finally, we build on earlier in-house work to evaluate WDCC’s FAIRness () and by performing a self-assessment using the metric collection presented in Bahim et al. () (cf. Sec. 2.1.5).

We summarise the main characteristics of the five FAIRness evaluation tools in Table 2. The detailed results obtained from applying the FAIRness evaluation approaches are available as supporting data (; ). All the tools were applied in the time of April and May 2021. The versions of the automated (FMES, F-UJI) and hybrid (FAIRshake) tools correspond to those current at that time.

Table 2

Summary of the five FAIRness evaluation tools used in this study. The hybrid method of FAIRshake combines automated and manual evaluation. The covered FAIR ((F)indable, (A)ccessible, (I)nteroperable, (R)eusable) dimensions refer to the number of metrics the tool tests, such as FMES checks for Findability using 8 different tests.


TOOLACRONYMMETHODCOVERED FAIR DIMENSIONSREFERENCE

Checklist for Evaluation of Dataset Fitness for UseCFUmanualn/aAustin et al. ()

FAIR Maturity Evaluation ServiceFMESautomatedF: 8, A: 5, I: 7, R: 2Wilkinson et al. ()

FAIRshaken/ahybridF: 3, A: 1, I: 0, R: 5Clarke et al. ()

F-UJIn/aautomatedF: 7, A: 3, I: 4, R: 10Devaraju et al. ()

Self Assessmentn/amanualF: 13, A: 12, I: 10, R: 10Bahim et al. ()

2.1.1 Checklist for Evaluation of Dataset Fitness for Use ()

The Checklist for Evaluation of Dataset Fitness for Use (CFU) was originally developed to supplement the CoreTrustSeal repository certification process () by providing a tool to ‘…check the fitness for use (e.g. FAIRness) of a repository’s holdings…’ (J. Petters, pers. comm. (Email) April 2021). So although not specifically designed with the FAIR principles in mind, CFU can be used in the context of our study because it addresses data curation aspects relevant in the context of FAIR.

The CFU is a manual questionnaire provided in the format of a Google form and can be accessed from the URL provided in Austin et al. (). The questionnaire consists of twenty questions covering aspects of dataset identification, state of the repository’s certification, data curation, metadata completeness, accessibility, data completeness and correctness as well as findability and interoperability. It is evident, that the topics covered by the questions map very well onto the FAIR principles (). The questions allow for nuanced answers (Yes; Somewhat; No) and are formulated in a sufficiently generic way to allow for discipline-specific answers. Like for any manual questionnaire, the evaluator has to be familiar with the common practice of the scientific domain and, ideally, be aware of the repositories’ preservation practice. The answers are saved to an online spreadsheet. Evaluators using the CFU can always come back to previous assessments, given that the spreadsheet is available, and comprehend the score a particular resource has attained. Objectiveness of an evaluator is key for reproducibility, though. The provision of resource metadata in the form facilitates the findability and the results of an assessment can be shared with anyone.

2.1.2 FAIR Maturity Evaluation Service ()

The FAIR Maturity Evaluation Service (FMES) is a fully-automated FAIRness evaluation tool building on community-driven efforts in compiling discipline specific FAIR maturity indicators (; ). The current implementation of the FMES is accessible online and lets users choose from a set of different FAIR maturity indicator collections for testing. At the time of writing, the majority of available collections is discipline agnostic and is provided by the tool developers.

For testing, the FMES takes the URL or PID of the online resource as input for finding and accessing the resource via the machine-actionable metadata provided as JSON-LD. If available, the PID strictly has to be provided to FMES to yield meaningful evaluation results. For later identification of the test, FMES also requires a title for the evaluation and the ORCiD of the evaluator as metadata. Once an evaluation has been performed – this can take up to 15 minutes to complete, we experienced an average of about two minutes per entry – the result of the evaluation is immediately displayed in the web interface and reasons for failing certain tests are documented (see , for more information). Evaluation scores are given in number of passed, n, versus number of total tests.

Every evaluation performed with the FMES is saved in its backend and can be searched for and accessed at any later time by anyone via the web-GUI. This enables comprehensibility and reproducibility of the evaluation results.

Here, we applied the FMES using the collection All Maturity Indicator Tests as of May 8, 2019. We used that collection because it contains tests for all aspects of the FAIR principles (cf. Table 2), was compiled by the maintainer of the tool and because no climate science specific FAIR maturity indicator collection was available at the time of testing.

2.1.3 FAIRshake ()

The FAIRshake tool takes a hybrid (combination of manual and automated) approach to assessing the FAIRness of digital resources (). FAIRshake can be accessed online and was initially designed for use in biology-related disciplines. The framework is intentionally kept generic enough to also be applicable to other disciplines (). Like with FMES, FAIRshake can be used with a number of different FAIR metrics collections, the so-called rubrics, which differ in the amount of included FAIR metrics, in the type of resource to be evaluated or in the scientific discipline the rubric can be applied to.

Applying FAIRshake is open to anybody upon online registration. Once registered, users organise their evaluations in projects, which contain the results from the digital resource assessments. The assessment itself is done by providing the URL to the digital resource, as well as further metadata like title, description and type of resource for later reference. The automated part of the evaluation sources the machine-actionable JSON-LD metadata of the resource. For our assessments, we used the FAIRshake dataset rubric because it contains the in our view most adequate set of FAIR metrics for the purpose of our study (cf. Table 2) and the most comprehensible test formulations.

In the FAIRshake dataset rubric, an automated approach is taken to evaluate the metrics relating to accessing the dataset landing page, accessing the data, contacts and licensing. The other metrics focusing on documentation of the data and its provenance, the repository the data is hosted in, versioning and citation of the dataset have to be answered manually. If an automated test fails because the required criteria encoded in the tool are not met, the test can still be amended manually. The results are given as nuanced answers (Yes (100% score); Yes, but (75%); No, but (25%); No (0%)). An evaluator can add additional information like URLs or free-text to justify the provided answer, which often requires the evaluator being familiar with the common practice of the scientific domain and also of the repositories’ preservation practice. Through the combination of automated and manual metric assessment, FAIRshake offers the unique possibility of testing for generic aspects of the FAIR principles, while also catering for domain-specific requirements.

Every assessment performed with FAIRshake can be accessed by anybody from the tools’ homepage, allowing for transparency and reproducibility. Our results are organise in the FAIRshake project WDCC for DSJ.

2.1.4 F-UJI ()

F-UJI is an automated tool for the assessment of the FAIRness of research data developed in the framework of the FAIRsFAIR project. Within the project, as set of metrics which follow the core FAIR principles was developed for use with F-UJI (). F-UJI not only enquires the machine-actionable (meta)data available as JSON-LD via the research data object’s landing page (specified by either URL or PID), but also harvests any available information on the hosting repository or the dataset itself from external resources. These external resources include established services like re3data, DataCite, the RDA Metadata Standards Catalog or Linked Open Vocabularies. This approach supports the automated evaluation of domain-specific FAIRness by leveraging the advantages of domain-specific over general repositories. For a more detailed description of F-UJI features, please refer to Devaraju & Huber () and Devaraju et al. ().

F-UJI is free to be used by anyone and can be either installed locally () or applied using an online demo version. The software behind the online demo corresponds to the most recent software version available for local installation (R. Huber, (PANGAEA, University of Bremen), pers. comm. (Email), April 2021). Here, we take the most economic approach for applying F-UJI and relied on the assessments of the online demo version. F-UJI takes the URL to the landing page of the resource to be tested as only input. An assessment itself happens on the order of a few seconds and the results are displayed in a dashboard-like manner. The overall FAIRness score is given in percentages, with each of the metrics having equal weights in the calculation.

An evaluator can easily enquire the reasons behind passed or failed tests by clicking on the corresponding icons. The results of an assessment can however not be saved online, making the comprehension of an earlier assessment result only possible by re-executing the assessment. Of course, this only makes sense if the F-UJI software stack hasn’t been updated in the meantime – which may indeed happen since F-UJI is still in development and constantly updated (see Sec. 2.1.6). We saved a PDF version of F-UJI’s output to our local infrastructure and have made them available via the WDCC (). For a more systematic application of F-UJI, a local installation is more beneficial.

2.1.5 WDCC-developed self-assessment along Bahim et al. ()

We constructed our own manual FAIRness evaluation tool by building on earlier in-house efforts to evaluate the FAIRness of the WDCC () and the FAIR metrics recommended in (). By relying on third-party recommendations on FAIR metrics (), the present approach reduces the risk of leaving the evaluation open for individual interpretation – a major problem of manual FAIRness assessments (e.g., ; ). Almost all of the maturity indicators listed in Bahim et al. () were evaluated, regardless of them being classified as Essential, Important or Useful, in order to obtain the most complete FAIRness assessment possible (cf. Supplement). We also allow for nuanced answers per maturity indicator where this makes sense, i.e. while some indicators can only fail (0%) or pass (100%), others can attain values in the range of 0% to 100%. For the final score per evaluated WDCC-entry, every FAIR maturity indicator is given equal weights.

Like for any manual FAIRness evaluation tool (cf. Secs. 2.1.1 and 2.1.3), trustworthy and useful conduction of the evaluation requires a strong background in discipline-specific practices and standards, while also allowing for a high degree of domain-specificity. The evaluation results are saved in a spreadsheet on local hardware and made publicly available in conjunction with this publication.

2.1.6 The benefit of contacting the tool authors

In the process of conducting the FAIRness assessments for this study, we inevitably came in contact with the developers to enquire upon usability of the tool for our purposes (CFU, FAIRshake), unexpected results (FMES, F-UJI) or to recommend enhancements to the user experience (FAIRshake). Especially for FMES and F-UJI, quick turnaround times in email communication resolved issues very efficiently. In both cases, our enquiries have led to improvements of the software by revealing bugs in the code or making the evaluation approaches more flexible, such as making the recognition of PIDs in the JSON-LD metadata case insensitive (FMES, M. Wilkinson, pers. comm., April 2021). An example from F-UJI would be that the tool now correctly identifies the resource type from information given in the JSON-LD metadata – which leads to one more test passed (R. Huber, pers. comm., April 2021).

For FAIRshake, we used the tool’s GitHub page to raise issues recommending improvements to the look and feel of the tool as well as the automated test routines. These recommendations were promptly adopted (usually within less than a working day).

2.2 Selection of WDCC entries for evaluation

The WDCC is a domain-specific long-term archiving service focusing on ensuring the long-term reusability of datasets relevant for simulation-based climate science. Therefore, the main focus lies on the preservation of datasets stemming from numerical simulations of Earth’s climate. Additionally, datasets originating from observations, for example, satellite data products, aircraft observations and in-situ measurements, are also preserved in WDCC but make up a relatively small fraction of the total data volume. Datasets preserved in the WDCC are required to comply with domain-specific (meta)data standards and file formats and be accompanied by rich and scientifically relevant metadata so as to ensure long-term reusability.

The total volume of datasets preserved in WDCC amounts to ≈3.1 PetaBytes (PB, August 2021). The largest part is represented by climate model output stemming from globally coordinated model intercomparison efforts like the global Coupled Model Intercomparison Project 5 (CMIP5, Taylor, Stouffer & Meehl, 2012) or regionalisations thereof produced within the Coordinated Regional Climate Downscaling Experiment (CORDEX, ). Those datasets are highly standardised, because global intercomparison studies rely on the efficient reusability of produced data across user communities. Indeed, data reuse is high for these datasets, therefore justifying the standardisation effort (). Smaller datasets archived in WDCC are comprised of climate modeling or observational projects organised at project or institutional levels (e.g., ; ; ) and research output forming the basis of academic publications (e.g., ; ).

The degree of data maturity (cf. ) required for archival in WDCC depends on whether or not a DOI is to be assigned to the archived data: data have to fulfill higher technical and scientific quality requirements if a DOI is to be assigned in the archival process (cf. ).

Individual WDCC-archived datasets, that is, files, are stored as parts of larger data collections – an approach broadly adopted in simulation-based climate science community (e.g., ) and which builds on the OAIS (Open Archival Information System, ) framework. In an OAIS, the archived information is organised in Archival Information Packages (AIPs), with two specialised AIP-types being the Archival Information Unit (AIU) and the Archival Information Collection (AIC). Broadly speaking, AICs describe a collection of AIUs which are combined in a meaningful way to enable discoverability. AIUs contain metadata describing the archived actual datasets, whereas AICs contain metadata describing the respective collection of AIUs. For an increased reading experience, we will refer to AIUs as ‘units/datasets’ and AICs as ‘collections’ for the remainder of this paper.

In the WDCC, data collections are comprised of ‘entries’, that is, AIPs, which follow a strictly hierarchical structure: the topmost level is the ‘project’, followed by the levels ‘experiment’ (collection), ‘dataset_group’ (collection) and ‘dataset’ (unit/dataset) (). Of these, the entry types project and dataset are mandatory, whereas the entry types experiment and dataset_group are used as the organisational backbone of larger collections. At the WDCC, DOIs are assigned at the AIC-level only. This is done to i) keep reference lists in publications using WDCC-archived data clear and concise and ii) display the effort put into the creation of a data collection through a single citation with the aim to elevate the data publication to the level of a paper publication. However, some older data preserved in WDCC also have DOIs assigned at the AIU, that is, the dataset, level (e.g., ).

An evaluation of the entire WDCC-archive is evidently out-of-scope as it contains >1.3M datasets, with a total number of 1126 DOIs assigned at the time of writing (August 2021). We have therefore chosen to evaluate a sample of thirteen WDCC-archived AICs (see Table 3), resulting in a total of 32 evaluated AIPs (thirteen experiments, six dataset_groups, thirteen datasets). In the selection of the sample, we aimed at providing a representative assessment across the entire spectrum of WDCC-archived data collections covering various degrees of data maturity while at the same time providing a representative sample in terms of data volume. We evaluated two AICs for two projects (IPCC-AR5_CMIP5 and CliSAP) because data maturity is heterogeneous in these projects. One AIC was evaluated for the remaining nine chosen projects. The evaluation approach is detailed in the next section.

Table 3

WDCC projects selected for evaluation. The project acronyms can be directly used to search and find the evaluated projects using the WDCC GUI. The project volume in TB (third column) refers to the total volume of the entire project named in the first column. See Peters-von Gehlen, & Höck () for details of evaluated resources.


PROJECT ACRONYMDATA SUMMARYPROJECT VOLUME [TB]DOI ASSIGNEDCREATION DATECOMMENTS

IPCC-AR5_CMIP5Coupled Climate Model Output, prepared following CMIP5 guidelines and basis of the IPCC 5th Assessment Report (2 AICs evaluated)1655yes and no2012-05-31 and 2011-10-10

CliSAPObservational data products from satellite remote sensing (2 AICs evaluated)163yes and no2015-09-15 and 2009-11-12one collection with no data access

WASCALDynamically downscaled climate data for West Africa73yes2017-02-23

CMIP6_RCM_forcing_MPI-ESM1-2Coupled Climate Model output prepared as boundary conditions for regional climate models, prepared following CMIP6 experiment guidelines51yes2020-02-27

MILLENNIUM_COSMOSCoupled Climate Model of ensemble simulations covering the last millennium (800-2000AD)47no2009-05-12

IPCC_TAR_ECHAM4/OPYCCoupled Climate Model Output, prepared to support the IPCCs 3rd Assessment Report2.6yes2003-01-26Experiment and dataset with DOI; First ever DOI assigned to data ()

Storm_Tide_1906_German_BightNumerical simulation of the 1906 storm tide in the German Bight0.3yes2020-10-27

COPSObservational data obtained from radar remote sensing during the COPS (Convective and Orographically-Induced Precipitation Study) campaign0.2yes2008-01-28

HDCP2-OBSObservations collected during the HDCP2 (High Definition Clouds and Precipitation for Climate Prediction) project0.06yes2018-09-18

OceanRAINIn-situ, along-track shipboard observations of routinely measured atmospheric and oceanic state parameters over global oceans0.01yes2017-12-13 7

CARIBICObservations of atmospheric parameters obtained from commercial aircraft equipped with an instrumentation container7.7E-5no2002-04-27

We consider the evaluated AICs (cf. Table 3) as representative for the data maturity level of the entire WDCC-project they are associated with, allowing us to extrapolate the results of our evaluation. Doing so, the cumulative data volume of the WDCC projects evaluated here amounts to ≈2PB (cf Tab.3). The sample is representative of about 65% of WDCC-archived data. The remaining 35% are represented by a large number of smaller AICs for which testing would have been out-of-scope in the context of this study due to time constraints. The results obtained from the evaluation of our sample thus provide a good indication of overall WDCC-FAIRness. We note here, that some of the evaluated AICs were archived before the advent of the FAIR principles and therefore represent the long-established WDCC-approach to ensure long-term reusability of archived data collections.

2.3 Evaluation approach

The granularity of data collections archived in the WDCC is motivated by providing the most appropriate level of data organisation for accessibility and reuse (see above). The amount and richness of metadata (contacts, references, parameter lists, quality assessment reports, free text summary, etc.) differs starkly between the levels of granularity. Therefore, reporting the FAIRness of WDCC-archived data at the level of individual AIUs would not be informative. Hence, we provide results of our assessment at the AIC level, that is, at the level of a WDCC data collection. Also, this is the only way to do justice to the domain-specific approach of organising climate science related simulation-based and observational datasets in larger collections (; ).

In practice, we assessed all AICs presented in Table 3 at the level of their AIUs and averaged the results at the AIC-level for all assessment approaches for reporting, but for our self-assessment (Sec. 2.1.5). For that approach, we performed the evaluation directly at the AIC-level.

2.4 Achieving comparability among evaluation approaches

The applied FAIRness evaluation tools all show a different number of maturity indicators, which are also differently distributed along the FAIR dimensions. In order to achieve comparability between the assessment approaches, we took a pragmatic approach and simply averaged the results over all maturity indicator tests per approach. We do so, because this approach is automatically applied for the two automatic assessment approaches (F-UJI and FMES). Where necessary, we normalised the results to yield a FAIR-score in the range between 0 and 1, indicating a low- or high-level of FAIRness, respectively.

We acknowledge the fact that this way of comparing the results of different FAIRness evaluation tools somewhat distorts the results, because the results per FAIR dimension are not equally weighted. However, we argue here that our study has the main focus of raising awareness for available FAIRness evaluation tools and highlighting the intricacies associated with applying them. In the end, the results of most tests compare well at the AIC-level (see next section).

3 Results

3.1 Mean scores of FAIR assessments

We show the calculated scores obtained from the five FAIRness evaluation tools along with some general statistics in Table 4. The calculated level of FAIRness strongly depends on the assessment method and the evaluated AIC. Overall, we obtain an ensemble mean FAIR score for the WDCC of 0.67, with individual results per applied FAIRness evaluation tool ranging from 0.5 to 0.88. The calculation of the mean FAIR score does not account for any weighting by data volume per AIC. Scores are mostly higher for the manual or hybrid approaches compared to the automated ones. This is mostly because the automatic FAIRness evaluation tools include checks on the actual data, which require the evaluated data to be openly accessible by the evaluation tool. Since almost all WDCC-archived data are open and free for use by anyone, but only accessible after authentication, the automatic tests requiring data access fail by design. The manual evaluation tools however allow for an evaluation of WDCC-archived datasets, as these can be accessed through human intervention (wording taken from ). Metadata must be prepared accordingly for automated tools, for example, in the JSON-LD, so that it can also be evaluated. We discuss further aspects behind the differences in FAIRness scores between the applied methods in Section 4.

Table 4

Results of FAIR assessments of WDCC data holdings using the ensemble of FAIRness evaluation tools detailed in Section 2.1. The scores per test are calculated as unweighted mean over all tested FAIR maturity indicators. The mean (∅), standard deviation (σ) and relative standard deviation (σ) on a project basis (three rightmost columns) are calculated across the scores of the five FAIR assessment tools. The mean value representative for the WDCC (∅ (WDCC), last row) is calculated for all values in the respective column of the table. See main text for more details. Results at finer granularity are provided in the supporting data ().


PROJECT ACRONYMSELF-ASSESSMENTCFUFMESF-UJIFAIRSHAKE∅ PER PROJECT σ PER PROJECT σ PER PROJECT

IPCC-AR5_CMIP50.840.720.440.580.950.710.200.29

IPCC-AR5_CMIP5, no DOI0.650.670.440.540.930.650.190.29

CliSAP0.860.780.480.580.970.730.200.28

CliSAP, no data accessible0.270.300.430.520.640.430.150.36

WASCAL0.900.800.500.580.910.740.180.25

CMIP6_RCM_forcing_MPI-ESM1-20.860.850.570.620.920.760.160.21

MILLENNIUM_COSMOS0.630.530.450.510.820.590.140.24

IPCC_TAR_ECHAM4/OPYC0.820.630.500.640.890.700.160.23

Storm_Tide_1906_German_Bight0.900.680.550.620.830.710.150.21

COPS0.860.470.530.550.870.660.190.29

HDCP2-OBS0.900.480.530.590.860.670.190.29

OceanRAIN0.900.750.570.600.970.760.180.23

CARIBIC0.620.700.500.540.820.640.130.20

∅(WDCC)0.770.640.500.580.880.670.150.22

At the AIC-level (column “∅ per project” in Table 4), the spread around the ensemble mean is slightly smaller, ranging from 0.43 to 0.76. AICs with DOI obtain the highest FAIR scores, with an AIC associated with the project CMIP6_RCM_forcing_MPI-ESM1-2, which has a DOI assigned and is comprised of data produced within the framework of the CMIP6 initiative (), scoring highest.

Consequently, AICs having no DOI assigned, such as MILLENIUM_COSMOS, score lower. The lowest score is determined for one of the CliSAP AICs (CliSAP, no DOI and no data accessible). While that AIC does provide ample metadata on the corresponding WDCC landing pages (cf. Supplement for details to find the tested AICs), the data is not accessible because the status of the AIC was never set to ‘completely archived’ by WDCC staff. The lack of data accessibility can in this case only be pinpointed using the manual and hybrid approaches – the automatic ones fail to recognise this major shortcoming and therefore cannot be used to capture the actual data curation status. While such curation levels are rather the exception than the rule for the WDCC, we deliberately chose to include an AIC with no accessible data in our evaluation to analyse the entire spectrum of WDCC data curation levels and for checking whether the automated tools recognise this.

Summarising this part of our results, we find that all FAIRness evaluation tools can be used to reliably distinguish between various degrees of (meta)data curation of AICs preserved in the WDCC and that for the most part, AICs preserved in the WDCC satisfy the majority of the FAIR maturity indicators addressed by the applied evaluation approaches.

3.2 Agreement between evaluation approaches

Our ensemble approach to FAIRness evaluation also offers the unique opportunity to analyse the consistency between the assessment approaches at the AIC-level. To illustrate this, we computed the relative standard deviation, defined as the standard deviation of a sample divided by the mean of the sample (σ), at the AIC level (rightmost column of Table 4) and the cross-correlations between the tests at the WDCC-level shown in Table 5.

Table 5

Cross-correlations between the scores per project obtained with the five FAIRness evaluation tools (Table 4).


SELF-ASSESSMENTCFUFMESF-UJIFAIRSHAKE

Self-Assessmentn/a0.610.650.730.79

CFUn/a0.360.500.78

FMESn/a0.650.30

F-UJIn/a0.49

FAIRshaken/a

If the applied FAIRness evaluation tools show a small spread in determined FAIRness scores for a particular project, they show agreement and σ is small. We find the lowest values for datasets having a DOI assigned and being associated with ample machine-readable relevant metadata, that is, CMIP6_RCM_forcing_MPI-ESM1-2 () and Storm_Tide_1906_German_Bight (), or a dataset with a low-level of domain-specific maturity (CARIBIC). At the other end of the spectrum, the FAIRness evaluation tools disagree most for the CliSAP AIC for which no data is accessible – for the reasons we alluded to in the previous paragraph. We provide a more detailed discussion of the differences between test results in Section 4.

The cross-correlations between the applied FAIRness evaluation tools (Table 5) clearly indicate that the level of agreement strongly depends on the applied methodology (manual, hybrid or automated), irrespective of covered FAIR dimensions per approach (see Section 2.1). Generally, the results of manual or hybrid approaches compare better to each other than to the automated ones. Similarly, the two automated approaches (FMES and F-UJI) compare well. However, there is an exception: the results of our Self-Assessment and the F-UJI tool also compare relatively well.

Summarising this part of our results, we find that at the AIC-level, the five evaluation approaches broadly agree on the level of FAIRness (with one notable exception, see above). At the WDCC-level, we find that the scores obtained from FAIRness evaluation tools taking an identical methodology (manual, hybrid or automated) also compare well to each other. Here, manual and hybrid approaches can be seen as applying the same evaluation methodology (‘human expert knowledge’) as compared to the purely automated tests.

4 Discussion

From the beginning, the FAIR data guiding principles have been defined as being first and foremost applicable to any research discipline (; ) and that it requires the effort of domain specialists to define FAIRness maturity indicators at a discipline-level (). Since consolidation processes on the definition of suitable indicators are still ongoing in the global RDM community, we have put as much focus on discipline-specific aspects in our evaluation of WDCC-preserved (meta)data as possible. Global data sharing and data reuse is an essential part of everyday climate science and the community has developed and adopted relatively sophisticated (meta)data standards to facilitate reuse (; ; ; ; , ). At WDCC, (meta)data is preserved with a focus on long-term reusability and is therefore required to adhere to these standards to a certain degree – we therefore anticipated a relatively high degree of FAIRness for preserved (meta)data.

In this section, we discuss the domain-specific aspects impacting our analysis of WDCC-FAIRness (Section 4.1) and the differences between and comparability of the different evaluation approaches (Section 4.2). Further, we present lessons learned (Section 4.3) and finish off with recommendations to inform the development and operationalisation of FAIRness evaluation (Section 4.4).

4.1 Data granularity

At WDCC, preserved data is organised in data collections following a strict top-down hierarchy (cf. Section 2.3), where each level in the hierarchy is identified by an entry ID and has its own landing page in the WDCC GUI. Initially, we planned to present results for each hierarchy level of an AIC (cf. Table 3), but realised soon in the process that this approach does not reflect the evaluation of domain-specific FAIRness in climate science in general and data curation practice at WDCC in particular. As outlined in Section 2.3, we did in fact test all AIUs of the AICs separately and then computed the average. Because the amount and content of machine-actionable metadata varies starkly between the AIC hierarchy-levels, especially the automated evaluation approaches yielded a range of FAIRness scores for the AIUs of a single AIC. For example, F-UJI computed a scores of 0.54 and 0.7 at the ‘dataset’ and ‘experiment’ levels, respectively, for CMIP6_RCM_forcing_MPI-ESM1-2. In this case, the DOI is assigned at the experiment level, automatically resulting in a higher score. However, both entities must not be considered separately, as on the one hand, the actual data is not available at the experiment level. On the other hand, the dataset level lacks the contextual information required for reuse. These domain-specific particularities of data granularity can at the moment not be captured with automated FAIRness evaluation tools but should be considered if FAIRness evaluation and certification become mandatory (see Section 4.4).

4.2 Comparability of test results

The varying capacities of the different FAIRness evaluation tools became very apparent and transpired early in our analysis. While the automated approaches (FMES and F-UJI) are useful for the evaluation of the machine-actionable aspects of preserved (meta)data, they fail to capture the actual curation status of (meta)data preserved in WDCC. We shortly describe four examples illustrating this point:

  • Datasets preserved in WDCC are accessible for free, but only after authentication. The machine actionable metadata (JSON-LD) contain an indicator regarding data accessibility (‘isAccessibleForFree’: true). While this is in full compliance with FAIR principle A1.2, the automated test yield failed tests. While this result is fully explainable (FMES and F-UJI check for dataset URLs which are deliberately not included in the JSON-LDs for security reasons), it does reveal a central shortcoming of the automated evaluation approaches and highlights the intricacies of exactly matching the syntax of machine-actionable content required to pass automated tests.
  • In cases when data are actually not available, the information on the availability status of the data is only provided on the landing page and not as part of the machine-readable metadata. Therefore, the automated approaches evaluate these AICs exactly as the other tested WDCC-entries (data is not accessible, test failed), resulting in too high FAIRness scores.
  • Contextual information is practically impossible to evaluate using automated approaches. As the main goal behind providing FAIR data is to foster their reuse, providing adequate references, documentation and provenance information is essential. The machine-readable qualifiers (‘subjectOf’) included in the JSON-LDs lead to associated publications or reports. Once such a reference is detected by an automated evaluation approach, the corresponding test is passed. However, the actual content of the linked reference cannot be checked – it could therefore be completely irrelevant in the context of the evaluated (meta)data. In the context of this study, the AIC HDCP2-OBS represents such a case.
  • By virtue of their intended application, the automated evaluation approaches do not take any information provided on the human-readable landing pages into account. At the WDCC, these often contain ample information about the data, like dataset size and file format. These parameters are not included in the JSON-LD because schema.org-requirements are vaguely defined.

All of the above points pose no problem to manual or hybrid tools. However, including the ‘human factor’ in the evaluation process may lead to inconsistencies. A further limitation of manual FAIRness evaluation tools is the obvious inability to check for machine-actionability. Since this is an essential component of FAIR data, checking just for the human-readable aspects of preserved (meta)data is just as impeding as only checking for the machine-actionable aspects. Or put in other words, automated FAIRness evaluation tools check for the technical FAIRness – or reusability – whereas manual approaches (can) check for the contextual/scientific reusability.

A further point worth discussing is the comparability of the different test results. As outlined in Section 2.1, the five FAIRness evaluation tools do not cover the four FAIR dimensions in a comparable manner: FMES puts little focus on R (2 of 22), FAIRshake is dominated by R (5 of 9), F-UJI is dominated by F and R (together 17 of 24) and our own self-assessment following Bahim et al. () puts equal emphasis on all FAIR dimensions and is far more comprehensive than the other approaches (45 tests, compared to 20, 22, 9 and 24 for CFU, FMES, FAIRshake and F-UJI, respectively). Since there exist no recommendations regarding the importance of individual FAIR dimensions – apart from F, which is seen as the single most important principle of the FAIR spectrum to enable data reuse () – and their weighting in an evaluation, we provide simple arithmetic means of the test results. Similar to the ensemble approach applied in simulation based climate science, where the ensemble mean over multiple models is usually a better representation of reality than the simulation of an individual model (), we see an added-value in presenting the mean over all FAIRness evaluation tools as ‘WDCC-FAIRness’ (Table 4) as compared to relying on just a single test. Of course, once FAIRness evaluation becomes standardised and an operational requirement for repositories and archives in order to be regarded as trusted in science, basing a certification on the results of an ensemble of tests is impractical. We therefore hope that the results we present here help the community converge towards standardised, broadly applicable and officially recommended FAIRness evaluation tools.

4.3 Lessons learned

The process of applying five different FAIRness evaluation tools has helped us judge the WDCC preservation practice, critically reflect on our internal workflow, indicate avenues for improving the FAIRness of our (meta)data holdings and develop a sound understanding for domain-specific FAIRness in climate science.

  • Machine actionability of archived data need not be the priority for data collections in the climate sciences. The size of datasets archived at WDCC is often O(102)TB and more. It is simply not practical to include URLs pointing to the actual datasets in the machine readable metadata, as this may incur both security and bandwidth issues. The WDCC is currently implementing a PID-system at the dataset level to increase Findability.
  • Some of the automated tests could have been passed, if the information given in the machine-actionable metadata would have been as comprehensive as that supplied on the landing pages of archived datasets. One example would be the specification of the file format. At the moment, we do not provide this information in the JSON-LD, because in some cases, the actual file format is NetCDF, a standard open file format of the climate science community, but the files are packed as .zip or .tar archives for download. Note however, that these issues are rather minor and do not reduce the FAIRness of WDCC data holdings per se – including them would merely increase the FAIR score of the automated evaluation approaches.
  • Archiving of climate science related data in data collections characterised by a strict top-down hierarchy which do not have PIDs assigned to every data file is a main characteristic of the discipline-specific standard procedure to make these data available to the community. Evaluating a collection in its entity is essential to fully characterise its FAIRness.
  • Reaching out to the developers of the evaluation tools was essential to apply the tools correctly, comprehend the test results and even discover bugs in the tools’ source code. Close communication and collaboration between the tool developers and those wishing to apply them cannot be overrated and we wish to contribute further to their development and testing in the future.
  • In the process of defining the sample of AICs to be tested, we discovered several ones in which the data is not available due to shortcomings in the WDCC archival workflow. We are at the moment sieving through the WDCC data holdings to find and amend these AICs and make the data associated with them available to the community.
  • Applying the manual evaluation approaches is far less straight forward compared to the automated ones. Even if domain and repository experts perform the evaluation, the results may differ because subjectivity cannot be ruled out. One example would be a maturity indicator demanding the provision of dataset and provenance documentation. While supplying links to a third-party online database containing this information would suffice for one evaluator, this might not be the case for another one. Therefore, evaluation results obtained by one evaluator should always be reviewed. In this context, the list of FAIR maturity indicators compiled by Bahim et al. () helps to reduce the risk of unconscious bias because it provides very specific guidance for testing.
  • For some AICs, documentation is provided in terms of README files or reports which are archived along with the data. However, these files are hard to find if a user is not familiar with the WDCC and does not know where to look. WDCC-efforts to improve the user experience in this regard are underway by providing more clear access to associated documents and by working towards a community-acceptance of the EASYDAB (EArth SYstem DAta Branding, ) concept which allows users to clearly identify high-quality archived datasets.

4.4 Recommendations for future FAIRness evaluation tools

In the course of our analysis, it became apparent that none of the five applied FAIRness evaluation approaches was entirely fit-for-purpose to evaluate the WDCC data-holdings (cf. Section 4.2 and 4.3), but all of them have their individual strengths on which to build future FAIRness evaluation tools. We provide an overview table summarising our experiences from applying the five different FAIRNess evaluation approaches in Table 6.

Table 6

Summary of the experiences gained from applying the ensemble of different FAIRness evaluation approaches in this study.


AUTOMATEDMANUALHYBRID

applied toolsFMES ()
F-UJI ()
CFU
self-assessment ()
FAIRshake ()

application/use of the toolthe tools take PID/DOI of the resource to be evaluated
if available, selection of appropriate metric sets is critical and requires prior review
completing questionnaires is time intensive and depends on the extent of metrics
expert knowledge is essential
the tools take PID/DOI of the resource to be evaluated
selection of appropriate metric sets is critical and requires prior review
expert knowledge required to evaluate contextual reusability time intensive

preservation of resultsresults are saved in an online database or are exported (printed) as PDF
local installations store results locally
date of the evaluation has to be manually noted (in the tools evaluated here)
results are saved locally as spreadsheets
date of the evaluation has to be manually noted
results are saved in an online database
date of the evaluation has to be manually noted (using the tool evaluated here)

interpretation of resultsdetailed information on the applied metrics is available as documentation
if tests fail, the tools provide technical output interpretable by experts results are provided as quantitative measure
the form is filled by a knowledgeable expert, interpretation is thus performed during the evaluation itself
quantification of results depends on evaluator perception
detailed information on the applied automated metrics is available as documentation
manual parts filled by a knowledgeable expert, interpretation is thus performed during the evaluation itself
quantification of results partly depends on evaluator perception

reproducibilityresults are reproducible as long as the same code version is usedhuman evaluation is subjective, reproducibility depends on manual documentation of each evaluationreproducibility of atomated parts is given as long as the same code version is used
human evaluation is subjective, reproducibility depends on manual documentation of each evaluation

evaluation of technical reusability/machine actionabilitygood
tests fail if code specifications are not exactly met
limited
machine actionability cannot be specifically tested
assessment only based on implemented methods/protocols, not their functionality
very good
failed automated tests can be manually amended given that an implementation is present but does not exactly match the test implementation

evaluation of con-textual reusabilitylimited
domain-specific and agreed standardised FAIR metrics are needed
good to excellent
depends on the domain-expertise of the evaluator and the time and effort put into the evaluation
good to excellent
depends on the domain-expertise of the evaluator and the time and effort put into the evaluation

For future FAIRness evaluation tools, we recommend the development of capable hybrid approaches to capture both the technical and contextual reusability of preserved research data.

For the reasons we elaborated on above, automated FAIRness evaluation tools are very good at testing maturity indicators which allow for binary yes/no answers following a standardised protocol. Of the two approaches used here, F-UJI seems to be more mature and capable than FMES, but still fails to capture the actual curation status of WDCC data holdings. At that point, the manual part of a FAIRness evaluation would take over to reliably judge the contextual reusability of the preserved (meta)data. Our recommendation to include domain experts and to not only rely on automated approaches in the evaluation of FAIRness and general (meta)data quality is also in-line with recent work on the same topic following a similar line of argument (; ; ).

In practice, we envision a hybrid approach similar to that of FAIRshake, but substantially more comprehensive. The tool would also include internal databases specifying domain-specific information, like standards, file formats or essential metadata fields specific to the discipline. In this context, the concepts of FMES and FAIRshake enabling the use of different sets of maturity indicator catalogs is very promising. Nevertheless, even with highly standardised and accepted metrics in place, subjectivity can never be completely ruled out when humans evaluate the contextual reusability of scientific datasets. With the current rapid advances in machine- and deep-learning research applications, it may just be a matter of time until such approaches are mature enough to provide objective assessments of FAIRness, such as by comparing documentation in text form with the associated numeric data.

5 Summary

In this study, we have applied an ensemble of five different FAIRness evaluation tools to evaluate the FAIRness of (meta)data preserved in the WDCC (World Data Center for Climate). The tools differed in terms of their applied methodology (manual, hybrid or automated evaluation) as well as in the weighting of the individual FAIR dimensions (Findable, Accessible, Interoperable or Reusable) in the evaluation. The research questions of our study were three-fold. First, the results of an earlier self-assessment of WDCC-FAIRness () were to be compared to results from available third-party FAIRness evaluation tools and methods, including a further development of our self-assessment approach. Second, we performed a comparative analysis of the results provided by the five tools to identify common strengths and/or weaknesses. Third, we intended to analyse the fitness-for-use of available FAIRness evaluation tools for the purpose of performing a comprehensive assessment of a repositories’ (meta)data holdings. Building on the results of our study, the ultimate goals were to determine how WDCC’s preservation guidelines live up to external FAIRness evaluation, to identify possible limitations and shortcomings and to provide recommendations to the global research data management community regarding the further development and application of FAIRness evaluation tools.

Addressing the first research question, we found that our previous self-assessment () yielded a significantly higher level of WDCC-FAIRness (0.9 of 1) compared to the ensemble mean score of 0.67, with a range of 0.5 to 0.88, obtained from the five evaluation approaches applied here. Specifically, our self-assessment of this study, conducted along the recommendations of Bahim et al. (), yielded a lower score (0.77) than the previous one. We attribute this difference to the more comprehensive and objective evaluation presented in this paper. The web resource detailing WDCC FAIRness will be updated accordingly.

Regarding the second research question, we found tools involving manual assessment yield higher FAIRness scores than automated tools. This is because the automated approaches cannot be used to assess the contextual reusability of preserved (meta)data. As data in WDCC is preserved with a focus on long-term reusability, data is usually accompanied by rich metadata providing, for example, documentation and provenance information (; WDCC, 2016) – an aspect which can only be adequately evaluated in a manual manner by a domain and/or repository expert. Further, lower FAIRness scores obtained from automated tools result from inaccessible data (WDCC data is only accessible after login, but for free) or missing information in the machine-actionable metadata provided by the WDCC. We are in the process of increasing the information content of those metadata. Further, the applied evaluation tools compare well at the data collection level if similar evaluation methodologies (manual, hybrid or automated) are used. An exception to this rule is the particularly good agreement between results from the automated F-UJI tool () and our own self-assessment based on Bahim et al. (). At the data collection level, we confirmed that a high level of (meta)data maturity () also directly translates into high FAIR scores (and vice versa) across all FAIRness evaluation tools.

Regarding the third research question, we concluded that none of the five applied FAIRness evaluation tools provides a completely satisfactory evaluation experience by itself, because manual and automated approaches lack the capacity to quantify the machine- and contextual reusability of archive data, respectively. The hybrid methodology applied in FAIRshake () is most promising in this regard as it merges the two approaches, but it lacked comprehensiveness in the setup we applied here.

Finally, we recommend focusing the development, application and operationalisation of future FAIRness evaluations on hybrid methodologies featuring a capable and comprehensive automated part and a contextual part evaluated by a domain and/or repository expert. Our recommendation is in-line with that of other recent studies (; ; ). We further strongly recommend that any part of a FAIRness evaluation be subject to scrutiny by expert reviewers.

With the ever-increasing demand for archives and repositories to showcase their FAIRness, we see our results and recommendations a step forward to effectively consolidate efforts to develop and provide the most fit-for-purpose tools to evaluate discipline-specific FAIRness of digital objects.

Reproducibility

The data and methods underlying this study are made publicly available via the WDCC (; ) and can be used to comprehend and reproduce the resuts presented here.