Fitness for Use of Data Objects Described with Quality Maturity Matrix at Different Phases of Data Production

Heinke Höck; Frank Toussaint; Hannes Thiemann

1 Introduction

1.1 Background

In recent years, the digital storage of data objects (data and metadata) in most scientific fields faced new challenges. Big projects mostly are global collaborations and require an increasing orientation to international standards. The data storage techniques in parallel shifted to federated, distributed storage systems like the Earth System Grid Federation (ESGF) for climate model data. For the long term archival (LTA) on the other hand, communities, funders, and data users make stronger demands for data and metadata quality to facilitate data use and re-use. Thus, for the efficient re-use of data objects, the metadata should contain the maximum possible information to judge on the data’s fitness for use from the users view – the intended user as well as any not yet known re-users.

For the assessment of data objects, stakeholders from academia, industry, funding agencies, and scholarly publishers have formally defined and endorsed a set of FAIR Data Principles ().

At the same time, on the other hand, there was growing interest of scientists, journals, and other parts of the community, to assess the quality of different approaches to research data management (RDM). This led to a rising importance of RDM assessment systems, for which already several layouts of maturity matrices have been developed and some of them published (, ). For example regarding repositories, the Core Trust Seal certification is a standardised assessment. The World Data System (WDS) of the International Science Council (ISC) requires this certificate as a condition of membership. Additionally, applicants need to show their compliance with the WDS’ strong commitment to “open data sharing, data and service quality, and data preservation”.

These two assessment types, for data objects and for RDM techniques, have various overlaps, as the assessment of a repository is not independent of the quality of its data. The assessment of data, on the other hand, depends on the procedure of data curation which can be evaluated with regard to several criteria.

These curation criteria, in turn, well might differ for different data objects in the storage of one single repository dedicated to different scientific communities. This was already pointed out by Treloar & Harboe-Ree () who described this situation systematically by identifying eight different curation criteria in which the data undergoes changes during its maturation process. As these changes mostly represent an evolutionary process, Treloar & Harboe-Ree refer to Curation Continua. Furthermore, they divided a data object’s evolution into the three phases Private Research, Shared Research, and Public.

However, the term maturity is problematic in a different respect as was discussed by Cox et al. (): ‘It might be taken to imply a single development path leading to a fixed mature finishing place. This is not normally the case. Also, terms like immature or underdeveloped, sometimes associated with maturity models might be seen as pejorative.’ This issue can be adopted for the term quality, as well.

An interaction between Treloar’s domain model and the maturity matrix (see below Table 1) may improve this situation because the workflow does not terminate at impact/re-use. Maturity and quality depend on the phase. Each phase has its own options for maturity and quality. For example, the persistence of data differs between the private production domain with local storage and the mostly public long-term phase in a long-term archive.

Table 1

Assignment of the DKRZ data dissemination system to the domains as described by Treloar & Harboe-Ree ().

Domain	Phase	DKRZ system

Research preparation phase	Concept generation	data management (DM) planning tool RDMO
Private Research	production/processing	DKRZ storage on hard disc and tape HPSS⁴
Shared Research	project collaboration intended use	ESGF, globally distributed project repository
Public	long-term archiving impact re-use	Long-term Archive

This shows that ‘data products produced by the same organization are often in various levels of maturity in terms of their data quality, accessibility, and usability as well as the states of completeness of data quality metadata and documentation’ ().

1.2 The users’ needs – early access and tools

Peng et al. () describe the requirements and challenges for the Public Domain: ‘as data are increasingly treated as valuable assets for decision-makers, decision support based on fast data analysis has made ensuring data quality a critical but challenging task. Therefore, having tools available is not just helpful but a necessity for effectively stewarding and serving digital scientific data. Those tools allow data and scientific stewards to effectively capture, describe, and convey data quality information’. Other helpful tools allow users to view data products before they request aggregated or subsetted data for their specific applications or they automatically select metadata from file headers.

However, any type of tools and other forms of automated processing require rich metadata. Partly for technical reasons, partly to bring a complete picture of the data into the user’s view. So metadata is an important criterion for the data quality in terms of data re-use.

Additionally, during the project’s runtime of shared research, scientists want to use the data product at the earliest stage possible and often settle themselves with poor information on the data, i.e. with poor metadata quality. To cover these needs, some projects might want to publish the data in an early and poorly conceived stage.

For the Shared Research Domain and Public Domain, it is important to differentiate between the intended use and the actual re-use of the data because this reflects that with increases in maturity, the data curation processes become more refined, institutionalized, and standardized (). The data producers/providers have the knowledge to deal with their raw data. So the need for more standardisation mainly arises from the requirements of the re-use.

At the German Climate Computing Centre (Deutsches Klimarechenzentrum, DKRZ), this situation led to a twofold dissemination system for the data, which follow the internal storage on project hard disks. The ESGF data dissemination system focuses on the needs of users as partners in globally running projects which want to access the data during the project collaboration – this is the intended use phase. The DKRZ’s digital long-term archive (DKRZ-LTA), in contrast, aims for long term data holding and data re-use far beyond the project run time. This requires high generic standards for the metadata quality.

In this paper we describe the development of a Maturity Matrix assessment for the quality of data and metadata (Chapter 2. and 3.). We present the different criteria used at DKRZ to rate the maturity of data during the data curation (Chapter 4.).

2 The Quality Maturity Matrix

2.1 Initial Considerations

In addition to using maturity matrices for RDM services and capabilities, this technique also has been applied to other areas as pointed out by Cox et al. (). Examples are software engineering (), digital preservation (), and data intensive research (). Maturity models have also been applied in the RDM space, within institutions (), and within research projects (). In Peng () the reader finds a good overview of the current state of assessing the maturity of stewardship of digital scientific data.

In 2015 we took the initiative to apply the Quality Maturity Matrix (QMM) technique to implement quality assessment for the digital storage workflow of climate model data at DKRZ. The first application field for implementing the QMM was the DKRZ-LTA World Data Center for Climate (WDCC) sector. We decided to use the System Maturity Matrix (SMM) of the German Weather Service (DWD) and EUMETSAT, which collaborate on climate related data products. The DWD in turn cooperates with us in many climate modelling projects. So the SMM became the initial point of the QMM development at DKRZ.

For the DKRZ QMM, the quality with respect to the repository itself was not considered, e.g., persistency of access and physical reliability of long term storage. This should be done by the repository certifications as for instance the CoreTrustSeal. We took those aspects out of our QMM scheme that are contained in published stewardship maturity matrices, as for example described by Peng et al. () and focus on the quality of stewardship. This reflects that the QMM is intended to be used purely for data objects. It can be used by anyone interested in the quality of data objects and is not limited to science and research.

2.2 Initial points for the development of a Maturity Matrix for data

Peng () provides us with the motivation to develop a Maturity Matrix for data stewardship and other processes, using the maturity model description: ‘A maturity model is considered as a desired or anticipated evolution from a more ad hoc approach to a more managed process. It is usually defined in discrete stages for evaluating maturity of organizations or process’. This can be expanded to similar techniques for the assessment of data objects like the DKRZ QMM.

This motivation led us to the following points for the development of a Maturity Matrix for data objects:

Definition of a level system for quality assessment
The levels correspond to the discrete stages Peng () announced for the maturity model. This is the metric we have referred to in the background chapter.
Description and definition of data object quality for data and metadata
In combination with the OAIS terms () and the QMM we want to provide an assessment frame for data object quality.
Support the estimation of data usability for the re-users’ intended application
We are using the ISO19157 () explanation of data usability: Usability is based on user requirements. Here ISO 19157 distinguishes between rationale – the intended application for creating a data set – and the user requirements for a particular application – the re-use.
Furthermore, ‘non-quantitative information is illustrative for users and can help assessing the quality of a data set, especially in cases where it is used for a particular application that differs from the intended application’ (). We implemented most of the information on usability by the QMM criterion ‘Accuracy’, specifically by the topic ‘references to evaluation results (data) and methods’.

3 Description of the Quality Maturity Matrix for Data Management at DKRZ

The two dimensions of the QMM are the levels and criteria/aspects. The levels and their characteristics are given in Figure 1. The QMM levels are numbered 1 to 5. The QMM criteria are consistency, completeness, accessibility, and accuracy; these are described fully in section 4 below.

Figure 1

Characteristics of data and metadata Quality Assurance Maturity Levels. QMM levels corresponding a) to different steps of the data production workflow and b) to the five data production phases with their standardisation characteristics and increasing degrees of formalisation.

The criteria are developed to support the phases of the data production: concept, production/processing, project collaboration/intended use, long-term archiving and impact/re-use.

In this context, support means description of experiential knowledge which makes sense in the context of data production steps and helps to reach the next level. This is the coarse outline of the levels. Data in the production phase are able to reach level 5 if the aspects’ criteria are fulfilled.

In Figure 1 the level colouring of the QMM levels changes from red for making a concept for the data to green for data of the highest degree of maturation.

3.1 The way to DKRZ’s QMM

The initial point of the QMM development at DKRZ was the DWD/EUMETSAT System Maturity Matrix (SMM⁷) which is used for monitoring the process of generating Climate Data Records for satellite data ().

The monitoring aspect of SMM is missing in the QMM approach because software readiness (SMM criterion) does not change after the climate model data and associated data (observations) have been transferred to the long-term archive of WDCC. The documentation of the methods could make progress, but the software itself cannot (e.g. coding standards) (Table 2).

Table 2

Shows a comparison of SMM and QMM.

SMM	QMM

Software Readiness	Omitted: the data object is considered as persistent. Software development would lead to new data objects except software documentation. That is part of the metadata provenance.
Metadata	Criterion: Completeness Aspect: Existence of Metadata
User Documentation	Criterion: Completeness Aspect: Existence of Metadata
Uncertainty Characterisation	Criterion: Accuracy
Public Access/Feedback/Update	Criterion: Accessibility/Criterion: Completeness Aspect: Existence of Metadata level 5: data provenance chain exists including internal and external objects e.g. software, articles, method and workflow description/Criterion: Consistency Aspect: Versioning and Controlled Vocabularies (CVs)
Usage	Omitted: we use the ISO19157 explanation of data usability. It depends on the ‘particular application’. From this point of view, an evaluation of usage is not possible.

Other modification aspects from SMM to QMM definition are:

One main matrix, no submatrices, but outsourcing of details in checklists () to adjust details into informal rules (level 2), project requirements (level 3), long-term archive requirements and discipline-specific standards (level 4), and interdisciplinary standards (level 5)
Generic descriptions in the matrix cells
Depiction of guidance on how to reach the next level () with connected common cells and transposed matrix
The quality with respect to the repository itself was not considered, e.g., persistency of access and physical reliability of long term storage. This should be done by repository certifications like CoreTrustSeal², the nestor Seal, or an ISO 16363 certification.
The criterion accuracy is part of the level 3 evaluation process with the provision of associated documents and/or information about:
1. procedure about methodological and technical sources of errors and deviation/inaccuracy,
2. procedure with validation against independent data (if feasible),
3. evaluation results (data) and methods of data production,
4. indication of missing values,
5. procedure of statistical quality control of the data.

Most of these modifications were included in the first draft of the DKRZ QMM, as we reported in a presentation ().

In addition, we adapted the relevant terms to the reference model of the Open Archival Information System (OAIS, Figure 2) and we implemented the OAIS Preservation Description Information (PDI, Figure 3) as obligate where applicable. The latter should be a minimum set of metadata in the long-term archive, which should accompany the Content Data Object (CDO).

Figure 2

OAIS Reference Model Information Packages on different Phases of the QMM process, showing the submission (SIP), archival (AIP), and dissemination information packages (DIP).

Figure 3

DKRZ Long Term Archive – example of minimum metadata (PDI, following the OAIS reference model).

4 The Criteria of the QMM levels

For the Quality Maturity Matrix we regard the four criteria consistency, completeness, accessibility, and accuracy. Each of these criteria is subdivided into aspects, for example, for Completeness the aspects are ‘Existence of Data’ and ‘Existence of Metadata’, as shown in Table 3.

Table 3

Overview of the QMM quality criteria and sub-criteria (aspects).

Criterion	Aspect

Consistency	Data Organisation and Data Object
	Versioning and Controlled Vocabularies (CVs)
	Data-Metadata Consistency
Completeness	Existence of Metadata
Completeness	Existence of Data
Accessibility	Metadata Access by Identifier
Accessibility	Data Access by Identifier
Accuracy	Plausibility
Accuracy	Statistical Anomalies

One of the ways to obtain the best possible re-use (and impact) of data objects is to make data FAIR. In this respect we are guided by the interpretation of Mons et al. () for the European Open Science Cloud (EOSC): ‘…as long as such data are clearly associated with FAIR metadata, we would consider them fully participating in the FAIR ecosystem.’

All in all, the FAIR mission statement consists of 15 aspects. With the QMM, one can assess to which degree FAIR Data Principles are fulfilled for a data object and data can therefore be marked as FAIR.

As the criteria and the levels of the QMM represent a matrix and also for space reasons, a presentation in tabular form was chosen for the following subsections. In the four tables (Tables 4, 5, 6, 7), we give an overview of the different factors relevant for the four criteria and their aspects. Connections between the QMM and the FAIR Data Principles of Wilkinson () are presently light faced.

4.1 Criterion: Consistency

Table 4

QMM criterion consistency.

Level 1	Level 2	Level 3	Level 4 R1.2	Level 5

Aspect: Data Organisation and Data Object
conceptual development	*data organisation is structured/conform to*
	internal rules informal documented	project specification	well-defined rule e.g. discipline-specific standards and long-term archive requirements (OAIS Package Info -binds)	interdisciplinary standards
	*data objects (OAIS) are*
	SIPs consistent to internal rules	SIPs correspond to project requirements	I1, I2 AIPs conform to well-defined rules e.g. discipline-specific standards and long-term archive requirements	AIPs conform to interdisciplinary standards up-to-date and consistent to external scientific objects if feasible
				DIPs are fully machine-readable with references to sources
			I1 DIPs datasets are self-describing
		*data formats – Content Data Object (OAIS)*
		correspond to project requirements	I1 conform to well-defined rules e.g. discipline-specific standards and long-term archive requirements	conform to interdisciplinary standards
		data sizes are consistent
	file extensions are consistent
Aspect: Versioning and Controlled Vocabularies (CVs)
conceptual development	*versioning follows/is*
	internal rules informal documented	systematic corresponds to project requirements	systematic collection including documentation of enhancement conform to well-defined rules old versions stored if feasible
				In case new versions are published: documentation is consistent to previous versions
	*data labelled with CVs conform to*
	informal CVs if feasible	formal project defined CVs if feasible	I1, I2 discipline-specific standards		interdisciplinary standards
Aspect: Data-Metadata Consistency
not evaluated	*OAIS metadata components are consistent*
	PDI components: Provenance- unsystematically documented: Reference- creators	PDI components: Provenance – basically documented: Reference –creators contact Descriptive Information -naming conventions for discovery – find and search	Complete PDI * Provenance Context Reference – cross Fixity Access Rights and Representation Information Descriptive Information Package Info
			*maintenance and storage policy are not affected, since they belong to the repository certification.		I3 external metadata and data are consistent

4.2 Criterion: Completeness

Table 5

QMM criterion completeness.

Level 1	Level 2	Level 3	Level 4 R1.2	Level 5

Aspect: Existence of Data (Completeness and Persistence)
not evaluated	data is in production and may be deleted or overwritten	datasets exist, not complete and may be deleted but not overwritten unless explicitly specified	data entities (conform to discipline-specific standards) are complete dynamic datasets – data stream are not affected number of datasets (aggregation) is consistent data are persistent, as long as expiration date requires	data entities (conform to interdisciplinary standards) are complete dynamic datasets – data stream are not affected number of datasets (aggregation) is consistent data are persistent, as long as expiration date requires
Aspect: Existence of Metadata
not evaluated	*OAIS metadata components exist*
	PDI components: Provenance- unsystematically documented Reference- creators	PDI components: Provenance – basically documented: Reference –creators contact Descriptive Information: naming conventions for discovery – find and search	F2, R1 Complete PDI * R1.2 Provenance Context Reference Fixity Access Rights and Representation Information R1.1 Descriptive Information F4 Package Info
				metadata is conform to interdisciplinary standards data provenance chain exists including internal and external objects e.g. software, articles, method and workflow description
			*maintenance and storage policy are not affected, since they belong to the repository certification.

4.3 Criterion: Accessibility

Table 6

QMM criterion accessibility.

Level 1	Level 2	Level 3	Level 4 R1.2	Level 5

Aspect: Data Access by Identifier
not evaluated	data is accessible by
	file names	internal unique identifier correspond to project requirements	permanent identifier (expiration is documented) (OAIS Package Info – identifies) datasets have an expiration date and are accessible for at least 10 years (conform to rules of good scientific practice)	F1, A1 global resolvable identifier (PID-persistent identifier) registered with resolving to data access including backup where it is commonly accepted that the identifier is persistently resolvable at least to information about fate of the object data is accessible within other data infrastructures including cross references
			checksums are correct
		checksums are accessible
		a bijective mapping between identifier and datasets is documented e.g. in data header (OAIS Package Info – binds, identifies)
Aspect: Metadata Access by Identifier
not evaluated	metadata is accessible by
	not specified	internal unique identifier correspond to project requirements	by permanent identifier expiration is documented (F4 OAIS Package Info – identifies) complete data citation is persistent	F1, A1 global resolvable identifier including backup complete data citation is persistent
				I3 external PID references are supported
		a mapping between data access identifier and metadata access identifier is implemented (OAIS Package Info relates Content Info and PDI)

4.4 Criterion: Accuracy

Table 7

QMM criterion accuracy.

Level 1	Level 2	Level 3	Level 4 R1.2

Aspect: Plausibility
not evaluated	R1 documented procedure about technical sources of errors and deviation/inaccuracy exists (data header and content is consistent)
not evaluated		R1 documented procedure about methodological sources of errors and deviation/inaccuracy documented procedure with validation against independent data R1 references to evaluation results (data) and methods exist
Aspect: Statistical Anomalies
not evaluated	R1 missing values are indicated e.g. with fill values
		R1 documented procedure of statistical quality control is available
			scientific consistency among multiple data sets and their relationships is documented if feasible

5 Evaluation Process by the QMM

The evaluation planning and implementation involve the feasibility of the evaluation for the specific level.

At DKRZ, we first identify which QMM level the data object has reached when it is submitted to us. For a more detailed level evaluation, implementation check lists are provided () to assess whether or not the criteria are successfully obtained for the specific level.

Once the positive outcome of the level evaluation is confirmed, we offer to the user guidance to enhance the data object to the next level of maturity.

For the implementation of the evaluation process at the WDCC, the submission process has so far been analysed for model data. The WDCC has to check some points to those it normally carries out in the workflow of the data object submission process. We found out that the following check points are sufficient to reach at least QMM level 4 ().

The WDCC LTA submission process of data and metadata must have been completed (completely archived) with DOI assignment.
The data format of the model data should be netCDF CF or WMO GRIB
The accuracy has to be documented with the model description in case of model data.
The DIP data size must be appropriate and the use constraint (license data providers apply to their particular data) should be checked.
The documentation of submission agreement between DKRZ-LTA and the project should be available.

This corresponds to the FAIR presentation of the (meta)data at WDCC LTA (). The (meta)data is FAIR with the exception of guiding topic I3 () under the sufficient but not necessary conditions: Meta(data) progress is ‘completely archived’ (long term archiving process of data and metadata is finished), DataCite DOI(s) are assigned and the data format is netCDF CF or GRIB.

The FAIR Data Principles do not contain the persistence of the data. However, this is included in the QMM under the aspect Existence of Data (Completeness and Persistence).

6 Granularity of QMM Level Assignment

It is recommended to assign levels on aspect granularity and not to use sub-aspects such as data formats – Content Data Object (OAIS) because netCDF CF is a discipline-specific standard (level 4) and netCDF is an interdisciplinary standard (level 5). To rule out the need for less stringent requirements follow to a higher level, the entire aspect must be fulfilled. Several quality results for different data quality aspects can be aggregated to the associated criterion, if all aspects have reached the corresponding level (Table 3).

The evaluation process was carried out as an exemplar at the DKRZ-LTA. The corresponding protocol is available online.

7 Conclusion

The DKRZ Long Term Archive stores Earth System data with a strong focus on climate model data. Especially for the latter, the described Quality Maturity Matrix for data has been developed. However, it can easily be adapted to other data types like satellite or in-situ data.

The aim of this data assessment by QMM is to give the data user the opportunity to appraise the Fitness for Use of the data objects. Metrics that represent this for the data records are useful for this purpose. The QMM described here should additionally provide clues for the improvement of the working method. With the QMM the data user is given an idea of how far the disseminated data follow the FAIR Data Principles and other standards and recommendations. The QMM goes beyond the FAIR Data Principles in the field of data persistence, which is of particular interest for archives and their users.

Data Science Journal

Practice Papers