1. Needs for Curating and Sharing Dataset Quality Information

Knowledge about the quality of data and metadata at the individual dataset level, such as their accuracy, completeness, timeliness, and provenance, is imperative for establishing trust and supporting informed decisions on and accurate (re)use of data (). To support effective decision- and policy-making processes, dataset quality information, such as information about the state of data, metadata, documentation, software, workflows, and tools used for producing and managing the data; and how the data being curated and serviced, should be consistently captured in the metadata and be a part of the ecosystem that supports open science.

Assessment of data quality is key for ensuring that the available data and information are credible and such assessments are essential when establishing trust for reuse of the data (). Trusted data are perceived as worthy of use in decision making environments where the metadata is sufficient to adequately describe the data, e.g., information about the dataset author and data timeliness. Describing the quality of a data product and providing access to such quality information can support potential users of a particular dataset to determine whether it is appropriate for their planned usage, i.e., fitness for purpose.

A systematic analysis was carried out under European Commission to estimate the annual cost of not sharing data, which was found to be a minimum of €10.2bn per year (). On average, data scientists spend 60–70% of their effort on dealing with data quality related issues (e.g., ). Thus, not sharing data quality information will compound that loss, especially on productivity lost due to redundancy in assessing data quality.

Although the importance of access to quality information is well recognized, methodologies for an evaluation framework and presentation of resultant quality information to end users may not have been comprehensively addressed. Access to this information is especially important for enabling data to be findable, accessible, interoperable, and reusable (FAIR). The FAIR data guiding principles defined by Wilkinson et al. () emphasize the importance of data sharing in a machine-friendly environment. Since their inception, the FAIR data guiding principles have been adopted by global entities and have had a major impact in promoting data sharing and reuse globally (e.g., ; ; , ; ; ; ). However, the FAIR data guiding principles are somewhat limited in that they call for only meta(data) to be associated with detailed provenance and “richly described with a plurality of accurate and relevant attributes” (). They do not explicitly address the sharing of quality information of meta(data). For example, if the FAIRness of a dataset has been evaluated, what can be done to ensure that the method used and assessment results are readily findable, accessible, and (re)usable to end users? What can be done to ensure that the quality information can be readily integrated across different tools and systems within and out of individual organizations?

Therefore, building on the direction that the FAIR guiding principles have provided for data sharing, we would like to go one step further and call for all dataset quality information to be FAIR to improve the sharing of quality information of individual datasets. The optimal goal of this call is to allow for global access to and global harmonization of quality information of individual datasets as an important step towards open science in both machine- and human-friendly environments. FAIR dataset quality information will also help improve data sharing as discussed in more detail in Section 4.

Dataset quality can be affected by activities that are conducted throughout the data lifecycle. For example, lack of a quality assurance and control (QA/QC) procedure when generating the data product may impact the scientific quality of the data, while lack of the information about the QA/QC procedure will influence the quality of metadata and potentially reduce the level of user’s confidence in trusting and using the data. In addition, data can be corrupted at any stage of the data lifecycle. In order for users and decision-makers to trust the data and the scientific findings resulting from analysis of this data, it is essential to establish and demonstrate, in a consistent and transparent way, the credibility of not only the data itself, but also the whole process of producing, managing, stewarding, analyzing, and servicing the data (). Therefore, a data lifecycle approach to data quality assessment is necessary. Furthermore, a data lifecycle approach to data quality assessment can facilitate effective recording of data quality information during various data lifecycle activities. The details of such information could be lost during later stages in the data lifecycle if they are not recorded in a timely fashion when the data quality events or assessments occurred. Moreover, managing quality throughout the entire dataset lifecycle is imperative for ensuring that the information and knowledge gained are not contaminated by inaccurate or corrupted data, as well as for facilitating accurate uncertainty estimates in the derived analyses. The value of lifecycle approaches to data quality has been recognized for various kinds of data, including remote sensing observations (), health services (), and health and biomedical citizen science (). Data lifecycle approaches to quality assessment also could be informed by lifecycle approaches to software quality ().

Another example where verifiable and consistent quality-controlled data are becoming increasingly important is in the domain of the emerging technologies of machine learning (ML) and artificial intelligence (AI). These are flourishing as useful and effective tools to uncover or gain new knowledge from various domains of Earth science data (see an overview by ). However, any sound analysis needs to build on reliable data (; ). Good quality data allows people and organizations to increase trust in data and analysis, apply robust ML models, increase revenue and business performance. Poor quality data could lead to a negative economic impact and even loss of human lives. Without good definitions and procedures for working with data quality, critical operations could be at risk. Organizations should be aware of the quality of the data used for AI, and ensure that only data fit for purpose within acceptable risks be used (e.g., ).

Therefore, it is crucial to consistently record, curate, and represent quality information of individual datasets, and make it readily available and integrable. However, dataset quality information is not routinely curated and much less represented in a human- and machine-readable manner, despite the fact that international standards for describing the quality of geographic data have been in place since 2003 (e.g., ; ). The lack of adoption of one or more data quality standards may in part reflect the diversity of approaches, availability of resources, technologies, networks, and research questions of investigators (), as well as the context for the planned purpose and use of the data (; ). Lack of motivation to document quality can be caused by the lack of prescriptiveness of existing standards – documentation of data quality metadata has always been optional in the ISO 19100 series and as of 2014, the ISO 19115-1 standard for metadata does not define a minimum set of discovery metadata, which used to suggest at least one data quality element (the provenance of a dataset).

Several other issues also may contribute to challenges for assessing and reporting data quality, and ultimately for the curation of dataset quality information. A frequently cited barrier against documenting the quality of spatial data is that it requires special domain-expert technical knowledge and across-domain expert knowledge integration, while documenting general metadata can be done automatically or by non-specialists (; ). Oftentimes, knowledge of the quality information resides with domain experts who need to contribute to the information but may not be fully aware of the importance of their contributions as the impact comes later on. The benefit of such knowledge may also be substantially less at an individual level than for the common good.

Investigating barriers to assessing and reporting data quality information in medical bioinformatics, Callahan et al. () identified issues, at both organizational and individual levels, that contribute to deficiencies in quality assurance. Such organizational issues include inadequate support, unclear expectations and insufficient training such as the absence of best practices. Individual ones include consequential as well as process issues, such as unresolvable conditions.

All these issues and challenges have led us to make this call-to-action statement to bring awareness about the needs and benefits of sharing quality information at the individual dataset level and rally people behind an important and challenging international and cross-disciplinary community effort.

The paper is organized as follows. Section 2 provides definitions of some key terms used in this paper. Section 3 touches on multiplicity of dataset quality attributes and dimensions, which is one of the main challenges for assessing and curating dataset quality information. Some potential benefits of having FAIR dataset quality information are described in Section 4, along with a brief summary of a real-life use case of how pre-vetted, timely and readily usable data and quality information is critical to disaster response efforts conducted by utility companies, with lives and billions of dollars at stake. A call for global community guidelines is then made in Section 5 which also includes an outline of community recommendations for developing such guidelines. A summary in Section 6 concludes the paper.

2. Terms and Definitions

In this paper, data are representations of observations, objects, or other entities and can refer to anything that is collected, observed, generated or derived, and used as a basis for hypothesis testing, reasoning, discussion, or calculation. Observed data include in situ and remotely sensed measurements. In situ measurements can be from weather stations, rain gauges, buoys, or autonomous vehicles/vessels, while remotely sensed data can be from instruments on satellites or aircrafts. Generated data can be results from a numerical model (e.g., a climate model) or a statistical model (e.g., a linear regression model). Model data can be analyses, predictions, or projections. Derived data can be produced from raw measurements or other data products. For example, atmospheric reanalysis data is one type of derived data that combines modeled data and observations via data assimilation to produce a dynamically consistent estimate of the atmospheric state.

Dataset refers to an identifiable collection of data (), and it can be published or curated by a single agent (). A dataset can be the digital rendition of a data product of a given version of an algorithm or model or experiment. A dataset may contain one or many data files or records in a database in an identical format, having the same variable(s) and product specification(s).

Dataset quality information consists of information about the quality of data, metadata and documentation. Documentation can include descriptions of measurement methods and instruments, software, provenance, as well as that on the state of practices, workflows, frameworks, tools, and systems associated with the dataset and production, data and quality management, data services and usage, customer support and user engagement.

3. Multi-Dimensionality of Dataset Quality

A dataset is associated with a number of distinct quality attributes or characteristics. For example, Wang and Strong () identified over 179 individual data quality attributes through a survey from a data consumer perspective, attributes such as accuracy, correctness, freedom from bias. Furthermore, dataset quality attributes can be categorized into different perspectives or dimensions with emphasis on certain quality attributes (e.g., ; ; ; ; ). For instance, Wang and Strong () prioritized the 179 quality attributes down to 15, and categorized them into four perspectives, i.e., intrinsic, contextual, representational, accessibility. Redman () defined accuracy, completeness, consistency, and currency as four quality dimensions of data values. Alternatively, based on the full dataset lifecycle, Ramapriyan et al. () categorized quality attributes into four quality dimensions: science, product, stewardship, and service. These various groups of quality attributes are explicitly listed to demonstrate that they can differ greatly depending on the different perspectives.

Assessment models are developed in Earth science to measure the maturity of different quality perspectives and dimensions at the dataset or collection level (see an overview by ). However, there are very limited and sparse actionable guidelines on how to curate and represent dataset quality information in a way that is consistent with FAIR principles for improved sharing.

Currently dataset quality information, when available, is published in science journals that are text-based and cannot be readily integrated into data management and stewardship processes or across different systems. In addition, dataset quality information needs to be readily understandable by both machine and human end users, including those who plan to use the described data as well as by those who are trying to determine whether the data are appropriate for their intended use. Therefore, we need to converge towards harmonized approaches for curating dataset quality information in a way that is consistent with FAIR guiding principles to effectively enable global access of this information.

4. Potential Benefits of FAIR Dataset Quality Information

The FAIR data guiding principles emphasize the importance of data sharing by ensuring that data and data descriptions (metadata) are findable, accessible, interoperable, and reusable (). Findable data are discovered and understood by a search agent (e.g., search engine or human user). Accessible data are rendered and used by machine and human end users via standard protocols within use and access constraints. Interoperable data can be readily used in conjunction with other data products or services and can also be integrated with other data to create new data products or services. Reusable data can be used, under the proper license and given well-documented dataset provenance, by diverse audiences beyond those who were initially envisioned as potential users by the original data producer(s).

Applying the FAIR guiding principles when curating and representing dataset quality information can help ensure the information is optimal for sharing and enabling global access. Successful reuse of data often depends critically on a potential user being able to access information about the quality of the data and determine its fitness for the intended application. Information about data quality also contributes to or improves the FAIRness of a dataset. For example, when data can be discovered based on information about certain quality attributes, the findability of the data is improved for users who need data that contain such attributes, and further, the quality information supports users to assess the relevance of a discovered dataset to a research or operational need. Global access to and sharing of dataset quality information will help improve data transparency and enable reproducibility, which is especially critical to highly-influential data that are used for decision-making (e.g., ).

Consistently curating and representing dataset quality information by following the FAIR principles could eventually lead to standardization within and across organizations, tools, and systems, which in turn will lead to harmonization of the information. In addition, describing quality information using standardized formats, schemas, and terminology with controlled vocabularies improves the interoperability and reusability of the data.

Appendix A describes in detail a real-life use case of how trusted, timely and readily-integrable data and quality information is critical to disaster responses by utility companies with billions of dollars at stake. For disaster response managers, any information needs to be trusted and readily integrated, and understood in layman’s terms in a matter of a minute. Accuracy and timeliness of data and information is extremely important, and any datasets that are selected to be a part of their decision-making processes need to be trusted. Managers will not trust just any available datasets, since their decisions can have an impact on the safety and survival of at-risk populations, can cost up to millions of dollars, and influence the reputation of their organizations.

For this use case, datasets are pre-vetted with an operational readiness level (ORL) ranking that is readily available and easily understood by decision makers who are generally non-data experts. An assigned ORL enables such decision makers to rapidly trust datasets. Data and information are integrated into a system which underpins an easy-to-understand dashboard for disaster response managers to allow them to make decisions promptly. Thus, providing quality information along with the data establishes the trust needed for supporting such potentially life-saving emergency response activities, and maximizes the benefit of sharing data. More detailed information can be found in Appendix A.

Pre-vetting datasets and developing the dashboard requires years of work and ongoing effort in addition to cultivated human relationships. Readily-available and consistently curated quality information of an individual dataset will help improve the process of establishing trust necessary to support tools and services provided to disaster responses, saving time and money. It will also support effective (re)use of the dataset for other applications, resulting in wide community utilization and therefore maximizing the value of the dataset.

5. Call for Global Earth Science Community Guidelines

The voluntary sharing of meteorological observations and scientific knowledge among countries and between individual researchers and organizations has been going on for over a century (e.g., ). Since 1991, the World Meteorological Organization (WMO) has passed several resolutions for sharing essential and high value data, including basic weather, hydrological, and climate data through a series of WMO resolutions (; ; ). Guidelines on acquisition, quality assurance and control of meteorological station and model data were developed (e.g., ; ; ; ; ).

Sharing and reusing data is a big challenge but significant progress has been made collectively by the Earth science community over the last few decades. For the first time, WMO () issued a regulatory technical recommendation on managing climate data to be more accessible and usable. Recognizing the need to address ever increasing data volume and variety of data types across disciplines, WMO () called for one unified data policy to support global environmental data sharing and open science. United Nations Committee of Experts on Global Geospatial Information Management (UN-GGIM) developed a strategic framework with recommendations for a geospatial community with a strong focus on data, data quality and standards. These recommendations will help countries make their geospatial data and information reliable, accessible, and easy to use (; ). Success stories include greatly enhanced accessibility, usability, and interoperability of observational and climate model data through coordinated community efforts such as the Observations for Model Intercomparisons Project (Obs4MIP; ) and the Coupled Model Intercomparison Projects (CMIP; ). CMIP6 data are published with persistent identifiers such as Digital Object Identifiers (DOIs) (). Sharing and reusing dataset quality information is an even bigger challenge that requires global community effort. Issuing a call to the community for such an effort is the first step towards achieving that goal.

To explore the needs for dataset quality information, approaches and challenges for consistently evaluating and representing the quality information, a one-day virtual workshop was held on Monday July 13, 2020. Domain experts from 9 countries across America, Europe and Oceania have participated in the workshop. It was followed by a report-out session on Wednesday July 22, 2020, during the virtual Earth Science Information Partners (ESIP) 2020 Summer Meeting (SM20), July 14–24, 2020. Additional information can be found in the workshop report by Peng et al. ().

A total of 14 presentations from organizations across 9 countries were given during the two live sessions of the workshop and the ESIP SM20 report-out session. Presenters summarized the data quality assessment approaches that have been developed and/or adopted by organizations representing the scope of national and international Earth science data producers, data management stewardship programs, data and service centers as well as data and information providers. Participants also included data users from academic and private sectors.

The needs, challenges and approaches of consistently curating digital Earth science data and products were discussed during the live sessions as well as online during the following weeks. There is an overwhelming need for developing actionable community guidelines. The scope and path forward for developing such community guidelines for Earth science dataset quality information were also discussed. Key takeaways from the discussions are listed and described in the following subsections.

5.1. Built by the Global Community

While needs are strong for practical community guidelines for curating data quality information, currently such information is very limited and sparse. Therefore, to ensure the relevance of such guidelines, it is crucial for the guidelines to be developed through a coordinated effort via an iterative process, leveraging the experiences and expertise of an international team consisting of interdisciplinary domain experts, and community best practices.

An international working group has thus been formed to develop practical community guidelines. The current members of the working group consist of data producers, publishers, managers and stewards from national science and/or data centers, repositories, as well as data users from the academic and private sectors. Collectively they bring together many years of valuable experience in production, management, services, and applications of various types of Earth science data, including satellite, in situ, and model data, along with knowledge of the challenges and best practices in their domains.

The membership of the working group is open for any domain expert who is willing and able to contribute. One can also support this effort by reviewing the draft of the guidelines or providing a use case. Interested parties are encouraged to contact the corresponding author of this article.

5.2. Quality-Attribute Agnostic Guidelines

As characterized in Section 2, assessing dataset quality is a multi-dimensional problem. The selection of the relevant attributes is context-dependent and it leads to different categorizations and practical dimensions (). The complexity exists even within one discipline that a quality attribute can have different definitions, and be measured and represented differently. An example of this is data uncertainty as explored by Moroni et al. ().

The selection of the relevant attributes is context-dependent because datasets are often crafted for specific designated communities. Traditionally, designated communities of data consumers are domain literate and have some familiarity with the scientific context, data generation, or intended data use. However, with the increasing availability of data today, the existence of interested audiences with a variety of scientific backgrounds outside the domain of data collection must be taken into consideration in order for scientific knowledge to be widely conveyed and understood. Designated communities may also change over time (). These guidelines are intended to be more general and agnostic of the quality attributes, how to tailor the dataset quality information to the designated community is left to the specific entities who serve that particular community.

Therefore, the guidelines aim to equip data consumers with readily available information that is consistently curated and disseminated, in a way that can be easily found, accessed, understood by end-users and integrated across tools and systems, regardless of the quality attribute and the assessment approach.

5.3. Community Consented Terminology for Enhanced Interoperability

For a given quality dimension, consistency in various components and attributes across entities in each component, namely, semantic and structured consistency as defined by Redman (), is important to generating machine-actionable quality information. Common terminology is necessary for integrating data and information across workflows, tools and systems, as well as for curating and representing dataset quality information. Moreover, terms should be defined, and, ideally, referenced with a persistent identifier, for all stages of a dataset lifecycle.

5.4. Continuous Engagement with Stakeholders

The guidelines are being developed through an iterative process to allow for feedback from the community and all stakeholders, including those who contribute to the acquisition, curation, dissemination, and application of data. Continuous community engagements are planned by means of informal updates to various working groups and formal presentations to the targeted stakeholders, including those to the American Geophysical Union (AGU) community at the 2020 AGU Fall Meeting (), the ESIP community at its winter meeting in January 2021 (), and the European Geosciences Union community at its general assembly in April 2021 (). Additionally, reviews of the guidelines draft are planned prior to its being baselined. The guidelines document will be a living document to allow for evolving community requirements and best practices.

5.5. Long-Term Sustainability

It has been suggested by the participants that long-term sustainability should be planned for such community guidelines. Once baselined, the guidelines will be publicly accessible via an open science platform such as ESIP Figshare (esip.figshare.com) and/or Open Science Framework (osf.io) that provides a globally unique and persistent identifier with searching and sharing capability for the document. To ensure currency and timeliness of the guidelines, curation and revision planning is essential. These approaches will help with maintaining relevancy and long-term access to the guidelines.

6. Summary

In summary, there are fundamental challenges in collecting, curating, and representing dataset quality information. Those challenges include:

  • Dataset quality information is multi-dimensional with many quality attributes,
  • Information about dataset quality traverses different knowledge domains and curating it requires cross-disciplinary collaborations and knowledge integration,
  • Requirements may be different for different user applications as well as ways to assess dataset quality.

There are also strong community needs and benefits of having community-developed guidelines that are practical to implement. At the same time, guidelines for curating and representing dataset quality information are limited and sparse.

Recognizing the needs, challenges, and benefits of sharing information on quality at a dataset/collection level, interdisciplinary domain experts around the world have come together and called for community guidelines towards global access and harmonization of information on quality of individual Earth science datasets. The guidelines will be targeted at addressing these fundamental challenges, as well as others that are identified through the development activities.

The quality-attribute agnostic guidelines will be developed under community effort via an iterative process, leveraging the experiences and expertise of an international team of interdisciplinary domain experts and best practices. Description of what the quality attributes are, how the attributes are assessed, and what assessment approaches are utilized, should be included in relevant metadata or a document, preferably in a consistent way for transparency and enhanced usability. The guidelines will call for a machine-actionable mechanism to represent assessed results for enhanced interoperability across systems and disciplines.

By adopting the FAIR principles, the guidelines will help to ensure global access to the dataset quality information. Effective sharing of structured dataset quality information will help to move towards its global harmonization, which in turn will support (re)use of the data by both human and machine end users and therefore further enhance the value of the data.

As mentioned in section 5.1, an international FAIR dataset quality information working group has been formed under the leadership of ESIP Information Quality Cluster (IQC), the Barcelona Supercomputing Center (BSC) Evaluation and Quality Control (EQC) team, and the Australian Research Data Commons (ARDC) coordinated Australia/New Zealand Data Quality Interest Group (AU/NZ DQIG). The membership of this working group is open to any domain expert who is willing to contribute to the development effort. Development of the guidelines has begun and the outcomes will be reported in a follow-up paper. The guidelines will be primarily developed for the Earth Science community. They will, however, be general enough so that other disciplines can readily adapt them, which will further promote global access and harmonization of dataset quality information in supporting open science.