Data Distribution Centre Support for the IPCC Sixth Assessment

Martina Stockhause1, Martin Juckes2, Robert Chen3, Wilfran Moufouma Okia4, Anna Pirani4, Tim Waterfield4, Xiaoshi Xing3 and Rorie Edmunds5 1 German Climate Computing Center (DKRZ), DE 2 Centre for Environmental Data Analysis (CEDA), GB 3 Center for International Earth Science Information Network (CIESIN), Columbia University, US 4 IPCC WG1 Technical Support Unit (TSU) c/o Université Paris Saclay, FR 5 World Data System (WDS), JP Corresponding author: Martina Stockhause (stockhause@dkrz.de)


Introduction
The Intergovernmental Panel on Climate Change (IPCC) is the United Nations (UN) body for assessing the science related to climate change. It was established by the United Nations Environment Programme (UNEP) and the World Meteorological Organization (WMO) in 1988 to provide member countries with periodic scientific assessments concerning climate change, its implications and risks, and options for adaptation and mitigation strategies.
Each assessment cycle includes three main Assessment Reports prepared by each of the three Working Groups (WGs), as well as Special Reports that cut across more than one WG. These reports are prepared by teams of authors over a period of several years, with scientific oversight by the WG Bureaus and supported by their respective Technical Support Units (TSUs). The Task Group on Data Support for Climate Change Assessments (TG-Data; formerly TGICA) oversees the Data Distribution Centre (DDC), which manages key data and information resources generated by IPCC assessments that are needed in subsequent assessments and that are useful to the broader scientific and policy communities. The DDC plays a key role in ensuring that authoritative, well-documented, and traceable data and associated information from the IPCC assessment are made fully available to scientists and other interested users worldwide; particularly those located in developing countries, who may have limited access and who seek to expand analyses and data access in understudied regions. During the Fifth Assessment cycle (AR5), the links between the DDC and WGs were weaker than in previous assessments, resulting in the WGs implementing ad hoc data management solutions and the DDC capturing AR5 data for long-term preservation only after the assessments were released.
A primary challenge for TG-Data and the DDC is to identify, access, archive, and document the critical data and information resources stemming from IPCC activities, especially once the assessment is completed and TSUs scale back between assessments. Such efforts are essential given the long history of the IPCC assessments -over more than three decades -and the inevitable turnover of scientists and other contributors involved in the process.
Over the past decade, in addition to the traditional assessment of scientific literature, other scholarly information such as scientific data, models, and software have increasingly become relevant and integrated into the standard scientific practices of research institutions; for example, formulated by the German Research Foundation (DFG, 2013). Organizations such as DataCite (http://datacite.org), ESIP (Earth Science Information Partners, http://esipfed.org/), RDA (Research Data Alliance, http://rd-alliance.org), WDS (World Data System; http://world-data-system.org) and CODATA (Committee on Data; http://www.codata. org/) of the International Science Council (ISC), and various publishers have helped to develop principles and guidelines such as FAIR (Findable, Accessible, Interoperable and Reusable; Wilkinson et al., 2016) data and expand their utilization in scholarly practice (e.g., Scholarly Link eXchange -Scholix: http://scholix.org, Burton et al., 2017;Stall et al., 2017 and2018;Cousijn et al., 2018). Additionally, CoreTrustSeal (http://www.coretrustseal.org; Edmunds et al., 2016) sets a basic quality standard for trustworthy repositories for research data, jointly developed by WDS and the Data Seal of Approval (DSA). These evolving best practices are important to consider in the continuing evolution of the DDC and its efforts to provide highquality data stewardship and support to arguably one of the most important, long-term scientific endeavors in modern history.

Improvements for the IPCC Sixth Assessment
The AR6 cycle places an important weight on integrating the risk framework with solution-focused information and the growing demand for policy-relevant regional climate information, and promotes enhanced interlinkages across Working Groups I, II and III contributions on several cross-cutting topics including the production of a Regional Atlas. This opens up new data management challenges for the DDC.
An IPCC review of the TGICA and the DDC after the Fifth Assessment cycle resulted in a revised mandate for the renamed TG-Data (Task Group on Data Support for Climate Change Assessments), strengthening the data management aspects and providing new guidance for the DDC (IPCC-47, 2018). The selection of members for TG-Data for the Sixth Assessment cycle has been finalized by the IPCC Bureau and is scheduled for approval in the next IPCC session (IPCC-49, 2019).
The key activities of the renewed guidance for the DDC fall into three categories: 1. Long-term data archival and data curation: Archive and curate data and scenarios in the DDC, which are used by the IPCC in its reports or underpinning key figures, key tables, and headline statements; data is archived in a stable, transparent, and traceable manner. 2. Collaboration with additional data centers: Collaborate as appropriate with data centers that hold data or provide functions relevant to the IPCC in a transparent manner; especially, contribute to a sustainable structure established and approved by the IPCC to provide observed and model data and information relevant at regional scales. 3. Support for IPCC authors and data users: mprove accessibility to Data Distribution Centre materials for supporting IPCC authors and external users; especially, in developing countries.
For practical support of the WGs on technical issues, the DDC Support group (https://cedadev.github.io/ ipcc_ddc) was formed in September 2017, consisting of members from the three TSUs alongside the three DDC managers. It carried out necessary practical discussions on data management and curation for AR6 during the transition of TGICA to TG-Data but is not a formal group of the IPCC. The future relation of DDC Support to TG-Data will be decided within TG-Data. WG TSUs as well as the DDC Managers are ex-officio members of TG-Data. In the following subsections, initiatives targeting these three DDC key activities are described.

Long-term data archival and data curation
The IPCC DDC has targeted long-term data archival and curation from its inception. However, this has been limited to specific data collections at the three data centers: Reference Data Archive for the global climate model output (WDCC hosted by the German Climate Computing Center, DKRZ), socioeconomic data, scenarios and observed impacts (CIESIN) and climatologies (CEDA). Owing to resource constraints, the three centers worked with the former TGICA to prioritize efforts on these core data collections that are important in each assessment cycle, and in particular where long-term continuity of management and access would be beneficial to the long-term assessment process, to the broad research community, and to potential stakeholders and decision makers. For example, as part of each assessment cycle, the IPCC systematically analyzes and compares the outputs of numerous of global climate model runs from research groups worldwide, producing a core set of reference data that not only form the basis for the IPCC's conclusions, but also support subsequent research on impacts and policy options. The DDC performs a central role in ensuring that these data are properly preserved, well documented, and easily accessible for all. Similarly, the DDC has supported community needs for access to emissions and climatological data, quantitative scenarios about future socio-economic development (e.g., Shared Socioeconomic Pathways -SSPs), and data and information about observed impacts of climate change.
Of course, the IPCC Assessments have drawn on a much broader range of available scientific data and information resources, including the results of Integrated Assessment Models (IAMs). In some cases, archiving and redistribution of data are constrained by intellectual property (IP) and permission issues. To avoid duplication of effort, the DDC developed a review process for linking to relevant high-quality data that are archived and curated at other data centers or institutions (Juckes et al., 2012).
Apart from these primary data, key datasets underlying figures, tables, or headline statements of the IPCC reports contain the central messages of each climate assessment. These key datasets, together with the primary data and the algorithms applied in analysis scripts, require long-term curation and accessibility to ensure transparency and traceability of IPCC results. Cousijn et al., (2018) discuss in this context that data citation is aimed at significantly improving the robustness and reproducibility of science, and enabling FAIR data. For AR5, only a snapshot of climate projection data was long-term archived in the DDC at WDCC/ DKRZ and data citation was enabled by minting DataCite DOIs (Digital Object Identifiers). Within AR6, data references provided for the source data (Stockhause and Lautenschlager, 2017) enable IPCC authors to cite their data sources for the first time in the history of the IPCC climate assessment reports. Thus, source data providers get credit for the preparation of their data and the source data can be tracked.
In addition, datasets underlying figures and tables of the report should be preserved together with basic tracking or provenance information, including a reference to the source data, the applied analysis scripts, and the data products. Gil et al., (2016) emphasize the importance of provenance for the traceability of research findings in their vision of an interactive future publication in geosciences. The information about the creation of key results inherited in the applied software and the provenance record should be published in addition to the data. For software publication, Zenodo issues DOIs on software versions extracted from Github. In case of the publication of provenance records, W3C PROV (Moreau et al., 2013) provides the international standard.
Working Group I (WGI) will support the development and implementation of best practices for the development of common software(s) to produce figures and tables in the IPCC report. The activities will entail archiving datasets underlying figures and tables of the report, assessing data provenance, accessibility and curation, and tracking the origin of source climate data, including a reference to source datasets, applied analysis scripts, and the data products.
WGI will build on existing community efforts that encourage open exchange of diagnostic source code for model evaluation to establish an end-to-end provenance mechanism that ensures reproducibility of figures and tables of the IPCC AR6 WGI report. WGI will encourage authors to develop community software tools, such as the Earth System Model eValuation tool ESMVal (Eyring et al., 2016), to develop easily reproducible and traceable common model evaluation metrics. The planned application of the ESMVal tool on the source data to be archived in the Reference Data Archive for AR6 at WDCC/DKRZ will add important information on model and therefore data quality of the AR6 data archived in the IPCC DDC. The TSU is also investigating other resources to access and process stored data, for example Jupyter notebook systems being developed as part of the WGI Atlas or data-centric infrastructure projects (Balaji et al., 2018) such as the Pangeo Environment (https://pangeo.io/). Ultimately, the AR6 cycle will make the analysis and visualization scripts public on a platform like GitHub with the details on how to recreate the figure or table and datasets used. This will allow the figures and tables to be available and utilized by the science community or any interested party. This will help in ensuring traceability of data products and analysis scripts used in the IPCC assessment, provide long-term discoverable archival solutions, and facilitate the availability and consistent use of climate change data and scenarios to support the IPCC work program.
The DDC, in collaboration with TG-Data, may also be able to identify key data being used to address specific questions in the assessment, and work in advance to ensure long-term preservation and access of these data. 1 Examples might include the emerging set of georeferenced human settlement and population data needed to better understand land-use change and potential climate impacts (e.g., https://www.popgrid.org), together with digital elevation model (DEM) data that are fundamental to better assess the risks of sea-level rise.

Collaboration with additional data centers
The DDC was initially formed in 1997 from data centers in the UK and Germany, with a third data center from the U.S. joining in 2003. In 2007, CEDA (formerly the British Atmospheric Data Centre) replaced the Climatic Research Unit of the University of East Anglia as the UK element of the DDC. Over the past three decades, the DDC has relied on national support provided by the governments of the UK, Germany, and the U.S. to maintain operations. The benefits of relying on a small set of centers have included easier close coordination with the IPCC TSUs and the TGICA; adherence to IPCC requirements regarding resources made available through the DDC; strong relationships with the relevant scientific communities; and continuity, quality control, and transparency in data management and stewardship across assessment cycles.
In the past few years, WDCC hosted by DKRZ and the NASA Socioeconomic Data and Applications Center (SEDAC) operated by CIESIN have become Regular Members of the World Data System, with CEDA joining as a component of the Natural Environment Research Council's Environmental Data Centres, as a WDS Network Member (Figure 1). Resultingly, the three data centers meet recently developed requirements for certification as Trustworthy Data Repositories (TDRs) established by CoreTrustSeal . WDS brings to bear broad expertise in scientific data management, including significant data holdings and data management capabilities in fields important to the IPCC assessments. WDS also spans a wide range of countries with growing climate research communities and increasing participation in IPCC assessments. It therefore makes sense to explore how the DDC could expand partnerships with other WDS Members, and potentially whether adding other WDS Members to the core DDC operations would help achieve the DDC's goals for data stewardship and access. For either approach, it is important to maintain the high IPCC standards for quality and transparency, together with the CoreTrustSeal standards established by the research community. Experiences from previous IPCC assessment cycles have shown that data in external data centers (including data disseminated via ESGF data nodes) might vanish or change over time, compromising the traceability of IPCC results. Other data for AR5, currently not part of the DDC data archives, is attached to the website of the IPCC AR5 report, which is problematic in the sense of long-term availability and discoverability. A task, which IPCC has identified as important, is the archival, curation and dissemination/support for regional climate model data and information. The current DDC partners store global data. Serving regional data and information requires the set-up of a federation of regional data centers, located in the regions they serve; for example, WDCC/DKRZ archives regional model data for the domain Europe.
New collaborative data management approaches and technologies should facilitate such efforts, in which a collaboration with the WDS International Technology Office (WDS-ITO) established in 2018 might play an important role. An example for such a collaboration could be the development of compute services, which can be used to reduce the transferred data volume or derive specific climate parameters not part of the archived DDC data.

Support for IPCC authors and data users
The IPCC DDC provides data user support as part of its data dissemination function. Regarding the IPCC users, WDCC/DKRZ provides an additional storage media mailing service for developing and economyin-transition countries with limited Internet bandwidths. Additionally, TGICA and the DDC have worked together to develop and disseminate guidance materials supporting data usage on key topics such as the data content, data usage, and scenario development.
To support the IPCC AR6 authors, Virtual Workspaces (VM) have been set up at CEDA and DKRZ based on the author experiences from AR5, feedback obtained from the 2016 IPCC expert meeting on the future of TGICA (IPCC-43, 2016), and requirements collected from AR6 contributors. In the VM, IPCC authors gain fast and easy access to the extensive climate data archives held at CEDA and WDCC/DKRZ and can share scripts and results. With a portion of the data evolving during the assessment period, core climate projection and observation data, as well as possibly scripts and core results of the IPCC authors, will need to be available on both sites in the latest versions. Thus an exchange of scripts and core results of the IPCC authors is required for consistency between the VMs at the two sites.
The specification of the core data was made, on the one hand, based on the analysis of AR5 data download and data usage information and, on the other hand, derived from tool requirements and consultations with national and international users. For example, the data needs of the model evaluation tool ESMVal are integrated in the core data specification at WDCC/DKRZ (http://bit.ly/2oNXdNs).
The main DDC web service at www.ipcc-data.org has also been moved to a public cloud hosting service in order to provide the increased level of service which is important now that it is acting more as a networking hub. Source pages, documents and other artifacts used to maintain the site have been moved into the GitHub cloud collaboration environment.

Discussion and Conclusion
The IPCC assessment must be comprehensive, rigorous and transparent, and the provision of data and related data services need to be reliable and long term. The assessment process is guided by the principles of openness, transparency and traceability. For data handling, there are significant opportunities for this to be improved for the AR6 and future assessment cycles. The DDC intends to work more closely with the IPCC TSUs, IPCC authors, and TG-Data than in the past to ensure that the highest priority needs are met in ways that are consistent with best practice in research data management. This can be accomplished by taking advantage of rapidly evolving data management approaches and technologies, and building on recently established standards for Trusted Data Repositories. In addition, the DDC needs to explore how to expand the expertise and resources available to support IPCC needs; for example, by working collaboratively with existing networks such as WDS and with key research institutions and organizations.
The remit of WDS is to coordinate and support trustworthy scientific data services in the long-term provision, use, and preservation of quality-assured scientific data and data services, products, and information, while strengthening their links with the research community. WDS Members are therefore well-placed to work with the DCC in ensuring the data legacy of the IPCC assessments. Furthermore, a strategic concern of the WDS Scientific Committee (WDS-SC), the governing body of WDS, is that there is currently no strict data policy for secondary or derived data resulting from analyses of CMIP simulations and cited in the IPCC assessments: researchers numbering in their thousands are doing detailed analyses, but the data and the analytical tools used to generate the data may not be archived appropriately. Hence, there is the potential not only for WDS Members to offer this archival function, but the WDS-SC is also in a position to provide specific guidelines to the analytical community on the documentation and storage of the forward model protocols and code being used.
A major challenge for IPCC service providers, including the DDC, is that the IPCC relies on national funding to develop and maintain infrastructure components and data-related services in the high-quality required. This bears the risk of reduced IPCC DDC services in case of a decrease or interruption of funding streams. As part of the CoreTrustSeal for WDS Regular Members, the problem of interrupted funding and center closure is targeted. If this were to occur, the data would be transferred to another data center, but the services and support for the data, which are indispensable for data reuse, might not be maintained to the same degree by the new data center, as they are largely dependent on experience with this domain-specific data. Other challenges for the DDC lie in the increasing data volumes and complexity, and in IPCC WG use of a broader range of data, for example regional or impact-related data.
In the short term, it is worth exploring whether additional countries would be willing to support these activities, through (say) their relevant in-country WDS Members. In the medium term, it may be important to seek additional sources of support to, for instance, increase participation of data centers or data experts from developing countries in DDC activities.
Driven by the needs of the scientific authors, and restricted by available funding resources, the strict IPCC AR6 timeline, as well as quality requirements, the DDC support group aims to improve both the data traceability and support of the IPCC authors for the IPCC Sixth Assessment.