Improving Opportunities for New Value of Open Data: Assessing and Certifying Research Data Repositories

Investments in research that produce scientific and scholarly data can be leveraged by enabling the resulting research data products and services to be used by broader communities and for new purposes, extending reuse beyond the initial users and purposes for which the data were originally collected. Submitting research data to a data repository offers opportunities for the data to be used in the future, providing ways for new benefits to be realized from data reuse. Improvements to data repositories that facilitate new uses of data increase the potential for data reuse and for gains in the value of open data products and services that are associated with such reuse. Assessing and certifying the capabilities and services offered by data repositories provides opportunities for improving the repositories and for realizing the value to be attained from new uses of data. The evolution of data repository certification instruments is described and discussed in terms of the implications for the curation and continuing use of research data.


INTRODUCTION
Society invests in science and also benefits from the results of scientific research activities that have been conducted. As part of scientific and scholarly research, significant investments are made to collect data, which enables new activities to build on the efforts of earlier work. Studies are conceptualized, investigative teams are formed, needed resources are identified, support is acquired, procedures are planned, instruments are designed and produced, and experiments and observations are conducted to collect the data. Often, data collection and other research investments are supported by academic institutions, government agencies, or philanthropic organizations. After investing so much to obtain data, it is important for the data to be made publicly accessible as open data, if possible, so that researchers, planners, decision-makers, educators, students, and the general public have an opportunity to use the data without restrictions and realize the continuing value of such investments.
In addition to providing evidence of the methods and the results of a research study to facilitate verification of the conclusions, open data can be reused by subsequent investigations for new purposes. New hypotheses are investigated and new research questions are pursued if the data are freely available for other investigators to use, as revealed by data citations in various disciplines, despite inconsistent data citation practices (He & Nahar, 2016;Khan, Pink & Thelwall, 2020;Khan, Thelwall & Kousha, 2019;Robinson-García, Jiménez-Contreras & Torres-Salinas, 2016). Researchers also could use available data to conduct replication, reproduction, and comparison studies. Furthermore, new open data products and services could be created by integrating freely accessible open data with other open data that also are freely available for use. Leveraging such opportunities, the scientific community and non-scientists will continue to realize the benefits of data collection efforts long after the publication of the initial results of the studies that were conducted to collect the data. The producers of such data also can receive additional recognition for their research contributions as the use of their data enables new studies to be conducted and new data products and services to be created, fostering more research. By curating, preserving, and enabling new uses of data by additional users, the value of the initial investment of limited research funds also continues to be extended to propagate new studies, new results, and new benefits to society. In effect, benefits from open data sharing are engendered by both creators and users of data.

SOCIETAL VALUE OF OPEN DATA
Many datasets can be reused for purposes that were not necessarily foreseen by those who originally collected the data. Open data that are preserved, curated, and disseminated to foster diverse uses offer more opportunities to increase the value of the data through reuse. Such reuse can include non-scientific activities as well as those that contribute to science and scholarship. Researchers from the original data collection team or from the same discipline might want to extend the research that was conducted when the data were initially collected and analyzed. Scholars and researchers from different disciplines may see new opportunities for data analyses to investigate phenomenon that were not necessarily considered when collecting the data originally. New investigations may leverage the potential for integrating the data with data products collected within other disciplines to create entirely new data products and services. The potential for new value being derived from data increases with each act of reuse.
In addition to members of the research community deriving new value from data, other members of society also have the potential to derive new value from data. Journalists, educators and students, planners and policy-makers, representatives of commercial and non-profit entities, and members of the general public also generate value by reusing data. For example, planners and decision-makers are able to explore issues of interest that may be unrelated to the studies conducted by the original data producers. Educators have the ability to use the data to provide experiences for their students who may be studying techniques for analyzing data for various purposes. Members of the general public also may have interest in research data due to the location represented by the data or the time period when the data were collected, for example. Disseminating data products for use by broad audiences, including non-scientists, offers opportunities for shared understanding across society (Baker, Duerr & Parsons, 2015). Given such possibilities for new and diverse uses of data that leverage the initial investments in data collection, opportunities for enabling future use of the data should be 3 Downs Data Science Journal DOI: 10.5334/dsj-2021-001 considered. Facilitating data reuse by diverse audiences, including the general public, requires efforts to reduce barriers that might prevent reuse, including such concerns as navigability, interpretability, accessibility, and analyzability (Grand et al., 2016;Levin et al., 2016). Data repositories that enable the use of data by diverse audiences, as well as enabling use by the designated community, improve the potential usefulness of data and increase opportunities for contributing to both the scientific value and the societal value of data.
If data are determined as having potential value for future use, creating opportunities for new uses enables future data users to explore ways to realize additional value from the previous investments that were made to collect the data. Documenting and submitting such datasets to a data repository that represents the domain of potential users offers current and future users with opportunities to realize such value. Enabling additional uses by scientists from other disciplines, as well as non-scientists, heightens that value. Archival activities should facilitate secondary value by enabling new uses (Borgerud & Borglund, 2020;Schellenberg, 1956, as cited in Jaillant, 2019. Infrastructure that enables new uses further increases the value of data as a result of such second-order benefits of reuse (Carrera & Hoyt, 2006). As new users make new discoveries, new value for society also will be realized from using the data that were previously collected. With appropriate recognition for data sharing, submitting datasets to a relevant data repository also enables the original data producers to realize new value from their initial investment in the data collection effort (Krzton, 2018). Even though attitudes vary among scientists for sharing data that they have collected (Tenopir et al., 2018), data producers and others who have contributed to the collection and sharing of open data should receive rewards for such contributions, including recognition for their data collection and sharing efforts from journals, sponsors, and promotion committees (Bierer, Crosas & Pierce, 2017). Recognizing contributions that enable the use of open data also can serve as an incentive and an inspiration for open science (Burgelman et al., 2019;Perrier, Blondal & MacDonald, 2020).

ADDED-VALUE OF DATA REPOSITORIES
Data repositories, such as research data centers and scientific archives, contribute to the potential for new value by facilitating reuse of data through the management and dissemination of open data products and services. By depositing data in a repository for dissemination as open data, data producers contribute to open science and to the reuse of their data for new purposes. Data repositories, particularly specialist or domain repositories, facilitate the reuse of data within and across disciplines by offering extensive data curation and stewardship services (Boté & Térmens, 2019). Quality control, documentation, and peer-review also are necessary data curation functions of repositories that facilitate reuse (Koltay, 2020). When open data are described and curated, potential users have an opportunity to explore data products to determine their potential usefulness for a particular purpose. Such data stewardship services add value by enabling data to be reused for new research, decision-making, and learning activities that otherwise might not be possible. Recognizing the importance of data stewardship to facilitate reuse, data repositories work with both data producers and data users to ensure that data are reusable. Meeting the need to foster open science by enabling the reuse of open data, data producers select and deposit their data in a repository that they can collaborate with on data curation to ensure that their data will be reusable by broad audiences and create new value for society.

IMPROVING RESEARCH DATA REPOSITORIES
Selection of the data repository, where future reuse of the data will be enabled, also contributes to the new value to be attained from the initial investments in data collection. Within this context, the characteristics of the data repository, such as disciplines and communities served, data policies, costs of deposit and preservation, discovery and access capabilities, user support services, sustainability, and reputation, as well as other selection criteria, are important. In addition, data depositors may be interested in the trustworthiness of the repository that will be responsible for providing stewardship and enabling continuing access to the data that have been submitted for future use. Characteristics of TRUST, including "transparency, responsibility, user focus, sustainability, and technology" (Lin et al., 2020, 4), contribute to the capabilities of a data repository that provide the stewardship to facilitate reuse of digital data 4 Downs Data Science Journal DOI: 10.5334/dsj-2021-001 over time. Perceptions of reputation, societal commitment, and mission also may influence the trustworthiness of a data repository (Yoon, 2014). Knowing whether a candidate data repository has been independently assessed to be appropriately managing data for future use also can help data producers determine where to submit their data. Identifying a data repository that has been assessed and certified as trustworthy offers assurances to data depositors and other stakeholders that the repository has been independently evaluated in terms of its data stewardship capabilities and data curation practices.
Several instruments have emerged in recent decades to evaluate the trustworthiness of data repositories. Since the initial development of the Open Archival Information System (OAIS) Reference Model by the Consultative Committee for Space Data Systems in 2002 (CCSDS, 2002;, and subsequent publication by the International Standards Organization (ISO) as ISO 14721:2003 and ISO 14721:2012, data repositories have strived to meet the requirements for becoming trustworthy (Garrett, et al., 2015). Following the initial ISO publication of the OAIS Reference Model, several instruments have been developed for measuring the trustworthiness of data repositories, since the OAIS Reference Model is a framework and not an instrument for measuring or assessing data repository trustworthiness. Early assessment instruments that were developed include the Data Seal of Approval (DSA) (Dillo & de Leeuw, 2015); the NESTOR Seal for Trustworthy Digital Archives that was published as the Deutsches Institut für Normung (DIN) 31644 standard (Harmsen, et al., 2013)); TRAC, the Trustworthy Repositories Audit & Certification: Criteria and Checklist (CRL, 2007); and DRAMBORA, the Digital Repository Audit Method Based on Risk Assessment (Innocenti, 2007), as well as others.
While the CCSDS and the ISO were conducting the reviews of the draft ISO 16363 standard, Audit and Certification of Trustworthy Digital Repositories (2011), which also emanated from the OAIS Reference Model, the European Framework for Audit and Certification of Digital Repositories was signed in 2010 by representatives of the DSA, DIN 31644, and ISO 16363 to offer guidance for data repositories seeking certification (Callaghan et al., 2014;Klump, 2011). Subsequently, the International Council of Science (ICSU) World Data System (WDS), which had been established from the redesign of the World Data Centers and has since become the International Science Council (ISC) WDS, published Certification of WDS Members to describe the evaluation criteria and procedures for WDS membership (2012). The ISO review of the CCSDS draft was completed and published as ISO 16363:2012 (2012). Also, the Data Management Maturity (DMM) Framework was established by the American Geophysical Union (Stall, 2015). After many repositories had been certified in compliance with the WDS requirements and with the DSA requirements, these two assessment instruments were merged to create the Core Trustworthy Data Repositories Requirements that are offered by the CoreTrustSeal (L'Hours, Kleemola & de Leeuw, 2019;Rickards et al., 2016). Figure 1 depicts the evolution of several instruments that have been developed and used, internationally, for assessing trustworthy data repositories.  With many of these data repository assessment instruments, in addition to requesting an independent assessment or audit of a digital repository to determine compliance with the requirements of a particular instrument, repositories also can conduct a self-assessment. Using these instruments to conduct self-assessments enables repositories to determine whether they meet the requirements and enables the repositories to identify requirements for which additional capabilities need to be developed to attain compliance. These assessment instruments are being used by repositories to improve their management, infrastructure, policies, and practices and, in many cases, to also obtain certification. Receiving certification provides data repositories with a credential that attests to their capabilities and affords their recognition as trustworthy data repositories. In addition to repository staff members recognizing the value of attaining certification to be designated as trustworthy (Donaldson et al., 2017), several incentives and benefits, both extrinsic and intrinsic, for obtaining digital repository certification have been reported in the literature (Lindlar & Schwab, 2018). In addition, most repositories that have attained certification as trustworthy have publicly declared their achievement by posting information about the certification on their website (Donaldson, 2020).
Characteristics of data also contribute to the potential for new value that would be realized from subsequent use. Attainment of the FAIR Principles, for "findable, accessible, interoperable, and reusable" research objects, also has been recognized as necessary for enabling use (Wilkinson, et al., 2016, 3). From this perspective, the FAIR Principles also are relevant to the curation practices that enable future use and should be considered as integral when providing capabilities for data curation. Meeting the requirements associated with the FAIR Principles, in conjunction with certification of digital repositories, also has become a consideration for realizing the value of data (Koers et al., 2020;Mokrane & Recker, 2019). The evolution of digital repository certification requirements will need to consider the assessment of capabilities to meet aspects of both the TRUST Principles and the FAIR Principles. Other efforts, internationally, also contribute to the evolution of digital repository certification instruments, including, for example, the advent of maturity models for data. (Peng, et al., 2018). Similarly, other international efforts are influencing practices for improving data stewardship. The Global Earth Observations System of Systems (GEOSS) Data Management Principles (GEO, 2020) and the GEOSS Data Sharing Principles offer guidance to improve practices for the stewardship and sharing of data, particularly as open data (GEO, 2020).

CHALLENGES FOR THE EVOLUTION OF DATA REPOSITORY CERTIFICATION REQUIREMENTS
With the designation of the CoreTrustSeal data repositories requirements as the instrument that is being used for certification and re-certification of members of the WDS, several research data centers and other repositories around the globe are using these requirements to improve and certify their capabilities for managing and curating data (Core Certified Repositories, 2020). Even though the same CoreTrustSeal instrument is being used within many data repositories, the diversity of repositories and the differences among them often necessitates the application of individual approaches to attain compliance with the requirements for trustworthiness. For example, in serving its Designated Community, a repository may offer unique features to address particular disciplinary needs or even local context. As technology changes and community expectations increase, further innovation among data repositories also can be expected. By exploring and comparing the commonalities and differences among data repositories in terms of their efforts to attain capabilities for trustworthiness, recommended practices can be shared with more data repositories to meet the challenges of serving users of research data. Sharing of such practices offers opportunities for integration of technologies and for increasing efficiencies attained by the repositories that continually strive to improve their capabilities for enabling the use of data in a trustworthy manner. Data repositories that steadfastly improve their capabilities for using data will contribute to the evolving community of data repositories and to the new value that is gained from the use of the data that these repositories are sharing. The evolution of practices within the digital data stewardship community also will necessitate evolution of the instruments for assessing data repositories.
The CoreTrustSeal serves as an example of such evolution as it has since been revised in accordance with plans for reviewing the requirements every three years (CoreTrustSeal Standards and Certification Board, 2019). CoreTrustSeal also has begun exploring opportunities 6 Downs Data Science Journal DOI: 10.5334/dsj-2021-001 to further meet the needs of the data repository community by expanding the types of certification that would be offered. Since emanating from the merging of the DSA and the WDS requirements for certification, CoreTrustSeal has been certifying many domain repositories that each serve a specified designated community and meet the certification requirements. However, recognizing that generalist repositories and related service providers also need to be certified in recognition of their capabilities, CoreTrustSeal has been conducting discussions with the broader community of domain repositories, generalist repositories, and data service providers to identify the requirements and the associated certifications that could be offered to meet the needs of the broader community (CoreTrustSeal, 2020). Like the CoreTrustSeal, other instrument developers also may need to consider the potential of broadening the scope of their instruments to reflect the evolving infrastructure for digital data and related research resources.
Generalist repositories, such as institutional repositories and large digital repositories that serve broad audiences, including non-scientists, recognize the importance of being entrusted with the responsibility of providing stewardship for the digital objects that they are being asked to manage. Likewise data service providers, such as those that offer storage, processing, access or other services to fulfill digital curation needs during the data lifecycle, also recognize the importance of being a part of the chain of trust that can be relied upon when providing services to support digital stewardship. Obtaining formal certification could offer evidence of attaining trustworthiness while assisting the entity that is seeking certification to improve its practices to be worthy of such trust. Likewise, pursuing periodic recertification would provide such entities with opportunities to continually earn and renew that trust while improving to meet the evolving requirements that are necessary for any standard to progress with changes in expectations and technology. Navigating the differences in requirements for attaining certification as trustworthy will be challenging as domain repositories, generalist repositories, and related service providers refine their roles within the digital data stewardship industry.

CONCLUSIONS
For research processes to continue effectively within the digital realm, it is necessary to establish, maintain, and verify sustainable infrastructure for managing and enabling access to data so that such capabilities can support current and future investigations. Current and future certification requirements will need to be developed by the communities that contribute to and are served by data repositories throughout the data lifecycle. In addition to reflecting the short and long-term objectives of data stewards, repository certification requirements also could reflect the evolving needs and workflows of data producers and users, as well as the perspectives of repository developers and service providers, tool developers, research institutions, sponsors, and publishers, for example. Improved channels of communication among such stakeholders, across repositories, disciplines, and international borders can help to identify common needs for improving the certification of data repositories, along with ways to address such needs by working together to improve the value of data. Similar to peer-review of research objects and institutions, community-review of data repositories offers opportunities to contribute to the capabilities needed to support scientific progress and the reuse of open data into the future.
The evolution of digital repository certification instruments appears to be contributing to the research data community's ability to navigate the abundance of issues associated with the development and provision of infrastructure for managing and enabling continuing use of data. Certification instruments are being improved over time and these, in turn, are being used to improve policies, procedures, skills, tools, services, and other capabilities for data repositories to enable data reuse on an ongoing basis. Research on the assessment and certification of data repositories also is needed to contribute to the improvement of approaches for managing and sharing data. Education, training, and the provision of learning resources that emphasize the importance of data stewardship policies and practices also is needed to improve understanding about the capabilities for enabling current and long-term use of data products and services by diverse communities, including those from disparate disciplines, as well as the general public.
The continuing evolution of instruments and processes for certifying data repositories is necessary to ensure that data are managed effectively and are available as open data for reuse and for attaining the value that can result from reuse. As certification instruments and processes evolve, it will be necessary for data repositories to leverage community resources 7 Downs Data Science Journal DOI: 10.5334/dsj-2021-001 to improve while meeting the expectations of their user communities. Along with the new value that is being derived from the reuse of data, costs are incurred for enabling the reuse of data. Recognizing that data repositories are a critical part of the global research infrastructure and that diverse capabilities are needed to support the reuse of data, how might the research community and sponsors further nurture the development, operation, and certification of these facilities?
In consideration of the kinds of funding often obtained for operating research data repositories, it also will be necessary to maintain a balance between the costs of obtaining and maintaining certification and the costs of operating a repository so that the process is feasible, without creating excessive barriers for repositories to be certified. Such costs include improvements needed to prepare and apply for certification as well as the cost of any fees for procuring certification. Increases to current requirements could be identified through community participation so that infrastructure improvements for meeting new requirements can be planned for and conducted incrementally to fit within the budgets and resource constraints of data repositories, especially data repositories that rely on grants and contracts to sustain their operations. Similarly, the costs of certification fees cannot be so large as to prohibit small data repositories from obtaining and periodically renewing certification. If the costs of meeting new requirements and obtaining certification exceed the budgets of research data repositories, such repositories will be at risk of being eclipsed by large ventures that have the resources to incur the technological, administrative, and financial costs that are beyond the reach of smaller data repositories.
In addition, the evolution of users' expectations, technologies, standards, certification instruments, and requirements will place demands on repositories that will need to be met with improved efficiencies in terms of operations and for the development of improvements. Data curation, stewardship, and usage capabilities also will need to be enhanced. Data repositories also will need to reduce the potential resource demands that stem from continuous improvement and efforts to maintain certification by sharing resources within and across repository communities. Resource sharing offers an opportunity to reduce costs, improve efficiencies, and leverage economies of scale while improving capabilities to serve broader communities along with the designated community of users. Options for resource sharing include utilizing common infrastructure, services, and tools, as well as collaborating and sharing knowledge and competencies with other members of the international data repository community. In light of the potential benefits of resource sharing, how can resource sharing be facilitated among data repositories in an efficient and mutually-beneficial manner?
Cooperation among other community stakeholders also can contribute to the sustainability of repositories and to the value that can be attained from the reuse of data. Research sponsors could recognize the need to provide sustainable infrastructure for managing and facilitating the reuse of data, especially those data for which they already have invested to support their collection and initial use. Improving the capabilities and sustainability of repositories can facilitate the reuse of data by a variety of future users, beyond the disciplines represented by the data producers. Considering the additional gains that can be achieved by leveraging initial investments in data, how can sponsors ensure that repositories provide sustainable services to improve the value of data?
Furthermore, in addition to publishing reports of their work, data producers should have an opportunity to make informed decisions to facilitate the future use of their data. Early planning to begin preparing data for reuse and to select a repository can reduce the need for hasty decisions later, enabling data producers to choose a repository that they can collaborate with to foster broad reuse of their data by new users as open data. Recognizing that workflow changes are needed, how might journals, especially data journals, improve guidance to data producers on the selection of a data repository for the dissemination of data for reuse as open data?
A change in the research culture also is needed to facilitate the reuse of open data and foster open science. The value of open data is an outcome of a research ecosystem that requires participation by various global stakeholders that contribute to and benefit from scientific and scholarly information resources. In addition to contributing to the production, curation, preservation, dissemination and support of open data and related research resources, members of the ecosystem have an opportunity to contribute to the individual efforts that enable value to be derived from open data. Scientific and scholarly communities, as well as members of 8 Downs Data Science Journal DOI: 10.5334/dsj-2021-001 society, can recognize the new value that is generated by producing, curating, and sharing open data when they are deciding on the allocation of resources and rewards. Recognition of the value of open data also includes proper data citation when data have been used to produce scholarly and scientific publications. Furthermore, review panels and promotion committees can contribute to the value of open data by acknowledging and recognizing data producers, curators, sponsors, repositories and hosting institutions for the contributions to science, scholarship, and society that are realized by the production, curation, and sharing of open data. Such acknowledgement and recognition could incentivize the contributions of stakeholders who contribute to the value of open data throughout the data lifecycle. Considering the need for cultural change across many institutions to further improve the value of data, we might ask ourselves the following question. How can we propagate needed changes throughout the research community to recognize and incentivize activities that contribute to the value of data to science, scholarship, and society?