1. Introduction

Harvestable metadata services are an effective, established and widely-used approach to promoting data discovery and sharing across broad communities of potential data users, across multiple disciplines (; ). For the purpose of this study, we understand harvestable metadata as a set of metadata records in a standardized format and schema that is shared with aggregation services by means of specific protocols for metadata transfer, which are also standardized. In this paper, we describe examples of ongoing and successful implementations of harvestable metadata services, which apply emerging and established standards and community practices to fit local and domain-specific research data management contexts. These use cases originated from the Harvestable Metadata Services Working Group (HMetS-WG), which met frequently in a series of working sessions over 6 months during 2020, followed by occasional meetings during 2021. The study offers an overview of the infrastructures, standards and communities of the repositories that were members of the HMets-WG, as well as offering a wider-ranging discussion of challenges that repositories may face when developing data services, such as harvestable metadata.

Taking a qualitative approach, this study explores issues for implementing harvestable metadata services at repositories. We start with a description of use cases, focusing on each repository’s technical features, along with the challenges encountered in pursuit of repository-defined and community-oriented service development goals. Repositories are also characterized by the subject and disciplinary areas covered, targeted user groups, and services offered. The full-length profiles for each repository are described as use cases by Urquidi Diaz et al. (). After examining the use cases within the context of the current literature on recommended practices for metadata syndication and pathways toward interoperability, we present a set of common characteristics and challenges described by the repositories in this study. These experiences involved making decisions about which technologies to develop for an often heterogeneous dynamic user base, within an evolving technological landscape, in order to implement data and metadata services that fit within the resource and policy constraints of the repository.

Metadata harvesting, standards and protocols

In a typical metadata sharing process, a research data repository will share a catalogue of assets: a collection of metadata records that describe each dataset, which are typically accessible through a search interface on the repository’s portal. The repository may also share a set of standardized metadata records via additional access points (or harvestable metadata services), using a metadata transfer protocol through which aggregation services, such as harvesters, obtain the metadata (see Figure 1). Persistent links to data landing pages at the host repository are typically contained in those records. An aggregator may then convert (re-format or cross-walk) the acquired records into a unified display standard, to be disseminated by means of a federated metadata catalogue or a federated search engine. Examples of metadata harvesters that target research data include the Canadian Federated Research Data Repository (FRDR), and B2FIND (Europe).

Figure 1 

The metadata harvesting process. Standardized metadata is harvested from repository catalogues, then processed by an aggregation service. The service disseminates the metadata records through a search and discovery portal and/or by serving it to further aggregation services for distribution.

The adherence to shared standards and community practices is a key tenet for successful digital research infrastructure (DRI) integration and interoperability (; ; ). Common standards for harvestable metadata include the Dublin Core (), DataCite () and ISO 19115 () metadata schemas, as well as protocols for transferring metadata, like the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) () and the Open Geospatial Consortium Catalogue Service for the Web (OGC-CSW) (). Metadata records are usually transferred as eXtensible Markup Language (XML) or JavaScript Object Notation (JSON, and JSON-LD, for linking data within semantic metadata) files. Another approach to syndicating metadata uses semantic metadata tags, such as Schema.org, that are placed in the HTML of dataset landing pages on a repository’s web portal, or in separate metadata files. This strategy relies on web crawlers (such as Google) parsing the semantic metadata to aggregate and index the landing pages for search engine retrieval. Even though this approach to metadata sharing can complement harvestable metadata services, the semantic strategy was not pursued by the HMetS-WG. Instead, WDS-ITO engaged members of various communities in a separate initiative to develop semantic metadata using Schema.org ().

As the importance of reusing data is increasingly recognized across disciplines, data repositories have proliferated to meet this demand, with the number of data repositories listed in repository registries, such as the Registry of Research Data Repositories (Re3Data), growing rapidly (). Also, with each data repository offering additional open data products, finding a particular dataset of interest becomes challenging for potential data users (; ). A recognized approach to address this challenge is for repositories to establish capabilities for harvesting metadata to facilitate searchability and global discoverability (; ; ). Furthermore, the availability of harvestable metadata is an indicator of dataset findability (; ) and contributes to repository TRUST-worthiness as it improves integration with the wider data management community ().

2. Research Methodology

Data Collection

In September 2019, the WDS-ITO invited WDS member repositories to participate in the HMetSWG and to (optionally) serve as use cases for the study. The invitation was sent to 35 unique WDS member organizations that had previously expressed interest in being informed about new WDS initiatives. Nine WDS member repositories participated in the group (Table 1), and seven (Table 2) were adopted as use cases. Over the course of the group’s sessions, the repositories presented an overview of their infrastructure, data holdings, and services. All participating repositories also provided the group with a schematic overview of their features and subsequently, three repositories (NSSDC, INTERMAGNET and SEDAC) also completed the implementation plan template, described below, and shared these with the WDS-ITO.

Table 1

HMetS-WG participants, with WDS membership type and host institutions.


WDS MEMBERTYPEHOST INSTITUTION(S)

Centre de Données Astronomiques de Strasbourg (CDS)RegularStrasbourg Astronomical Observatory (ObAS); University of Strasbourg; French National Centre for Scientific Research (CNRS)

Global Change Research Data Publishing and Repository (GCdataPR)RegularInstitute of Geographical Sciences and Natural Resources Research (IGSNRR), Chinese Academy of Sciences (CAS); Geographical Society of China

International Real-time Magnetic Observatory Network (INTERMAGNET)NetworkMultiple institutions (worldwide)

International Service of Geomagnetic Indices (ISGI)RegularSchool and Observatory of Earth Sciences (EOST); University of Strasbourg; French National Centre for Scientific Research (CNRS)

International GNSS Service (IGS)NetworkMultiple institutions

National Space Science Data Center (NSSDC)RegularNational Space Science Center (NSSC), Chinese Academy of Sciences (CAS)

Socioeconomic Data and Applications Center (SEDAC)RegularCenter for International Earth Science Information Network (CIESIN), Columbia University; Earth Observing System Data and Information System (EOSDIS), National Aeronautics and Space Administration (NASA)

World Data Center for Geomagnetism (Edinburgh)RegularBritish Geological Survey (BGS)

World Data Centre for Renewable Resources and Environment (WDC-RRE)RegularIGSNRR; CAS

Table 2

Subject areas represented by repositories and target users groups. Subject areas were provided to WDS-ITO by the repositories.


REPOSITORYSUBJECT AREASUSER GROUPS

GCdataPRAgriculture, Area studies, Earth sciences, Economics, Environmental studies, forestry, Geo-ecosystems Geography, and HistoryGlobal change students, researchers policy makers and society in China and worldwide

IGSEarth sciences, Geodesy, GNSS, GPS, Precise positioning, Navigation, Timing, and Space sciencesMainly IGS staff, project and working group participants. More broadly: worldwide users of modern mapping, orientation and navigation systems, enterprises, non-profits, institutions and government actors

INTERMAGNETEarth sciences, Geomagnetism, Space sciencesScientific community, geomagnetism community, members of IAGA, commercial users

ISGISolar-Terrestrial physics, Space weather-Space Climate, Space sciences, Earth sciences, GeomagnetismAcademia (including behavioral biology), members of IAGA communities, private and public sectors (military, telecommunications, satellite operators)

NSSDCAstronomy, Computer sciences, Planetary science, Space physics, Space sciences, Space weatherTypical users are Chinese and international researchers in subject areas

SEDACAgriculture, Architecture and design, Anthropology, Area studies, Business, Chemistry, Climate science, Computer sciences, Cultural and ethnic studies, Earth sciences, Economics, Engineering, Environmental science, Environmental and forestry studies, Geography, Health sciences, Information system science, Political science, Sociology, Statistics, Sustainability science, Systems science, TransportationUser community interested in studying human interactions in the environment

WDC-RREEarth sciences, Ecology, Environmental studies and forestry, Geography, Geoinformatics, Natural resourcesMainly academic researchers and students, also scientific staff and technicians, general public, government agencies, policy makers, and international organizations

The group’s work agenda was initially guided by a workflow structure proposed by WDS-ITO (Figure 2), which represents harvestable metadata services development as a set of discrete, successive steps. As group discussions progressed, WDS-ITO provided members with a Harvestable Metadata Services Implementation Plan template () to describe their implementation plans. The template was inspired by and borrowed heavily from the CESSDA-Saw guidance package () and the JISC project plan templates () for data service planning, which were designed to be adapted to specific use cases for a single service or a subset of services, such as harvestable metadata services. Both of these resources also include guidance for drafting implementation plans for these types of services (; ), which also informed the development of the template. Supporting information resources also included a Twine interactive narrative/storyfied walk-through of the implementation plan flowchart (), and a Zotero library with resources related to harvestable metadata services (). While the questions derived from the workflow structure guided initial HMets-WG discussions, the availability of these additional resources, along with the individual repository overviews and implementation plans, facilitated broader discussions of implementation issues among the HMetS-WG repositories.

Figure 2 

Flow-chart diagram of a typical harvestable metadata services implementation (). This diagram gives a schematic representation of the steps involved in creating a harvestable metadata service. The HMetS-WG used these steps to scaffold the group’s initial work.

Data Processing and Analysis

Building on the initial discussions of the workflow questions, the subsequent broader discussions among the HMets-WG repositories further contributed to the development of detailed repository profiles, which are accessible online as use cases (). Where available, the profiles reference the repositories’ technical documentation and other relevant publications to provide informative use cases. Urquidi Diaz et al. () described the following characteristics of the repositories within the use cases:

  1. Institutional overview: Brief description of the repository’s institutional context: Its governance, history, mandate, mission, memberships, and other organizational features.
  2. User community: Target communities for repository services.
  3. Infrastructure overview: Description of repository’s data holdings and technical infrastructure for service provision.
  4. Current state of metadata: Metadata formats, standards used, and metadata services (if any).
  5. Planned development: Plans for future development.
  6. Resources: Description of repositories’ sources of support and financing.
  7. Challenges: Initially, each repository described the challenges they have faced in developing harvestable metadata and other data services on their platforms.

Discussing the institutional overviews and implementation plans, as well as the compiled information resources, in terms of applicability to repository practices, contributed to understanding the current state of the repositories implementation issues. While differences across the repositories were observed, discussions about the common challenges that the repositories faced when considering the issues associated with the development of harvestable metadata services identified similarities among the challenges of the repositories represented in the HMetS-WG. Recognition of these similarities led to the emergence of a consensus on the challenges that the participating repositories face for the development and deployment of harvestable metadata services.

3. The HMetS-WG Set of Use Cases

3.1. Repositories participating in the working group

The host institutions of the participating repositories (see Table 1) were based in China, the UK, the US, and France. Seven repositories were Regular Members of the WDS and two were Network Members ().

3.2. Research areas and target user communities

The research areas served by the repositories represent a predominant Earth- and planetary sciences orientation. Social sciences, including environmental and economic sciences, also are strongly represented. As described by Urquidi Diaz et al. (), three repositories can be classified broadly as social and environmental science research centers that focus on spatial data: The World Data Centre for Renewable Resources and the Environment (WDC-RRE), Global Change Data Publishing & Repository (GCdataPR), and the Socioeconomic Data and Applications Center (SEDAC). Two repositories, the Chinese National Space Science Data Centre (NSSDC) and the International GNSS Service (IGS), can be categorized as representing astronomy and geodesy (). Lastly, the International Real-time Magnetic Observatory Network (INTERMAGNET) and the International Service of Geomagnetic Indices (ISGI) are dedicated to managing and sharing geomagnetic research data and related data products ().

3.3. Repository features

Table 3 gives an overview of each repository’s technical features: the type of repository platform and catalogue service used, metadata standards and protocols, and a list of any current, known aggregators of their metadata assets. Figure 3 presents the metadata exchange protocols utilized by the repositories studied, in the context of those of the larger WDS membership, as surveyed in 2019 by the WDS-ITO (). Relative to WDS members previously surveyed, the repositories in the use cases have, or plan to develop, more OGC-CSW and Opensearch, and fewer OAIPMH services (). It also should be noted that the WDS member survey data reported by Payne and Urquidi Diaz () does not distinguish between protocols residing within repositories and those that are provided by aggregators, such as the Earth Observing System Data and Information System (EOSDIS) and the Global Earth Observation System of Systems (GEOSS), that disseminate metadata on behalf of repositories.

Table 3

Use Case Infrastructures: Summary of Features.


REPOSITORYREPOSITORY PLATFORM & CATALOGUEMETADATA STANDARDSMETADATA SERVICE PROTOCOLSKNOWN AGGREGATORS

GCdataPRCustom GCdataPR 2.0DCI, DataCiteOpenSearchCrossRef, China-GEOSS, CNKI, DCI, CSTR, ScienceEngine

IGSCatalogue via NASA CMR
  • Developing new discovery platform
DIF 10, ECHO 10, ISO 19115-2:2009 (MENDS and SMAP dialects), UMM-CCMR CSW, CMR public APIs, OpenSearchvia NASA’s CMR

INTERMAGNETCustom repository, with some datasets on GFZ Potsdam data repositoryVia INTERMAGNET: IAGA2002, CDF; Via GFZ: GeoJSON, DataCite, ISO 19115Via homepage: HTTP, FTP; Via GFZ: request to DataCite’s APIDataCite, FIDGEO

ISGICustom
  • Public access metadata service
IAGA2002
  • CERIF, DataCite, and/or DCAT based profiles and/or crosswalks
Via homepage: HTTPS; request to DataCite’s API

NSSDCCustomNSSDC Core Metadata Specification, SPASE
  • DataCite, Data model compatible with NSSDC
OpenSearch, OGC-CSW (via WDS China), Data search platform,
  • OAI-PMH
National Science and Technology Data Sharing Network of China, Scientific Data Center, CAS

SEDACVital Digital Asset Mgt. System (Fedora)
  • Migrating to Drupal 8
FGDC CSDGM, ISO 19115, DataCiteIDN OGC CSW, NASA CMR CSW, CMR public APIs, OpenSearchDataCite, GEOSS (via EOSDIS/CMR)

WDC-RRECustom: Debian OS, OSS NGNIX, PostgreSQL, TorCMSDublin Core, ISO 19115, custom Data Identification and Metadata Standards
  • Revision planned
OpenSearch, OGC-CSW 3.0.0, OAI-PMH 2.0, SRU 1.1.,
  • Geonetwork
WDS-China, CNKI

Figure 3 

This bar chart compares the mechanisms for metadata exposure (aggregation, discovery, etc.) that were reported by the HMetS-WG repositories with those reported by the WDS repositories in a 2019 member survey (). Since some repositories reported serving their metadata via third-party services, these services also have been included (e.g. DataCite, EOSDIS, etc.). *Includes schema.org.

3.3.1. Participation in research data networks

As described in the sections below, it appears that participation in national, regional, as well as subject-specific networks has generally shaped the repositories’ infrastructure, particularly in the ways that their adoption of harvestable metadata services has developed or is being planned for development.

All of the studied repositories have been guided or supported by a larger entity while developing harvestable metadata services: INTERMAGNET and ISGI have participated in the European Open Science Cloud’s (EOSC) EPOS ERIC project, while WDC-RRE, NSSDC and GCdataPR have developed with support from Chinese research data institutions. One of the data sources for the GCdataPR comes from cooperation with journals for enabling discovery. GCdataPR initiated a tri-journal program since 2015 to facilitate dataset publication, data paper publication and science discovery publication. The three journals worked closely with authors to publish discovery papers as well as datasets and data papers. Finally, both SEDAC’s and IGS’s infrastructures have been supported by the National Aeronautics and Space Administration (NASA) EOSDIS community, and their extensive collections of knowledge and technical resources.

Geomagnetism data in Europe: the EPOS ERIC. Within the European geomagnetism community, the European Plate Observing System European Research Infrastructure Consortium (EPOS ERIC) has played a major role in promoting the uptake of 21st century technologies and standards to create more granular and robust metadata and dataset documentation (; ). Following EPOS ERIC’s leadership, ISGI plans to migrate the repository’s metadata records into an interoperable schema that will allow repositories to serve metadata to European aggregators like OpenAIRE. Currently, ISGI is considering implementing CERIF, DataCite, and/or DCAT compliant metadata. Since 2013, INTERMAGNET has been publishing yearly definitive data through the GFZ (GeoForschungsZentrum) Data Service, which serves dataset metadata to aggregators using various metadata standards and sharing protocols. Furthermore, a metadata development project is underway to gather metadata for all observatories recording geomagnetic data worldwide. This includes the INTERMAGNET geomagnetic observatories metadata combined with metadata records held by the WDC for Geomagnetism, Edinburgh.

The Chinese research data infrastructure. GCdataPR, NSSDC and WDC-RRE were among the original Chinese data repositories that joined the ICSU system of World Data Centers in 1988. In 2008, to promote collaboration between the eight Chinese repositories at the WDS, the WDS China Common Clearinghouse was created (). The prototype for the WDS China’s unified metadata search portal was constructed with Pycsw, a Python implementation of the OGC’s Catalogue Services for the Web (CSW) specification (). This initiative, led by WDC-RRE, encouraged and supported WDS members to develop harvestable metadata services based on similar spatial data interoperability standards, notably ISO 19115/19139/19119 metadata and the OGC CSW protocol.

Outside of the WDS, the Chinese repositories contribute to the larger Chinese digital research infrastructure, as part of 20 Chinese Data Centers organized under the National Science and Technology Infrastructure Center of China. The 20 national data centers provide their metadata collections on a regular basis to a unified metadata search portal operated by the National Science and Technology Data Sharing Network of China (). These records must comply with the Chinese Science and Technology Infrastructure Resource Core Metadata standard (). Furthermore, all metadata records held by the 20 national data centers, including NSSDC, must be registered in accordance with the Science and Technology Resource Identification (CSTR), GB/T 32843-2016 (), so that these metadata records can be discovered in the CSTR Identification platform. Another class of the data repository is peer reviewed dataset publications through the digital journal. The Global Change Data Repository is a digital journal (ISSN 2096-868X), which is issued monthly and compatible with the Journal of Global Change Data & Discovery (ISSN 2096-3645), a journal for publishing data papers. The two journals and the data and knowledge hub (metadata based links for specific applications) are part of the Global Change Research Data & Repository (GCdataPR). Through its publication methodology and procedures, the GCdataPR maintains long-term preservation and public availability of timely, quality and informative datasets. Both WDC-RRE and NSSDC also maintain custom metadata profiles that integrate local and international interoperability features. In addition, the China National Knowledge Infrastructure (CNKI) also is aggregating metadata from GCdataPR and WDC-RRE.

EOSDIS at NASA. Two of the repositories, SEDAC and IGS, are (at least partially) based in the United States, and they receive support from the National Aeronautics and Space Administration’s (NASA) infrastructure. As one of NASA’s Distributed Active Archive Centers (DAACs), SEDAC participates actively in initiatives stewarded by the Earth Science Data and Information System (ESDIS) project and SEDAC metadata is provided to NASA’s EOSDIS Common Metadata Repository (CMR). The CMR is the back-end of Earthdata Search, the Global Change Master Directory (GCMD), and the International Data Network (IDN), the latter of which transfers SEDAC metadata into GEOSS. The complete collection of IGS data, which is distributed across data centers, has one of two complete mirrors hosted by a NASA EOSDIS data center, the Crustal Dynamics Data Information System (CDDIS) (the second mirror is hosted by the European Space Agency). Thus, at present, metadata records for SEDAC datasets and for IGS collections are served in metadata search/retrieval endpoints at the CMR (), and they are available in multiple established metadata formats, specifically: DIF 10, ECHO 10, ISO 19115-2:2009 (MENDS and SMAP dialects), and UMM-C ().

4. Challenges

As described within the Methodology section, analyses and discussions of the similarities among the challenges that the HMetS-WG repositories face for developing and deploying harvestable metadata services led to consensus on the similarities observed among these challenges. The emerging consensus among the challenges that were reported by the repositories revealed three major overarching themes for the common challenges that were identified. The themes that represent the common challenges for developing and deploying harvestable metadata services include changing user needs, sustainability, and evolving technologies.

The three themes that were found for the challenges faced by the HMets-WG repositories when developing and deploying harvestable metadata services are closely linked to each other. Developing a good understanding of current and evolving technology trends and changing user needs, in light of existing and projected capabilities and resources, can help repositories to identify a sustainable approach for their new development efforts, and reduce the potential of incurring costs to employ expensive corrective measures in the future.

4.1. Changing user needs

The first major theme reflects repositories’ efforts to identify and meet the changing needs of the user communities that they serve. Such efforts include adopting standards that maximize metadata interoperability, deploying metadata schemas that are widely used, but also versatile and extensible to address the changing needs of the user community. Serving the needs of repository users, including data producers and data reusers, is one of the primary objectives of research data repositories. Meeting the challenges for providing services to the user community as the needs of the users change is a key indicator of repository success.

Minimally, a research data repository exists to make a collection of data assets available to a designated community of users. Deploying harvestable metadata catalogues is a key strategy for reaching users, as these services can inform potential users and increase awareness of repository holdings. Such catalogues can be especially effective if they are tailored for interoperability with infrastructures (e.g. metacatalogues) that are highly visible, feature-rich, widely-used, and also themselves integrated within the larger ecosystem of research infrastructures.

4.1.1. New users, new challenges

As a repository shares data more widely, its users become more diverse and heterogeneous. Catering to these evolving user needs is one of the most salient challenges faced by the HMetS-WG data repositories.

ISGI and INTERMAGNET provide good examples of how users’ growing diversity may pose challenges to repositories, even those with well-established data-sharing cultures. Open Data and sharing have always been essential for the geomagnetism community, as earth-observation research can rarely be done without data from multiple countries. In fact, geomagnetism’s established data-sharing tradition is evidenced by over 50 years of collaborative data practices which have included yearly data publications and established, shared standards; e.g. the IAGA2002 data Exchange Format (). At INTERMAGNET, participants are volunteer magnetic observatories which, following standards defined by the network, seek to share and confidently reuse geomagnetic data within the community. ISGI’s participants, in contrast, are institutes whose official task is defined by the International Association of Geomagnetism and Aeronomy (IAGA): to derive and make available officially endorsed data products. In recent years, the geomagnetism community has sought to achieve interoperability with other scientific fields of Earth and environmental observation, and to keep up with current trends to make data more usable, and also more useful, to a larger group of users, not only geomagnetism specialists. As we shall see below, both organizations have needed to factor in these developments when selecting their data and metadata sharing technologies.

Post-Pandemic, data driven regional economic development efforts involve new challenges, especially in rural areas, mountain regions, and small islands. In order to help such regional stakeholders, including decision makers and small business companies, GCdataPR initiated the Geographical Indications Environment & Sustainability (GIES) program. By opening quality datasets, data papers, and metadata (physical geographical data, agriculture products data, socio-economic data and local culture information, as well as in situ timely ecosystem monitoring data), the geographical indications or specific agriculture products could be used by consumers. The GIES cases clusters and practices demonstrated this as an effective solution for the repository to serve local people in attaining the 2030 Sustainable Development Goals (SDGs) ().

4.1.2. Stakeholder engagement, user outreach, adaptation of services

The repositories in this study have shown a clear user orientation, and most report an intent to serve diverse user communities: from the general public to industry data users, to researchers in highly specialized knowledge areas (see Table 2). Concerted outreach is regularly carried out among multiple groups of users and stakeholders, including current and potential users. Also, without exception, each of the repositories participates actively in sundry working groups and opportunities to exchange knowledge, within grassroots, top-down, or federated organizations. Some of these include the WDS, the Research Data Alliance (RDA), and the International Science Council’s Committee on Data for Science and Technology (CODATA), the American Geophysical Union (AGU), the European Open Science Cloud’s (EOSC) EPOS ERIC, the Group on Earth Observations (GEO), China-GEOSS, and the ESDIS system at NASA.

At IGS, for example, data services are being developed to meet the needs of new and established users (), such as those found within IGS itself, including product coordinators, participants in working groups and pilot projects or in analysis centers, (). But because all users of modern mapping, orientation and navigation systems are beneficiaries of the work done by IGS, the IGS Central Bureau has established various channels for outreach and communication (ibid.: 18), with the public and individuals, enterprises, non-profits, institutions and government actors worldwide. These channels include social media outlets like Twitter, where IGS uses the #GNSS4impact hashtag to tweet about common applications of GNSS data. Part of the aim is to make the general public aware of this foundational yet invisible infrastructure. Making IGSs work visible to the general public in ways that can be measured – such as through citation of IGS data, products, and other published outputs – helps IGS advocate for the organization and make a strong case to its supporting partners and funders ().

Repositories also will need to adapt the metadata that they distribute to address the current needs of the user communities that they serve as these needs change. In addition to revising repository services offered, such as recommended uses and data formats and the like, it may be necessary to adopt metadata standards and enhance metadata harvesting capabilities to reflect the knowledge and research interests of the new community segments and domains that are being served. For example, a repository may discover changes in the disciplines of its users by identifying the disciplines of publications and authors that are currently citing the repository’s data holdings. Learning about such changes can enable the repository to identify additional metadata standards, particular metadata elements, specific vocabularies and harvesters that can serve the needs of the new communities as the disciplines of users change. Recent developments, such as those described by Musen et. al (), include metadata templates, discipline-specific ontologies, and metadata evaluation software tools that enable rich FAIR-compliant metadata to be produced for distribution to particular communities and across communities of data users.

4.1.3. Repository usage metrics and citation counts

To some extent, repositories can keep track of their efforts to increase data discovery and, ultimately usage, through counters that measure user engagement with repository assets (e.g. clicks, downloads, searches, turnaways), which can help keep track of fluctuations and patterns in a repository’s engagement and usage. A current standard for repository metrics is embodied in the COUNTER Code of Practice (). Some repositories, such as NSSDC and SEDAC employ a simple user authentication requirement, via a single log-in or registration with an e-mail address, to gain insight into data usage patterns beyond raw metrics, shedding light onto the frequency of usage for each item and the types of users who may be accessing data assets. In contrast, GCdataPR reports using IP addresses and real-time usage statistics to keep track of the repository’s international visits, in a way that is consistent with GCdataPR’s stated goal of reaching a broader international user base. But, while potentially useful for tracking users’ online interactions with the repository, these alternative metrics also have limitations as indicators of actual dataset reuse ().

Alternatively, data citation tracking, despite its limitations, is increasingly becoming a tool that can be used to estimate the scientific impact of a repository’s data assets and to facilitate some types of bibliometric analysis of data usage. Among our use cases, GCdataPR, SEDAC and WDC-RRE report tracking data citations. SEDACs platform has also implemented a searchable online database that contains references to citations of the repository’s datasets ().

4.2. Sustainability

The second set of challenges of repositories for developing and deploying harvestable metadata services refers to the ways in which repositories are limited in terms of opportunities for ensuring the sustainability of their services, especially when considering resource and policy constraints. Sustainable services are needed to provide continuous operations while facing the combined challenges of meeting the changing needs of users with technology that is evolving. Furthermore, with limited resources for technical development, repositories must consider the costs of establishing new services while providing and maintaining existing services.

Securing continual support for sustainable repository development and maintenance is a fundamental management challenge, especially for small- and medium-scale research facilities. Our group of repositories have faced these challenges by gaining support within their host institutions and finding support through partnerships.

4.2.1. Sustainable growth and operations

In research organizations without a strong culture of research data management (RDM), it may take time to build support for expanding data services with initiatives such as a new metadata service. For example, the Göittingen eResearch Alliance () built institutional support by engaging with the organization’s key decision makers and stakeholders. Alternatively, SEDAC and the three WDS members in China have been able to build support for their data centers within their host institutions and their national data infrastructures, and this is reflected in the repositories’ maturity status. These examples also underscore that collaboration among community stakeholders fosters efforts to attain data repository interoperability, as reported by Gries et al. ().

For less hierarchical organizations like research networks and data federations, the most salient challenges involve coordinating the development of a common standard or application profile, or coordinating the adoption of an existing technology (). The two WDS network members among our use cases, IGS and INTERMAGNET, are different examples of established, international data federations that managed to create impressive infrastructures on the basis of voluntary member participation, through many decades of collaborative work.

The voluntary, federated character of IGS relies on decentralized funding schemes for projects and initiatives, usually by public institutions, governments or other research organizations. To maintain its reliable service provision, IGS must rely on system redundancy and on multi-year support commitments from the institutions that host the key elements of the system (; ). To marshal support for a project, repository partners have to be able to envision the positive and tangible ways in which the project will impact funding partners and their constituencies, and how it will benefit the institution and society as a whole. In particular, IGS public outreach and communication initiatives reflect the organizations keen understanding of that fact.

4.2.2. Resource constraints

It is also useful to bear in mind that open-source software (OSS) is being produced and made available on a regular basis, some of which is intended for repositories to implement harvesting protocols with lower investment costs. For example, harvesting protocols can be implemented as modules in bespoke repository platforms by means of Viringo, an OAI-PMH API created by DataCite and further developed at FRDR, or Pycsw, a Python implementation of the OGC CSW protocol that is used by WDCRRE for its Catalogue Service. A minimal implementation of harvestable metadata may consist of a web-accessible folder (WAF), sitemap, or publicly accessible XML file of machine-readable metadata.

While a discussion of the advantages and disadvantages of OSS lies outside of the scope of this paper (see , for a discussion of OSS pros and cons), it bears mentioning that repository managers will need to weigh the benefits of OSS against potential trade-offs (e.g. increased labor costs, community vs. corporate support services, etc.). Nevertheless, software solutions implemented with OSS may offer advantages for adoption if technological compatibility and software reusability is possible.

Independent of the decision to select a particular approach for implementing an enhancement, such as harvestable metadata capabilities, additional sources of support may be needed to sustainably develop and deploy improvements to data repository infrastructure. If the costs of enhancements are not absorbed by operating budgets, such costs may need to be supported separately. In such cases, data repositories may need to initiate projects and secure additional support for improvements to their services as part of their approach to providing sustainable data stewardship ().

4.3. Evolving technologies

The third theme reflects the set of challenges for making strategic decisions and associated investments in a landscape of evolving technologies and changing standards. Weighing the factors that influence such decisions presents a significant challenge for repository managers. Repositories must assess the potential of a technology or standard to meet current and future needs, as well as its maturity, to determine whether and when it can be adopted.

The repositories in this study represent established data-sharing communities that have been sharing scientific data (in analogue and digital formats) long before the advent of the internet. Considering the ever-changing technological landscape, the ‘ideal’ constellation of technologies and services may seem like a moving target: Over the past few decades, these repositories have experienced multiple waves of technical innovation, which have time and again transformed the ways in which data is obtained, documented and shared with other researchers.

4.3.1. Metadata and open data access policies

In general, repositories may be hesitant to expose metadata for protected datasets and/or collections. Although none of our repositories reported hosting private or confidential data, some assets in the NSSDC repository are embargoed for a short time period, which is deemed long enough to ensure that data owners’ rights and interests are protected. NSSDC’s approach is compatible with the requirement that data be as open as allowable, but as restricted as necessary. SEDAC favors the use of open data licenses (mainly CC BY 4.0), ‘unless there are extenuating circumstances such as data restrictions inherited from input data’ (). Wherever relevant, necessary consideration must also be given to data sharing practices and principles – beyond FAIR – that focus on various ethical concerns, such as the First Nations Principles of OCAP (), and the CARE Principles for Indigenous Data Governance (). This means investing in the technical solutions that embody those principles: differentiated access policies and secure data storage, with trustworthy capabilities for offering selective data access under distinct protection classifications; or providing access only to authorized users. Machine-readable data licenses in metadata () can instruct search engines and automated software to display and filter content according to their licensing, which can in turn remind users of the freedoms and obligations (e.g. proper attribution) associated with the dataset.

4.3.2. PIDs, DOIs, and identifiers for dynamic datasets

Persistent, unique identifiers (PIDs) for digital objects can enhance and enable a range of interoperability features, from automatic metadata retrieval for bibliographic references in tools like RefWorks and Zotero, to deduplicated aggregation of dataset metadata into federated catalogues, to the analysis and visualization of networks of scholarly communication and collaboration like OpenAIRE’s Research Graph (). The Digital Object Identifier (DOI) standard (), which emerged in the 1990s, as well as newer PIDs like the Research Organization Registry (ROR) and Open Researcher and Contributor IDs (ORCID) have opened new avenues for automating links between metadata records, and for creating new digital research services. The growing use of the ROR identifier in dataset metadata is a case in point. Since implementing ROR tags in 2020, national aggregation platforms like the Federated Research Data Repository (FRDR) have the option to selectively harvest Canadian data from non-Canadian repositories when at least one of the authors is affiliated with a Canadian research organization (). Similarly, ORCIDs make it easier to track the scholarly output of individual researchers.

The ability to permanently and uniquely reference arbitrary data subsets and subsequent versions of a dataset is key to safeguarding the reproducibility of scientific studies that rely on shared data. To tackle the technical challenge involved, groups such as DataCite () and the Research Data Alliance’s (RDA) Data Versioning Working Group (; ) have developed approaches and recommendations to implement dataset versioning and dynamic data citation. In 2015 the latter group released an RDA recommendation describing the dynamic assignment of PIDs to every new, unique data query that produced a given data subset (). With this approach, when a dataset changes due to updates or reprocessing (), or when a subset of data is extracted from a larger dataset, or republished within a larger data collection (as described in ), these unique products can themselves be reconstructed identified, referenced, cited and reused. These RDA recommendations have been implemented in various data repositories that enable citation of time-stamped versions of subsetted dynamic datasets with persistent identifiers, facilitating retrieval, across sundry data types, for reuse ().

Of our present set of use cases, only the WDC-RRE repository reported having already implemented a system to assign PIDs to versioned datasets (), in which identifiers are coded to refer back to data queries executed on specific, timestamped dataset versions. Two others, INTERMAGNET and ISGI, expressed an interest in developing a PID versioning system in future stages of their repositories’ development. This approach would expedite the release of non-definitive datasets of geomagnetic observations, making these very detailed and highly valuable data assets available sooner to the scientific community. Another recent and well-documented example of metadata versioning from a WDS Member repository is Project MINTED at Ocean Networks Canada (; ), who also had an active role in developing the RDA’s Data Versioning WG’s outputs.

4.3.3. Maximizing data asset potential: Two approaches

To determine how much an existing repository infrastructure can achieve, and to pursue new development opportunities accordingly, an ongoing and thorough assessment of a repository’s infrastructure is recommended. To support a repository’s initial self-assessment, the global RDM community has produced instruments to assess the maturity and trustworthiness of a data repository and the data assets, including metadata records, it contains (; ). Some practical, up-to-date frameworks for reviewing a repository’s current state are the most recent version of the CoreTrustSeal requirements for trustworthy data repositories (), the RDAs new FAIR data maturity model (), the CARE Principles for Indigenous Data Governance (), and the TRUST Principles for digital repositories (). Data repositories also need to continually assess the technology landscape to identify opportunities for improving capabilities to serve their designated communities. Cooperating with other repositories, within and across disciplines, helps with such assessments, especially when cooperating repositories share adoption stories and lessons-learned.

Two cases in our study reflect an interplay between changing user needs, evolving technologies, and resource constraints. The two geomagnetism data repositories, INTERMAGNET and ISGI, contain data assets with enormous potential for innovative, interdisciplinary research, but whose metadata formats and services have not been updated to current standards. For each repository, the challenge lies in finding a strategy that will allow them to exploit their data’s potential to serve their current (known) users as well as future (known and unknown) ones. It involves optimizing between general and use-case based repository developments, including metadata standards and exchange protocols. ISGI and INTERMAGNET have reported different strategies, based on different priorities, to respond to this challenge. INTERMAGNET has reported having to ponder the advantages of general-purpose, extensive standards that can open future (yet unknown) avenues of research and collaboration, versus use-case based approaches that tailor new developments to better support each new case. In contrast, the existence of concrete opportunities for interdisciplinary collaboration for example, between ISGI and researchers in the biological sciences may justify an approach that tailors a repository’s developments to a set of concrete use cases, taking a chance on their potential for future extensibility. HMetS-WG repositories also recognize the tension between the two fundamental principles of investing in future-proof technologies or maximizing user engagement with the data over time. In practice, repositories will usually attempt to balance both principles when designing their development plans.

4.4. Limitations of a harvesting strategy for dataset discovery

In many of the cases described in these reports, the development strategy for harvestable metadata services has been very thorough. To varying degrees, the SEDAC, WDC-RRE and GCdataPR use cases hint at the limits of a discovery/findability strategy based on harvestable metadata services alone. These repositories, in particular, have motivated the ITOs decision to create an inventory of metadata aggregation services () that will allow repository managers to find aggregators outside their community’s beaten path. Furthermore, and as mentioned above, motivated in part by inclusion in Google Dataset Search, SEDAC has a metadata harvesting capability already underway. Furthermore, WDC-RRE and GCdataPR have expressed future interest in receiving ITO support to develop a semantic metadata strategy as well.

5. Conclusions

The experiences reported in this study frame the socio-technical dimensions of research service development, where success depends largely on meeting the diverse needs of stakeholders within the designated communities of the repositories studied. And within each repository, the users may reflect different research perspectives in terms of interests and methods, or they may even employ different epistemological and ontological approaches (). In effect, developing repository services, including harvestable metadata, involves identifying, adopting, and developing technologies that are continuously evolving to demonstrably serve the changing needs of heterogeneous user communities, within the policy and funding constraints of the institution. While the ‘ideal’ constellation of technologies and services may seem like a moving target, finding the right balance for their unique use case appears to be an attainable goal for most repositories.

When developing new services using cross-domain recommendations and policies, the “need for standardization and interoperability” must be balanced ‘against the need for flexibility and discipline-specific nuance’ (). Which standards and technologies will best serve the original producers and established users of datasets, as well as the larger user community, including new and future data users? Nearly all of our repositories conduct some level of market research and intelligence gathering to inform their service development in general, and harvestable metadata services in particular: Gathering usage data and data citation counts and characteristics is necessary to monitor how data is queried and used. Other common practices involve engaging in designated community outreach and participation in cross-domain and/or international working groups, as well as having dedicated working groups with diverse stakeholders; or engaging with current and prospective users directly, such as via interdisciplinary research collaborations.

Strategies for project sustainability vary according to the repositories’ institutional structure. For repositories embedded in centralized and hierarchical institutions (such as research centers, or national digital infrastructure projects), attaining long-term sustainability is contingent on continued support by parent organizations. In these settings, some key strategies include sustained engagement with the organization’s key decision makers and stakeholders to seek strategic alignment, and maximizing opportunities to build support for data centers within their host institutions. For repositories embedded in decentralized organizations, like research networks and data federations, the main sustainability challenge is one of coordination and community development. Among our use cases, IGS and INTERMAGNET represent examples of data infrastructures that leverage voluntary member participation and decades of collaborative work to develop and maintain their services over time.

Lastly, the results from this study strongly suggest that participation and integration into technical networks (national, regional or subject-specific) can be a driver of technological development in member repositories. In all cases, the intermediating entity (a network, community or institution) effectively functions as a catalyst for service development and standards implementation, as well as an incubator that connects repositories’ local ecosystems with global research data sharing spaces. The three themes that have been identified in this study for the challenges of developing and deploying harvestable metadata services also offer implications for the challenges that repositories face, generally and in terms of other capabilities, as they try to improve their services while meeting the changing needs of users with evolving technology in a sustainable manner. Such implications may be considerations for future research and theory development.