A Discussion of Value Metrics for Data Repositories in Earth and Environmental Sciences

Cynthia Parr1, Corinna Gries2, Margaret O’Brien3, Robert R. Downs4, Ruth Duerr5, Rebecca Koskela6, Philip Tarrant7, Keith E. Maull8, Nancy Hoebelheinrich9 and Shelley Stall10 1 National Agricultural Library, Agricultural Research Service, USDA, Beltsville, US 2 Center for Limnology, University of Wisconsin, Madison, US 3 Marine Science Institute, University of California Santa Barbara, Santa Barbara, US 4 Center for International Earth Science Information Network (CIESIN), Columbia University, New York, US 5 Ronin Institute for Independent Scholarship, Boulder, US 6 DataONE, University of New Mexico, Albuquerque, US 7 Julie Ann Wrigley Global Institute of Sustainability, Arizona State University, Tempe, US 8 National Center for Atmospheric Research, Boulder, US 9 Knowledge Motifs, US 10 American Geophysical Union, Washington DC, US


Introduction
Publicly funded research data repositories would like the ability to demonstrate a return on investment (ROI) in their efforts to archive and publish research data. The business world uses data analytics to improve their decision-making, cost reductions, and product or service launches. Estimates of data-generated revenue range from tens of billions to over 160 billion USD in 2018 (Statista, 2018;IDC, 2018) and may be broken down by market size and value added, such as the number of jobs created, cost savings, and efficiency and productivity gains (Günther et al., 2017). Teasing out the contribution of Earth and environmental obser-vation data to these revenue estimates is difficult, however, it is important to note that some of the most valuable uses of environmental data are in emergency response management (Dubrow, 2018;Pinelli et al., 2018), drought monitoring (Bernknopf, et al., 2019), pollution assessment, agricultural production and groundwater quality (Forney, et al., 2012), fisheries and water management, logistics and trade, and on longer time scales, in real estate development, risk assessment and the insurance industry (Voosen, 2017;Downs, 2018).
Beyond commercial value, are the returns that accrue to the scientific enterprise itself. While the impact of open and accessible data on accelerating certain areas of science is still an active discussion (Gewin, 2016), a number of positive measures are emerging, e.g., increased numbers of publications by data publishers themselves (Milham et al., 2018) and up to a 25% increase in citations when the data were published in an open repository (Colavizza et al., 2019). Even harder to measure and usually not part of monetary analyses are the intangible benefits to society. For example, the ability to predict -based on data -environmental habitability and needed changes in lifestyle are priceless, but not valueless. Societal impacts are often captured anecdotally (e.g. Ramapriyan and Behnke, 2019;NOAA 2019), or in major impact reports (IPCC 2014). Data repositories are essential to the functioning of these activities as well.
Determining a monetary return from funds invested in the research data repositories that house these data remains challenging. In part, this is due to a lack of hard metrics (Dillo et al., 2016). The few comprehensive ROI studies of research data repositories do not distinguish between the impact of publicly available data and related research services; for these together they estimate ROI between 2-and 10-fold (Beagrie and Houghton, 2013). Beagrie and Houghton's (2013) analyses of the value and impact of three data centers in the UK (Archeology, Economics and Atmospheric sciences) examined complex metrics, ranging from value estimations to both users and depositors of data which measured social welfare, work-time savings, and explored non-economic benefits. They found that data centers contribute to significant increases in research production and that the value to users exceeds their investment in data sharing and curation. Qualitatively, academic users reported that having the data preserved for the long-term, with the repositories targeting dissemination, was the most beneficial aspects of depositing data there. Kuwayama and Mabee (2018) described similar results from impact assessments of the socioeconomic benefits of satellite data applications at different decision-making scales, and report on efforts to measure the benefits to human health of data on air quality and harmful algal blooms. They also summarized other analyses of very specific stakeholder groups, e.g., for the value of Landsat mapping to the gold mining sector and of a frost prediction application to tea farmers in Kenya.
Analyses to quantify benefit in economic terms are complex in that they require expertise in several fields typically unrelated to the repository itself (e.g., social sciences, economics, survey statistics). They are expensive to perform and time-consuming, and so happen only rarely; and typically only for repositories with long lifespans and relatively large user communities. Beagrie and Houghton's reports (2013) were commissioned and funded by national agencies over a period of two years, not by the repositories themselves. Moreover, methods for determining the economic value of repositories might necessarily vary dramatically among scientific domains. Thus, it seems worthwhile to adopt a practical approach that can help repositories demonstrate their value efficiently, on short time scales, and within the context of their disciplines.
The Make Data Count (MDC) project is an initiative to design and develop consistent, standardized metrics that measure accesses of individual research datasets (Kratz and Strasser 2015a), an essential step towards realizing comparable metrics of reuse. MDC surveyed scholars and publishers to determine which data-use metrics and approaches would offer the most value to the research community. Data usage or access metrics for research data were an important indicator of impact by researchers and other stakeholders, second only to data citations (Kratz and Strasser 2015b). However, standards were lacking on how usage metrics should be collected and reported, so the MDC project collaborated with COUNTER, a non-profit organization, which provides the Code of Practice for Research Data Usage Metrics (Fenner et al., 2018) so that publishers and vendors can report consistent, credible and comparable usage data for their electronic resources. Here, we build on the MDC work by focusing on indicators of the value of a repository managed as a whole rather than that of individual datasets.
Repositories are experiencing increased expectations, e.g., to meet criteria for making their data holdings "FAIR" (Findable, Accessible, Interoperable and Reusable, Wilkinson et al., 2016, Stall et al., 2018GO FAIR, 2016), to align with schema.org (Guha et al., 2015), and ensure that content is machine readable. Previously, repository stakeholders were mainly research funding agencies and researchers; that group has now expanded to include publishers of academic journals and international audiences; yet these new stakeholders typically do not provide the resources required to implement and maintain the capabilities needed. Consequently, additional requirements that may not provide a clear benefit to primary stakeholders are difficult for repositories to embrace. This paper explores these increasing challenges in assessing the value of repositories. For background, we introduce generally recognized categories of costs and benefits of publishing data in dedicated repositories. We then describe an approach for quantifying the value of data repositories, assembling possible metrics to measure both the costs and the benefits they create and report on an exercise to closely evaluate and prioritize these metrics, with recommendations intended to guide metrics development and refinement. Ultimately, repositories want to be sure they are worth the funding they receive, and a reduced set of consistent, streamlined, and meaningful metrics will help.

Background: the costs and benefits of publishing data in repositories
We define ' data publishing' as the process of making data accessible in a public repository that provides a defined level of professional services. Net value requires evaluating both costs and benefits, where benefits should go beyond financial considerations to include broader societal benefits.

Costs
Cost metrics are an important component of any business (Rubin 1991, Phelps 2003, in that they ensure that expenses are understood and contribute to operational decisions and strategic planning. In principle, measuring these is relatively simple. Data repositories may be more similar to libraries or museums, although those have far more costs associated with physical infrastructure than do data repositories (e.g, Lawrence et al., 2001). When funding data publication in a repository, costs can be categorized into four areas that focus on typical aspects of physical infrastructure and personnel (expertise, time, salary), and are usually outlined in operational budgets. (Table 1, Curation Cost Exchange, 2018).

Benefits
Benefits are less straightforward to articulate and translate less easily into financial terms (compared to costs). Although the view that making data publicly available for reuse will benefit science or society has been contested (Lindenmayer and Likens, 2013;Longo and Drazen, 2016), many scientists, professional societies, funding agencies and journal publishers agree on its overall benefits, summarized in Table 2 (McNutt, 2016, AGU 2013, Baker et al., 2015, Popkin, 2019, Wilkinson et al., 2016, Starr et al., 2015, Piwowar et al., 2011.

Approach
This work began under a National Science Foundation grant, which brought together data and repository managers interested in pathways for increased, sustainable collaboration and coordination to benefit both research networks and individual data use scenarios. In late 2015, a collaboration area developed within the Federation of Earth Science Information Partners (ESIP, http://esipfed.org) for further activities, of which one was to consider frameworks for describing Return on Investment (ROI) in data repositories.
In a series of workshops and teleconferences, thirteen self-identified data curation specialists representing seven environmental data repositories and two data-aggregation facilities (listed under Notes, below) reviewed the literature and current practices for assessing data repository value. They identified and categorized 50 specific metrics (Appendix 1) for measuring the costs and benefits that were applicable to environmental data repositories. As an exercise, each individual scored each metric with respect to its importance to measuring repository value on a scale from 'not valuable, not applicable, or unclear', 'low', 'moderate', to 'high'. Importance was generally understood as how critical the metric was to demonstrating repository value and was judged based on the extensive experience that curation specialists brought to the discussion.
Each repository scored each metric by its ease of implementation categorized as 'metric is already collected','metric not collected, but could be collected easily', ' collection will require nominal additional resources', 'metric could not be implemented without new actions, such as research on its methodology, a refined definition or guidelines, significant additional funds or community policies.' Scorings are somewhat subjective and only roughly quantitative, but the activity allowed us to closely consider the metrics for trends and recommend priorities for adoption or further discussion. No attempt was made here to assign monetary values to any metric. This work was not intended to be an independent survey; in that a group of repository representatives both identified the metrics and scored them. However, the group brought both deep and broad experience in repository management, technology implementation, user support, data curation, and in obtaining funding for repository operations. Because this was merely an exercise, we refer to the outputs of that exercise in an associated dataset (Gries et al., dataset: 2018). In this discussion, the metrics tallied in the accompanying dataset are denoted by italics.

Findings
Of the 50 identified metrics, 30 measure the benefit (or value) created by holding datasets and making them accessible from a repository, and 20 measure the direct costs related to curation, publication and preservation of those same holdings.
In total, 35 (70%) of the 50 identified metrics had been implemented by at least one repository. Eleven (22%) were implemented by only one repository, however, for most of those, several repositories stated that they would be able to implement them with no additional resources, indicating that if these metrics were to be included in a set of unified guidelines, they could be addressed quickly.
In general, highly variable responses reflect that major aspects of operation differ greatly among repositories (even within the research domain of Earth and environmental sciences), and that all repositories

Cost of Initial Investment
Gathering requirements, preservation planning, development of repository platform (hardware, software licenses) and search and access capabilities, development of policies for data acceptance and retention.
Costs can vary widely depending on the scope of the requirements, the suitability of off-theshelf software, and the time required for initial set up, testing and evolution to full production. Requirements may be imposed by the funder or the interests of the scientific community which influence the repository's design and infrastructure.

Cost to Publish
Data acquisition, appraisal, quality review, standards-compliant metadata preparation and dissemination, overhead, marketing, user support.
The variety of scientific communities needs leads to a variety of curation practices and repository goals, with costs partly depending on the data source. Earth and environmental data naturally ranges from the relatively homogeneous, e.g., from sensors or instruments to highly complex organismal observations and physical samples (biological, chemical, geoscience) under both ambient conditions and from experimental manipulations. Large, mission-centric repositories (e.g., satellite data) have costs generally tied to data collection. Repositories serving many individual data producers rely considerably on their contributors' expertise and time which distributes part of the curation cost to those projects. Repositories that are primarily aggregators (whose goal is to collect a variety of metadata or sources for indexing) rely on a minimum level of metadata standardization from their sources; their costs typically arise from resolving incoherent source data and heterogeneous metadata, with related outreach efforts to improve practices.

Cost to Add Value
Data dissemination planning, processing, data product development, and quality control of the new data products, overhead.
Varies greatly among repositories, but may represent the most visible return, or possibly even an opportunity for commercialization. Some raw data will have already received comprehensive processing to make them further useable. The concept of "Analysis Ready Data" are applied in other domains with value-adding steps by repository to target uses from multiple disciplines, non-research uses (e.g., policy makers, general public, education), or per the demand by such groups for the development of specific data products (Baker and Duerr, 2017). The cost for tasks to add value depends greatly on data types, diversity and envisioned uses.

Cost to Preserve
Anticipated retention period, facilities system maintenance, enhancements, and migration; staff development and technology upgrade.
While tracking existing needs is relatively straightforward, future costs may be more difficult to predict. Preservation costs are greatly influenced by technological change (e.g., new hardware, standards, vocabularies, storage formats), and new requirements and data policies that must be translated into repository operations (Maness, et al., 2017, Baker andDuerr, 2017). Iterative migration necessitates expenses in development, data and metadata conversion and user engagement, sometimes without immediately noticeable changes in service. Moving from supporting primarily data publishing to supporting data which are frequently reused requires new services and possibly, value-added products.
were already tracking several metrics, although the suites differed. Across both metric types (cost and benefit), on average the repositories had already implemented 13 (range 4-20), with another 15 rated as simple to implement (mean, range 10-29, Gries et al., 2018 dataset). An average of 15 were judged to require significant additional research, extensive resources or outside expertise. Overall, seven (40%) of the 18 metrics ranked most important were not implemented by any of the repositories. More than half of the metrics in two categories were ranked as important but were not implemented. These were metrics related to "Value-added Data Products (Benefits, see Table 2)", and "Cost to Preserve (Costs, see Table 1)." Metrics for costs were generally easier to implement than those for benefits. The metrics most likely to be implemented were related to categories commonly found in budgets: direct costs (e.g., Hardware) and for personnel, (both as general staff positions, and as the primary cost of software development). Most of those metrics were already measured.
With respect to benefits, ease of implementation scores fell into three broad groupings ( Table 3). Note that the metrics that were most likely to have been implemented or took little effort and resources were those that could be gleaned from the data holdings themselves or from server logs.
Interestingly, a few repositories had already implemented some of the metrics that were ranked important yet difficult (e.g., use of data in papers, reduced storage cost to users, and user satisfaction). Such implementation generally required significant planning, staff time and effort, and confirmed the judgement by more than half of the repositories that implementation would be expensive.

Avoidance of Data Generation Costs
Data gathering is expensive; offering reusable data avoids the cost of recreation.
Data value may be easily estimated as the cost to create; however, the future value of data cannot be predicted and different kinds of data will have different useful lifespans, generally dependent on how easy or expensive data are to create and whether they lose or gain in applicability over time. It may be feasible to recreate experimental data, but it generally is impossible to recreate observational field data.

Efficiency of Data Management
Infrastructure investments benefit all data producers; central programming functions for data search and access improve discoverability and reduce distribution costs to researchers; efficiency benefits are most obvious in repositories serving a large number of single investigators, though all repositories keep large amounts of data safe by upgrading hardware and software as technology changes, and by managing services, such as unique identifiers (e.g., Digital Object Identifiers, DOI).
Data repositories can be compared to specialized analytical laboratories as they employ an expert workforce having specific skills in data curation and preservation that ensure the quality and interoperability of their holdings. Once data have met curation standards, repositories maintain continued usability capabilities and working life beyond the lifespan of the original creator's data storage options by addressing format obsolescence and other issues.

Long-term Usability and re-use of Data
Implementing sustainable data curation, stewardship, metadata capture, and quality of data and metadata enables meta-analyses, innovative re-use for new science or applications.
Lengthening the working life of data creates enduring value by enabling subsequent usage over time. Ongoing stewardship can support new uses and user communities. Properly curated, data can be combined or analyzed with data that will be collected in the future and allows the ability to build upon prior work (Starr et al., 2015).

Transparency of Scientific Results
Making data publicly available in a repository is an important step toward transparency and reproducibility of research, which in turn assures credibility of scientific results (McNutt et al. 2016) and the ability to build on prior work.
Historically, best efforts have been made to preserve publications and the salient data published in them. In modern publishing, data needs to be managed and published as a product in its own right (Downs et al., 2015).

Value Added Data Products
Some repositories increase data utility via pre-processing, semantic and format standardization, data aggregation and interpretation, and specific tools that support the creation of new data products, uses and audiences beyond the original data users (e.g. general public, policy makers, education and outreach) (Baker et al., 2015).
Among all benefits metrics, the rate of implementation did not correlate positively with being evaluated as 'important'. E.g., three metrics related to users' interactions are finding and accessing data, and actual downloads. Interestingly, of these the first two (finding and accessing data) were deemed far more important than actual downloads, which ranked near the bottom in importance -yet number of downloads is a frequently implemented metric (likely due to its ease) while feedback on users ability to successfully find data of interest is hard to obtain.
Three metrics, specific costs of preservation and related infrastructure, user support, and enabling future access were the only metrics to consistently receive a high Importance score. However, this group -at best -were measured at only 40% of repositories. Only one repository (NCAR Research Data Archive) has a mechanism for anticipating future costs. Most (seven out of nine) were not yet able to implement this metric for various operational reasons.
Other benefits metrics ranked as important involved the ability to count data use in publications or citations and track impact on societal priorities. These reflect a repository's ability to promote efficient data management, to provide for long-term usability of its data holdings, including the generation of new knowledge (proposals, studies), and to create value-added products. However, with the exception of being able to tally possible reuse (e.g., via page or catalog visits), many repositories had no current or planned mechanism to collect these, agreeing that a change in data citation practice, and more research or discussion were needed.

Discussion
The broad range of perspectives reported here reflects the repositories' diverse pre-existing priorities, technology choices, user interaction history, and resources, and occasionally, differences in interpretation. Varying degrees of effort are required to implement the different metrics types (e.g. Table 3). Budget-related and tangible metrics (e.g., FTE) are relatively easy to measure as are the number of downloads from server logs. However, the repository managers in this group rated downloads and total page views as lower in importance than did the Make Data Count project Kratz and Strasser (2015b). This generally reflects the uncertainty that downloads are a reliable correlate of use or impact, given the lapse in time between download, use, decision-making, and knowledge gained, and the difficulties tracking that pathway. These scores also reflect lingering concerns about over-standardization and interpretation. A simple measure like downloads ignores the volume of data downloaded; moreover, the repository will have done a better job if users do not need to download data excessively. However, number of downloads is typically valued by scientists publishing data and several community efforts are underway to standardize and track the number of dataset downloads (Kratz and Strasser 2015b). Generally, explicit data citations were rated as much more important here, which is consistent with Kratz and Strasser (2015b), though this tracks academic use only.
The metrics considered most important to a detailed understanding of value are typically intangible (e.g., the benefits to future knowledge or understanding the impact on policy, and cost of ensuring future access), Table 3: Summary of findings on ease of implementation of repository benefit metrics. For detailed list and description of metrics see Appendix A. Here the metrics are not necessarily named individually but restated in general terms.

Currently measureable by most repositories
Derived from data holdings: Temporal, spatial and subject coverage; Value of repository services: number of data submitters and users supported, grants or projects served, workforce development achieved; cost savings for trustworthy data storage and distribution per submitter. Support for reuse: completeness of metadata; expressiveness of metadata standard; presence and enforcement of data/metadata quality policies. Data reuse: Numbers of downloads, page views, or distinct IP addresses accessing the data, of metadata pages accessed, data products and specific tool accessed; time spent at the site.

Possible in the foreseeable future with research, advanced technology and changed practices
Scientific impact: extracted with artificial intelligence technologies from current publications, webpages, blogs, proposals and data management plans, and more reliably based on standardized data citations once practice is established.

Requiring major additional resources and expertise
Surveys: interviews to ascertain user satisfaction and perceived impact on research success (research enabled, time saved, new questions developed). Economic and societal impact: of data and data products beyond scientific use, or for fraud avoidance. and will be a challenge to measure at all, much less measure consistently. The expectations of funding agencies to make data available with minimal conditions, i.e. without requiring account registration or user identification, dramatically reduces opportunities for gathering more detailed customer related metrics. If contact details are not collected at the time of download, e.g., by forms, questionnaires or log-in, the repository has limited ability to follow up with the user regarding the perceived quality of data and metadata, or the value and relevance to the intended purpose. Across all categories would be the need to balance the requirement for free access and privacy-compliant practices with requests from funding agencies or institutions to report on data usage, or even from the user communities themselves (e.g. to generate personalized data use statistics). This is an area of continued discussion with recognised benefits and disadvantages on all sides of the argument. Given these limitations and the fact that many easily acquired benefit metrics are hard to translate into a comparable value for data (Kratz and Strasser 2015b) or repositories, data citation has been identified by this group and many others (e.g., https://datacite.org/, Kratz and Strasser 2015a) as the best metric for measuring 'value of repository' to the science community, if not the larger world. However, data citation is still evolving as a practice (Parsons, MA, et al., 2019, Garza and Fenner 2018, Data Citation Synthesis Group 2014, and is neither a direct analog to paper citation nor firmly established within the Earth and environmental science communities. Hence, several represented repositories have resorted to manual linking of datasets to publications based on expert knowledge and manual or semi-automated literature searches. Wider implementation of Scholix will certainly help here (Cousijn, et al., 2019). In the future, we expect that reliable metrics about academic use will be based on standardized data citation practices. Established practices in turn, could form the training datasets for artificial intelligence technologies that more fully measure complex metrics (e.g., policy impacts) which extract usage from publications, laws and regulations, webpages, blogs, proposals, and data management plans.
Major additional resources or expertise will be required for socio-economic metrics that rely on user surveys and interviews to assess satisfaction and perceived value and impact; these include metrics such as research enabled, time saved, new questions developed, the economic impact of data and data products on society beyond scientific endeavors, and even the repositories' ability to engage the public and its scientific and data management communities. Those types of societal value and impact metrics require expertise generally not found among data curators or repository managers and necessitate targeted resources or funding for survey techniques and economics to simply define, let alone carry out. For example, the methods of Tanner (2012) for measuring impact of digital resources from memory institutions, such as museums and libraries could be adapted for use in this context. Some US federal agencies, e.g., the National Aeronautics and Space Administration (NASA), fund annual customer satisfaction surveys, with results typically driving repository activity over the next year. NASA also collects and has historically sponsored the creation of stories about the use and impact of particular types of data (Ramapriyan and Behnke, 2019). In some instances repositories have been able to obtain funding for advanced products including science analyses to support products for newly identified audiences (Baker et al., 2015).
Some agencies or networks have specifically targeted data synthesis and reuse efforts instead of new data creation, sometimes awarding funds primarily on that basis. These efforts will highlight the importance of initiatives such as "FAIR Data Principles" (Wilkinson, et al., 2016). For example, the Belmont Forum encourages new science to come from existing data concomitantly, even requesting examples of successful research working side-by-side with data management (Belmont Forum, 2018). The Marine Biodiversity Observation Networks specifically target existing long-term research-grade data to model practices for networks of scientists, resource managers and end-users (Wetzel et al., 2015). Repository curation and preservation services make these types of integrated, synthetic research possible.

Recommendations and conclusion
1. Sponsors should invest in research on defining the most important complex benefit metrics, support their implementation, and support evolving repository practices in this new environment. Repositories should be involved in the research components to ensure applicability and feasibility. 2. An initial set of metrics for regular reporting by environmental science repositories should be those that are already measurable and generally useful, with consistent dashboards (such as those noted above or promoted by Make Data Count), but repositories should progressively develop specific metrics to suit their individual stakeholders, while coordinating with similar repositories to avoid duplication of effort. 3. Stakeholders should be aware that many extant ROI calculations from economics-based analyses or specialists are expensive and will happen only when resources permit.
Without the investment of curation and long-term funding of repositories to preserve their data holdings, further research with today's irreplaceable data will not be feasible. Repositories measure what is valuable to their stakeholders and reasonable to collect given their budgets and missions. Future science and societal needs will help determine the value of the long-term investment. As research data publishing and data repositories continue to mature into an integral part of our scientific research endeavors, we should expect to gain a better understanding of the costs and benefits of publishing and preserving research data. We should also expect to see a rationalization of the repository landscape with refined practices for how and where research data are curated. It may be determined that fewer repositories will better leverage the investments made in infrastructure, or alternatively metadata aggregators will provide an ideal entry point into smaller, discipline-focused repositories. Either way, our goal should be to maximize the availability and usability of the data produced. By achieving this goal we can ensure that we extract the maximum value from our research funding while also improving the transparency and credibility of the conclusions drawn.