Introduction

In the 21st century, digital data drive innovation and decision-making in nearly every field (Big Data Value Association (BDVA) 2015, Holdren 2013, Houghton and Gruen 2014, Kalil and Miller 2015, Manyika et al. 2011, Manyika et al. 2013, Obama 2013, Podesta et al. 2014, Science and Technology Council 2007, Vickery 2011). Key questions today center not on whether data can add value in these areas, but rather on how to obtain access to more data, how best to leverage data given concerns about privacy and security, and how much value can be gained by reusing data (Cummings et al. 2008, Manyika et al. 2013, NSF 2007, Obama 2009 and 2011, Office of Management and Budget (OMB) 2012, Thompson Reuters 2013, Vickery 2011, Vickery 2012). These questions are being raised in both the public and private sectors, where stakeholders increasingly see data as an asset that can be leveraged to spur creativity, innovation and economic growth, as well as to increase trust (e.g., in government or the results of scientific research) (Association of Research Libraries 2006, BDVA 2014, Berman et al. 2010, Borgman 2012, National Academy of Sciences (NAS) 2009, National Research Council (NRC) 2003, NSF 2007, Organization for Economic Co-operation and Development (OECD) 2015, Podesta et al. 2014, Sveinsdottir et al. 2013, Tenopir et al. 2011, The Royal Society 2012, Ubaldi 2013, Wallis et al. 2013).

Desires for greater transparency and access to data have been particularly high for data 1) that are produced at public expense, whether as part of publicly-funded research or other public initiatives, and 2) that are used in or produced as part of sponsored scholarly inquiry (“sponsored research data” or “research data”), whether publicly or privately funded. Demand for access to the former is driven by public interest and opportunity for public benefit (Podesta et al. 2014, Borgman 2012, Holdren 2013, Manyika et al. 2013, Obama 2013, OMB 2012, Ubaldi 2013, Vickery 2011, Vickery 2012). Demand for access to the latter is driven by public interest and principles of scholarship, especially those that advocate for open availability of knowledge to support further inquiry (Borgman 2012, Holdren 2013, NAS 2009, NRC 2003, OECD 2015, Research Councils UK (RCUK) 2015, The Royal Society 2012, Tenopir et al. 2011). Demand is also driven by a desire for increased reproducibility and accountability. Several high profile cases of a lack of supporting data or suspected or actual fraud have brought greater scrutiny to the availability of data to replicate and verify research results (see for example Climatic Research Unit email controversy 2015, Vogel 2011, Wicherts et al. 2011).

The demand for greater access to sponsored research data has focused attention on the chain of activities that lead to data access, including what data are saved by those who create them, where and how those data are stored and preserved, how they are described, what support for their reuse is available, and how they can be discovered and accessed. Taken together, these activities to maintain the integrity of and preserve access to data are commonly known as data stewardship. The National Academies, in a 2009 study (NAS 2009, p. 27), defines stewardship as:

“…the long-term preservation of data so as to ensure their continued value, sometimes for unanticipated uses. Stewardship goes beyond simply making data accessible. It implies preserving data and metadata so that they can be used by researchers in the same field and in fields other than that of the data’s creators. It implies the active curation and preservation of data over extended periods, which generally requires moving data from one storage platform to another. The term “stewardship” embodies a conception of research in which data are both an end product of research and a vital component of the research infrastructure.”

The importance of data stewardship to leveraging sponsored research data for a variety of purposes, both now and in the future, is reflected in the initiatives and policies that have been created in recent years in countries around the world, particularly in the United States and Europe, but in other places as well, to increase access to sponsored research data (Holdren 2013, National Aeronautics and Space Administration n.d., Obama 2013a, Obama 2013b, OECD 2007, OMB 2002, OMB 2013, RCUK 2015, The Royal Society 2012, Willetts et al. 2013).

While there is general agreement about the actions that must be taken and roles that must be played to steward research data, there is a lack of clarity about who should have responsibility for fundamental aspects of stewardship, and different understandings of what constitutes effective stewardship. This contributes to a fractured and diffuse environment for stewardship (Wynholds et al. 2012, Borgman 2015; see also Pepe et al. 2014, ARL 2006, Borgman 2012, Borgman 2015, Downs and Chen 2013, Esanu et al. 2004, Thaesis and van der Hoeven 2010, Thaesis and van der Hoeven 2010, Berman et al. 2010). In fact, despite the large number of data repositories, stewardship initiatives, and policies across the research data landscape, we know relatively little about the total amount, characteristics, or sustainability of stewarded research data (Berman 2008, Berman 2014, Gantz et al. 2008, Hilbert and López 2011, Pienta 2006, STC 2007, Turner et al. 2014).

What we do know gives us pause. For instance, in 2015, Read et al. conducted a study of the number of datasets mentioned in journal articles resulting from National Institutes of Health (NIH)-funded research that were deposited in “well-known, publicly-accessible data repositories” (Read et al., 2015, p. 3). They found mentions of such deposit in only 12% of published articles. Based on the number of datasets Read et al. identified in articles where no deposit of datasets was mentioned, they estimated that between 200,000 and 235,000 datasets resulting from NIH-funded research in 2011 were “invisible” (not found in one of the well-known repositories). The PARSE.Insight project found similar results in its wide-ranging study to “gain insight into the practices, needs and requirements of research communities” (Kuipers and van der Hoeven, 2009, p. 9). It found in a survey of more than 1,300 researchers across multiple disciplines that only 20% of respondents deposited data in a digital archive (Thaesis and van der Hoeven, 2010).

This research raises important questions. Are stewardship arrangements sufficient? Do researchers, research sponsors, and research institutions adequately understand what they need to do? Are public policies appropriate? These are questions worth answering.

The starting point for answering these questions is a very substantial published literature about research data stewardship. In this article we explore what we know about research data stewardship through the lens of that literature, allowing us to characterize the important questions that previous researchers have asked. It also allows us to show areas that will require additional research in the future. We return to those unanswered questions at the end of this article, so that we can propose valuable lines of future research that need to be explored.

Data Collection and Analysis

Our literature review explores three different samples of literature, which we used to conduct the different analysis presented in the paper. For the purposes of developing our samples, we defined data stewardship according to the National Academy of Sciences definition provided above. Also as above, we defined “sponsored research data” or “research data” as data used in or produced as part of sponsored scholarly inquiry, whether publicly or privately funded.

The first sample (Sample A) is a body of 87 works, including literature reviews, reports, and empirical research that we analyzed to discover what scholars and practitioners identify as challenges to data stewardship. A list of these works can be found in York et al. (2018b).

We conducted descriptive coding of this sample, from which we identified three levels of stewardship gap “areas” and “sub-areas.” At the highest level we defined six gap areas. We arrived at these by distilling 14 broader gap areas, which we in turn aggregated from 56 more granular gap sub-areas. The areas and sub-areas are described below. The different levels of gap areas and subareas, as well as the papers we identified them from are available to browse at York et al. (2018d and 2018e).

The second sample (Sample B) comprises 74 works selected out of the 87 works from Sample A. These are listed at York et al. (2018c). In addition to identifying challenges to data stewardship, the authors of these 74 works also identified relationships between the challenges (e.g., challenges that cause or exacerbate others). The data in Figures 1 and 2 refer to this sample.

Figure 1 

Statements about gap areas and the relationships between them.

Figure 2 

Gap areas and relationships between them. The figure is arranged to show that gap areas in each column impact the gap areas in the rows below them. For instance, Culture (in the first column) impacts Knowledge, Commitment, Legal and Policy Issues, etc. (the gap areas in the rows of that column).

The third sample (Sample C) is a set of 142 works, some of which are included in Sample A and Sample B, that explicitly seek to measure stewardship gap areas and sub-areas, or articulate metrics for measuring them. These are listed at York et al. (2018d). Sample C excludes reports and other works that, for instance, articulate strategies or ideas for addressing stewardship challenges but do not conduct empirical research to measure at least one of our identified gap areas, or theoretical research to identify what might be measured.

We limited works in Sample C for the most part to those dealing explicitly with research data (as opposed, for example, to preservation of digitized cultural materials), though there were a few others. These include studies that investigated the total amount of digital information (e.g., Lyman and Varian 2000 and 2003, Gantz et al. 2007, Manyika 2011), studies targeted toward digital curation skills broadly (but that include consideration for research data) (e.g., Cirrinnà et al. 2013, Hank et al. 2010) and some studies that investigated public sector or government information (e.g., Ubaldi 2013, Vickery 2011).

We conducted initial stages of coding using a combination of spreadsheets and the Web-based tool Workflowy. We subsequently kept track of article codes using spreadsheets and a Web-based database platform (Drupal) where data from the project are available (see http://www.stewardshipgap.net; for data in tabular form see York et al. 2018a).

We identified the works in all three samples through a variety of methods, including searching for topics related to stewardship and curation in and across databases (e.g., using services such as Google Scholar and cross-database aggregation services such as Summon), and analyzing cited references in relevant articles, reports, and projects. The works have a geographic bias towards North America and Europe and are biased as well to those in English. We describe our analyses using these samples below.

Defining the Stewardship Gap

Identifying Gap Areas

While numerous studies and reports have defined data stewardship, identified stewardship needs, put forth strategies to improve stewardship, and undertaken measurement and analysis of key factors that contribute to data stewardship (described below), no community-wide metrics for or measurements of the stewardship gap as a whole exist (one method for identifying the existence of a stewardship gap is described in York et al. 2016).

Measuring the stewardship gap is complex not only because it is difficult to measure the amount of sponsored research data that exist, but because a simple quantified measure of data would not provide critical information about the stewardship environment, prospects for stewardship, or other indicators that could yield insight into the likelihood that data will be stewarded either in the short or long term. Measuring the stewardship gap involves taking stock of a wide variety of component issues or “gaps” and the ways these interrelate and affect one another.

We show the scale of the issue in Table 1, in which we identify 14 gap areas, drawn from 87 articles, reports, and other works related to data stewardship (Sample A).

Table 1

Stewardship gap areas, descriptions.

Gap Area Description

Culture Gap arising from differences in attitudes, goals, practices, and priorities among disciplines and communities that have an impact on data stewardship and reuse
Legal/Policy Gap between current regulations and policies that govern data stewardship and reuse and those that would maximally facilitate stewardship and reuse
Knowledge Gap between what is known and what needs to be known to effectively plan for and ensure effective data stewardship
Responsibility Gap between who currently has responsibility for stewardship and who is best placed to steward data over time
Commitment Gap between the stewardship commitments that exist on valuable data and the commitments necessary to ensure long-term preservation and access
Human Resources Gap between the human effort and skills needed to steward and make data accessible, and the effort and skilled workers that are available
Infrastructure and Tools Gap between the infrastructure available to steward and reuse data and infrastructure needed to maximize stewardship and reuse capabilities
Funding Gap between the funding needed for effective stewardship and the funding available
Curation, Management, and Preservation Gap between the ways data is managed and prepared for preservation and reuse and ways that would maximize its potential for preservation and reuse
Sustainability Planning Gap between planning that is done to ensure adequate resources for stewardship and the planning that is needed
Collaboration Gap between the collaboration needed for effective stewardship and the collaboration that takes place
Sharing and Access Gaps between the amount of data that are shared or made accessible and the amount of data that is not
Discovery Gap between the amount of accessible data that is discoverable and the amount that is not
Reuse Gap between the data that is available for reuse and the data that is used

While in this paper we discuss all 14, in some of our analysis we combined these into six categories as listed below:

  1. Culture (including Legal and Policy Issues)
  2. Knowledge
  3. Responsibility
  4. Commitment
  5. Resources (including Infrastructure and Tools, Human Resources and Funding)
  6. Stewardship Actions (including Curation, Management and Preservation, Sustainability Planning, Collaboration, Sharing and Access, Discovery, and Data Reuse)

Further information about each gap area is provided in the Appendix.

Identifying Relationships Between Gaps

Many of the articles and reports that we examined also indicate a relationship between gap areas—for instance, that deficiencies or gaps in policies for archiving data affect the quantity of data that are shared. Examples of statements indicating such relationships are shown in Figure 1. The arrow indicates the direction of the relationship. As the fourth and fifth statements indicate, the influences are not always unidirectional (e.g., Knowledge can affect Sustainability Planning and vice versa).

Figure 2 shows the relationships between the 14 gap areas as identified from nearly 300 relationship statements like the ones above within 74 of the 87 works we reviewed (Sample B). The figure is arranged to show that gap areas in each column impact the gap areas in the rows below them. For instance, Culture (in the first column) impacts Knowledge, Commitment, Legal and Policy Issues, etc. (the gap areas in the rows of that column). The relationships shown are direct relationships drawn from the statements, and not something that we have inferred. One might infer, for example, that legal and policy issues would have an impact on how much we know in certain areas, or who is responsible for which aspects of stewardship. Since these relationships are not explicitly indicated in the literature, however, they are not represented here. The figure, then, does not attempt to represent comprehensive or definitive relationships between the gap areas. It does, however, represent what has been written about in a fairly large sample of widely cited literature about research data stewardship.

Horizontal rows with significant amounts of red indicate areas where many factors are at play. For instance, there are many factors that affect funding for data stewardship, the seventh row from the top (e.g., Culture, Knowledge, Responsibility, Commitment, etc.). Rows with significant white space indicate areas that may be difficult to address because there are not a lot of identified factors that influence them. For instance, Responsibility is shaped by Collaboration and Commitment, but there are few factors that affect Collaboration and Commitment themselves (and two of the factors that affect Commitment are bi-directional relationships with Responsibility and Collaboration).

One finding from this analysis is that many gap areas that have the largest impact on other areas are affected by relatively few factors. This implies that changes in such areas, including Collaboration, Culture, Knowledge, Responsibility and Commitment, could benefit data stewardship, but may also be difficult to effect. On the positive side, our analysis shows that there are at least some factors that do influence these gaps (e.g., Collaboration is impacted by Infrastructure and Tools and Culture by Funding and Legal and Policy Issues) and these factors could potentially be leveraged in efforts to change the size and nature of some gaps.

A second significant finding is the scarcity of references to factors that have an impact on Discovery of data, or vice versa. Discovery is only mentioned in a couple of contexts in the literature, mainly in connection with infrastructure (e.g., that infrastructure is needed for discovery). Many sources talk about curation, management and preservation influencing reuse of data, but skip the step of how it is made known that data are available for reuse.

Our effort to define the stewardship gap leads us to believe that there are multiple gaps, that the gaps are not isolated from one other but rather relate to and impact each other in different ways, and that while a number of such relationships have been identified in the literature, the relationships between some may be better understood than others (e.g., more is known about factors that affect Infrastructure and Tools than Discovery). It follows from these beliefs that the development of effective strategies to address apparent stewardship gaps will depend on an analysis of the gap areas that are most relevant in a particular context, an understanding of which other gap areas could be targeted to help reduce or eliminate the observed gaps, and reliable means of measuring the extent of the gaps (in order to calibrate levels of investment). We turn our attention now to the last of these—means of measurement.

Stewardship Gap Measurements and Metrics

How do we measure the stewardship gap? The stewardship literature includes many studies that define ways to measure the gap (“metrics”), or that actually measure the gap itself (“measurements”). For the purposes of our investigation, we considered studies to be measurements if they gathered information relevant to a stewardship gap area (whether through case studies, interviews, surveys, ethnography, or another method), and to develop or articulate metrics if they stated criteria that could be used as a basis for measurement. The value of measurements is that they help us understand specific attributes of stewardship that can be measured, for example the amount of resources or the size of archives; the value of metrics is that they help us understand how to measure stewardship or define measurements.

Our initial review of the literature led us to identify the 14 areas identified in the previous section relevant to the stewardship gap. We discuss measurement of these areas across the literature below, following some examples of what we considered to be measurement and metrics studies.

Examples of Measurement and Metrics Studies

Fecher et al.’s (2015) article “What Drives Academic Data Sharing” is an example of a study that includes both measurements and metrics. Fecher and colleagues describe a framework for understanding data sharing in academic settings, which we consider metrics. The framework comprises six categories of factors that contribute to data sharing. These are, as described in the paper:

  • Data donor, comprising factors regarding the individual researcher who is sharing data (e.g., invested resources, returns received for sharing)
  • Research organization, comprising factors concerning the crucial organizational entities for the donating researcher, being their own organization and funding agencies (e.g., funding policies)
  • Research community, comprising factors regarding the disciplinary data-sharing practices (e.g., formatting standards, sharing culture)
  • Norms, comprising factors concerning the legal and ethical codes for data sharing (e.g., copyright, confidentiality)
  • Data recipients, comprising factors regarding the third party reuse of shared research data (e.g., adverse use)
  • Data infrastructure, comprising factors concerning the technical infrastructure for data sharing (e.g., data management system, technical support)

In order to develop the framework, Fecher and colleagues conducted a systematic review of the literature and a survey of secondary data users, which we consider measurement. In their research, they explored questions such as why researchers do not share data, what returns or awards are received from data sharing, whether data sharing is encouraged by employers or funding agencies, what would motivate researchers to share data, and what value is gained from data sharing. They related the results from their survey to findings of other studies on data sharing in order to build the data sharing framework, which they believed had both theoretical and practical use. Their findings indicated that, in contrast to theoretical representations of open science or crowd science, “[r]esearch data is in large parts not a knowledge commons.” Their results pointed to “a perceived ownership of data (reflected in the right to publish first) and a need for control (reflected in the fear of data misuse). Both impede a commons-based exchange of research data.” This finding, they argued, had practical implications for policy:

“Considering that research data is far from being a commons, we believe that research policies should work towards an efficient exchange system in which as much data is shared as possible. Strategic policy measures could therefore go into two directions: First, they could provide incentives for sharing data and second impede researchers not to share.” (Fecher et al. 2015, p. 19)

Overall, they argued that their framework helped “to gain a better understanding of the prevailing issues and [provide] insights into underlying dynamics of academic data sharing” (Fecher et al. 2015, p. 19).

Fecher’s study is out of the norm in addressing both measurement and metrics, and we found only one other study, “A game theoretic analysis of research data sharing,” by Pronk et al. (2015) that articulated metrics for data sharing. This study describes a game theoretic model in which there is a cost associated with sharing datasets and a benefit associated with reusing datasets. The model includes such parameters as the time-cost to prepare a dataset for sharing and for reuse, the benefits of gaining citations, the probability of finding an appropriate dataset to reuse, and the percentage of scientists sharing their research data. The authors ran simulations with varying parameter values and found that not sharing data is always the best option for researchers individually; however, both researchers who share data and those who do not are better off when more researchers share, and more researchers can thus gain the benefits associated with reusing data. Pronk et al. note that this is a classic example of the prisoner’s dilemma. They conclude from their experiments that introducing a “citation benefit” for papers that are accompanied by a shared dataset is a more effective means of incentivizing and increasing rates of sharing than, for instance, reducing the costs of data sharing or making sharing obligatory through the use of policies.

The majority of studies regarding data sharing concentrated on measurement alone, focusing on attitudes towards data sharing, whether and how data are shared, limits on data sharing (e.g., privacy, intellectual property, or security concerns), incentives for data sharing, and problems encountered when trying to share data.

Measurement and Metrics Across the Literature

Table 2 shows the number of studies, reports, and projects (hereafter referred to as “studies”) out of 142 investigated in Sample C that either measure or provide metrics for measuring aspects of the stewardship gap. There are 56 distinct gap sub-areas within the 14 gap areas described above and our final six areas. These gap areas and sub-areas are represented as Level 3, Level 2, and Level 1, respectively, in Table 2. We identified some type of study (related to either measurement or metrics) in 48 of the 56 areas.

Table 2

Three levels of coding of gap areas and sub-areas and how many studies for each we identified in the literature. We identified some works as both measurement and metrics studies and some fell into multiple gap areas and sub-areas. The rows with totals include the distinct number of studies (out of 142) in each Level 1 gap area.

Level of coding aggregation Measurement Studies Metrics Studies

Level 1 Level 2 Level 3

Culture Culture
Sharing attitudes and practices 45 2
Standards 8 0
Research and development culture 6 0
Evaluation of quality 5 10
Stewardship priority 2 0
Demand for data 1 0
Data definition 1 0
Intellectual property 1 0
Archive mandates and objectives 0 0
Identifying what is valuable 11 5
Legal and Policy
Lack of consistency and alignment 11 3
Deficiencies that inhibit stewardship, access, and use 10 0
Institutional structures and pressures 6 1
Incentives that support stewardship, access, and use 5 1
Culture Measurement and Metrics Study Total 77

Knowledge Knowledge
Amount of data 27 5
Costs of stewardship 14 10
Infrastructure for stewardship 2 0
Where to deposit data 2 0
Challenges of enabling data reuse 1 0
How to preserve 1 0
Provenance and authenticity 0 3
Reuse possibilities 0 0
Knowledge Measurement and Metrics Study Total 47

Responsibility Responsibility
Conduct stewardship activities 9 8
Coordinate stewardship activities 1 1
Support stewardship activities 1 7
Responsibility Measurement and Metrics Study Total 18

Commitment Commitment
Lack of commitment 1 1
Extent of commitment 1 1
Duration of commitment 0 1
Commitment Measurement and Metrics Study Total 2

Resources Human Resources
Lack of skills 19 4
Lack of support for data management 10
Lack of people 5 0
Uneven distribution of skills 2 0
Unequal access to resources and expertise 0 0
Infrastructure and Tools
Lack of infrastructure 19 2
Lack of tools 16 1
Difficulty meeting generalized and special needs 2 0
Different timescales of infrastructure development and maturity 0 0
Funding
Lack of funding 12 0
Imbalance in funding 0 0
Resources Measurement and Metrics Study Total 37

Actions Curation, Management, and Preservation
Fragmented data management 21 0
Insufficient data curation or management 18 7
Difficulty managing data for reuse 14 2
Difficulty establishing the trustworthiness of curated data 1 2
Difficulty maintaining the integrity of data over time 1 9
Tradeoffs between data management for short or long term 0 0
Sustainability Planning
Business and economic models 5 8
Dynamic and adaptable infrastructure 4 0
Lack of strategy and planning 4 0
Design and staffing of organizations 3 2
Collaboration
Lack of collaboration 3 0
Challenges forming partnerships 2 0
Support structures 0 0
Lack of critical mass 0 0
Sharing and Access Sharing and Access 36 7
Discovery Discovery 7 0
Reuse Reuse 26 2
Actions Measurement and Metrics Study Total 87

Many studies were relevant to more than one gap area. The overall distribution of studies is as follows: Culture: 77; Knowledge: 47; Responsibility: 18, Commitment: 2, Resources: 37, Actions: 87.

We did not find any measurement studies in the following areas in our sample: tradeoffs between data management for the short or long term; lack of critical mass for collaboration; support structures for collaboration; duration of commitment (one metrics study); archive mandates and objectives; provenance and authenticity; reuse possibilities; imbalance in funding; unequal access to resources and expertise; different timescales of infrastructure development and maturity. As can be seen in Table 2, there were many more areas for which we did not find metrics studies as well. We discuss these later.

The studies reviewed do not comprehensively represent all written works related to the stewardship gap, but they constitute a large subset of such works. The bibliography on which this analysis is based is posted online (see York et al. 2018a and York et al. 2018d), and we expect to add to it over time. A dynamic visualization of the data in Table 2 is available from York et al. (2018e).

Results

Many stories could be told from the results presented in Table 2. The results most pertinent from the perspective of measuring the stewardship gap are imbalances and differences in the numbers and types of studies in the different gap areas. These are, more specifically:

  • Imbalances in the attention given to different gap areas
  • Imbalances between the number of measurements and metrics studies
  • Differences in the depth of investigation undertaken

Imbalances in attention to different areas

The differing amounts of attention given to measuring different aspects of the stewardship gap that we discovered in our sample is clear from the counts of studies in Table 2. The small amount of attention given to Commitment and Collaboration is particularly striking because these are two areas where deficiencies or strengths have the greatest potential impact on other gap areas (as identified in Figure 2). The large number of studies that focus on sharing and access (under the umbrellas both of Culture and Sharing and Access) in comparison to the smaller numbers on Sustainability Planning, Legal and Policy, Funding, and Curation, Management, and Preservation, is also notable given the influence the latter areas have on data sharing (also as shown in Figure 2).

Table 2 also illustrates the differing attention given to metrics across the gap areas. Some of the most striking results point to areas where no metrics were found (30 out of 56 areas). These include fragmented data management, lack of strategy and planning, dynamic and adaptable infrastructure, discoverability, kinds of collaboration, adequate funding or staff support, and different cultures of research and development. A lack of metrics in these areas may indicate a lack of common targets for individuals or organizations to achieve, or a deficiency in means of evaluating progress.

Future research needs to address the importance of areas that have been little studied until now, and direct attention to those that will have the greatest impact on future stewardship.

Imbalances in measurements and metrics studies

There are several areas where the contrast between measurement and metrics studies is particularly pronounced. These include metrics for sharing attitudes and practices (45 measurement studies to two that articulate metrics), reuse of data (26 to two), fragmented data management (21 to zero), lack of skills (19 to four), lack of infrastructure (19 to two), lack of tools (16 to one), lack of funding (12 to zero), difficulty in management of data for reuse (14 to two), lack of support for data management (10 to zero) and incentives and deficiencies in, and alignment among legal and policy issues (5 to one, 10 to zero, and 11 to three, for incentives, deficiencies, and alignment, respectively).

One of the common challenges encountered by studies of stewardship gap areas is the difficulty of obtaining comparable results across different academic domains, especially at large scale. For example, Borgman et al. (2014) note that while the case study method they use to investigate research data infrastructures could be used in other domains, large-scale surveys would likely be less effective due to the importance of local context. Similarly, in their study of the value and impact of research data, Beagrie and Houghton (2014) describe the challenges of conducting their study in different contexts: “The data collection and economic analysis are time consuming and need to be tailored to the specific nature of operation and use of each data centre.” It is possible that a greater focus on metrics in the areas above (e.g., what indicates a lack of infrastructure; what it means for data management to be fragmented; how the difficulty of managing data can be quantified) as well as areas where metrics have been articulated but not widely agreed upon, would result in the collection of more consistent information in different contexts and domains. This could in turn result in more consistent measurement and comparison of research findings across disciplinary boundaries and at scale.

The imbalance between measurement and metrics studies suggests that future research should emphasize metrics, effectively setting broadly applicable standards for measuring discrete aspects of effective stewardship in order to understand how to improve stewardship, both in specific research and data domains, and more generally across the board.

Differences in the depth of studies

The literature contains multiple types of studies. In one common type, which we termed “targeted,” the entire investigation is focused in one or two closely related areas, such as resources or specific actions like curation (e.g., Akmon 2014, Atkins 2003, Ayris et al. 2010, Beagrie and Houghton 2013a and 2013b, Borgman et al. 2014, Cirrinnàet al. 2013). Another common type comprises “wider” studies, which investigate several different gap areas at once, often in the context of a single institution, a nation’s scientific enterprises, or a comparative international framework (e.g., Alexogiannopoulos et al. 2010, Gibbs 2009, Hoeflich Mohr et al. 2015, Jerrome and Breeze 2009, Kuipers and van der Hoeven 2009, Martinez-Uribe 2009, Mitcham et al. 2015, Open Exeter Project Team 2012, Parsons et al. 2013, Perry 2008, Peters and Dryden 2011, Thornhill and Palmer 2014, UNC-CH 2012, Waller and Sharpe 2006). Of the 142 we reviewed, 115 studies were targeted, and 28 were wider.

Wider studies, though they may cover many topics (e.g., in the context of a survey), often have only one or a few questions about any specific given gap area (Wynholds et al. 2011 and Tenopir et al. 2012 are two exceptions that examine multiple gap areas in depth). A raw count of studies including both targeted and wider studies may thus overestimate the depth of investigation that has occurred in a particular area. Table 3 shows 16 of the 50 overall gap sub-areas where we found either a measurement or metrics study. In all of these 16 there is a significantly higher proportion of wider studies (which did not necessarily investigate the indicated area in depth) than targeted. We include only measurement studies in the table as all but one metrics study, related to responsibility for conducting stewardship activities, were targeted.

Table 3

Gap sub-area measurement studies with a larger proportion of “wider” studies than “targeted”.

Gap Sub-area Measurement

Targeted Wider Total

Fragmented data management 7 14 21
Lack of infrastructure 4 16 19
Lack of skills 4 15 19
Difficulty managing data for reuse 3 11 14
Insufficient data curation or management 4 14 18
Lack of funding 2 10 12
Lack of tools 4 12 16
Identifying what is valuable 4 7 11
Lack of support for data management 0 10 10
Conduct stewardship activities 0 9 9
Deficiencies that inhibit stewardship, access, and use [in legal and policy areas] 0 10 10
Standards 1 7 8
Incentives that support stewardship, access, and use 0 5 5
Evaluation of quality 1 4 5
Lack of people 1 4 5
Lack of strategy and planning 1 3 4

The proportion of targeted versus wider studies is an important factor in understanding the universe of research relevant to the stewardship gap. In many cases, such as those indicated in Table 3, not only more research, but more in-depth research is critical to advance our knowledge of the stewardship gap and to give guidance to policy makers, researchers, and research institutions about ways that they can ensure that the research data critical for future success is well stewarded.

Conclusion

This paper has reported the results of our efforts to understand the nature and characteristics of the stewardship gap through a review of relevant literature. In the process of our review we came to understand that there is not a single stewardship gap, but rather numerous and diverse components that contribute to and influence whether research data are responsibly stewarded. We identified 14 gap components or areas from the literature and the relationships between them. We further categorized these components into six major areas, Culture, Knowledge, Responsibility, Commitment, Resources, and Actions, and identified studies that had been conducted to measure or develop metrics in these areas and corresponding subareas. Our effort to measure the stewardship gap led us to focus on three primary results: imbalances in the attention given to different gap areas in the reviewed literature, imbalances in the number of measurement versus metrics studies, and differences in the depth at which studies investigated gap areas.

Our review has shown the stewardship gap literature to be rich with descriptions of challenges to effective stewardship, but that measurement of those challenges is not necessarily balanced. At the same time, the literature is also rich with descriptions of the relationships between challenge or gap areas, and these relationships can provide guidance to institutions and organizations, acting individually or cooperatively, to prioritize and affect gap areas that are most relevant to their situations and needs. Some key questions going forward are:

  • What strategies are most effective for addressing particular gaps or combinations of gaps, and over what timescales?
  • How might these strategies differ depending on discipline, cultures of practice, or levels of knowledge, responsibility or commitment?
  • How can we improve ongoing measurement and evaluation of gap areas to adjust strategies appropriately over time?
  • How can we stay abreast of changes to the gap areas themselves to ensure meaningful and accurate measurement?

It is important to note regarding the final two that the gap areas presented in this paper do not represent all gap areas, only those identified in the literature reviewed. In addition, our review does not cover all works that have been written that are relevant to the stewardship gap. Although it covers a significant subset, and has significantly guided the direction of our research, the stewardship gap bibliography is a work in progress that we expect will become more comprehensive over time through continuing investigation.

Data Accessibility Statement

The works reviewed in the samples of literature used in the study (samples A, B, and C) as well as information pertaining to the evidence of gaps, gap relationships, and study designations associated with each is titled “Stewardship Gap Project Bibliography” and is available at https://doi.org/10.7302/Z2ZW1J47.

Additional File

The additional file for this article can be found as follows:

Appendix

Description of Gap Areas. DOI: https://doi.org/10.5334/dsj-2018-019.s1