Interest in research data and its preservation, findability, and reuse is focused on currently produced, machine-readable data. Older data in paper or analog format held in labs, offices, and archives across research institutions are an often overlooked resource although their use is evident in individual studies throughout the literature.
Researchers across a wide variety of the sciences make use of older analog data. There are examples from ecology, fisheries, forestry, climate change, geosciences, and many other fields (Buma et al. 2019; Allen & Mandrak 2019; Chen 2019; McGowan et al. 2012; Bradshaw, Rickards, & Aarup 2015). Although longitudinal studies make valuable contributions to the knowledge base in many fields, it can be difficult and expensive to collect data over a multi-year period. Combining previously collected data with new data is another way to study changes over time (Clavero & Delibes 2013; Burgi, Steck, & Bertiller 2010; Coulter et al. 2020).
Through informal interactions with researchers on our campus we came to believe that there were accumulations of older analog data scattered throughout labs and offices. In these conversations, the data seemed to be viewed as having value, but it was unclear if researchers were planning to share them with others or preserve them for the long-term. Beyond the use of these types of data by individual researchers, we found little in the literature to address the broader issues of how much of these data exists, how they are discovered by researchers, and what efforts are being taken to safeguard or preserve them. What we did find was almost all were limited to particular disciplines. This included a book chapter that grew out of a 1984 conference on research data management in the ecological sciences, a 1995 report by the Future of Long-term Ecological Data (FLED) committee of the Ecological Society of America that includes recommendations about a registry and cooperation among organizations, and a 2015 article in an earth sciences journal that discussed the value of “heritage” data and the fact that it may be at risk (Bowser 1986; Gross & Pake 1995; Griffin 2015). One recent article does look across disciplines as it provides guidance on assessing risks to scientific data, both print and electronic (Mayernik et al. 2020).
To work toward a better overall understanding of this issue, we decided to conduct a survey to assess the landscape of older analog data at one large research university. We wished to locate individuals who held or knew about older analog data to find out the nature of the data, if it has been reused, and what plans they had for its future. Two of the authors have backgrounds in life sciences which gave us an understanding of the types of data that might exist, its condition and potential value. Our expertise in data management informed this work and we anticipated that there were more issues but we did not know what they were. We were concerned about the fate of the data and we wanted to become more familiar with the landscape so we could consider what services might be useful.
This study aimed to examine the extent of agricultural and life sciences analog data existing in laboratories, offices, departments, research centers, and field stations, and how researchers are currently using, storing, and planning for the future of the data.
For the purpose of this study, analog data is defined as non-digital data, primarily in print, such as field books, lab notebooks, ledgers, data sheets, photographs, maps, drawings/sketches, slides, and so on. It does not include physical specimens (e.g., soil samples, tissue samples, herbarium specimens, organisms), but would include any analog data describing physical specimens.
Participants for the survey were recruited based on their employment status as faculty within either the College of Biological Sciences (CBS), the College of Food, Agricultural and Natural Resources Sciences (CFANS) or Extension. Direct email invitations to participate in the survey were sent to 772 faculty in these colleges. The survey was distributed using Qualtrics and was open for one month from November 2018 to December 2018. One reminder message was sent during the open survey period. Sixteen emails were undeliverable.
The survey was 31 questions long, with a mixture between multiple choice and free-text responses. Questions aimed to discover first if researchers had any analog data in their possession and then:
A full list of questions is available in Appendix A. Data from Qualtrics was pulled into Excel and analyzed. In cases where answers were written as free text, results were coded by two authors into broader categories to preserve anonymity. For the question about how much analog data a researcher possessed, we coded their responses using the guidelines for standardized holding counts and measures (see Appendix B). Where their response could be reasonably quantified, respondents’ answers were converted into linear feet, a measure of shelf space necessary to store documents on edge or horizontally.
Results cover the quantity, condition, and physical accessibility of analog data in the possession of researchers, their preservation and sharing practices, and feelings and practices around data reuse. In all, 108 people responded to the survey, representing a 14% response rate. However, the number of people responding to each survey question varied, based on the survey logic.
Respondents held a range of faculty positions, including: assistant professor (20%; 18), associate professor (16%; 14), full professor (39%; 35), and emeritus professor (4%; 4). Twenty-one percent (19 people) of respondents indicated holding other positions, such as administrative, extension educators, and a variety of other extension roles. Extension researchers, who may have dual college appointments, represented 51% (45 people) of the respondents.
Respondents were asked to indicate the college in which they held their appointment. The majority represented CFANS (52 people), followed by Extension (23 people). There were only 7 respondents from CBS. Colleges outside those targeted were also represented, including Medicine (2 people) and the College of Education and Human Development (1 person). One person did not indicate which college they represented.
Along with college, we asked respondents what department they worked in. Twenty-one different departments were represented. There was high representation from Fisheries, Wildlife and Conservation Biology (9 people), Forest Resources (9 people), Horticultural Science (6 people), Applied Economics (6 people), Extension - Agriculture, Food and Natural Resources (5 people), and Extension - Community Vitality (5 people). The rest of the departments counted fewer than five people and disciplines included: Plant Pathology, Genetics, Cell Biology and Development, Ecology, Evolution and Behavior, Food Science and Nutrition, Bioproducts and Biosystems Engineering, Entomology, and Soil, Water and Climate.
Survey respondents were asked how long they have been doing research in their discipline and how long they have been at the University of Minnesota (UMN). The largest number of respondents indicated they have been doing research in their discipline for more than 20 years (48%; 43). Early career researchers (0–5 years) represented only 11% (10) of responses. Those working from 6–10 years were also 11% (10) of responses whereas those working from 11–15 years were 21% (19) of responses (see Figure 1). Similarly, those who have been at the university over 20 years constituted 35% (31) of respondents. Twenty six percent of researchers (23) had only been at UMN for 0–5 years (see Figure 2).
Seven questions in the survey asked researchers about the quantity of analog data in their possession in order to determine the breadth and depth of analog data issues on campus. In the first question, respondents were asked whether or not they had analog data in their possession (e.g. field books, lab notebooks, ledgers, datasheets, photographs, maps, drawings/sketches, slides, etc.). The majority of respondents or 69% (74) said yes and 31% (34) said no.
Respondents were also asked to specify who produced the data. In this select-all-that-apply question, 55 faculty said they personally created the data; however in 41 cases graduate students or post-docs were the creators, and in 35 cases, staff had created the data. Additional selections included predecessors (17) and collaborators (17). Six respondents wrote in answers which mentioned volunteers, undergraduate students, other faculty, and government staff. Additionally, the majority (95%, 54) of respondents identified themselves as the person responsible for the ongoing management of the data. Of these, 38 people said they had sole responsibility whereas 16 people said that numerous people held responsibility, including staff, research technicians, post doctoral researchers, graduate students, and undergraduates. Only three people said that others, including staff or graduate students, were responsible for the data.
Researchers were asked to quantify the amount of data in a free-text answer. They were given the opportunity to describe the quantity in a manner that would allow them to more easily estimate the amount, rather than asking them to use predefined categories that may not describe their situation. These responses ranged from a few notebooks to several file cabinets. Of the sixty-one respondents who answered this question, 6 did not quantify an amount of data, replying “dozens of large boxes” or “some”, and their responses were not included in the results. The remaining 55 respondents had a sum total of 1,414.83 linear feet of analog data, with individual totals ranging from .02 linear feet to 864 linear feet. More than one-half of respondents (57%, 31) reported they had under 5 linear feet, and 16% (9) of respondents reported 5–10 linear feet. Fewer respondents reported larger amounts: 11–15 linear feet (9%, 5), 16–25 linear feet (7%, 4), 26–50 linear feet (7%, 4), and 51–75 (2%, 1). One respondent was an extreme outlier, reporting 864 linear feet (2%, 1). Excluding the outlier, 73% (40) of respondents had less than 10 linear feet of analog data, and the remaining 25% (14) had between 11–75 linear feet of analog data.
Additionally, respondents were asked to estimate the length of time that the analog data covers (see Figure 3). The responses ranged from one year to over ninety years of data. Respondents were only asked to estimate the number of years, rather than specific dates.
We also asked what subject areas or topics their analog data covers. This was a write-in question; responses were coded into 40 broad categories. Agriculture (11) and ecology (10) data were most highly represented. Horticulture (8), forestry (7) plant pathology (6), and avian population biology (5) were mentioned at high frequency. The rest of the topics were mentioned fewer than five times; however, they represented a wide range of topics, from invasive species, genetics, clean energy, veterinary medicine, physics, engineering, and landscape design.
The majority of respondents indicated that the analog data are not tied to physical specimens that are held elsewhere 66% (41). However, 38% (21) did indicate physical specimen-related data. The data related to specimens referred to collections held at the University of Minnesota’s Bell Museum of Natural History (e.g., herbarium or avian collections) and Insect Collection. Additional mentions included plant, reptile, invertebrate, and biological samples.
When asked about where analog data are physically stored in a select-all-that-apply question, the majority (55) stated in their office, followed by 24 that had data in their labs, and four stated elsewhere on campus. However, several indicated that their analog data are offsite with seven stating they were at their home (see Figure 4). One mentioned that their graduate students had it and another person stated that it was in the “home of a friend in India - too expensive to bring it back.”
In addition to asking respondents about the background of their analog data, this survey sought to explore the current and potential future use of the data, along with attitudes around sharing and reusing data, and preservation measures being undertaken.
When asked if they were still using the data in analog form, 48% of the respondents (30) stated yes. Thirty-five percent (22) said no and 16% (10) stated they were not currently using it, but planned to in the future.
The vast majority of respondents, 79% (49), were still adding to the analog data with only 21% (13) not doing so. However, respondents were split as to whether they were adding to the analog data either in analog form (21%; 13), in digital form (11%; 7), or both in analog and digital formats (47%; 29).
The majority of respondents, 58% (36), indicated that the analog data in their possession is in both the raw (unchanged since collection) and processed (edited, cleaned, transformed, analyzed, etc.) formats. However, 37% (23), only have analog data in the raw format, and 5% (3) only have it in the processed format. When asked if the data were sensitive, confidential or proprietary, the majority, 71% (44), indicated they were not, while 18% (11) stated that they were sensitive, and 11% (7) were unsure.
When researchers were asked if their analog data has been digitally scanned, 40% (25) said yes, and 60% (37) said no. Those that had scanned data were using a variety of storage methods. The majority (21) stated the scanned data are stored on a university or departmental drive. Fourteen stored it on a personal device or drive (e.g., hard drive, external drive, personal cloud account), while smaller numbers uploaded it to a website or a data repository (see Figure 5). When asked who had access to the scanned data, many respondents (15) stated it was just them. Others stated it was people working in their lab (9), collaborators (7), and departmental (6) or other university staff (6). No one mentioned that their scanned data were freely available on the web.
Additionally, we asked if the analog data were machine readable (i.e., written tables that were rekeyed into a spreadsheet). Responses were closely split, with 53% (32) indicating yes, their data were made machine-readable, and 48% (29) indicating they have not been made machine-readable.
We asked if written documentation existed about the data (e.g., a description of methods, analyses, how it was processed, etc.) and 69% (43) indicated yes documentation exists, while 31% (19) said no. However, virtually all respondents to this question (92%; 57) said there was someone who could describe/explain how the data was collected, analyzed, or processed, and only 8% (5) said no, no one could explain the data.
In addition to wanting to understand the physical state of the analog data, we hoped to learn if the data has been reused or shared. We asked if anyone in their lab or department had reused the data in a new study. The majority of respondents (66%; 41) stated they have used the data in a new study and 34% (21) stated they had not reused the data. The respondents were asked how they would feel about sharing their analog data with researchers: 50% (31) would share it but have never been asked, 40% (25) have shared it in the past, and 10% (6) said they are not interested in sharing it. As a follow-up we asked in what formats they shared the data. The formats included: the original raw data, scans, photocopies, rekeyed spreadsheet data (both raw and summative), and verbal distribution. In addition, we asked what were their reasons for not wanting to share their data. These reasons included:
Respondents were also asked what will happen to the analog data when they left their institution. Of those who responded, their replies fell into three main categories: they hope it will be retained by their successor or their department (2 people), it will be destroyed or thrown out (5 people), and they were unsure (10 people). Two respondents mentioned that they hope their data is archived or digitized when they leave the institution. One person discussed the complexities of preserving analog data, stating:
“Assuming someone in the department or university system would need it or use it, the material could be given to them. If there is no apparent need among the current researchers it would be offered to the library for catalogue and archiving. If the library decided it was not worth keeping, it would be in short term storage until someone decided to throw it away.”
In this study, we learned much about faculty researchers’ analog data, how much data exists on our campus, what condition it is in, if it is accessible, and what researchers are doing (if anything) to preserve it. Due to the vast increase in digital data over the last fifty years, we assumed that most researchers would no longer be collecting data in analog formats. However, our research showed that this was not uniformly true. Some people are still adding to their data solely in analog format; however, many more are adding to their historical analog data in both analog and digital formats. Laid end to end, the quantity of data respondents estimated that they had control over would span four American football fields (360 feet or 100 yards long, with a 10-yard-deep end zone on each side). Analog data storage is a concern, as there is often only one unique copy available and if it is lost, it is not recoverable. Upon asking respondents where they were storing their analog data, 7% of them indicated it was at home, while 5% described other off-site scenarios, illustrating that it may be vulnerable and at risk. Acquisition and appraisal of off-site historical analog data, as a subset of a group of records, fails in part because of the length of time between record creation and efforts to recover or preserve materials (Maday & Moysan 2014). For example, if the original creator is incapacitated or deceased, family members or colleagues may be unaware of the data’s existence or value, how to interpret it, or may be unsure of who to contact to manage the records.
Although digitization is seen as a way to preserve analog datasets and increase their accessibility, scanning old data is known to be both time and resource intensive (Bosilovich et al. 2013; Farrell et al. 2019). Yet, more than 40% of our survey respondents have done some scanning, illustrating their commitment to preserving the data. Their protocols and standards were unreported. However, when asked how they were storing their scanned data, a fairly large number of people were storing it in unstable electronic formats or vulnerable locations: 36% of people were storing these data on personal drives and 5% were storing the data on a website. While researchers are becoming more accustomed to sharing their digital research data via data repositories, sharing of historical paper or analog data (or its scanned counterpart) does not seem to be as common. We saw only 5% of respondents were sharing their scanned analog data in repositories. The majority of researchers (53%) have made select parts of their data machine readable, whether or not the entire corpus has been scanned.1 This speaks to the value that the data holds to the researchers as it is a time and resource intensive process.
We hoped to learn if these analog datasets could be reused and about researchers’ perception around data sharing. In theory, most of the data described in this study should be reusable since respondents indicated that there is either adequate documentation for the dataset or someone who could describe the data to other researchers. Ninety percent of respondents stated that they either would share or have already shared their analog datasets. Therefore, if these datasets were made discoverable and accessible, they could be used in additional research studies. However, many researchers were uncertain about their analog data’s future and do not know what will happen to their data when they leave the university. Action must be taken to intervene and help researchers plan for their data’s preservation or in many cases, it will most likely be lost. While policies and best practices have recently been developed for currently produced electronic data on how to describe, manage, preserve and share data, similar guidelines often did not exist when this older data was generated. UMN does not currently have a policy around managing legacy data.
There is a sense of urgency to this problem, as our results showed, many of those holding analog data are also late career faculty, potentially close to retirement. They are uncertain about where their data will go when they leave the institution and no broad solution exists to help them with this dilemma. Researchers have held on to these data for long periods of time and yet, many do not have plans for how to safeguard it. Although broadly university archives have been mentioned as a possible repository, policies for intaking research data vary on an institutional basis and it is not guaranteed that archives would accept an individual faculty member’s research output.2
However, preservation is only one facet of the problem. If the data is deemed to be valuable, and the goal is to share and have this research data reused in further scientific studies, it is crucial that work must be done to make it findable and accessible. The FAIR (findable, accessible, interoperable, reusable) principles serve as guidance on working towards these goals (Wilkinson et al. 2016). Libraries are in a position to help researchers tackle these problems, as they bring disciplinary subject expertise as well as expertise in research data management, metadata creation, indexing, repository management, management of archival materials, and search/discovery platforms. Subject librarians could build on their relationships with academic departments and individual faculty members to initiate conversations with researchers now about their historical analog data and identify potential solutions that both preserve it and increase its findability and accessibility. Some of the data may be able to be ingested into data repositories, some may be able to be scanned and placed into digital repositories, some could possibly be added to archives in its analog format, and other analog datasets may be able to be indexed and identified via a registry. A case study that investigated how to organize, preserve, and make a historical fruit breeding analog dataset discoverable implemented a variety of these strategies (Farrell et al. 2019). For other data sets, a different combination of methods might be more appropriate. However, if we wait to have these discussions until researchers are either on the verge of retirement or have already retired, it may be too late to save it.
This study had several limitations: it took place at a single institution and represented the responses of faculty within only three colleges. The response rate was below 15% and the majority of those responding were full professors who had been working in their discipline for a long time.
As a follow-up to this survey, we are interviewing select faculty at our institution about their analog data in order to gain a deeper understanding of the kinds of data, what amount of analog data has not been previously published, how much of the data has been inherited from a predecessor, and what kinds of services researchers would like to help them with their data. We also hope to continue to work with individual faculty (both late-career and others) to improve organization and documentation of their existing analog datasets so that they are in line with digital data management standards. Finally, additional research could address this study’s limitations by expanding the types of disciplines and institutions surveyed and interviewed in order to understand the full scope of analog data on university campuses.
This work was an attempt to take a deeper look at the issue of analog data in the life sciences on the UMN campus. Because we limited our survey pool to three colleges, we recognize that there is probably much more analog data on our campus across various disciplines. We have identified that these data exist in large amounts, that researchers have retained it over time, that many do not have long-term preservation plans in place, and that many would share or have previously shared it. Therefore, we should be working to help them develop storage and preservation plans and the means to share their research data. The work that we did with horticultural researchers regarding their historical fruit breeding data serves as an example of one approach that could be taken and the impact it can have (Farrell et al. 2019).
Although work has been done in various smaller groups or for specific purposes, particularly in the ecological sciences and earth sciences, larger solutions have not emerged. On a wider scale, discoverability remains a problem. Researchers need to know these data exist in order to reuse it. Libraries are in a position to collaborate on devising broad-scale, cross-disciplinary solutions with their high-level view spanning multiple disciplines and expertise in metadata description, preservation, and repository management.
If we are to ensure that these data sets survive over time, work must be done to identify and preserve them now. As researchers leave their institutions or retire, their data is at a risk for loss. These data, even though they could be reused in future scientific studies, could end up discarded. As we have seen, historical data has been incorporated with newer data in research studies to observe changes over time. This kind of work is particularly important in studies of biodiversity and climate change.
The additional files for this article can be found as follows:Appendix A
Survey Questions. DOI: https://doi.org/10.5334/dsj-2020-051.s1Appendix B
Guidelines for Standardized Measurements. DOI: https://doi.org/10.5334/dsj-2020-051.s2
1Our survey did not ask respondents to supply rationale for selection of items for digitization or conversion to machine readable format.
2While our university does have a records information management policy, the implementation is open to interpretation by each curator. Things that affect their decision to consider acquiring the researcher’s files include the researcher’s notability, condition of the files, supporting information, and space considerations.
The authors have no competing interests to declare.
Allen, B and Mandrak, NE. 2019. Historical changes in the fish communities of the Credit River watershed. Aquatic Ecosystem Health & Management, 22(3): 316–328. DOI: https://doi.org/10.1080/14634988.2019.1672463
Bosilovich, MG, et al. 2013. On the Reprocessing and Reanalysis of Observations for Climate. In: Asrar, GR and Hurrell, JW (eds.), Climate Science for Serving Society: Research, Modeling and Prediction Priorities. Dordrecht: Springer Netherlands, pp. 51–71. DOI: https://doi.org/10.1007/978-94-007-6692-1_3
Bowser, CJ. 1986. Historic data sets: lessons from the past, lessons for the future. In: Michener, WK (ed.), Research Data Management in the Ecological Sciences. Columbia: University of South Carolina Press, pp. 155–179.
Bradshaw, E, Rickards, L and Aarup, T. 2015. Sea level data archaeology and the Global Sea Level Observing System (GLOSS). GeoResJ. (Rescuing Legacy Data for Future Science), 6: 9–16. DOI: https://doi.org/10.1016/j.grj.2015.02.005
Buma, B, et al. 2019. 100 yr of primary succession highlights stochasticity and competition driving community establishment and stability. Ecology. DOI: https://doi.org/10.1002/ecy.2885
Burgi, M, Steck, C and Bertiller, R. 2010. Evaluating a Forest Conservation Plan with Historical Vegetation Data A Transdisciplinary Case Study from the Swiss Lowlands. Gaia-Ecological Perspectives for Science and Society, 19(3): 204–212. DOI: https://doi.org/10.14512/gaia.19.3.10
Chen, X. 2019. Dynamics of forest composition and growth in Alabama of USA under human activities and climate fluctuation. Journal of Sustainable Forestry. Philadelphia: Taylor & Francis Inc, 38(1): 54–67. DOI: https://doi.org/10.1080/10549811.2018.1497998
Clavero, M and Delibes, M. 2013. Using historical accounts to set conservation baselines: the case of Lynx species in Spain. Biodiversity and Conservation, 22(8): 1691–1702. DOI: https://doi.org/10.1007/s10531-013-0506-4
Coulter, A, et al. 2020. Using harmonized historical catch data to infer the expansion of global tuna fisheries. Fisheries Research, 221: 105379. DOI: https://doi.org/10.1016/j.fishres.2019.105379
Farrell, SL, et al. 2019. Resurfacing Historical Scientific Data: A Case Study Involving Fruit Breeding Data. Journal of eScience Librarianship, 8(2). DOI: https://doi.org/10.7191/jeslib.2019.1171
Griffin, ER. 2015. When are Old Data New Data? GeoResJ, 6: 92–97. DOI: https://doi.org/10.1016/j.grj.2015.02.004
Gross, KL and Pake, CE. 1995. Final report of the Ecological Society of America committee on the future of Long-term Ecological Data (FLED). Washington, DC: Ecological Society of America, p. 122.
Maday, C and Moysan, M. 2014. Records management for scientific data. Archives and Manuscripts, 42(2): 190–192. DOI: https://doi.org/10.1080/01576895.2014.911686
Mayernik, MS, et al. 2020. Risk Assessment for Scientific Data. Data Science Journal. Ubiquity Press, 19(1): 10. DOI: https://doi.org/10.5334/dsj-2020-010
McGowan, S, et al. 2012. Humans and climate as drivers of algal community change in Windermere since 1850. Freshwater Biology, 57(2): 260–277. DOI: https://doi.org/10.1111/j.1365-2427.2011.02689.x
Wilkinson, M, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3: 160018. DOI: https://doi.org/10.1038/sdata.2016.18