Introduction

Concerns about the management of data, including its preservation, findability, and reuse, are almost entirely focused on recently-generated data in electronic, machine-readable formats. While many of the principles of the management of electronic data such as proper description and good organization apply to data in any format, the discussions about applying those principles to older data in non-electronic formats have not received much attention.

In this paper we review publications in various scientific fields that discuss older data that is in analog or print format and the use or reuse of older data in general. By analog data we mean items in print format such as numeric data as well as field or lab notebooks, photographs, drawings, and maps. Analog data may also be called historic data, legacy data, heritage data, or dark data, although these and other phrases can include older data that is not necessarily in print format. Some authors also use the term ‘data rescue’ which has also been used to describe recent efforts to duplicate and secure electronic data that may be at risk of loss (see Data Refuge: https://www.datarefuge.org/).

Our interest in this topic began when a few senior faculty members approached the University library for assistance in organizing and possibly housing their analog data (). A survey of life sciences researchers on campus revealed that many held analog data and considered it valuable but were unsure of how to preserve it (). Nearly all were willing to share it. Given that most researchers now either collect data digitally or quickly transfer any analog data, this is a finite problem, but because many of the stewards of analog data are nearing retirement, it is timely. We undertook this literature review to learn how scientific researchers are dealing with the analog data in their possession and if any large scale efforts have been undertaken to address the issues.

Types of analog data

Much of the analog data that exists in offices, labs, homes, archives, and other locations is numeric in nature. It was probably collected before electronic spreadsheets were commonly available for both capturing and analyzing data. The format could be loose notebook paper, index cards, large data sheets, or bound or unbound notebooks. It could also take the form of a log, possibly combining numeric and descriptive data in chronological order.

The data may also be descriptive in nature and contained in field notebooks or diaries. The tags associated with museum and herbarium specimens are often mined for the data that they note such as species, location, dates, and other parameters. Although they are inextricably tied, when we discuss analog data we are not including the specimens themselves but just the information on the tags.

Drawings and photographs may accompany other forms of data or may stand on their own, hopefully with enough description to make them useful to current researchers. The same is true of maps, which may be printed or hand drawn.

History of concern about analog data

A number of authors have written about analog data over the last 50+ years, often noting its potential value and lamenting the lack of procedures, funding, and best practices to help support its ongoing use and preservation. Psychologists in the 1960s and 1970s noted not only the importance of new observations coming from the re-examination of older data but also the practice of comparing newly-gathered data to historic data (; ). Speaking about data that authors have not retained, Wolins () suggests a role for professional associations: ‘If it were clearly set forth by the APA [American Psychological Association] that the responsibility for retaining raw data…this dilemma would not exist’. For a time the U.S. government played a role through the American Documentation Institute at the Library of Congress, which accepted some raw data to be preserved (). Recently, Buma made use of photographs in Glacier Bay to longitudinally track plant growth and establishment and noted that if it was easier to learn of the existence of older data and to obtain copies, its value would grow (; ).

In most cases authors limit their discussions to the situation in their own subspecialty although a few have taken a broader view. A notable example is the final report of the Ecological Society of America (ESA) committee on the Future of Long-term Ecological Data (). The lengthy report details the situation as well as offers numerous recommendations for the future. Although it does not exclusively focus on analog data, it states ‘[a]mong the least secure are data in the hands of an individual researcher who has made little or no provision for long-term curation’.

Also in 1995, the National Research Council published both ‘Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data’ and ‘Preserving Scientific Data on Our Physical Universe’ (; ). The former report highlights the variables, measures, and data management, and puts forward 18 recommendations. These call on professional societies, research institutions, funding agencies, and individual researchers to collaborate, plan carefully, focus on interoperability, create rich metadata, and make data more widely available. The latter report notes many problems and few solutions, stating ‘[t]he most important deficiencies are in the documentation, access, and long-term preservation of data in usable form.’ Again, analog data was not the focus of these works but it was covered.

Easterday et al. () notes the ‘potential of historical dark data to contribute to the modern digital ecological data landscape’. She notes the importance of metadata and the need to promote the data and the best practices around it. In his book Repurposing Legacy Data, Berman () states that ‘data repurposing creates value where none was expected’. It includes case studies from a variety of disciplines and has chapters on identifying data that might lend itself to repurposing and understanding the organization of older data.

Griffin () advocates for the value of ‘heritage data’, noting that much of it is at risk and in order to secure it for future use, ‘certain priorities need to be re-ordered, new skills acquired and taught, resources redirected, and new networks constructed’. Griffin was active in the CODATA Data at Risk Task Group which, along with its successor, the Research Data Alliance’s Data Rescue Interest Group, worked to highlight the value of older data and promote projects that used or preserved it (https://codata.org/blog/2015/07/02/data-at-risk-and-data-rescue/). Patil and Siegel () note that bringing more dark data to the forefront will require different incentives from all those involved: ‘journals, citation indexes, funding agencies, academic institutions and, not least, the researchers themselves’. Although they write from a health sciences perspective, this probably applies more broadly.

A number of authors have drawn attention to the use or potential use of analog data in their particular fields. In fisheries, Singer and his co-authors () surveyed fellow researchers to get a better idea of how and why they used fish collections in order to inform both researchers and those who manage the collections. The value and possible reuse of data collected at biological field stations has been noted since at least the 1980s (). Bowser emphasized the importance of data management and suggested that field station data might be deposited with libraries, historical societies or federal agencies. Easterday and colleagues () make their observations about the use of data science principles by highlighting work from three California field stations and Michener and colleagues () wrote an article entitled ‘Biological Field Stations: Research Legacies and Sites for Serendipity’.

Ecological researchers have long mined analog data and historical records in their work, according to Beans (). While she focuses on journal entries, maps, and photos, she highlights common challenges such as locating material and working with someone else’s organizational scheme. She highlights Loren McClenachan (, , ), a marine ecologist who utilizes historical data in her research and also published a policy-oriented article on the benefits of using older data to set baselines in marine studies. Over 20 years ago Olson and McCord (, ) wrote two book chapters on data archiving in the ecological sciences. Although the emphasis is on digital data, they spell out recommendations on incentives, metadata, and components of an archive that apply to analog and digital materials.

Kwok () reports on the use of older data in the fields of both ecology and climate science. In the area of climate science, Brönnimann et al. () are mainly concerned with digital data but provide an overview of efforts to locate and digitize analog data, commenting that ‘the fraction of yet-to-be-digitized data is difficult to quantify’, implying that it is large indeed.

Geological researchers sometimes have an added reason to want to discover and use older data—it may have been collected using methods that are now difficult or impossible to employ due to stricter regulations. Diviacco et al. () writes about a project where data was both analog and digital and had been obtained using dynamite. Vearncombe et al. (, ), using examples from the mining industry, note that ‘upcycling’ of data can mean cost savings as well as new insights from reexamination of data.

A number of disciplines have employed citizen science projects to assist in the analog data efforts. These take the form of both mining older citizen science projects for their data or initiating new projects that provide person-hours to reformat or otherwise transform or collate analog data. Clavero and co-authors (, ) examine species lists to study trout decline, Hof and Bright () look at previous counts of hedgehogs, and Snall et al. () consider the use of presence data from bird monitoring. A recent citizen science project on the Zooniverse platform involves identifying data in papers written by students at the University of Michigan Field Station (https://www.zooniverse.org/projects/jmschell/unearthing-michigan-ecological-data/about/faq).

While many authors bemoan the unfortunate state of older data in their subdisciplines, a few areas offer success stories. Researchers working in biodiversity, many of whom are connected with museums or herbaria which hold physical specimens and their metadata-rich identification tags, are an example. They have built networks and secured funding for several international biodiversity-related projects that address data tied to specimens as well as the objects themselves. Projects include Integrated Digitized Biocollections (iDigBio, https://www.idigbio.org/), Global Biodiversity Information Facility (GBIF, https://www.gbif.org/), and Distributed System of Scientific Collections (DiSSCo, https://www.dissco.eu/). The progress in digitization and dissemination of biodiversity data over the last 20 years is summarized by Nelson and Ellis ().

Climate researchers have also made great strides in gathering disparate data in analog and digital format and making it accessible to the global community of scientists. The EU-based Copernicus Climate Change Service (C3S, https://datarescue.climate.copernicus.eu/) and International Data Rescue Portal (I-DARE, https://www.idare-portal.org/) serve as examples.

Some contemporary groups that rescue and reuse older analog data have very narrowly focused subject areas. The Living Data Project (https://www.ciee-icee.ca/data.html), sponsored by the Canadian Institute of Ecology and Evolution, funds new projects each year with topics such as ‘Species ranges, diversity and life history of Neotropical birds’ and ‘Responses of freshwater zooplankton to road salt pollution: A global perspective’. Another project, based at the USDA National Agricultural Library (Data Rescue Case Study: Long-Term Livestock Production Data), gathered older data from throughout the US, converted it to electronic formats and deposited it in AgData Commons ().

Field and lab notebooks have been the focus of a number of digitization projects. They may be held in archives, museums, libraries, or research facilities as well as by individuals. The Biodiversity Heritage Library, in conjunction with several other institutions including the Smithsonian, includes nearly 3,000 scanned field books (https://www.biodiversitylibrary.org/collection/FieldNotesProject). On a smaller scale, Texas A&M Libraries has digitized the field notebooks and specimen catalogs of W. B. Davis (1930–1981) and they have been viewed over 1,000 times (https://oaktrust.library.tamu.edu/handle/1969.1/129120). Thomer et al. () proposes a method for efficiently extracting species data from handwritten field notebooks.

Ways that older analog data is utilized

Researchers may use older data in a variety of ways. Some strive to repeat an earlier survey or experiment as closely as possible (; ; ; ). Others reexamine older data or incorporate portions of it into their current work (; ; ; ). Authors may also have consulted earlier data as they developed their research plans. Mandates for the preservation of data that have emerged in the last 15 years have elevated the topic of data reuse, although most recent research has considered only digital data (; ; ).

The methods that researchers use to obtain older data often remains a mystery. Large data collections such as iDigBio provide background, training, examples, and other resources for potential data users (https://www.idigbio.org/research) and authors are likely to mention or cite these collections. This is often not true for projects that use older data. In a preliminary investigation the authors conducted examining 66 scientific papers that used analog data, only seven spelled out how the authors located it (see Figure 1). None of the authors of this set of papers mention going back to the original authors of the publications to obtain more detailed information although it is hard to imagine that none of them took that step.

Figure 1 

Description of sources of historic data for scientific researchers who had re-used it in publications. Scientific papers (N = 66) that illustrated evidence of use of this data were examined to determine the source of the data and how it was identified and located by the researchers.

Obtaining data directly from the researchers is known to be problematic and a statement such as ‘data available on request’ in an article does not always lead to success. A 2014 study focused on 500+ articles from two to 22 years old and the authors state ‘[o]ur results reinforce the notion that, in the long term, research data cannot be reliably preserved by individual researchers’ (). A new study suggests that all data associated with open data publishing needs to go into an open repository before publication. Of authors who indicated that data were available on request in publications, 1,670 (93%) did not respond to a request for data or chose not to share ().

In addition to individual researchers, various types of organizations may be in possession of analog data. Government agencies hold weather data as well as the aforementioned museum and herbarium records. Fisheries and agricultural records have also been used along with conservation-related documents (; ; ; ; ). Nonprofits may hold data from citizen science projects (). Archives can also be a source for analog data; this is sometimes where researchers discover field books and diaries (; ). They may also hold photographs used by those conducting repeat photography work (; ; ).

There are numerous examples of authors reusing analog data that they located using less conventional sources. This includes literature, ship logs, tax records, newspapers, and church records (; ; ; ; ; ; ).

Challenges with reusing older data

Scientists note the challenges and potential pitfalls when combining or comparing old and new data. There are few standards or best practices (; ; ). Individual authors provide rich insights from their experiences but finding general guidance is mostly lacking, unlike the situation with digital data (; Weibe & Allison 2015). Also unlike the digital data landscape, ownership and stewardship responsibilities are often unclear. Costs for reformatting and preservation of analog data can be high with few options for funding (; ). Institutions have few incentives to save data ().

Although not much is known about how individual researchers find the analog data that they reuse, several authors note the difficulty in locating it. It languishes in labs, gets redistributed to multiple locations, or disappears (; ; ; ; ). Many data repositories, especially those housed at academic institutions, require data to be in machine-readable formats. Some such as AgData Commons (https://data.nal.usda.gov/) will accept scanned data. Zenodo (https://zenodo.org/), the European Community repository that includes data as well as software and documents, welcomes data in any format although they are currently working on guidelines for deposit. Data registries, where metadata about analog data could reside, have not materialized as predicted ().

There are numerous challenges as researchers bring together data gathered years apart. Combining old and new data sets can be complex (; ; ). Metadata is a concern, as the original description may lack elements and they may have been defined differently in the earlier work (; ; ; ; ; ).

Interpreting historic data may involve assumptions and comparison methods that need to be selected carefully (; ; ; ; ). Engelhard, Righton, and Pinnegar (), studying the distribution of North Sea cod, noted ‘the well-known problems with fisheries data such as discarding and misreporting practices by fishers’. Beans () notes that this underreporting was often due to attempts to minimize taxes on a boat’s catch. Historical records may have biases that must be dealt with when comparing with current surveys (; ; ; ).

Possible paths forward

While the individuals who are the current stewards of analog data and the organizations where they work have major parts to play in the solution to this issue, other entities can also take a role in developing solutions. Although few professional societies are in a position to host a data repository, there are other important roles that they can play. They could investigate and report on the status of analog data availability, use, and status in their realm, like the ESA. If they publish journals they could encourage authors to cite data papers and accept data papers where appropriate. Societies could call on their members to describe and preserve their own analog data. They could also endorse standards for metadata. If they have the financial means, they could fund the digitization of selected data.

Funding agencies already play a large role in the preservation and reuse of recently-produced electronic data through their mandates and they could also play a role with older data. Agencies could encourage pre-mandate grant recipients to make their data available or follow the lead of the USDA which welcomes scanned as well as machine-readable data from pre-mandate grant recipients in its AgData Commons repository. Agencies could promote the idea of data papers for material from earlier grants and endorse particular repositories in their subject areas. Funders could also award grants to projects that preserve analog data or make it more easily findable.

What can individual researchers do? They can organize, inventory, and describe any analog data in their purview and document details about how it was generated (). Many standards exist for doing this with digital data and those can be used for analog data as well. In a survey of those holding analog data, many reported that there was a person who could describe the origins of the data but fewer had documented that information (). If you have used historic data, think about how you found it and how you wish you might have been able to find it. Explore the concept and content of data papers and think about whether you might have some older data that you could describe in that same way. Talk to others about the topic and look for commonalities, especially across disciplines in your organization. Consult with the science librarians at your institution to see how you might work together. Think about your professional societies and how they might play a role.

Conclusions

Researchers across the sciences use older data in analog format but little is known about how they learn of its existence or locate it. Over the last 50+ years authors have expressed concern about its fate and noted challenges with its use. With the exception of the community of biodiversity researchers, there have been few large projects to address the preservation and findability of analog data and little interest expressed by government agencies, professional associations, and academic and research institutions that would be in a position to act on a broader scale. The best practices (including selection of metadata schema, developing a data dictionary, describing data collection methods) and policies developed to govern the preservation and dissemination of digital data could serve as an example for developments concerning analog data. In the digital realm best practices are often developed by professional associations, both disciplinary and data-focused, as well as those who manage data repositories.