The management of research data in its ‘long tail’ (Heidorn 2008), where data are collected, analyzed and archived by small research groups, continues to challenge researchers and curators. As Heidorn states: “While great care is frequently devoted to the collection, preservation, and reuse of data on very large projects, relatively little attention is given to the data that is being generated by the majority of scientists” in these small research groups. Moreover, while research in small groups in many disciplines has become increasingly data-intensive, allocation of resources for data management has not always kept pace (Brase et al. 2014) and data management training for scientists is lacking (Tenopir et al. 2016). Yet, research funding agencies and publishers have increasingly emphasized data management and sharing. In the United States (US) this emphasis increased with the National Science Foundation’s (NSF) 2011 requirement that data management plans be included with research funding proposals (NSF Proposal Guide 2011).
Since NSF’s data management requirement was established, research libraries in the US have stepped up to provide research data management services within academic institutions (Fearon et al. 2013). These services typically provide researchers with data management planning support, data management training opportunities, and some also provide curation services by which researchers can share their data (Tenopir et al. 2014; Bryant, Lavoie, B & Malpas, C 2017). Some of these research data management services are geared towards domain expertise or have domain experts within them (Wittenberg, Sackmann & Jaffe 2018; Teperek et al. 2018), and thus can provide more targeted data management training and support for researchers. For example, at Virginia Tech the research data management service is staffed with PhDs from engineering, social sciences, biological sciences and geosciences (Ogier et al. 2018). While these personnel do not cover all disciplines and subdisciplines of research across the university, it does allow the service unit to provide a level of tailored training for some researchers at our institution.
The first author (Jonathan Petters of Data Services in the University Libraries) helped develop a tailored training for a research group in the Department of Fish and Wildlife Conservation at Virginia Tech. The goal was to build a data management training curriculum for field workers on long-term ecological field research projects in the Florida Panhandle. Here, we showcase this training curriculum and the subsequent efforts of the research group (Carola Haas, George Brooks, and Jennifer Smith, referred to as ‘the co-authors’ throughout) as a case study in building data management capacity. This case study serves as one example of how targeted training and efforts in data and project management for a research project can lead to substantial improvements in research data quality.
In June 2017, Haas contacted Data Services and outlined several data management issues they wanted to address with respect to their long-term ecological field research projects in the Florida Panhandle. The research team involves faculty, graduate students, technical staff, and undergraduate students, permanently based at Virginia Tech, and field crew leaders and field technicians, permanently based on site in Florida. The field technicians have a bachelor’s degree in biology, environmental science, or wildlife management, and the field crew leader has a master’s degree. While the team members at Virginia Tech focus on data analysis and interpretation towards answering research questions, these team members are experienced in field techniques (e.g. trapping, marking, and measuring animals; locating nests; conducting prescribed burns or vegetation management) and are only occasionally involved in data analysis.
The project work includes population assessments of rare and declining amphibians and reptiles, including the federally endangered reticulated flatwoods salamander (Ambystoma bishopi) and the state threatened gopher tortoise (Gopherus polyphemus). Both species are at risk of illegal collection for the pet trade, and thus the data, particularly the site locality information, are highly sensitive in nature. Therefore, the databases were not amenable for storage on cloud services such as Google Drive or Dropbox, and instead have been relayed via portable storage devices or over lengthy email chains for several years. Haas’s initial e-mail elaborated these issues:
“I have a large long-term research project that takes place at a remote research site…Over the years, we’ve had different techs and graduate students creating or adding to databases…now we’re in a chaotic situation. Every database I open is full of typos, [field workers] are entering the same data in different places…I’ve tried to work with them on version control but there are still regular problems with multiple copies of databases going around and one tech is entering into one and another is using a different one.”
Petters responded to this initial e-mail and met the co-authors soon after. Their discussion revealed concerns that this “chaotic” data management situation could be compromising the quality of field data used in analysis, and that it could have negative downstream effects on the research. This first meeting resulted in several potential approaches to addressing these concerns. These approaches were:
Over the course of a few meetings it became clear that a primary issue centered around a disconnect between i) how the field workers collected and managed the field data, and ii) how and the research group at Virginia Tech were using the data. Focusing on approaches c) and d) above, the co-authors agreed that the field workers (both seasonal and more permanent) would benefit most from a formal research data management training curriculum prior to November, when the field workers ramp up flatwoods salamander data collection.
Petters developed a one-and-a-half day customized data management training curriculum with the input of the co-authors, and in September 2017 they presented it to several field workers. This curriculum (Petters et al. 2017, available in slide form at http://hdl.handle.net/10919/89070) incorporated important input from the other co-authors and includes:
An important aspect of this presentation and ensuing discussion was not to unilaterally impose external rules on the field workers, but rather to begin a dialogue as to how field data collection and management was currently done and how everyone (field workers and the co-authors) could work together to improve the management of the collected data. Changing organizational practices and procedures, including those in research data management, is greatly eased when there is a collective agreement on why change and the effort to make change is important (Gilley, Gilley & McMillan, 2009). While all the DataONE educational modules provide data management content that can benefit researchers and field workers alike, material from the three modules of “Data Entry and Manipulation”, “Data Quality Control and Assurance”, and “Metadata” were selected because the co-authors targeted data collection, quality and documentation as particular areas of concern (DataONE, 2016b; DataONE, 2016c; DataONE, 2016d). Material from “Why Data Management?” were also used to help motivate the discussion (DataONE, 2016a).
Upon conclusion of the presentation and discussion, and in follow-up via e-mail, both the field workers and co-authors expressed their appreciation for the proposed curriculum. Brooks (via video conference) and the field supervisor Kelly Jones (in-person) subsequently held a data management training session for the field workers in November 2017. Setting aside time for group discussions was key, as crew members were able to articulate to us some of the challenges they faced. The field workers were in support of working with the research group in instituting new protocols for data management and acquisition. The co-authors encouraged the field workers to make data quality a priority in their work (sometimes at the cost of collecting more observations) so that their findings could be shared with the broader community. As Whitlock (2011) states, “The central goal to have in mind when archiving your own data is to ensure that a new user, perhaps someone unknown to you working with the data 20 years later, can correctly interpret the results and derive correct conclusions from the data.” If the data are quality controlled and sufficiently documented for this purpose of long-term archiving, they would also meet the needs of the co-authors in their current research. This data management training session was reprised in November 2018 for the next season of field workers.
While the co-authors agreed on the importance of implementing this training curriculum for the field workers, they also expended effort to improve some of the technical aspects of the data management training workflow as described in a) and b) above. Previously the field workers were collecting data in multiple spreadsheets and databases on their field computer, leading to quality issues and confusion. Brooks implemented a process of maintaining the database on the server at Virginia Tech that all the field workers can access through an editable front-end. For small research groups developing these skill sets prove a significant obstacle. Both Brooks and Smith were self-motivated to learn SQL and invested substantial time to learn enough of it for effective database implementation.
There were initial stumbling blocks to navigate after the database was implemented, as varying levels of permissions to the server had to be created for all, and VPN access provided to the Florida crew. Additionally, several training sessions over Skype were required to i) explain the functionality, and ii) communicate the utility of a split database design (i.e. a database with separate front-end and back-end components). Following implementation, the split database vastly improved version control, and has received resounding praise as a massive improvement by the technicians that routinely enter data and the researchers that have witnessed the transition. Moving all data to a permanent storage system on the Virginia Tech server yielded additional benefits to data security; the server is backed up hourly, and protected by high-end encryption software.
Once the database had been centralized and permanently housed, systematic proofing of historic data became possible. Basic SQL code was created to search for, and amend, typos (e.g., Temperature = 76°C) and inconsistencies (e.g., yes vs YES vs Y). The research group members also created forms within the split database for data entry, with strict criteria for each data field, to prevent similar errors continuing into the future. For example, depending on when members of our field workers learned frog identification, they might know the scientific name of the southern leopard frog as Rana utricularia, Rana sphenocephala, or Lithobates sphenocephalus, and the six letter ID code in our database was thus sometimes entered as Ranutr, Ransph, or Litsph, making it impossible to easily locate all records of this common species. Availability of just one of these codes on the drop-down menu forced everyone to use the same abbreviation. The co-authors continued to remind field workers that the data they were collecting could be very useful for other researchers and conservation managers decades in the future, but only if these other groups could interpret the information.
The co-authors also worked with the field workers to develop written protocols for data proofing. Originally, Haas assumed that field technicians knew how to proof data. However, some of the field workers thought it meant only glancing over the field data sheet to make sure that a day’s worth of data had been entered. They did not realize that the values for each entry should be checked by another observer. Once the co-authors understood that training needed to be provided on data proofing, they could write a protocol including ways to quickly check for major errors, such as sorting on a value or using filters in a spreadsheet to look for unusual or extreme values.
In November 2018 and a year removed from working intensely with the other co-authors, Petters reached out to ask them: “What improvements/changes have occurred with respect to data management for your field research projects?” The co-authors’ responses suggest a rousing success story. Both the type and frequency of errors (e.g. typographical, inconsistent data entry) seen in their databases have been drastically reduced. Prior to Haas’ initial e-mail to Petters there were approximately ten errors per one hundred records; that error rate was reduced to two errors per one hundred records. Additionally, channels of communication between the research group and field workers have been used to much greater effect. The field workers are now recommending modifications to improve database functionality prior to starting data entry as opposed to after data entry. The co-authors’ implementation of new data collection procedures has also led the field workers to be more aware of the need to develop their own protocols for recording and proofing data.
Additionally, the co-authors found this effort to improve data management for these long-term ecological field research projects in the Florida Panhandle worthwhile enough to extend to other research projects and into their educational curriculum. Haas noted that they “are starting a new field project in Virginia and are planning on using the [curriculum] materials for that too, and would definitely recommend it to others.” Smith, now a professor at The University of Texas at San Antonio, added that through this effort they have “certainly become very aware of data management and best practices and [am] dedicating time in my class next semester about data management”.
An important but not uncommon challenge for this field research project to overcome in improving data management was communications and data transfer between the research group at Virginia Tech and the field workers in the Florida Panhandle. It took more than one attempt to port the new split database to the field worker’s server in Florida due to poor network connectivity. Because the server is on federal government property it is difficult to upgrade the network connection. There were also permissions and security issues to address for the field workers to access servers at Virginia Tech remotely. Effectively troubleshooting these issues was eased by use of screen-sharing software, but that also took some time to implement.
Also and as noted above, Brooks and Smith took the time and effort to learn enough SQL and database implementation to replace the previous data management acquisition system. While research libraries and others at academic institutions can offer assistance to researchers in such data management efforts, this time investment can be a significant obstacle for small research groups.
While data management towards these field research projects has substantially improved in the last year, there are issues that continue to require attention. First, both the researcher group at Virginia Tech and the field workers in the Florida Panhandle undergo substantial turnover in personnel from year to year. Smith is currently at a new institution and Brooks will soon move on as well. Many of the field workers involved in the initial data management training discussion in September 2017 have also left this project. It is vital that new research group members are acculturated to these new data management practices and their importance, and this will require time and effort going forward. A concise data management policies and procedures document could be shared with these new members to facilitate this acculturation.
Additionally, metadata entry and file-naming has not yet been standardized. The US Geological Survey (USGS), a potential future funder of these research projects, expects metadata to be kept and made available in the Federal Geographic Data Committee Content Standard for Digital Geospatial Metadata or following the International Standards Organization 191xx series of metadata standards, and could be a goal in the near future for the research group (USGS 2019). The USGS provides a list of several tools for creating such metadata (USGS 2019).
Briefly comparing the present narrative case study to other cases described in the literature does not make for rigorous exploratory or critical incident case study research (e.g. Yin 2009). However, comparison of this case study to a few other cases in the literature (Parsons, Brodzik, & Rutter 2004, Burnette, Williams & Imker 2016, Curdt 2019) is useful in illuminating broader generalities about data management processes and procedures across scientific research conducted in large projects or within the ‘long tail’ (small projects).
Parsons, Brodzik, & Rutter (2004) and Curdt (2019) describe large field projects; respectively the Cold Land Processes Field Experiment (CLPX) and the Collaborative Research Centre/Transregio 32 ‘Patterns in Soil-Vegetation Atmosphere Systems: Monitoring, Modeling and Data Assimilation’ (CRC/TR32). Both of these field projects acquired relatively large volumes of data in datasets spanning a wide range of spatial and temporal scales, and these data were to be used by several research teams. Consequently these field projects were funded for dedicated data management support and training personnel. For example, CLPX supported up to four data management specialists in the field for each of its four intensive observation periods (IOPs), and appointed a ‘data wrangler’ to oversee transfer of CLPX datasets to respective data providers. CRC/TR32 maintained an online database system (TR32DB) during and after the field project, and provided workshops and tutorials for how CRC/TR32 scientists should interact with the system.
In contrast, Burnette, Williams & Imker (2016) and the present case study describe respective research projects managed by and designed for small research groups (i.e. less than 10 scientists), and did not have the funds for additional dedicated data management support and training personnel. Thus having researchers within the group who have both data management skills and an internal impetus to focus on data management was vital for improved data management. The co-PIs interviewed in Burnette, Williams & Imker (2016) were motivated by previous data management challenges to start on a better footing for this research project. They “explicitly sought out someone with superior attention to detail and organizational skills” to help with data management; a substantial portion of this project manager’s time was devoted to data management processes and procedures both prior to and during the research project.
The co-authors of the present study put in concerted time and effort that led to substantial improvement in data management. Petters took a substantial amount of time to provide tailored consultative guidance and a useful data management training framework for the co-authors and field workers. However, without the implementation of this framework and other data management improvements made by Brooks, Smith, and the field-based crew described above, we would likely not be able to describe this case study as a success. Also, unlike the research project described in Burnette, Williams & Imker (2016), data management improvements in the present case study were made mid-stream (i.e. in the midst of years of data collection that has increased in complexity over time). To facilitate these mid-stream changes, Petters’s consultative approach intentionally began with motivating field workers on the importance of field data collection and quality control to underpin robust research. While streamlining the technical aspects of data management is an important step towards improving research data management, technical improvements are not sufficient. Haas notes that addressing the disconnect in understanding of goals and needs between field workers and research group members, and framing the importance of getting everyone on board about why data management was important, was a “huge contribution and a great first approach to staff training”. For the field workers, making an observation was an accomplishment and provided its own reward while recording the observation in a particular format and proofreading it seemed tedious. If they found a rare species breeding in a new location or returning to a site that had been unoccupied for years after the crew had conducted habitat restoration, they felt that they had accomplished the major goals. However, for the campus-based research group, those observations were only valuable if they could be published and shared with other managers and researchers. The co-authors could not publish the observations if they could not locate the information or easily summarize it.
Regardless of the scale of the research project, clearly defining data management roles and responsibilities was seen as critical. Petters’s framework for these roles and responsibilities was perceived as a useful starting point for more clearly defining who was responsible for what regarding data management. The projects described in Parsons, Brodzik, & Rutter (2004) and Curdt (2019), with dedicated data management staff and data management training for researchers, were able to clearly define roles and responsibilities for data management from the outset of data collection. The research group in Burnette, Williams & Imker (2016) hired a project leader who was explicitly given responsibility for data management for the project.
One additional point of comparison and contrast is that the two large field projects (Parsons, Brodzik, & Rutter 2004; Curdt 2019) were designed with the express intention of providing for long-term archiving of and access to datasets collected. In Parsons, Brodzik, & Rutter (2004) CLPX data were transferred to the National Snow and Ice Data Center for these actions. The TR32DB system is the preservation and access point for the CRC/TR32 project described in Curdt (2019). Burnette, Williams & Imker (2016) do not describe archiving and preservation plans. The co-authors of the present study do not intend to make all data collected from their long-term ecological field research projects openly available owing to the sensitivity of endangered species data, but are interested in being able to share some of it publicly. Research data consultants in libraries like Petters can help small research groups (in the ‘long tail’) consider ways to make their data openly accessible when appropriate, and can sometimes provide platforms to this end.
This case study in building data management capacity for a small (less than 10 members) research group at Virginia Tech serves as one example of how targeted training in data and project management can lead to substantial improvements in research data quality for field research projects. Consultative research data management support from Data Services in the University Libraries played an integral role in the development of this training. The research group members agreed that emphasizing the importance of data quality to the field workers at the beginning of the training discussion was a vital part of its success. Another critical factor towards these substantial improvements was the co-authors internal motivation to see data quality improve, and to take the time to work with field workers and on data management systems to see these improvements through. The co-authors anticipate continuing to give attention to data management and quality as field workers cycle in and out of their long-term projects.
We thank Kelly Jones and Steve Goodman (current and former research associates at Virginia Tech) for bringing the field crew perspective to campus and leading the implementation of the training of the field workers, and thank Brandon Rincon and Vivian Porter for ongoing efforts to improve the data acquisition and management process. We thank Mark Parsons of Rensselaer Polytechnic Institute for his encouragement to publish this case study. We also thank Mark, Nicholas Caruso (currently in Carola Haas’ research group), and five anonymous reviewers for comments that led to substantial improvements of this manuscript. Publication was funded by Virginia Tech University Libraries through the VT Open Access Subvention Fund.
The authors have no competing interests to declare.
Brase, J, Socha, Y, Callaghan, S, Borgman, C, Uhlir, P and Caroll, B. 2014. Data Citation: Principles and Practice. In Ray, J (ed.), Research Data Management: Practical Strategies for Information Professionals, 167–186. West Lafayette, IN: Purdue University Press.
Bryant, R, Lavoie, B and Malpas, C. 2017. A Tour of the Research Data Management (RDM) Service Space. The Realities of Research Data Management, Part 1. Dublin, Ohio: OCLC Research. DOI: https://doi.org/10.25333/C3PG8J
Burnette, MH, Williams, SC and Imker, HJ. 2016. From Plan to Action: Successful Data Management Plan Implementation in a Multidisciplinary Project. Journal of eScience Librarianship, 5(1): 6. DOI: https://doi.org/10.7191/jeslib.2016.1101
Curdt, C. 2019. Supporting the Interdisciplinary, Long-Term Research Project ‘Patterns in Soil-Vegetation-Atmosphere-Systems’ by Data Management Services. Data Science Journal, 18(1). DOI: https://doi.org/10.5334/dsj-2019-005
DataONE. 2016a. DataONE Education Module: Why Data Management? Available at https://www.dataone.org/sites/all/documents/L01_DataManagement.pptx.
DataONE. 2016b. DataONE Education Module: Data Entry and Manipulation. Available at http://www.dataone.org/sites/all/documents/L04_DataEntryManipulation.pptx [last accessed 16 July 2019].
DataONE. 2016c. DataONE Education Module: Data Quality Control and Assurance. Available at https://www.dataone.org/sites/all/documents/education-modules/pptx/L05_DataQualityControlAssurance.pptx [last accessed 16 July 2019].
DataONE. 2016d. DataONE Education Module: Metadata. Available at https://www.dataone.org/sites/all/documents/education-modules/pptx/L07_Metadata.pptx [last accessed 16 July 2019].
Fearon, D, Gunia, B, Pralle, BE, Lake, S and Sallans, AL. 2013. Research Data Management Services. SPEC Kit 334. Washington, DC: Association of Research Libraries. DOI: https://doi.org/10.29242/spec.334
Gilley, A, Gilley, JW and McMillan, HS. 2009. Organizational change: Motivation, communication, and leadership effectiveness. Performance Improvement Quarterly, 21(4): 75–94. DOI: https://doi.org/10.1002/piq.20039
Heidorn, PB. 2008. Shedding light on the dark data in the long tail of science. Library Trends, 57(2): 280–299. DOI: https://doi.org/10.1353/lib.0.0036
National Science Foundation. 2011. NSF Grant Proposal Guide, Chapter 11.C.2.j. NSF 11-1 January 2011. Available at http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_index.jsp [last accessed 7 January 2019].
Ogier, AL, Brown, AM, Petters, J, Hilal, A and Porter, N. 2018. Enhancing Collaboration Across the Research Ecosystem: Using Libraries as Hubs for Discipline-Specific Data Experts. Practice and Experience in Advanced Research Computing. DOI: https://doi.org/10.1145/3219104.3219126
Parsons, MA, Brodzik, MJ and Rutter, NJ. 2004. Data management for the Cold Land Processes Experiment: improving hydrological science. Hydrological Processes, 18(18): 3637–3653. DOI: https://doi.org/10.1002/hyp.5801
Petters, JL, Haas, CA, Brooks, G and Smith, J. 2017. Eglin AFB Field Projects Data Management Training Curriculum. http://hdl.handle.net/10919/89070.
Tenopir, C, Allard, S, Sinha, P, Pollock, D, Newman, J, Dalton, E, Baird, L, et al. 2016. Data management education from the perspective of science educators. International Journal of Digital Curation, 11(1): 232–251. DOI: https://doi.org/10.2218/ijdc.v11i1.389
Tenopir, C, Sandusky, RJ, Allard, S and Birch, B. 2014. Research data management services in academic research libraries and perceptions of librarians. Library & Information Science Research, 36(2): 84–90. DOI: https://doi.org/10.1016/j.lisr.2013.11.003
Teperek, M, Cruz, MJ, Verbakel, E, Böhmer, JK and Dunning, A. 2018. Data Stewardship–addressing disciplinary data management needs. DOI: https://doi.org/10.31219/osf.io/5w9pj
US Geological Survey. 2019. Data Management. Available at https://www.usgs.gov/products/data-and-tools/data-management/metadata [last accessed 15 March 2019].
Whitlock, MC. 2011. Data archiving in ecology and evolution: Best practices. Trends in Ecology and Evolution, 26(2): 61–65. DOI: https://doi.org/10.1016/j.tree.2010.11.006
Wittenberg, J, Sackmann, A and Jaffe, R. 2018. Situating Expertise in Practice: Domain-Based Data Management Training for Liaison Librarians. The Journal of Academic Librarianship, 44(3): 323–329. DOI: https://doi.org/10.1016/j.acalib.2018.04.004