YARD: A Tool for Curating Research Outputs

Limor Peer; Joshua Dull

Introduction

YARD (Yale Application for Research Data) is an adaptable curation workflow tool that enhances research outputs and associated digital artifacts designated for archival and reuse.

Quality and curation in repositories

The scientific principle of self-correction asks researchers to be transparent about their design, data, methods, and analysis. Transparency makes it possible for independent researchers to “reproduce reported results; test alternative specifications on the data; identify misreported or fraudulent results; reuse or adapt materials (e.g., survey instruments) for replication or extension of prior research; and better understand the interventions, measures, and context” (). To the greatest extent possible, data and materials, including code, should be made publicly available in order to increase accountability for researcher error (; ) and allow others to reproduce and confirm results ().

Universities, private and public funders, scholarly societies, journals, and other stakeholders in the scientific enterprise have been looking to data repositories to make research data, code, and other materials underlying reported findings more widely discoverable and accessible. Data repositories, however, do not apply uniform or standardized curation practices, with many offering self-deposit or opting for a minimal curation model in an attempt to appeal to busy researchers. So as the digital artifacts associated with research outputs proliferate via various repositories, they are not necessarily usable or interpretable. Stodden et al () recently found “serious shortcomings in usability and persistence” of digital artifacts supporting publications (see also a discussion of attempts to evaluate empirical claims in published studies, ). We view the ability of future users to independently understand and reuse research outputs as a key aspect of quality (see also ; ; King, 1995; ).

The cost of repositories’ failure to ensure the usability and interpretability, or quality, of research outputs can be great. As data science emerges as the next frontier (; ), the ability to reliably use data and other digital artifacts associated with research outputs for the purpose of validating the integrity of scientific claims must be a precondition. From a practical standpoint, the community should expect that investment in research infrastructure is extended to the production of digital artifacts that can be used meaningfully. Moreover, given that, “a large number of scientific studies… suffer from the underlying computational and statistical issues” (), usable and interpretable research outputs are imperative.

Curation of digital research data is traditionally defined as activities that reduce threats to their long-term research value and mitigate the risk of digital obsolescence (). We refer to gold-standard curation as the measures taken to ensure that research outputs are independently understandable for informed reuse (). Some curatorial activities – such as the periodic review of the digital integrity of a file and remedial actions to protect data from digital erosion or hardware failure – need to be ongoing. Other activities – such as code review – may be more pertinent at certain points of the data lifecycle (see , for a comprehensive list). We believe all curation activities are vital.

Our concern here is with the optimization of curatorial activities that enhance the quality of research outputs and associated digital materials. Ensuring the quality of these digital materials designated for long-term reuse requires effort and inevitably some cost. It has been noted that, “managing research data for quality, in one form or another, has in fact been the core responsibility of data curation since its inception as a distinct sub-discipline within the library and information sciences” (). At present, however, there is no consensus in the scientific community about who is responsible for this effort: Researchers, data centers, university libraries, or data repositories? Moreover, an analysis of curation practices in general purpose and domain repositories found that review and curation of these materials are sometimes minimal or limited in scope (), with self-archiving models often guaranteeing no more than bit-level preservation in order to control costs.

Absent a consensus on responsibilities and practices, it is not surprising that there are very few tools currently available for curators and others to manage, standardize, and share responsibility for curation activities. In contrast with custom tools built and used by individual repositories to accommodate their own specific curation needs and preferences, a universal tool affords other entities a way to engage with curation activities earlier in the data lifecycle. For example, laboratories can use the tool during active research, for example during data collection, to ensure that subsequent transformations to raw data are documented in sufficient detail. This can enable other researchers to trace the final analysis datasets supporting published findings back to those original raw data.

We describe YARD, a tool that responds to this challenge. YARD, the Yale Application for Research Data, is a workflow tool that facilitates gold-standard curation tasks. Our goal in developing YARD is to create highly curated research packages that can then be deposited into any data repository. In addition to the standard curation activities, the tool facilitates tasks for reviewing and enhancing research outputs, including verifying that data, code, and other relevant digital artifacts computationally reproduce the results those materials reportedly support. The tool also creates rich metadata about the artifacts, helping generate findable, accessible, interoperable, and reusable (FAIR) digital objects. By tracking curation tasks, YARD supports a transparent and documented workflow that can help researchers, curators, and publishers share responsibility for curation activities through a single pipeline. And finally, by building flexibility into the system, YARD is designed to be adapt to changing requirements and standards.

General Description

The curation tool is a web-based application designed to process digital artifacts associated with research outputs, including metadata management, and to deliver highly curated research packages into designated repositories (see Figure 1). YARD is the implementation of the tool at Yale University’s Institution for Social and Policy Studies (ISPS) which began in 2018.

Figure 1

YARD is a workflow tool for reviewing and enhancing research outputs and delivering them into a repository.

Specifically, the curation tool offers two main benefits:

Managing complex workflows. The workflow design helps guide depositors and curators through tasks for reviewing and enhancing research outputs. The tool tracks these curation tasks and generates rich metadata. The tool can be used to manage any updates to metadata and data, which can then be pushed out to public repositories. An advantage of the tool is that the high quality data packages it produces can be linked to different endpoints for dissemination. For archives and repositories that already do a fair amount of curation, the tool facilitates a systematic workflow with tracking and integration capabilities. For self-archiving systems that offer little or no curation, the tool can be an option for depositors, as a means of enforcing minimal documentation standards, for example. In addition, organizations with distributed expertise can use the tool to collaborate, coordinate, and standardize curation activities. For example, university library staff may be responsible for metadata generation, a statistical support unit responsible for verifying computational reproducibility of statistical analyses, and a repository responsible for assigning persistent links. An advantage of a distributed workflow model is the potential to increase the feasibility of scaling curation services without shouldering the entire cost of labor and technology.
Enforcing data curation standards. Specifically, the curation tool supports FAIR principles for findable, accessible, interoperable, and reusable data and other digital research artifacts. It can accommodate extending curation workflows to include additional quality checks (e.g., verification of computational reproducibility). The tool supports archival preservation policies by enforcing standards (e.g., the OAIS Reference Model requirement to clearly define roles, see ) and providing documentary evidence of such, which ISO 16363 () and CoreTrustSeal require (). As standards evolve, the tool can be configured to adapt.

Technical Features

Key features are based on services critical to rigorous data curation: Templates for multi-file metadata creation and editing, item-level metadata creation and editing, metadata error reporting, customizable metadata exports, controlled vocabularies for selected fields, controlled vocabulary editing capabilities, record versioning, user access options, administrator and tracking controls, and a variety of content management features. The curation tool is API-enabled and modular.

For a minimal installation, the tool requires two open source software pieces, a web server, a database, and file storage.

The two open source software pieces, available on Github under a GNU Affero General Public License v3.0. (), include the Curation Service and Curation Web application.

a) The Curation Service manages the curation workflow and logs all application events. The workflow is based on established curation steps triggered by certain user actions, and automated when possible.
b) The Curation Web Application provides a web interface for the Curation Service. All users – researchers depositing data and code, curators processing the outputs, and administrators – can access the curation tool through the web application.

The curation tool also requires,

c) A web server to host the curation tool. For the YARD implementation, the Curation Service and Web Application are installed on a Windows 2012 web server, which also hosts other software components.
d) A curation database for storing the application and curation metadata.
e) File storage for data, codebooks, code, and all other files required for curation. The application requires storage locations for phases, a) an ‘original’ directory for the original files and metadata comprising the research package, b) an ‘active’ directory for copies of the files and metadata during active curation, and c) a ‘processed’ directory for copies of the files processed and approved, as well as metadata.

The curation tool affords easy integration with other software or workflows. These optional components are not integral to the functioning of the curation tool but are congruent with its purpose; they can be replaced, enhanced, or left out per organization policy. Figure 2 illustrates the curation tool components, their function, and relationship.

Figure 2

Curation tool components, required and optional.

At Yale, the YARD implementation of the curation tool integrates with components that provide additional desired functionality. It is configured to the requirements of ISPS and includes some proprietary software components. The YARD implementation includes,

Colectica Repository, a proprietary software developed by Colectica to create variable-level metadata extracted from SPSS, Stata, CSV, and Excel files (). The metadata scheme is based on the Data Documentation Initiative (DDI) (the tool allows adding new fields from other established metadata schemes). This software requires its own database to store the metadata.
ClamAV as an antivirus check for all uploaded files ().
StatTransfer for creating plain-text copies of data files ().
Yale’s Persistent Linking Service to create persistent URLs ().

Table 1 lists the required and optional components, specifies the components used for the YARD implementation, and suggests alternative options for components where available.

Table 1

Curation Tool Components.

Component	Function	License	YARD implementation	Alternate Component options

*Required components*
Curation Web Application	Web interface for the Curation Service	AGPL 3.0	Curation Web Application
Curation Service	Data deposit and curation	AGPL 3.0	Curation Service
Curation Database	Storage for curation tool data	Proprietary	Microsoft SQL	Postgres, MySQL
File storage	Storage for files	Yale Service	Network attached local service (storage@Yale)	Any file storage (requires read/write access)
*Optional components*
Metadata Repository	Captures, generates, and versions DDI Lifecycle metadata	Proprietary	Colectica Repository	Repository software that generates metadata
Metadata database	Stores variable-level metadata	Proprietary	Microsoft SQL	Postgres, MySQL
Anti Virus	Virus scan for deposited files	GPL	ClamAV	Any Antivirus software
File Conversion	Creates csv copies of data files	Proprietary	StatTransfer	Any statistical or custom software
Persistent Link	Persistent Link	Yale Service	Yale Handle service	Any persistent linking service

User Roles

All users are required to create an account in the curation tool. Users are assigned one of the following roles: Depositor, Curator, or Organization Admin. Users of the curation tool will experience a different workflow and have access to different features depending on their role in the system. Below, we discuss the curation workflow through the lens of the three main roles.

Depositor

By default, all users are Depositors. A Depositor can submit data, code, and other research outputs that comprise a Catalog Record. A Depositor can be a researcher affiliated with an organization hosting the curation tool or one of the organization’s staff members. A Depositor creating a new record will add study-level metadata (e.g., author, title, sample size, field dates, etc.), upload all related files, add file-level metadata, and finally submit for curation. Figure 3 is an example of uploaded files associated with a sample study.

Figure 3

The Depositor view of the file list after initial upload, as seen in the user interface.

Each file is checked by Clam AntiVirus, assigned a universally unique identifier (UUID), and deposited into the ‘original’ directory, where copies of all the raw data are kept. A Depositor can update the record if there are changes or new additions to a study and re-submit for curation. Each version submitted is preserved in the ‘processed’ directory so no data are lost.

Organization Admin and Curator

Once a Depositor submits a study for curation, copies are made and stored in the ‘active’ directory and a notification is sent to the Organization Admin. As noted, the Organization Admin has permissions to edit the organization’s settings, including setting the domain name, assigning storage locations, specifying a repository destination, adding a deposit agreement, managing roles and permissions, and other technical settings. The Admin assigns a Curator to each Catalog Record and approves publication once curation is complete.

When notified of a new record, the Curator will complete all curation tasks which the tool automatically assigns based on the file types (for example, a data file will have different curation steps than a codebook). Figure 4 is an example of the Curator’s review panel which lists curation tasks. Once curation is complete, the Curator will submit the Catalog Record for publication approval by the Admin.

Figure 4

The Curator view of the curation tasks for the assigned record, as viewed in the user interface.

A record publication approval triggers a series of events on the server side. A plain-text, preservation copy (e.g., .csv) is created of certain proprietary data files (e.g., Stata .dta) and added to the ‘processed’ directory. Any updates or additions to the study-level metadata are synced in the Curation Database. Changes to the files are versioned by Git, which is built into the Curation Service. The Curation Database also stores the details about each completed curation step including the date, time, and which user completed the step. The tool provides a full history log for each Catalog Record. If configured, updates to the data files and variables are synced to the Metadata Database. The optional Colectica Repository software uses metadata from both the Curation & Metadata Databases to create a detailed metadata file using the Data Documentation Initiative (DDI) 3.2 schema. Finally, the Catalog Record, and each file marked as ‘public’, are assigned a persistent link. YARD uses Yale’s in-house handle service to generate these links, but integration with other services is possible.

The Curation Workflow

The curation workflow as implemented in YARD is designed to the specifications of ISPS at Yale University. The workflow is based on the Inter-university Consortium for Political and Social Research () pipeline and adapted for quantitative research output from randomized controlled trials (RCTs) in the social sciences (). For example, YARD prompts Curators to review whether documentation and contextual information necessary for long-term usability (e.g., a codebook, a readme file) are included. The curation workflow has been further enhanced to include tasks for reviewing code and statistical analysis to obtain verification of computational reproducibility (; ). For example, YARD guides Curators to review code files – statistical and other programming scripts – by verifying that the code executes and that the published scientific results can be computationally reproduced with the given code and data. The workflow was developed with input from potential users at the Odum Institute Data Archive at the University of North Carolina, Chapel Hill and the Cornell Institute for Social and Economic Research at Cornell University.

Integration with a Repository

The curation tool is not a replacement for a repository in so far as it is not meant to be an access point for other scholars or the general public. End users can only access records and files processed through the curation tool if they are ingested into another system such as a data repository or archive. The YARD implementation is currently designed to integrate with Drupal and provides access to processed records via the ISPS Data Archive. Organizations can determine a preferred means of dissemination based on their own infrastructure.

There are two methods for integrating processed records into other software or workflows: the ‘processed’ directory and the Extensible Markup Language (XML) feed. The ‘processed’ directory is a compressed directory created by the curation tool when each catalog record is finalized and approved. The zipped archive, as seen expanded in Figure 5, is structured to match BagIt specifications. It contains a file manifest with MD5 checksums, a handle map, and a ‘data’ directory containing the curated files and the application-generated DDI file. This DDI file contains the metadata necessary to ingest studies into another system or repository. Since studies can be re-curated and re-approved, a unique archive directory is created for each instance of publication (e.g., a study reviewed and finalized a second time will have two archive directories, one for each version).

Figure 5

An example archive directory for a catalog record after curation and final approval.

The second option for disseminating records is the XML feed, which is created when a record is approved for publication. The feed contains metadata about each study and any associated files processed through the curation tool. Figure 6 shows a sample of the XML feed. Only studies and files marked as ‘public’ will appear in the feed. The feed includes the persistent link for each file, so files can be downloaded or ingested via the feed. The feed can be ingested into any system with a XML mapping option. In the YARD implementation, the XML feed is ingested into a Drupal site using the Drupal feed importers module, whereby each XML element is mapped to a specific Drupal field.

Figure 6

An example XML feed produced by YARD showing the metadata for a catalog record.

Customizability

The curation tool is designed to be fully customizable. That includes configuring the curation workflow such that curation tasks can be adjusted. For example, other curation frameworks could be applied (e.g., the Data Curation Network’s CURATED checklist, see ) or specific tasks can be made optional or dropped altogether (e.g., checking for the presence of personally-identifiable information). These and other changes to the application code–changing user roles (e.g., the Depositor role may be eliminated if a repository only allows Curators to deposit, or other roles can be added), changing the metadata schema as appropriate to other disciplines, customizing study-level information (e.g., adding a geolocation field), and more–can be done by a skilled developer.

Other customization involves changes to admin settings in the application or editing config files on the web server. For example, within the application, administrators can customize file storage locations, edit or turn off persistent link minting, and integrate with other software to automate tasks where possible (see Table 1). Admin settings also allow customization of email notifications and permissions to create new users accounts. Finally, admin access to the web server grants permissions to edit config files in order to customize functions such as changing database paths, limit or increase allowable upload file size, and customize error reporting.

Given the diversity of research practices and products, the tool is designed to be modular and built with interchangeable components, such that it could be customized by an organization to meet its specifications, requirements, and policies.

Availability

YARD is currently supported at Yale with local IT resources and infrastructure, including admins who deploy the full stack and monitor and maintain the web and database servers. Any organization that assumes responsibility for the curation of research outputs–for example, a repository, a research lab, an academic research center, or a library–can have a local installation by compiling from the source code. Access to the code and comprehensive documentation are available in a public repository (). As an open source project, it is our hope that interested parties will join us in supporting and improving the software in accordance with best practice governance models in the academic Open Source community.

Discussion and Conclusions

This paper describes a policy-driven adaptable workflow tool supporting the archival and dissemination of high-quality research outputs. The tool is designed to increase the potential for long-term usability by creating high quality and FAIR-compliant data packages. The essential design principles applied to this tool is modularity and open source. The tool also promotes research transparency by connecting the activities of researchers, curators, and publishers through a single pipeline. Our vision is for this tool to be used by organizations committed to both rigorous research practices and high-quality output. We believe this project is a significant step toward the “development of more generic tools and processes for validating and improving various aspects of data quality,” as called by Digital Curation Centre Director, Kevin Ashley ().

YARD addresses variability in research output quality by helping economize and standardize curation efforts and services. It achieves that by,

Providing a workflow in which curation activities can be managed, tracked, inspected, standardized, and shared and,
Enabling implementation of quality standards and policies aligned with making research outputs more usable and interpretable in the long term and deploying a design approach that facilitates accommodating new conditions and integrating with improved tools.

Developing YARD was the collaborative effort of several groups at Yale and Colectica. The team made use of project management tools to communicate with the developers, track software bugs, and document the software development process. At Yale, good working relationships with partners in Yale Information Technology Services and Yale University Library IT were essential to the project’s success in all steps of development. Looking back at the trajectory of the project’s development, we recognize that, as with many software development projects, we were subject to tightly resourced environments that presented a challenge to well-intentioned but sometime compromised efforts to test and deploy the tool within scheduled timelines and to assume local project ownership beyond the initial Colectica development. A more agile approach to deployment and testing of the software could have mitigated the consequences of some legacy decisions made at the project’s inception, such as, hosting the software on Yale ITS managed infrastructure (which provided automated server backups and security management but required additional coordination across departments) as opposed to a cloud service like AWS (which would give us more flexibility and control but require additional internal resources). Despite a lack of funding beyond the initial development and unforeseen delays, we have confidence in YARD’s sound fundamentals and potential to contribute to standardized, efficient, and transparent curation.

Future improvements to the software may include developing an API to allow further integration of published records with various workspaces or repository destinations. The curation log generated by the tool may be mined for information about curation tasks to inform staffing needs and educational efforts relating to research data management and curation. The curation tool’s version tracking and UUID capabilities may be used to track the evolution of digital objects throughout the research lifecycle, from creation to publication or archival, and to link them to other systems, such as institutional sponsored projects record keeping. Related, other methods of authentication may be implemented to allow seamless integration with other systems. We urge the community to take advantage of the open source software. For now, we are confident that the curation tool provides a framework and a method for enhancing the digital artifacts underpinning scientific research – something that research institutions, repositories and archives, and publishers have a vested interest in.

Data Science Journal

Practice Papers