Detailed Implementation of a Reproducible Machine Learning-Enabled Workflow

Kenneth E. Schackart III; Heidi J. Imker; Charles E. Cook

1. Introduction

There is broad concern over the lack of reproducibility in science (; ), with many believing there is a crisis (). While the extent is contested (; ), concerns about scientific reproducibility are ongoing, and flawed study designs and irreproducible analyses play a role. There have been efforts to encourage better practices, such as pre-publication of study protocols, analysis plans, and all code (). However, as argued in Haring (), while the different biases in production and reporting of research are largely identifiable and modifiable, continued methodological training for early career researchers is also crucial.

Use of machine learning (ML) in biosciences has proliferated so rapidly that it is difficult for adoption of good practices and proper training to keep pace. Open Science practices, such as public release of code and data, aim to remedy this (). While access to code and data are necessary for reproduction of computational results, such access does not guarantee that results can be reproduced. Indeed, the recent Ten Years Reproducibility Challenge investigated the ability to rerun code and reproduce results from projects ten years or older, and the issues involved resulted in a useful ‘reproducibility checklist’ (). Additionally, efforts have been made to set standards for reproducible code, including for ML, and they serve as rubrics for assessing reproducibility (). What seems lacking, however, are detailed examples of practical implementations. This work provides such an example by explaining how a ML-enabled study was planned and executed with reproducibility as an explicit goal from the onset of the project.

In our example, the study is a ML-enabled inventory of biodata resources identified from the scientific literature. Biodata resources are biological, life sciences, and biomedical databases that archive research data generated by scientists, serving as the repositories of record for particular data types; as well as knowledge bases that add value by aggregation, processing, and expert curation. These resources are connected through extensive exchanges of data and form a distributed global infrastructure. They are crucial for the entire life science research endeavor and are used ubiquitously.

However, the infrastructure is not well-described. A number of existing resource registries, such as re3data and FAIRsharing, have done a commendable job of cataloging resources either through self-registration by the resource owner or through addition by a curator. However, neither the number of resources nor their location has been systematically explored. A better understanding of the scale of the infrastructure, as provided by this inventory, will aid funders and other stakeholders in addressing challenges to sustainability faced by the infrastructure. The methods and results of creating this inventory are fully described elsewhere (). However, during preparation of that manuscript we realized that there were many additional details to share about how we attempted to design and implement a reproducible workflow—details we wish we had found in the literature ourselves.

As context for this reproducibility case study, the following provides an outline of the research project (Figure 1), and we invite readers to access the openly available article referenced above for additional details. Briefly, the study first utilized the API of Europe PMC (europepmc.org) (), which is a data resource that archives a large corpus of medical and life sciences publications (). Europe PMC provides both individual (browser-based) and automated (API-based) queries. Our workflow started with a targeted query to the Europe PMC API to retrieve the titles and abstracts of publications for which both a URL and the word ‘data,’ ‘database,’ or ‘resource’ are present in the title and/or abstract. The results of the query represented publications that might describe a biological (biodata) resource. A 10% random subset of publications from this initial result was manually classified as describing or not describing a biodata resource (see and additional documentation in ). Those that did describe a biodata resource were curated to label the resource’s common name (e.g., PDB) and full name (e.g., Protein Data Bank) (). Recently, BERT (Bidirectional Encoder Representations from Transformers) performed well on NLP tasks (). Several BERT models pre-trained on biomedical corpora (e.g., SciBERT, PubMedBERT, BioMed-RoBERTa-RCT, etc.) were selected from huggingface.co and fine-tuned for the classification (predicting if the article describes a biodata resource) and named-entity recognition (predicting common and full name) tasks. Further downstream processing was performed, including URL extraction and HTTP status checking, before finalizing the inventory.

Figure 1

Flowchart of overall study design to identify biodata resources from the scientific literature. The fine-tuning procedure is not shown. Reproduced unmodified from () under Creative Commons Attribution License.

During the study, a strong emphasis was placed on Open Science, reproducibility, and robustness of the codebase and documentation for both philosophical reasons (in support of Open Science) and practical reasons (enabling future updating of the inventory). The entire process, from data splitting, model training and selection, to all downstream processing, is encapsulated in a Snakemake workflow (). This allows reproduction of the entire analysis with a single command. Strong standards of code quality were developed and are enforced through the use of static code checking and automated testing. Additionally, significant efforts were made to make all data products findable, accessible, interoperable, and reusable (FAIR) ().

When we began the project, we turned to the literature for robust examples of reproducibility that implemented both open data practices and code standards. Several articles contain excellent conceptual overviews (e.g., ; ; and a recent synthesis in ), and examples of efforts to implement Open Science practices, including open data and/or computational reproducibility, have been reported from many domains (e.g. ; ; and ). These examples show how reports often focus on a few critical aspects of implementing Open Science practices; for example, although Bush et al.’s work didn’t provide the explicit code details we were interested in, it provides excellent administrative considerations like accounting for trade-offs. Figueiredo et al. provides a clear and detailed ‘kit’ for using computational notebooks in order to both show the value of reproducible workflows as well as enable their adoption. In Kim et al.’s article, they first describe their efforts to reproduce a study in which the original authors had taken steps towards reproducibility, the challenges faced despite those steps, and then their own iteration towards greater reproducibility. While there is similarity between these efforts and our goals, when it comes to implementation, there are many details which are inherently different, if described at all, because of variation in the nature of the work and relevant packages and tools. Not surprisingly, we were unable to locate implementation details that mapped exactly to our project and goals, so we adapted to fit our scenario. As a ML project, we found Heil et al.’s rubrics especially helpful in providing a framework for us to consider and specific goals to aim towards. We recognize that there are other ways of attaining these goals, and projects that have subsequently cited Heil et al.’s standards show this diversity (e.g., ; ; and ). We offer our experience as just one example of how to make a computationally heavy study reproducible and open. We provide the reasoning behind the various considerations, which may be applicable to other research projects. We also provide specific examples of how those were realized in this study.

2. Have a Plan

‘A goal without a plan is just a wish,’ wrote Antoine de Saint-Exupéry in The Little Prince (). As with any other part of a research project, planning ahead makes the path to achieving reproducibility as smooth as possible. To this end, early in the project we developed an Open Science Implementation Plan (). In this document, we outlined the goals for reproducibility and how we planned to achieve them. These goals were organized into four groups: reproducibility of methods, code standards, data standards, and external review/validation (Figure 2).

Figure 2

Graphical overview of the objectives of the study and the tools and methods used to address them regarding reproducibility, code quality, and data standards. The execution of these objectives was assessed by external review and validation.

By considering these topics early in the project, we explicitly defined what expectations we had for our Open Science goals. Keeping these goals in mind helped ensure that the effort and resources required to obtain them was anticipated and considered a core aspect of the project. This minimized the accumulation of technical debt that would have been time-consuming and difficult to address near the end of the project.

3. Reproducibility of Methods

We found the reproducibility standards (bronze, silver, gold) defined by Heil et al. () useful for ranking reproducibility levels. In our case, bronze alone was not acceptable (data published and downloadable, models published and downloadable, source code published and downloadable). Obtaining silver was acceptable (bronze + dependencies set up in a single command, key analysis details recorded, analysis components set to deterministic), but the gold standard was our goal (silver + entire analysis reproducible with a single command).

3.1. Meeting the bronze standard

The bronze standard of reproducibility is characterized by having the following published and downloadable: all data necessary for reproduction, trained models, and source code.

Data availability and, more broadly, FAIRness (findability, accessibility, interoperability, and reusability) will be further discussed in a later section. To address the minimum requirements of the bronze standard, all data are available for download from the project’s Github and Zenodo repositories.

Model availability is addressed in a few ways. All of the models used in this project were pre-trained by other groups and made available on HuggingFaceHub (HFHub, https://huggingface.co/). As part of model training, these pretrained models were fine-tuned to various tasks (sequence classification and token classification). These fine-tuned models are made available on HFHub.

All source code is stored in two places. First, GitHub serves as a ‘living’ repository. An important aspect of Open Science is providing a place for open discussion (and criticism) of methods. The GitHub Issues system permits and encourages free and open commentary of computational methods. However, GitHub repositories are not immutable. It is important to have the methods, as described in the original publication, preserved and available, so the source code used to obtain the results in the associated full publication mentioned above has been archived as a code release on GitHub and also deposited into the Zenodo archive unmodified.

3.2. Meeting the silver standard

The silver standard requires, in addition to those aspects listed in the bronze standard, that all dependencies can be installed and set up with a single command, key analysis details are recorded, and all analysis components are deterministic (not random).

A common challenge for reproducibility is having simple installation procedures. To reach the silver standard in this regard we wanted it to be possible to install all dependencies with a single command. For Python-based projects that is often possible with the command ‘pip install -r requirements.txt’ (). However, sometimes other dependencies not covered by pip need to be installed. To simplify this step, we utilized Make (GNU Make v42.1) (). While Make is a powerful tool intended for the control of executable files, we use it only for effectively creating aliases for shell commands. In the case of installation, we provide a Make target called ‘setup’. By doing so, the user can simply type ‘make setup’ and shell commands are executed to install all dependencies, including running pip (v21.1.2) for installing Python dependencies () and renv (v0.14.0) for installing R dependencies ().

In addition to providing a simple pip install procedure we created a conda installation procedure (). While using pip to install dependencies at the user level is sufficient in isolated environments, such as Google Colab (https://colab.research.google.com/), it can lead to conflicts on other systems if a virtual environment is not used. Conda (v22.9.0) provides an isolated environment in which the project-specific dependencies are installed. By providing a conda environment description (yaml) file, it is possible to recreate the conda environment in a single command.

Beyond virtual environments, containers such as Docker () are often used for documenting and sharing computational environments directly. However, containers can be challenging to use in certain environments. We wanted this project to be reusable for people with a wide range of technical skills, including those who may not have ready access to a robust computational infrastructure. This is especially important when thinking of potential users on a global scale, whose access to resources will be highly variable. This dependence on access to computational resources has been noted as an important part of data democratization (). Here, we designed this project to be run on Google Colab for its low barrier to entry and its provision of graphics processing units (GPUs) for free use. Unfortunately, Colab does not natively support common container services such as Docker. However, by providing several options for dependency installation we hope that future users can find one to suit their needs.

Sufficient documentation of ‘key analysis details’ is subjective. To satisfy this requirement, in addition to an overview README that describes the entire repository, we provide README files in every directory within the repository. These explain what the various files/scripts are and how they relate to each other. Since 2021 GitHub supports the use of Mermaid, a JavaScript-based diagramming and charting tool (), in markdown files, which we leverage to create informative flowcharts illustrating workflow logic.

An often overlooked key to reproducibility in computational methods, particularly ML methods, is seeding pseudo-random processes such that they are deterministic (; ). The random numbers generated by pseudo-random number generators can have significant effects on the trained model and model performance (). So, to make the process reproducible, we added options to use seeding to make the processes deterministic.

3.3. Meeting the gold standard

The gold standard implies that the entire analysis can be run with a single command (). Such single-command analyses require the use of a workflow manager, of which there are several options. We utilize Snakemake (v7.1.1), which facilitates automation through the definition of ‘rules’ or steps that take inputs and generate outputs. By stating what outputs are desired Snakemake creates a directed acyclic graph of which rules must be executed to create the specified output. For instance, in this project we specify that we would like the final output file to contain the classified articles along with extracted metadata. If the final output is not present, Snakemake executes all necessary steps in the pipeline including data splitting, model training and comparison, classification and Named Entity Recognition (NER), and all downstream processing. With the help of a Make alias, the Snakemake workflow for reproducing all results can be run with the single command ‘make train_and_predict’.

It is important to be able to reproduce all results from the raw data to final results, including model training. However, model training is resource intensive, and may require the use of specialized hardware such as a GPU for training to be performed in a reasonable amount of time. Requiring that all models be trained to reproduce results may be a practical challenge to reproducibility. To minimize the computational resources necessary for reproduction all fine-tuned models are available in HFHub. If the fine-tuned models are downloaded and present when Snakemake is run then Snakemake will not execute model training.

4. Beyond Reproducibility

The goal of reproducibility is to allow anyone to reproduce the results of published research. We have provided, as described above, a system that allows the results of the inventory of global biodata resources to be reproduced. However, this project was also designed to allow the entire analysis to be rerun periodically. Strictly speaking, this goes beyond reproduction since the underlying data is expected to change as more publications are added to the corpus of literature archived in Europe PMC, so the methods developed need to be generalizable. Generalizability benefits from the same considerations as reproducibility but tends to include additional challenges.

We approached generalizability with the same standards as reproducibility and wanted to make updating the inventory possible with a single command. To this end we designed a second Snakemake workflow for periodically updating the inventory. For this process the trained models can be automatically obtained from Zenodo using the setup command. The previously best performing models for each task are used, which eliminates the need for retraining and evaluation.

5. Code Standards

We’ve taken the philosophy that the results of a computational research project are no more trustworthy than the code used to produce them. Trustworthiness of code is dependent on code quality, including considerations such as readability and robustness. In this section we will describe the measures taken to ensure code quality such as code formatting, static code checking, and automated testing.

5.1. Code formatting

To accomplish Open Science, accessibility of code should not be limited to code being publicly available. True accessibility requires that code also be readable and well documented. A good first step is to utilize a code formatter, which all modern programming languages have. We used yapf v0.31.0 to format all of the Python code in this project (). Similarly, Snakemake files were formatted with snakefmt v0.6.0, and R files were formatted with styler v1.7.0 (; ). These steps are meant to ensure that all components of the project are readably formatted and documented to maximize their ease of use for others.

5.2. Static code checking

Another measure taken to increase code robustness is static code checking. Again, the code checking tools available will depend on the language. We utilize the linters pylint v2.8.2 and flake8 v.3.9.2 to check all Python code to ensure that community code standards are upheld and to detect code smell (patterns indicative of potential problems) (; ). Many of the items that these linters consider can greatly improve code quality and readability. Some examples of considerations of the linters are: line lengths must be limited to predefined thresholds, within any context (e.g., a function) there should not be too many variables, and all functions should have docstrings. These, and many other requirements, encourage developers to write cleaner, more readable code.

Additionally, while type annotations are not required in the Python community, we implemented them as they provide a number of benefits. Type annotations provide built-in documentation by defining the data types of all inputs and outputs of functions. A lesser discussed benefit of type annotations is that they provide an enhanced integrated development environment (IDE) experience since the IDE has more knowledge of the variables and can give better help messages, syntax highlighting, and autocompletion. The final benefit of type annotations is prevention of unforeseen bugs when they are used in conjunction with a static type checker. We used mypy v0.812 to check type compatibility within all our Python code (). This can significantly reduce the chances of encountering bugs that occur not at compile time (since Python is interpreted and dynamically typed), but instead at runtime, which can be more difficult to resolve and may not show up until running the code at a later time.

While static code checking has many benefits, programmers need not strictly adhere to all suggestions made by the code checkers. Luckily, most tools are configurable. Importantly, the user can disable certain warnings. To ensure portability of these configurations, most code checkers allow for configurations to be defined in a resource configuration (rc) file rather than in global or user settings. Accordingly, we have included our rc files in the GitHub repository so that when someone else runs the code checkers on our published code they yield the same results.

5.3. Testing

A crucial software engineering practice that is often absent from research code is testing. Testing in all of its forms: unit, integration, and end-to-end, defines the specifications of a piece of software and ensures that the software meets those specifications when the tests pass. This has numerous benefits that cannot be understated.

One of the primary benefits is that tests serve as a contract, which is a form of documentation. A unit test of a function explicitly states what kinds of input are expected and what kinds of outputs will be produced. For documentation, the only thing better than telling what a function does (through comments and docstrings) is showing through tests (asserting that when certain inputs are provided, the expected output is returned). While the descriptions provided in docstrings and comments are what the developer intends the software to do, a passing test demonstrates that it indeed does what was intended. Conversely, anything not covered in the test cases is where the contract ends. Tests ensure that the code can do what it says.

From an Open Science perspective testing is particularly valuable. Not only does testing provide more detailed documentation than could ever be provided in an article’s methods section, but it facilitates community feedback and contributions. Making changes to software always poses the risk of disrupting previous functionality. When considering applying community feedback or contributions this is problematic. However, with strong test coverage, developers can have more confidence that updates do not introduce breaking changes, as long as all previously passed tests still pass. Indeed, they provide a clear avenue for addressing bugs which may be caught by the community. Developers can add another test case that exposes the bug, then modify the code such that the new test and all previous ones pass. This is effectively amending the contract provided by the tests so that it is more comprehensive. Without tests in place developers would have to check that the code still behaves as described manually. Such checking is so error prone that many researchers may be hesitant to implement changes suggested by others.

Of course, adding strong test coverage does require more work than, for instance, implementing static code checks or formatting. Without tests, though, code must be manually assessed to ensure that a given piece of software is able to perform its intended task, and there is a barrier to implementing community feedback. Further, a lack of tests is a form of technical debt, and the price is paid when trying to refactor or fix bugs.

Pytest v6.2.4 was used as a testing framework for all Python code in this project (). Pytest plugins for flake8, pylint, and mypy are used to include static code checks of each file as part of the test suite (pytest-flake8 v1.0.7, pytest-pylint v0.18.0, pytest-mypy v0.8.1) (; ; ). This makes it such that the test suite cannot pass without all static checks passing. Additionally, most functions have associated tests, and most scripts also have end-to-end tests that ensure that they properly reject bad inputs and produce correct output when given good input. While we aim to have good test coverage, some functions and scripts are not comprehensively tested. This is generally the case for functions/scripts that take a very long time to run, such as the actual process of model training. Additionally, the Snakemake workflows developed are not formally tested using an automated testing framework, although it would be best to do so and we may implement this at a later time.

5.4. Configurability

Our aim was that the users of code, whether for reproducibility, generalization, or separate implementation, would not need to edit source code to change its behavior within the intended use cases. Parameters that may change could be supplied as inputs/arguments instead. Often, this means that paths to input files should not be hard-coded but rather passed in when calling a script. In terms of ML projects, this also often applies to hyperparameters.

One solution to this is to use parameterization extensively and, in order to make the analyses reproducible, to store the parameters used in configuration (config) files. By doing so, others can see what parameters were used to generate the results. This process additionally gives future users a clear indication of what parameters are likely okay to change, all without them having to edit any source code.

We store a large number of parameters in config files such as input/output directories, training parameters, and locations of fine-tuned models. To train a new model and compare its performance to existing models, a new row need simply be added to a tab-separated config file. The README file in the config/ directory describes the acceptable ranges of values allowed in the config files, such as a description of what kind of models are compatible with the existing workflow.

Snakemake also makes extensive use of config files, and the config files described here are formatted such that Snakemake can utilize them when executing the workflow. So, to change the behavior of the workflow (again, within the expected range of uses), only config files need to be edited.

6. Data Standards

6.1. Source selection

Both code and data were integral components of this project and both required consideration for reproducible outcomes. To create an open inventory as a product we aimed to reuse and create data that aligned with the FAIR guiding principles (). The primary data source needed was bibliographic metadata. There are several commercial sources of bibliographic metadata such as Dimensions (Digital Science), Scopus (Elsevier), and Web of Science (Clarivate Analytics). However, these resources require a subscription which would limit others’ ability to reproduce and reuse our workflow and neither are they openly licensed. Therefore, we opted to use the open metadata available from Europe PMC as the data source for creating the inventory. Although not as exhaustive as the commercial options mentioned, Europe PMC covers a large swath of the life sciences; as of October 2023, high quality, interoperable metadata, including titles and abstracts, was available for over 40 million articles. Additionally, Europe PMC offers robust and well-documented APIs that facilitate access and are especially useful for a reproducible pipeline. Although we know that some biodata resources will be missed due to articles being published outside of the ~4000 journals available in Europe PMC, we felt that this tradeoff was justified in order to optimize openness and reproducibility.

6.2. Addressing data findability and accessibility

Depending on context, anyone interested in reusing the data from this project might wish to start at different points. We therefore offer multiple options. The exact query string we used can be rerun to obtain results from Europe PMC. Additionally, since bibliographic databases may change slightly over time (e.g. records added, removed, or corrected), query results themselves (PMID, title, abstract) may be of use to reproduce our results using the exact same data. There is also the labeled training data that was used to train the various models, a preliminary inventory that is subjected to selective review by a curator, and, finally, the primary data product for this project is the final inventory itself. The query string, query results, training data, preliminary inventory, and the final inventory are all available within the project’s GitHub repository and were archived for long-term preservation and persistent reference in an associated Zenodo deposition once the article was accepted for publication. Zenodo provides a DOI and relies on the DataCite metadata schema, which allows the dataset to be found within Zenodo’s search interface, DataCite’s central metadata store, and via internet search engines such as Google.

6.3. Addressing data interoperability

For the final inventory, we retained unique article identifiers (PMIDs) to allow easy extraction of additional metadata or for access to the full text, when available, from either Europe PMC or PubMed Central. Additionally, we logged URL status codes per specification RFC 9110 (), extracted countries from author affiliations following ISO 3166 (), and retained geo coordinates for IP address look-ups, when available. While it would have been ideal to include a persistent identifier for the biodata resources located (e.g., ROR ID or DOI), most resources do not have an identifier, which perfectly illustrates the challenge of trying to locate these resources in the first place.

6.4. Addressing data reusability

In addition to the efforts towards interoperability described above, we also maintained a structured format throughout and used the CSV format for preservability and to ensure ease of reuse. These files are accompanied by a plaintext README file that includes a description of each variable as well as data collection details and licensing. By using open data from Europe PMC, we were able to release the data with CC0 licensing, thus allowing the broadest reuse possible. Together, this documentation, the repository’s Github history, and Zenodo’s commitment to long-term archiving all provide provenance.

Finally, to further extend the potential for reuse, we plan to provide identified biodata resources to Europe PMC as community annotations. This will allow easy bulk access to the identified resources as well as their associated articles. The annotations can be used for several purposes, for example, mining articles with full text available or analysis of the intersection between these annotations and the many other annotation types available within Europe PMC.

7. External Review/Validation

In the Open Science Implementation Plan that we drafted (see Section 2 above), we also included a desire to have a party external to the team review the products of the study. Working within a team inherently provides a mechanism for internal feedback, but review by another person outside of the project helps reveal implicit knowledge developed during the project that would otherwise remain hidden to potential reusers. For example, team members may, without realizing it, adopt terms or abbreviations that are not well-known outside of the project.

This section of the Open Science Implementation Plan was not particularly well-developed beyond acknowledging that such a review would be ideal, as noted by others (; ), and that this role is included in the CRediT taxonomy (). As we moved closer to having products finalized, we had a better sense of what sorts of reviews will be most valuable. We recruited an individual who reviewed the code and documentation in detail and ran nearly all the code available in the open archive. We budgeted 40 hours for this work, which was easily consumed given review effort required. Others may wish to allocate even more resources to this activity, which we found extremely helpful for identifying errors and pointing out gaps in our documentation. We formally acknowledge this effort here as well as in the associated article.

8. Discussion

Here we have described the efforts that were taken to develop a methodology for obtaining and updating a biodata resource inventory with Heil et al.’s gold standard of reproducibility, a robust codebase, and complying with FAIR data standards.

8.1 From Principles to practice

We, and many others, are committed to Open Science and see the imperative of reproducibility. Putting these principles into practice on a complex project presented an opportunity for us to work through philosophical, organizational, and technical details. We were successful in meeting the goals outlined in the Open Science Implementation Plan established at the beginning of the project. Installation of dependencies and reproduction of the entire analysis can each be performed with a single command, and analysis steps are fully documented. All code passes static code checks for formatting, linting, and type compatibility. Much of the code was formally tested with unit and integration tests. The core data products, such as the labeled training data and preliminary inventory, are present in GitHub and in Zenodo, with accompanying documentation.

The methodologies used in this work are not novel on their own. Wherever possible, we looked to existing tools and practices. The automation employed to make reproduction simple relies on the widely used Snakemake workflow manager. It is also common practice in software engineering disciplines to leverage static code checking and testing as we have done. Regarding data standards, we looked to the FAIR principles. The purpose of this report is to provide an example of how a research project that utilizes computational methods, particularly ML, can be implemented to maintain robustness and strive for a high level of reproducibility. However, we recognize that there are numerous ways to accomplish this and do not mean to claim our implementation is failproof.

8.2 From details to decisions

When we began the project, we were especially interested in finding implementation details. How exactly does one make it possible to re-run an entire analysis with a single command? How exactly does one make data ‘interoperable’? Although we knew these details would be different in our case, concrete examples can provide clarity and inspiration. As the project progressed and we learned by doing, our questions evolved to focus on the choices that must be made. One example is the tradeoff of using only open data versus a more extensive commercial data source, which would likely have yielded a larger, but in our estimation less useful, inventory. Many of the trickiest decisions involved accounting for the diverse interests of, and the resources available to, potential reusers, now and into the future.

There were also ambitions that we had at the start of the study that are now future directions because we chose to devote time developing a robust workflow instead. This required principled project management and caused, even as we write this, some amount of wistfulness. In the end, we could not do it ‘all’, and we fully appreciate that others must decide for themselves where to place their efforts. Such decisions required us, and will require others, to devote a substantial amount of time to think through and implement. We were able to do this only because of our team’s collective belief that these efforts were worth the resources invested.

8.3 Limitations

Certain improvements could be made, such as using a more robust package manager like poetry and using git hooks to automatically run tests upon committing to git. Importantly, test coverage is lacking in some areas, especially for portions that involve heavy computation such as model training. Still, the current test coverage is enough to increase confidence in the code’s behavior. As Peng () noted, ‘Given the barriers to reproducible research, it is tempting to wait for a comprehensive solution to arrive.’ Thus we thought our experiences may be helpful to share.

Possibly the greatest limitation, or threat to long-term reproducibility, was the decision to not use containers as a trade-off to be compatible with Google Colaboratory. In the current configuration, all dependencies are listed in a requirements.txt file and must be installed to run the code. However, it is possible that dependencies become unavailable or incompatible in future. Containers mitigate this problem by packaging all dependencies with the code, eliminating this concern.

A key consideration is how generalizable the efforts and methods toward reproducibility presented here are to other research projects, methods, and domains. Fortunately, most of the methods and tools here are not specific to natural language processing pipelines, and therefore generalize well to most computational research tasks. For example, workflow managers such as Snakemake can be applied to data analysis pipelines in general. Additionally, the more conceptual steps, like creating the Open Science Implementation Plan at the start of a project, could be broadly applied.

9. Conclusion

Through articulating our goals early on and dedicating time and resources, we were able to accomplish our Open Science and reproducibility goals. Throughout this case study, we provided details on the steps we took to make the code clean and robust and the data FAIR. We invested considerable effort into ensuring reproducibility, with the intent that both the methods and outputs would be of use to us and others. Our first update of the inventory, initiated approximately one year after project completion, only required modification to the Colab notebooks to account for Google Colaboratory changes, but otherwise functioned as expected. With this promising, albeit early, success, we remain cautiously optimistic that the work is durable. By presenting our experiences, we hope this Practice Paper provides a helpful example for others to consider as they work to build greater reproducibility in their research.

Data Accessibility Statement

Code and data generated during the course of the project are archived in Zenodo along with associated documentation (https://zenodo.org/doi/10.5281/zenodo.10105161). The final inventory and associated data dictionary are available as a separate Zenodo deposit (https://zenodo.org/doi/10.5281/zenodo.10105947). Readers may visit HuggingFaceHub (https://huggingface.co/globalbiodata/inventory_2022_all_models/tree/main) to access the fine-tuned models. Additionally, all materials are available on GitHub, which may be updated after this publication (https://github.com/globalbiodata/inventory_2022/). All other software used is openly available and shown Table 1.

Table 1

Glossary of Software.


NAME	DESCRIPTION	REFERENCE

conda	Package and environment management system	()

flake8	Python linter (static code checking)	()

Make	Build automation tool, used here for creating shell command aliases	()

Mermaid	Diagram generator for Markdown	()

mypy	Static type checker for Python	()

pip	Package manager for Python	()

pylint	Python linter (static code checking)	()

pytest	Python testing framework	()

pytest-flake8	Pytest plugin to run flake8	()

pytest-mypy	Pytest plugin to run mypy	()

pytest-pylint	Pytest plugin to run pylint	()

renv	Dependency manager for R	()

snakefmt	Code formatter for Snakemake	()

Snakemake	General-purpose workflow manager	()

styler	Code formatter for R	()

yapf	Code formatter for Python	()

Data Science Journal

Practice Papers