Engaging with Researchers and Raising Awareness of FAIR and Open Science through the FAIR+ Implementation Survey Tool (FAIRIST)

Christine R. Kirkpatrick; Kevin Coakley; Julianne Christopher; Inês Dutra

1. Introduction and Background

The FAIR Principles describe 15 aspirational dimensions of research data management (). They provide a starting point for mapping out data stewardship practices needed for any given research project. However, there is no way to ensure all research objects adhere to FAIR, and FAIR is not all encompassing. For example, FAIR is silent on data quality and reproducibility. FAIR also does not comment on data sovereignty, such as is covered in the CARE Principles of Indigenous Data Governance (). Consensus is lacking on whose responsibility it is to ensure the FAIRness of research objects, as well as the underlying issue of who is responsible for research data (; ). Stakeholders range from individual researchers, to institutions, funders and publishers (; ). In each case, the stakeholder group is looking for clear and practical advice—and is less interested in philosophizing about the need for research data management practices or complex and detailed arguments over which approach is better. Researchers must address these topics only as much as funders require it. The US National Institutes of Health (NIH) require a data management plan and beginning in 2023, increased the requirements to cover data sharing (). The US National Science Foundation (NSF) requires a data management plan to be included with proposal materials. Funders around the globe require the discussion of research data and FAIR in varying degrees. Countries and regions that lead the trends include the European Union’s (EU) research funding calls and Australia. Were it not for these funder requirements, researchers would only take these steps on a voluntary basis. However, this requirement provides a key opportunity for outreach and awareness of the FAIR principles and how they relate to newer technologies during proposal preparation. The FAIR+ Implementation Survey Tool (FAIRIST) creates information that can be included in a proposal’s data management plan or project description. Its contribution and value are as much in what FAIRIST produces, as well as the conversations and decisions its completion evokes. Even where support and services are not available to researchers from their institution, the mention of FAIR implementation possibilities can initiate important discussion.

This work is organized as follows. Definitions and terminology are introduced in Section 2. Related work is also presented in this section. In Section 3, the motivations and stakeholders of FAIRIST are discussed. Section 4 provides detail on FAIRIST’s design, functionality, and the user feedback process. This work closes with a discussion and perspectives on future work.

2. Literature Review, Definitions, and Terminology

This work refers to concepts from information and computer science.

FAIR data is data that meets principles in four categories: Findability, Accessibility, Interoperability and Reusability. The FAIR Principles are the 15 principles that correspond to making research objects FAIR (). The principles are not prescriptive and are not rules, but rather are touchstones for concepts that lead to research object, or data, usability. One of the first steps in the FAIR Principles is to make data ‘Findable’ (; ). Data should be easily findable both by humans and computers (; ). The automatic and reliable discovery of datasets and services depends on machine-readable persistent identifiers and metadata. Persistent identifiers are important because they unambiguously identify data and facilitate data citation. An example would be a Digital Object Identifier (DOI). The (meta)data should be retrievable by their identifier using a standardized and open communications protocol, with restrictions in place if necessary. Metadata should be available even when the data are no longer available. Data do not need to be all open; they can be restricted and still be FAIR. Open or not, data should be stored somewhere safe for the long-term. The data should be able to be integrated with other data, applications, and workflows. The format of the data should therefore be open and interpretable for various tools. The concept of interoperability applies both at the data and metadata level. Common formats and standards and controlled vocabularies should be used. Ultimately, FAIR aims to optimize the reuse of data. To do this, data should be well-documented, have a clear license to govern the terms of its reuse, and provenance information.

FAIR Digital Objects: Even though the original FAIR Principles publication called for the need to make all types of research artifacts FAIR, there has been an overemphasis on data, e.g., a chunk of information or a single data point. This work acknowledges the need to make all digital objects FAIR, including software, models, algorithms, and workflows. The term ‘FAIR Digital Object’, or FDO, describes a concept and associated, evolving guidelines for packaging metadata about each chunk of information and the data together—as well as associating each component with its own unique identifier. The complete FDO is assigned a master identifier to the assembled package of data and metadata (; ). FAIRIST takes the approach that all research objects should be assigned identifiers. In doing so, FAIRIST aims to move towards recommendations that provide advice to create FDO-compliant research objects.

FAIR+: To provide a shorthand for FAIR and reproducibility, FAIR+ is used as a term in this work.

Open Science has been used by different stakeholders to focus on different aspects of openness, from technological architecture to the (public) accessibility of knowledge creation, measurement and the democratization of access (). This work focuses on the qualities of the research processes that lead to openness, including transparency and reproducibility through technical and practical approaches. FAIRIST supports open science aims via recommendations for implementing the FAIR Principles that relate to findability and accessibility, as well as reproducibility.

Reproducibility: This work uses Gundersen’s definition of reproducibility. It chiefly states that science should be able to be reproduced, not to the extent that the results are numerically identical, but so that the results support the same inferences drawn from the original research (). Although often mistaken as the “R” in FAIR, reproducibility is aided by the implementation of the FAIR principles, especially those that pertain to the openness of software, tools and libraries, the accessibility of data, etc. Both FAIR and reproducibility are continuums, more than a destination. Research reproducibility can be resource-intensive, therefore researchers should do as much as possible to document and provide a path for another researcher to retest their conclusions. However, it is understood that it is often not possible to recreate the exact same environment, or to provide the compute and storage resources needed for reproducibility work.

The FAIR principles and associated literature give very good recommendations and guidance on how to address data in research projects. But usually, researchers work with text files and other non-structured documentation. Some efforts have been made to transform these recommendations and guidance into a computational tool that can standardize the process of data ‘FAIRification’. Other important advancements that relate to FAIRIST include tools for interviewing researchers about FAIR implementation, data management practices, and upcoming tools for publishing and reusing data management plans. Four tools in particular were surveyed to understand if they could be extended to include the approach conceived for FAIRIST. Argos and DMPTool are both good candidates for partnership, as discussed in Future Work. The FAIR Implementation Profile Wizard uses a complementary approach, and it was important to understand what could be leveraged or learned from the platform. The last platform examined, FAIR Connect, is relevant as a potential platform for publishing and sharing plans created using FAIRIST.

Argos: A joint effort between OpenAIRE and EUDAT, this platform provides a way to create and manage Data Management Plans (DMP) (). Argos allows for manual entry or guidance with a wizard. Argos aims to document a research project and its outputs, mainly datasets. The Argos UI is streamlined and appears to employ modern UI/UX principles. It uses FAIR principles in how it collects data, utilizing APIs wherever possible so that researchers, institutions, and funders are not manually entered but connected to a unique identifier. The Dataset feature allows a researcher to document a dataset manually or prefilled from templates, customized for the needs of the funder. Argos allows for collaborative writing, and DMP templates can be added, updated, and modified. Argos provides DOI and DMP versioning via Zenodo and supports export in JSON format. Argos does require knowledge of data management concepts to complete the forms. Argos is a potential partner for integrating some of the questions from FAIRIST, although it would be a significant expansion in scope for the platform.

DMPTool: This tool provided by the California Digital Library (CDL) was created several years ago to assist researchers with the creation of their data management plans to accompany proposals (). The survey is comprehensive and updated regularly. However, the primary text is entered by researchers into many text boxes. This can be daunting for researchers who are new to research data management and are not sure where to start. Furthermore, in practice, researchers may only use the DMPTool once and then reuse plans from project to project, with minimal adjustment to the plan for the new project’s needs. The survey and output of FAIRIST could be appended to the DMPTool and combined for ease of use by researchers.

FAIR Implementation Profile (FIP) Wizard: Created by members of the GO FAIR International Office and the GO FAIR Foundation, the FIP wizard eases the creation of a FAIR Implementation Profile that can be read by machines (; ). One answers questions in survey style, and the output is in the Resource Description Framework (RDF) format. The tool is also part of an exercise for communities to discuss (metadata) standards choices, such as those used by the WorldFAIR project (). Participants across several domains, or ‘petals’ of the project, reported that having the discussion around metadata choices was as valuable as creating the FIP (). The FIP Wizard employs some of the same techniques and design as for FAIRIST. However, it is concerned with aggregate information for a domain or subdomain of science, rather than individual projects.

FAIR Connect: This new initiative from the GO FAIR Foundation and iOS Press seeks to extend new tools for data stewards and researchers (). It provides a way to publish FIPs and DMPs as nanopublications. It also allows for data stewards to comment or endorse submissions. Additionally, it provides a way for stewards to be recognized for their contributions via citations. FAIRIST outputs could be published in FAIR Connect as nanopublications and assigned persistent identifiers for citation and attribution.

3. Motivation and Stakeholders

Researchers desire practical advice on how to implement the FAIR principles, but are challenged by the steep learning curve and background needed to engage in data stewardship. Some may not even be aware of FAIR until they see it mentioned in a funding solicitation. A systematic tool can help by narrowing down topics based on research activities and outputs planned, rather than the approach of presenting everything and leaving it to the researcher to select relevant principles. Such a tool should be designed to only broach topics that apply directly to the researchers’ planned work. Those in the humanities who only plan to produce data and disseminate findings on a website would not encounter more complex topics, such as where to share their machine learning (ML) models. Conversely, a computationally intensive project will find specific suggestions on where they might deposit ML artifacts and how to aid the reproducibility of their work by others in the future.

FAIRIST began as a templated response used to assist colleagues in crafting DMPs. In particular, researchers sought advice on how to implement FAIR, how to address FAIR when machine learning is employed, as well as what artifacts to make FAIR. The implementation advice distilled as many of the FAIR principles as possible into a table that a researcher could include in their DMP. Table 1 gives an example of the text created for an NSF proposal. This template was reused for other proposals, where the project name was replaced and the dimensions of FAIR added or subtracted depending on the planned research. This capitalized on researchers’ interest in learning more about FAIR implementation and research data management during the proposal process. However, proposal development is a very busy time and, most attention is given to the project description or plan, not the DMP. Knowing this, the advice given was created to be almost ready for inclusion in the DMP, and areas to update were clearly marked in brackets (< >).

Table 1

Template that inspired the creation of FAIRIST.


FAIR DIMENSION

Findable	Data will be assigned a PID <how?> and will be referenced on the <project website> A catalog entry will be added to <FAIR Data Point or community/institutional catalog>. Metadata and links to related ontologies will be available on the <project website>. Where tags exist, schema.org descriptors will be utilized.

Accessible	Available via <storage location>, that doesn’t require specialized software to access. This includes both the raw data and curated or derived data. The surrogate and other ML benchmarks will be deposited in <repository>. Any APIs will be versioned and described, linked from the <project website>.

Interoperable	Code stored on github and linked from the <project website> Uses libraries from <project name> that utilize <standard or standard Python libraries, etc.>. Uses standard references for <more here>. Both input and output data are in <specify> format.

Reusable	ML model and data will be deposited at <repository>. Notebooks will demonstrate how to assemble model and sample training datasets. Each notebook product will be assigned a DOI using <specify DOI source>. The <project> notebook interface is on <place shared, e.g., github>. Provenance of the simulation creation will be available as part of the metadata. A designation will be added to the website noting that all data as licensed under Creative Commons Attribution 4.0 International License.

After filling out these templates manually a few times, it became clear that the process could be streamlined through a self-service survey. Even though every research project is different and the topics can be complex, much of the human logic could be distilled into ‘if/then’ statements. For example, if the project means to produce notebooks, then the DMP should specify where notebooks will be shared, if they will be given a DOI, and if a notebook template will be used.

FAIRIST provides customized text for a researcher to include in a data management plan or proposal. Some form of DMP is required by many federal funders. The added benefits of planning data management at the outset of a research project are many: it makes it easier to audit, to check compliance with requirements, and to document the project which all benefit both researchers and funding agencies. Raising topics as part of creating a required document can also put a research project in good stead for complying with other domain-specific publication requirements later. For example, the Association for the Advancement of Artificial Intelligence (AAAI) hosts one of the most prestigious annual conferences for AI researchers. Papers submitted must also include a reproducibility checklist (). Many of the implementation solutions to the FAIR principles aid researchers in also being ready for the reproducibility checklist. For example, for papers submitted to AAAI that rely on data sets, this question must be answered, ‘All novel datasets introduced in this paper will be made publicly available upon publication of the paper with a license that allows free usage for research purposes (yes/partial/no/NA).’ Adherence to the FAIR principles relating to the clear statement of data usage licenses and the accessibility of data would prepare researchers to answer ‘yes’ to this question.

The stakeholders for a tool like FAIRIST include researchers from all domains and sectors, like academia and industry, although FAIRIST is tuned for research grant proposals. Additional stakeholders include anyone involved in the proposal process where a DMP is required or where the discussion of the FAIR principles is beneficial. This could include research support professionals, pre-award and project managers, and students or postdocs involved in proposal creation. This tool could also be used in synchronous and asynchronous trainings, such as the curriculum (), a grantsmanship course (), or a data management plan training course hosted by a university library.

4. FAIRIST Technology & Testing

FAIRIST surveys aspects of the project and then maps them to possible options and suggestions. Based on past experience and project requirements, all topics in FAIRIST are organized such that the logic is easily followed by the researcher in an attempt to save them time. When the form is complete, all information is automatically generated and embedded in the project proposal in an organized and structured way. The important qualities of this approach include: it makes a complex topic accessible; makes efficient use of researchers’ time; and uses the time spent in the survey to lift awareness of the topic, its richness and dimensions. For example, if the project being described will produce Machine Learning (ML) models, then a follow up question is added asking, ‘Where will the ML models be shared?’, with several answers the user may or may not be aware of previously. An additional question asks, ‘What are the reproducibility considerations you will undertake to document analysis that utilizes ML?’ (Figure 1). By providing check box options rather than only a free form text box, the user can gain knowledge about the topic that doesn’t rely on specific understanding of FAIR+ concepts. The example shown in Figure 1 distills ML implementation factors that can affect reproducibility to introduce the concept and ways to remediate variability.

Figure 1

Embedded logic in FAIRIST expands the survey questions to fit the project described.

The reproducibility consideration options are distilled from a much longer and complex Computer Science paper on sources of irreproducibility (). The source paper is linked in the FAIRIST survey question, in case the user wishes to read more about the topic before deciding or implementing the suggestions. The advice from the paper was adapted as postcard-sized material (Figure 2) that could be used for outreach and awareness building of the concept and FAIRIST tool (). This approach could be used to rapidly put other research data management scholarship into practice.

Figure 2

Outreach and awareness building material that could be used in concert with FAIRIST by libraries, research computing facilitators, and other researcher support personnel.

Technologies & methodology utilized

FAIRIST was built in Qualtrics, a business survey tool with a full-featured user interface (UI) that allows for survey customization as the input is given (). For example, it is possible to ask up front what types of research objects will be created and add or skip questions automatically based on the first response. The Qualtrics UI allows for a non-programmer to refine the form iteratively, enabling an ‘Agile’ approach. Agile refers to a software methodology with four pillars: ‘individuals and interactions over tools and processes, working software over comprehensive documentation, customer collaboration over contract negotiation, and responding to change over following a plan’ (). At the time Agile was introduced, it stood in stark contrast to process and resource-intensive methodologies, such as Waterfall (; ). The key principles adopted in the creation of FAIRIST including focusing on the individual’s needs above the assumed process of writing a DMP and creating a working example quickly that could be iteratively refined based on early user input.

Qualtrics is limited in the customized output it can provide. Some of the survey responses are used to determine what additional questions to ask, whereas the other responses set variables. Using the Qualtrics API, the variables call a Python script hosted on a local cloud instance that transforms the variables into completed sentences (). This text output is formatted for inclusion in a DMP but could also be extended to include a machine actionable DMP. This would be accomplished by providing the text formatted for written language, as well as in RDF. The Python program that formats the variables collected from Qualtrics could be adapted to any format including the Research Data Alliance DMP common standard ().

The Qualtrics web service workflow is asynchronous; it can take from 1–5 minutes for the output to be sent to the FAIRIST output generator. A user is notified by email when the FAIRIST output is ready to be retrieved. FAIRIST output links are encoded with a 128-bit universally unique identifier, so that others’ output can’t be easily guessed, and thus viewed. After a few minutes, an email is sent to notify the user that the recommendations are available for viewing. An example of the output is shown in Table 2.

Table 2

Example Output from FAIRIST Showing Recommendations.


FAIRIST Recommendations Based on your responses, the following recommendations are included for your consideration and/or inclusion in your project’s Data Management Plan.
Types of Data Research objects associated with the project can be classified into the following groups: Data (Machine Learning) Models
Data Stewardship Practices Planned Table 1 shows specific data stewardship actions that will be undertaken during the project as they relate to the high-level goals of FAIR.

FAIR DIMENSION	RESEARCH DATA STEWARDSHIP PRACTICES PLANNED

Findable	Research products will be posted to the Project website. Data will be assigned a unique identifier per community best practices and will be referenced on the Project’s website. Metadata and links to related ontologies will be available on the Project website. Where tags exist, schema.org descriptors will be utilized.

Accessible	Available via open, web accessible folder. All data is open.

Interoperable	Code stored on github (and linked from the Project website). Uses libraries included with the code. Both input and output data are in HDF5 format.

Reusable	ML model and data will be deposited at OpenML.org. A notice posted will designate research objects as licensed under CC-BY.

Table 1: Data Stewardship Practices Planned by FAIR Dimension

User feedback and testing

FAIRIST questions were developed within the core team and refined as a questionnaire in a document. Once that was converted to the Qualtrics format and initial user testing began, the wording of questions was refined for clarity and brevity. Several questions had to be reworded, so that the user input specified would form a grammatically correct sentence. For example, for the question, ‘Will your data management plan, or the document this is being developed for, be shared?’, the options were: ‘Yes, at FAIR Connect’; ‘Yes (specify)’; ‘No’. In the feedback, the ‘Yes’ options triggered the inclusion of the sentence, ‘This plan will be shared’, appended by the variable for the question. The user input for ‘Yes (specify)’ would be a grammatically incorrect sentence, unless the user knew to begin with a preposition. If the user specified, ‘my institutional repository’, the resulting sentence would be ‘The plan will be shared [sic] my institutional repository’. This question was changed to not include a ‘specify’ option. If the feedback is ‘Yes’ the feedback includes, ‘The plan will be shared.’ The information on FAIR Connect was moved to be part of the question, ‘Examples of places to share DMPs include FAIR Connect’ with a link to the FAIR Connect website. Where possible, links were embedded into the questions, so that users could read more about a topic, for example, if a user answers ‘Yes’ to ‘Will an API be provided?’, they would receive the follow up question, ‘Are you using a Smart API?’, where ‘Smart API’ links to the website https://smart-api.info/.

Once initial tests were complete, the tool was released to researchers at University of California San Diego and the University of Porto. A webform was supplied for reporting bugs and feature requests. Several researchers sought out the project team and provided direct feedback. Most indicated that FAIRIST was easy to use and that they would recommend it to colleagues.

Highlights of the feedback:

Some reported that FAIRIST made them aware of a new tool previously unknown, including Smart API mentioned above.
One researcher wanted to ‘reverse engineer’ how to implement FAIR for his project, as his results only indicated he had plans for Findable and Accessible, but not Interoperable or Reusable.
Another researcher noted that FAIRIST made no mention of gateways, a software portal used to access other infrastructure, especially to make High Performance Computing (HPC) more accessible.

One of the major takeaways was that FAIRIST’s value rested in helping streamline a part of the proposal process. The major bugs uncovered related to the ability to specify custom text or “other” on most questions. This created extra complexity on the programming side and many of those responses were difficult to insert into fill-in-the-blank sentences. Based on the researchers’ feedback:

Some of the ‘specify’ answer options were or will be eliminated. Feedback that incorporates ‘specify’ responses will be rewritten to work with open-ended statements.
Relevant questions now include an answer option for gateways.

FAIRIST is available for use at http://fairist.sdsc.edu/.

Feedback can be submitted at https://tinyurl.com/fairist.

5. Conclusion and Future Work

While FAIR implementation is dependent on the specific factors for each research project and domain, as well as changes with evolving technology, it is possible to provide researchers with concrete advice. Tools such as FAIRIST provide a framework for embedding new information as practices develop and raise awareness on open science practices. By utilizing Agile methodologies and readily available cloud-based tools to create FAIR tools and resources, implementation advice can be more rapidly distilled and presented to researchers. These takeaways can be packaged not only for use in FAIRIST and tools like it, but also reformatted as outreach and awareness tools that promote both tools and FAIR+ concepts. Though created with the motivation to assist researchers and their teams with FAIR implementation, and to increase the adoption of the FAIR Principles, survey tools with proactive suggestions can assist researchers in other ways. Streamlined tools like FAIRIST can anticipate publishing and other funder requirements.

The preliminary results obtained from researchers are positive, and it seems FAIRIST is a good proof of concept. A more detailed evaluation is needed, where the refined FAIRIST obtained with the preliminary results is made available to a larger audience, along with the refined questionnaire. The analysis of this data will open new paths for improvement. A call for feedback will be issued through partnerships with research data consortia and other organizations active in both research and FAIR practice development. In the interim, feedback received from users will be considered and used to incrementally improve FAIRIST. The team is contemplating a feature that speaks to the request to reverse engineer the survey, which would allow one to view the full set of FAIR implementation steps. Further feedback from researchers will be gathered to inform what would be most useful and how it should be presented.

Future work should include a wider review of the survey options and outputs by experts in information science and research computing. It would be more sustainable if these questions and survey techniques were adopted by an existing tool. However, if that does not occur, an advisory board should be formed to guide decisions. For example, one of the questions, ‘Where will your ML datasets be shared?’ provides several options. Should the survey reflect the current practices or eliminate options the community determines to be suboptimal? Qualtrics as a platform is not a long-term solution for FAIRIST because of its output limitations: waiting for an email with a link to feedback rather than displaying the information immediately. However, in the short-term, Qualtrics enables rapid, incremental improvements based on user testing and input from other experts. Once FAIRIST’s questions and output have been well tested and refined, FAIRIST should migrate to a custom, efficient, stand-alone Python script and/or be integrated with an existing DMP tool. If FAIRIST remains a stand-alone tool, it should be converted to an application with a database for storing surveys and accessing past FAIRIST output. The output should be immediately available and be also formatted as a machine-readable output using an established standard, in addition to the text meant for DMPs. Future work could include connecting FAIRIST to data sources, so that funder and program specifics could influence the questions asked and feedback given. It would also allow for the specification of resources by PID, such as specific equipment and standards. This is already present to some degree in Argo (Related Work); a potential path would be to fold FAIRIST’s features into an existing platform such as Argo.

There are numerous other sources to mine for potential questions and implementation suggestions. This work focused on Computer Science and AI because of the authors’ backgrounds and a perceived gap in implementation advice. A future version of FAIRIST could include custom options tailored to advice for specific domains. This is beyond the complexity that can be handled by Qualtrics but would be possible in a future iteration of FAIRIST. For example, for projects that will produce a new domain repository and are from the Earth Sciences, FAIRIST could offer the option to include the repository in the Magnetics Information Consortium () or the Council of Data Facilities (CDF) consortium (). At the moment, this option is offered to anyone, regardless of scientific domain, which indicates that the project will create a domain repository. FAIRIST could be further tailored to research needs by asking custom questions based on the agency funding source. As funders or institutions implement machine actionable DMPs, FAIRIST could also include the implementation guidance in machine readable format, e.g., triples in RDF. This could then be used to automatically verify compliance with the planned research data management practices.

Data Science Journal

Practice Papers

Engaging with Researchers and Raising Awareness of FAIR and Open Science through the FAIR+ Implementation Survey Tool (FAIRIST)

Abstract

1. Introduction and Background

2. Literature Review, Definitions, and Terminology

3. Motivation and Stakeholders

4. FAIRIST Technology & Testing

Technologies & methodology utilized

User feedback and testing

5. Conclusion and Future Work

Acknowledgements

Funding Information

Competing Interests

Author Contributions

Author Information

References