Implementing Informatics Tools with Data Management Plans for Disease Area Research

Vivek Navale; Matthew McAuliffe

Introduction

Data Management Plans (DMPs) serve as planning tools to describe the type of data and metadata produced during a research project, the standards used during the collection of data, and the designated repositories for storage, along with information on associated software tools required for accessibility. Traditionally, DMPs have been considered static documents, human-readable in nature; however, more recently, there is an impetus for DMPs to be machine-actionable. Embedding DMPs in existing workflows can improve the stakeholder experience, involving funding agencies, researchers, repository managers, publishers, and others (). DMPs are integral in a research data life cycle, which involves the collection, processing, validation, storage, analysis, and access (). Recently, the US National Institutes of Health finalized the policy for Data Management and Sharing (). The policy highlights the importance of good data management practices and enables sharing of scientific data generated from NIH-funded or conducted research. The policy is supported by supplemental information, including privacy protection when sharing Human Research Participant Data and the elements for Data Management and Sharing Plans.

In biomedical clinical research, typically, research hypotheses are tested, and the results depend on the accuracy and reliability of data collected during the study. Clinical Data Management involves various stages that can be integrated by implementing DMPs. The plans comprise designing Case Report Forms () to organize clinical protocol-specific information collected for a research project. The data fields in CRFs should be clearly defined and consistent throughout. It should be supported by data and metadata submission procedure specifications, validation and discrepancy resolution methodology, extraction, coding, audit trail management, and secure storage within a database ().

The CRFs traditionally have been paper-based, involving hand-written notes, however, electronic CRFs are increasingly being used to provide consistency during the clinical research data collection work. The CRF development work occurs at the very beginning of the research study, during the protocol development and approval process. The responsibilities of the CRFs primarily reside with the Principal Investigator and the team conducting the research study. As a best practice for clinical research, only required data should be collected. By using standards that apply to the research protocols, the CRFs can further increase data quality during a research study ().

Informatics software can be used to support clinical DMPs. It can facilitate the creation of CRFs, import existing paper-based forms into electronic CRFs, and provide procedures for validating the data collected during the study (). Recently, clinical DMPs with mobile applications have been used for data collected for a community-level disease study ().

In this paper, we discuss informatics tools that researchers utilize to support clinical DMPs for advancing Traumatic Brain Injury (TBI) and Parkinson’s Disease (PD). The tools are part of a Biomedical Research Informatics System (BRICS) developed for several National Institutes of Health (NIH) biomedical research programs (). We discuss the software services that are used in the research projects and explain the association of the informatics tools for supporting the NIH Data Management and Sharing plan elements. The tools discussed lead to sharing of de-identified data with the goal of data being FAIR (), (findable, accessible, interoperable, and reusable) within designated disease-specific digital repositories that are trustworthy ().

Data collection

The NIH and the US Department of Defense provide research grants that support TBI and PD studies and require submission of data to the designated data platforms (FITBIR: Federal Interagency Traumatic Brain Injury Research Informatics System), and Parkinson’s Disease Biomarker Program (PDBP), respectively. Within a disease area, (e.g., TBI and Parkinson’s) clinical research is protocol-specific, which involves data collection on specific variables (data elements) that enable the data analyses to prove or disprove research hypotheses. For TBI and Parkinson’s researchers, using Common Data Elements (CDEs) as part of the clinical DMP is recommended (). A CDE is defined as a fixed representation of a variable collected within a clinical domain, interpretable unambiguously in human and machine-readable forms (). The National Institute of Neurological Disorders and Stroke (NINDS) provides detailed information on CDEs and has also developed CRF templates (CRF library) with data dictionaries for different neurological diseases. The data dictionaries comprise data elements, form structures, and electronic forms. A data element has a name and precise definition with permissible values when applicable. A data element directly relates to a question on a form, and the form structure serves as the container for data elements.

As a first step, researchers are asked to complete a data submission form (a component of DMP) that is reviewed and approved by the data access committee. Creating a research study in the designated data repository (e.g., TBI and PD repositories) and associating the submission form is part of the data management process. The study metadata includes information on the organization, Principal Investigator, funding source and IDs, study type(s), start and end dates for grants, therapeutic agents used, sample size, data types, forms used, and publications.

A Protocol and Form Research Management System (ProFoRMS) provides researchers with the services to manage research protocols when collecting clinical data. An electronic CRF system enables scheduling patient visits, collecting, adding new data, modifying previously collected data entries, and correcting any discrepancies before submission to designated repositories. The software tools used to execute the ProFoRMS have been discussed in an earlier publication (). The CDE-based data dictionaries are available with the Federal Traumatic Brain Injury and Parkinson’s Disease Biomarker () data platforms. The ProFoRMS also supports automatic validation with the data dictionaries for TBI and Parkinson’s disease. Researchers have the option to collect data by other methods (e.g., REDCap), however, the output files from other methods are validated with the data dictionaries before submitting to the designated repositories for TBI and Parkinson’s disease.

Data de-identification

A random alphanumeric unique identifier that is not directly generated from personally identifiable information (PII) is assigned to individual patients. The BRICS privacy-preserving record linkage tool, also known as the Global Unique Identifier (GUID), creates one-way encrypted hash codes, allowing the PII to reside only on the researcher’s site. The GUID tool () is available through the and PDBP platforms. This approach of using unique identifiers allows for the tracking of patients who may be enrolled in multiple studies ().

Data validation

The file format for data submissions is comma-separated values and is structured to be consistent with CDE-variable names and data values. A validation tool is available to support the data repositories and ProFoRMs modules. The tool compares the submitted values with the defined and/or acceptable ranges for TBI and Parkinson’s disease CDEs. Any identified errors during this process are corrected before a data submission package is produced for uploading to a designated repository.

Data storage

Research study information is stored within the TBI and Parkinson’s disease repositories. Management of a research study within a repository prompts researchers to describe in detail the data collected to make data accessible to users. The information within repositories contains patient assessment (form) data, imaging, electroencephalogram, magnetoencephalography, and derived genomics data.

Data access

As raw data is stored in the repositories, initially, access is limited to the Principal Investigator and their team members who can share with other researchers associated with their work. Currently, the TBI and PD data access committee permits researchers to maintain the data in a private state for a year after the research grant has ended, but after one year, the data moves to a shared state. All approved data users have access to the shared data. The data repository also provides an interface for generating digital object identifiers for a study that can be referenced in research articles.

Metadata and study summaries are also made available via and PDBP public sites. The FITBIR provides a metadata visualization tool that helps in searching for research studies (https://fitbir. nih.gov/visualization).

Table 1 illustrates the various elements of the NIH Data Management and Sharing Plan. The plan highlights the importance of providing required information on the amount and type of data (e.g., imaging, genomic, mobile, survey) being collected, the level of aggregation (e.g., individual, aggregated, summarized), and the degree of data processing that has occurred (i.e., raw or processed data). It requires information on standards (data formats, dictionaries, identifiers, definitions, and associated documentation) used when collecting data. Also, the plan should provide information on how data and metadata will be findable and identifiable (e.g., persistent unique identifier or other indexing tools), maintain privacy and confidentiality (i.e., de-identification, certificates of confidentiality, and other protective measures), and identify the repository(ies) where the scientific data and metadata will be preserved. Information should also indicate when the scientific data and metadata will be made available to other users and specify the duration of accessibility to the data.

Table 1

Associating Data Management and Sharing Plan Elements with Informatics Tools.


DATA TYPE	DATA STANDARD	DATA PRESERVATION	DATA ACCESS AND SHARING	SERVICES AND TOOLS	OVERSIGHT OF DATA MANAGEMENT

TBI Patient clinical data, imaging data, bio-samples	FITBIR Common Data Elements, Data dictionaries	FITBIR repository, CSV files, DICOM images	Controlled access, DAC approvals needed, Meta-studies	BRICS service modules – GUID, ProFoRMS, FITBIR Repository, Query tool	DOD, CIT, and NINDS https://fitbir.nih.gov/

PD Patient clinical data, imaging data, bio-samples, genomics data (VCF)	PDBP Common Data Elements, Data Dictionaries	PDBP repository, CSV files, DICOM images	Controlled access, DAC approvals needed, Meta-studies	BRICS service modules – GUID, ProFoRMS, PDBP repository, Query tool	CIT and NINDS https://pdbp.ninds.nih.gov/

Eye disease clinical data, genomics data (processed)	NEI Common Data Elements, Data Dictionaries, LOINC data standards	NEI repository, CSV files,	Controlled access, DAC approvals needed, Meta-studies	BRICS service modules- GUID, ProFoRMS, NEI repository, Query tool	NEI and CIT https://eyegene.nih.gov

In light of the NIH Data Management and Sharing Policy, TBI, PD, and eye disease are provided as examples to illustrate the information needed for the plan elements. The type of data collected, the CDEs used during collection, and the designated repositories for each of the examples are unique to biomedical programs. For the examples shown in Table 1, the services and software tools used are similar.

FAIR data and trustworthy repositories

Clinical DMPs play an important role in enabling data to be FAIR. Maintaining the confidentiality of the patients is of paramount importance. The DMPs for TBI and PD require data de-identification as a prerequisite step during data processing. With the assignment of unique identifiers, data from the same patient is findable and can be integrated as needed. Data accessibility is governed by the data repository policies and requires approval by the data access committees. The metadata summaries are publicly available through the FITBIR and PD data platforms.

The use of data dictionaries and associated CDEs for TBI and PD research studies provides consistency in data collection, improves data quality, and facilitates integrating data for different studies during analysis work. Also, the adoption of research community-recommended standards during data collection and subsequent preservation in centralized repositories for TBI and PD leads to the trustworthiness of data.

Conclusion

Clinical DMPs are an integral part of the research data life cycle process that involves collection, processing, validation, and storage within repositories. Informatics tools provide direct support for clinical DMPs to effectively establish confidentiality, integrity, and accuracy of clinical records. Maintaining patient data confidentiality is essential within a clinical setting. To ensure data quality during the de-identification process, it is important to utilize clearly defined data concepts/variables, data dictionaries, and systems that support data curation and preservation (). Electronic data collection with eCRFs that are associated with data dictionaries can reduce the time involved in data curation work. Validation of data before storage improves the data quality and promotes trustworthiness in repositories.

Data Science Journal

Practice Papers

Implementing Informatics Tools with Data Management Plans for Disease Area Research

Abstract