Introduction

High-throughput technologies are increasingly being employed in biomedical- and healthcare-informatics research, producing large biological and clinical data sets at rapid speed (). Modern biomedical and clinical research is thus characterised by the exponentially increasing volume of a variety of data types and structures, produced and processed at unprecedented velocity (). Integrating these different sources of information holds great potential to elucidate the aetiologies of complex medical conditions, develop novel treatments for such conditions and revolutionise modern health care with the incorporation of personalised medicine, predictive modelling and clinical decision support, and improved disease and safety surveillance ().

Stroke, defined as an acute focal or global neurological deficit, results from spontaneous haemorrhage or infarction of the central nervous system with objective evidence of infarction or haemorrhage irrespective of duration of clinical symptoms (). It remains one of the primary causes of brain injury, disability and death, worldwide (; ). Currently, big data analytics are employed in stroke research to elucidate the genetic and environmental underpinnings of stroke (; ), as well as to research improved methods of stroke health care, such as revolutionising visual analytics, employing predictive analytics in hypertension patients and employing telecardiology as method of care ().

Large-scale data analytics have introduced new challenges to biomedical researchers, including the storage, management, sharing and analyses of datasets (). The datasets require powerful and novel technologies to extract biologically meaningful information and conclusions, as well as enable more broad-based health-care solutions (). Standardising the methods in which clinical and research data are collected, reported, shared, managed and (or) stored can support the resolution of these challenges, by enhancing data compatibility, interoperability, reproducibility and re-use (), and facilitating data sharing and collaboration (). This can be particularly useful for stroke research in low-resource settings, where primary research is historically epidemiological in nature, and the technological capacity for appropriate case identification is lacking ().

Biomedical standardisation efforts are being led globally by the Global Alliance for Genomics and Health (GA4GH) (www.ga4gh.org), a policy-framing and technical standards-setting organization, as well as FAIRsharing (www.fairsharing.org), a dynamic standards database which aims to promote FAIR (Findable, Accessible, Interoperable, and Reusable) principals (). Additionally, the PhenX (consensus measures for Phenotypes and eXposures) Toolkit (https://original-phenxtoolkit.rti.org/index.php) promotes standardised data collection, by recommending a catalogue of standard measures of phenotypes and environmental exposures for use in biomedical research, although some measures are limited in terms of applicability to low-resource settings ().

Drawing from the aforementioned resources, as well as existing data collection measures, data reporting standards, controlled vocabularies, data dictionaries and relevant ontologies, the Human Heredity and Health in Africa’s (H3Africa) Bioinformatics Network’s (H3ABioNet, www.h3abionet.org) (; ) Data & Standards work package aimed to develop domain/field-specific research data reporting guidelines for several diseases. The current report focused on the development of the stroke-specific research data reporting guideline entitled, ‘The Minimum Information Required Guideline for Stroke Research and Clinical Data Reporting (Version 1.0)’.

Methods

A list of reporting recommendations for the guideline was proposed based on the review of existing literature and online resources. These resources included data collection methods hosted on PhenX Toolkit, including the measure for collecting stroke history, and experimental reporting guidelines, hosted on FAIRsharing, including The Minimum Information About a Proteomics Experiment (MIAPE) () and The Minimum Information required for DMET experiment (MIDE) (). These recommendations were also harmonised with the H3Africa Standard Case Report Form (CRF) (www.h3abionet.org/data-standards/datastds). A reporting guideline was drafted based on these recommendations, which was then subdivided into three main sections, entitled; participant-level information (which details information specific to the study participants), study-level information (which details information specific to the study) and experiment-level information (which details information specific to the experiments within a study), to illustrate the different levels and users of the data.

Thereafter, the drafted guideline was reviewed by a broad range of international stroke researchers and clinicians, using an anonymous online survey, constructed to evaluate, harmonise and consolidate the proposed recommendations, identify which recommendations represented “essential” or “optional” information, propose additional recommendations and remove any existing reporting, ontology, nomenclature, and unit inconsistencies. The online survey was constructed, and study data were collected and managed using REDCap (Research Electronic Data Capture) 7.5.0, hosted at The Centre for Proteomic and Genomic Research (CPGR) (). The online survey consisted of 116 fields (Supplementary File 1).

Once harmonised, the recommendations (henceforth referred to as elements) were manually defined using ontologies found through the BioPortal search engine (), the Ontology Lookup Service (OLS) at the European Bioinformatics Institute (EMBL-EBI) () and the Zooma annotation tool.

Following completion, an associated eXtensible Markup Language (XML) schema was designed for implementation in REDCap. The schema was designed to carry all the data and metadata within the reporting guideline, while maintaining the associations between them to allow the exchange of clinical data between dissimilar health information or research systems without losing the semantics and structure of the reported data. The schema also defines the rules of validation for each element, as well as the datatype, atomic units and validation rules for each element, to ensure reporting correctness.

Results

The online survey was completed by 20 international stroke-specialists, majority of respondents were based in Africa (10), followed by America (4), Europe (4) and Australia (2). Of these respondents; 28% were working as clinicians, 15% were working as researchers, and 57% were working as dual clinician-researchers. The majority of respondents had between 10- and 20-years’ experience in the field (38%), whilst 34% of respondents had more than 20 years’ experience in the field, 15% had between 5- and 10-years’ experience in the field and 13% had less than 5 years’ experience in the field. Figures 1 and 2 illustrate the survey response to the proposed elements. Furthermore, respondents proposed additional elements, which shaped the final structure of the reporting guideline. This includes, but is not limited to, Diet and Dyslipidaemia. The raw survey results can be found in Supplementary File 2.

Figure 1 

Survey response to proposed participant-level information.

Figure 2 

Survey response to proposed study- and experiment-level information.

The Minimum Information Required Guideline: Stroke Research and Clinical Data Reporting is summarised in Table 1.

Table 1

The Minimum Information Required Guideline: Stroke Research and Clinical Data Reporting.

ElementsImportanceDefinition

Participant-level information

DemographicsDate of BirthEThe calendar date on which a participant was born.
SexEThe classification of the participant’s sex.
Self-reported Race/EthnicityEMembership to social group based on a common heritage.
Country of Birth (COB)EThe country that the participant was born in.
Country of ResidenceOThe country that the participant resides in.
Native Language(s)EThe primary systematic means of communication used in the participant’s household.
Tribal AffiliationOThe tribe which the participant is affiliated to.
Father’s Country of BirthOThe country in which the participant’s biological father was born.
Mother’s Country of BirthOThe country in which the participant’s biological mother was born.
Lifestyle FactorsHistory of HypertensionHas a healthcare worker ever said that you have high blood pressure or hypertension?EParticipant’s background regarding high blood pressure or hypertension.
If yes, then at what age were you first told this?
FOR WOMEN ONLY: Was this during pregnancy?
Have you ever taken medication for hypertension/high blood pressure?
If yes, then at what age did you begin taking medicine for this?
Physical Activity7-Day FrequencyEThe number of occurrences of physical activity per unit time (7 days).
TimeThe average time spent per physical activity (in min).
IntensityThe average energy expended per physical activity. Light exercise is 20–60 minutes and elevates heart rate to 35–60% of maximum heart rate (e.g. housework, gardening, slow walking); moderate exercise is 20–60 minutes and elevates heart rate to 35–60% of maximum heart rate (e.g. basketball, single tennis, brisk walking); strenuous exercise elevates heart rate to over 60% of maximum heart rate (e.g. jogging, swimming, bicycling).
Alcohol UseLifetime UseEA description of an individual’s current and past experience with alcoholic beverage consumption.
Age of InitiationThe age of initiation of alcoholic beverage consumption.
30-Day FrequencyThe number of occurrences of alcoholic beverage consumption per unit time (past 30 days).
30-Day QuantityA record of the quantity of alcohol consumption (in standard drinks) (past 30 days).
Tobacco UseLifetime UseERecord of whether the participant has ever used any tobacco product during his or her entire life.
Lifetime Frequency
Age of InitiationThe age of initiation of tobacco use.
Recreational Drug UseLifetime UseERecord of whether the participant has ever used a drug during his or her entire life.
Age of InitiationThe age of initiation of drug use.
30-Day TypeA record of the participant’s type of drug use within the past 30 days.
30-Day FrequencyThe number of occurrences of drug use per unit time (past 30 days).
DietEThe customary allowance of food and drink taken by a person from day to day.
Anthropo-metricsAverage HeightEThe vertical measurement of distance from the sole to the crown of the head with body standing on a flat surface and fully extended (in cm). Averaged over 3 measurements.
Average WeightEThe measurement of mass or quantity of heaviness of an individual (in kg). Averaged over 3 measurements.
Waist CircumferenceOThe abdominal circumference at the navel (in cm).
Head CircumferenceOA circumferential measurement of the head at the widest point (in cm).
Body Surface AreaOA measure of the 2-dimensional extent of the body surface (i.e. the skin) (in m2).
Prosthesis (if applicable)ELocation of a device which is an artificial substitute for a missing body part or function.
Blood PressureAverage Systolic Blood PressureEThe average pressure exerted into the systemic arterial circulation during the contraction of the left ventricle of the heart. (in mmHg).
Average Diastolic Blood PressureThe average pressure exerted into the systemic arterial circulation during cardiac ventricular relaxation and filling (in mmHg).
Adverse Drug ReactionsMedicationOThe drug product which caused a detrimental or unintended response associated with the use of a medication.
TypeThe type of detrimental or unintended response associated with the use of a medication.
DateThe calendar date on which the ADR occurred.
DyslipidaemiaHigh-Density Lipoprotein (HDL)
Low-Density Lipoprotein (LDL)
EA lipoprotein metabolism disorder characterized by decreased levels of high-density lipoproteins, or elevated levels of plasma cholesterol, low-density lipoproteins and/or triglycerides.
Triglycerides
Stroke HistoryHistory of StrokeWere you ever told by a doctor or healthcare worker that you had a stroke, TIA, mini-stroke, transient-ischemic attack?EA question to determine if the respondent has had a stroke and/or any symptoms related to this event.
Have you ever had a sudden painless weakness or numbness on one side of the body, suddenly lost one half of your vision, lost the ability to understand what people are saying or lost the ability to express yourself verbally or in writing?
Family History of StrokeHas anyone in your family had a stroke?EA record of a patient’s background regarding stroke and stroke-related events of blood relatives.
Primary PreventionEPrimary prevention involves prevention of disease in susceptible individuals or populations through promotion of health and specific protection as distinguished from the prevention of complications or after-effects of existing disease.
Secondary PreventionESecondary prevention involves procedures or treatment processes designed to prevent further complications.
ConsanguinityAny cases of consanguineous mating in the family?OReproduction between genetically related individuals.
‘If Yes, please specify:’
Sample-specific InformationSample IdentifierEName or other identifier of an entry from a biosample database.
Sample’s Case or Control StatusOAn indication of a subject’s status as a case or a control for a given study.
ConsentEThe planned process that an individual agrees to participate in.
Stroke-related InformationDifferential DiagnosisEDifferential diagnosis refers to the process of differentiating one diagnosis from another, and in turn, providing the most fitting diagnosis based on an individual’s presentation.
Date of DiagnosisOThe calendar date of confirmatory diagnostic testing.
Stroke ScaleOClassification systems employed for clinical and research purposes which to improve diagnostic accuracy, determine the suitability of specific treatments, monitor change in neurologic impairments, and predict and measure outcomes.
InstrumentationOThe specialized objects, or items of electrical or electronic equipment, employed to perform diagnosis (with versions).
Clinical SignsEThe objective evidence of disease perceptible to the examining healthcare worker.
Stroke OutcomeEThe result of an action (stroke).
Stroke ImpactEDisabilities and impairments due to a stroke.
HistopathologyEThe visual examination of cells or tissue (or images of them) with an assessment regarding the quality of the cells or tissue.
Pre-Stroke Co-morbidities (Systemic)EThe presence of co-existing or additional medical conditions pre-stroke.
Post-Stroke Co-morbidities (Systemic)EThe presence of co-existing or additional medical conditions post-stroke.
Pathogenic Co-morbiditiesEThe presence of co-existing or additional pathogenic diseases with reference to an initial diagnosis or with reference to the index condition that is the subject of study.
AllergiesOAn immune response or reaction to substances that are usually not harmful.
Prescribed MedicationMedicationEA record of the prescribed drug product currently in use.
DosageThe size or frequency of a dose of a medicine or drug.
StrengthThe amount of the medicine or drug that provides its particular effect.
ReasonThe cause of the prescription.
Start DateThe calendar date on which treatment was initiated.
Stop DateThe calendar date on which treatment is to be or was terminated.
Non-Prescribed MedicationMedicationEA record of the non-prescribed drug product use in the past 2 weeks.
DosageThe size or frequency of a dose of a medicine or drug.
ReasonThe cause of the prescription.
Start DateThe calendar date on which treatment was initiated.
Stop DateThe calendar date on which treatment is to be or was terminated.
Study-level information

Study-specific informationResearch InstituteEThe name of the organisation affiliated with a specific study.
Study DurationOThe duration of any specifically defined piece of work that is undertaken or attempted to meet requirements (in years).
Study Start DateOThe calendar date on which the project is initiated.
Study IDEThe unique identifier of the project.
DiseaseEClinical entity defined by a set of phenotypic abnormalities resulting from a common physiopathological mechanism with a homogeneous evolution and homogeneous therapeutic possibilities.
Clinical SubtypeOThe subdivision of a disease, malformation syndrome, morphological anomaly, biological anomaly, clinical syndrome or particular clinical situation in a disease or a syndrome further defined by its particular clinical presentation.
Study DesignEThe nature of the investigation or the investigational use for which clinical study is being done.
Study AimEA textual entity describing the study aim.
Sample SizeEThe subset number of a larger population, selected for investigation to draw conclusions or make estimates about the larger population.
PMIDOPubMed unique identifier of an article.
DOIODigital Object Identifier (DOI) of a published article.
Experiment-level information

GeneralBiospecimen TypeEThe type of a material sample taken from a biological entity for research purposes.
Sample Management ProtocolEThe specifications employed for the management of samples.
Quality Control ProtocolEThe specifications employed to ensure a certain level of quality of biospecimens.
Experimental AimEA textual entity describing the experimental aim.
Experimental ProtocolEThe specifications with respect to the design and implementation of an experiment or set of experiments.
InstrumentationESpecialised equipment, tools, appliances, and(or) apparatus employed in the experiment(s).
Data AnalysisEThe data transformation techniques used to analyse and interpret the data to gain a better understanding of it.
Experimental ResultEThe outcome of the experiment or set of experiments.
Output LocationOFull name and location of output (raw or analysed data).

Footnote: E – Essential; O – Optional.

The quintessential information reported using the standard are separated into three fields; participant-, study- and experiment-level information. The standard further divides elements into essential and optional information. Optional elements refer to information which is not necessary for the interoperation of studies within the same field but useful for integrating studies from varying disease fields. Participant-level information contains 13 subsections of varying essential and optional elements, including Demographics, Lifestyle Factors, Anthropometrics, Blood Pressure, Adverse Drug Reactions, Urine-Related Test Index, Stroke History, Sample-Specific Information, Stroke-related Information, Prescribed Medication, Non-Prescribed Medication, and Therapy. The Study-level information includes various elements which describe the details of a given study, including essential elements such as Study ID, Research Institute and Study Design, and optional elements such as Study Duration, Study Start Date, and Pubmed Unique Identifier. Finally, experiment-level information includes various elements which describe the various experiments within a given study, including essential elements such as Biospecimen Type, Instrumentation employed, Sample Management Protocol, Quality Control Protocol and Experimental Aim, and optional elements such as Output Location, which describes where the data will be saved. Although descriptions of the Output Location are widely encouraged, this data element remains optional to accommodate scenarios where data is private, under embargo and(or) the reporting guideline is used explicitly for internal research data management.

The complete reporting guideline can be found in Supplementary File 3, specifying each element’s data type, collection format and (or) accepted values, and related ontologies and standards. Herein, the Ontology ID column contains the most appropriate ontology which the element is mapped to whilst the Concordant Ontologies and Concordant Standards columns describe ontologies and standards which include similar data elements. These lists are not meant to be comprehensive or exhaustive, but to illustrate the utilization and overlap with existing resources. A comprehensive guideline explaining how to employ the reporting guideline locally can also be found in Supplementary File 4.

An associated XML schema was developed for REDCap implementation, consisting of 3 sections – the participant-, experiment and study-level information, and can be found in Supplementary File 5. The relationship between these sections are illustrated in Supplementary File 6. The XML schema represents and requests information as outlined in the proposed reporting guideline, and therefore functions as a standard format of the reporting guideline. Importantly, the reporting guideline and the associated XML schema can also be obtained from the H3ABioNet website (www.h3abionet.org/data-standards/datastds), along with a guideline document on how to employ the reporting guideline locally.

Discussion

The paper outlines the development of the Minimum Information Required: Stroke Research and Clinical Data Reporting Guideline. To our knowledge, though ontologies and collection standards have previously been described for stroke-related clinical care and research, no reporting guideline for stroke research and clinical data has previously been proposed or published. Most notably, the Stroke Ontology (https://bioportal.bioontology.org/ontologies/STO-DRAFT) defines the terms and relationships of the knowledge domain of stroke and The Human Phenotype Ontology (HPO) () defines stroke-related phenotypes. Similarly, PhenX, Clinical Data Interchange Standards Consortium (CDISC) and Health Level Seven (HL7) have previously developed and proposed standards fit for clinical data collection in various disease fields. In the development of our reporting guideline, we utilised these existing resources to harmonise collection measures and terms and develop a comprehensive and harmonised data management tool which allows centralised management of both clinical and research data in a complex disease field (stroke) which requires collaborative and inter-disciplinary research. Combining the clinical and research data elements in one standard allows principal investigators to maintain various levels of data access whilst still centralising comprehensive data management and storage. This empowers principal investigators to manage their research data in a coordinated and comprehensive manner, and to maintain the participant-level data associated with various studies and(or) experiments in a user-friendly way (and vice versa).

Employing the reporting guideline can thus add great benefit to stroke research studies, as it references stroke-based ontologies, data dictionaries and collection standards, ensuring comprehensive, harmonised data reporting, which is re-usable and enhances interoperability. The reporting guideline is designed for use by research clinicians and healthcare workers, researchers, data managers and bioinformaticians involved in stroke research, bearing in mind different levels of data access. Given the appropriate levels of data input and access right, allows the reporting guideline to be used in both research and clinical settings whilst defining the information as essential or optional allows the research to be adaptable for various types of research with regards to stroke. Additionally, the reporting guideline goes beyond listing “minimum required” data elements and aims to provide a comprehensive data dictionary and controlled vocabulary with standardised response options, which is scalable and can be adapted for broader or custom use.

In multidisciplinary fields, standardization can often be difficult to implement, therefore, the reporting standard is also accompanied with an associated platform-specific XML schema. Although XML is not inherently user friendly, and is highly computationally amendable, the schema was specifically designed with the REDCap platform in mind. It is therefore immediately implementable to promote user friendliness in terms of both data capturing and governance, allowing accurate and seamless duplication in the local setting (). Additionally, the accompanying Recommendations For Use guideline (in supplementary material) further enables use and user friendliness. The XML has been used extensively for describing data in many applications for storage or transport (). The language, by its design, allows for extensibility and self-description. Its openly documented standards, wide adoption, and support in many applications and existing tools make it a good first choice for describing scientific data that could be exchanged between healthcare systems (). It has previously been used in health reporting for such purposes (; ).

As previously exhibited in oncology research, widespread utilization of the developed reporting guideline can function to reduce data and reporting inconsistency and redundancy across systems, as well as promote collaboration and(or) interoperability between systems (; ; ). Promoting such broad use could allow for improved data mapping in clinical registries, improving data quality and interoperability (). A given standard may be more widely adopted if advocated or endorsed by “omics” databases, funding bodies and scientific journals, geared towards stroke research, specifically. To promote the adoption of the reporting guideline, we hope to employ the reporting guideline within our own consortia studies, and advocate use on an international platform.

In the future, the H3ABioNet’s Data & Standards Work Package aims to develop more domain-specific reporting guidelines which are relevant to both African health and the H3Africa consortia. We also aim to align our efforts with the standardization efforts driven by GA4GH. This will include further refining elements such as ethnicity, diet and prescribed medicine to accommodate African-specific considerations. The Minimum Information Required Guideline: Stroke Research and Clinical Data Reporting aims to promote FAIR reporting and will therefore be added to the FAIRsharing database, as the database provides curation support to resource maintainers, as well as a point of contact for the standard, and related support material (). Bearing in mind the diverse target group the reporting standard aims to accommodate, various methods of implementation will be investigated in the future, to provide comprehensive solutions for collaborative efforts and increase the research data value. Education and training in the use and implementation of these standards will be of high importance to supplement use. Furthermore, additional elements will be investigated for incorporation into the standard, including various environmental factors. Ultimately, the reporting guideline has the potential to support both the H3Africa community as well as the stroke research community at large with current and future research.

Data Accessibility Statement

All data referenced in the article can be found in Supplementary File 2.

Additional Files

The additional files for this article can be found as follows:

File 1

Stroke Online Survey. DOI: https://doi.org/10.5334/dsj-2019-026.s1

File 2

Stroke Online Survey – Raw Results. DOI: https://doi.org/10.5334/dsj-2019-026.s2

File 3

The Minimum Information Required Guideline: Stroke Research and Clinical Data Reporting Data Dictionary (Version 1.0). DOI: https://doi.org/10.5334/dsj-2019-026.s3

File 4

Recommendations For Use Guideline. DOI: https://doi.org/10.5334/dsj-2019-026.s4

File 5

Stroke Research Data Reporting REDCap XML. DOI: https://doi.org/10.5334/dsj-2019-026.s5

File 6

Relationship between the different sections of the reporting guideline. DOI: https://doi.org/10.5334/dsj-2019-026.s6