1 Introduction

Neuroblastoma is a pediatric cancer that forms in certain types of nerve tissue, most commonly arising from the adrenal glands. Neuroblastoma is the most common cancer among newborns and affects thousands of children worldwide (). Ninety percent of neuroblastoma cases are diagnosed by age five. Neuroblastoma forms in specific types of nerve tissue and usually develops in the adrenal glands in the abdomen (). The 5-year survival rate for children with neuroblastoma in England is approximately 67% (). High-risk neuroblastoma is usually treated with intensive chemotherapy, surgery, radiation therapy, bone marrow and hematopoietic stem cell transplantation.

Knowledge about this pediatric cancer can be discovered not only with laboratory results and clinical trials, but also by analyzing the data contained in the electronic health records (EHRs) of patients. Computational statistical methods and machine learning techniques applied to data derived from structured EHRs can, in fact, be effective tools to infer new knowledge about neuroblastoma and eventually to help medical doctors to develop better treatments ().

Datasets derived from electronic health records, although extremely useful for the progress of scientific research, are often kept reserved and unshared with the rest of the scientific community, because of privacy issues or because of lack of data sharing culture in the hospital or research centre. In recent years, a wave of requests for open data sharing has been launched by researchers around the world, which culminated in the definition of FAIR principles for data sharing (; ; ).

The advantages of open data sharing have been already demonstrated by important initiatives that have had a profound impact on scientific research. The University of California Irvine Machine Learning Repository (), for example, is an online catalogue of free datasets coming from several domains (biology, medicine, physics, computer science and engineering, social sciences, business, and games) that started in 1987 and now contains 588 different datasets. It includes some derived from electronic health records, but not of neuroblastoma.

Another online resource that shares public datasets is Kaggle. Since 2010, this online community of data scientists has collected and provided open datasets to their users, to be used for scientific competitions or for independent analyses. To date, Kaggle lists around ten thousand public datasets (). In 2017, Google launched the Dataset Search () engine, where users can easily look for public datasets on the internet. Re3data () is another interesting resource to mention: it a public registry of scientific data repositories available online.

In bioinformatics, Gene Expression Omnibus (GEO) () has been providing thousands of open gene expression and methylation datasets to researchers worldwide, producing thousands of studies and publications too. GEO contains gene expression data of neuroblastoma, too, that led to some significant genetic discoveries (; ). Neuroblastoma bioinformatics data were publically shared and integrated for the CAMDA 2017 conference Neuroblastoma Data Integration Challenge (; ). Multiple datasets of images have also recently been released on public repositories: ophthalmological images () and cancer images () among them.

Regarding neuroblastoma, the International Neuroblastoma Risk Group (INRG) recently released the INRG Data Commons (; ), a database of thousands of EHRs of patients with neuroblastoma. Despite its usefulness, the access to the INRG Data Commons is restricted to the participants to approved projects: to obtain the data, one needs to fill and submit a project application, which might or might not be approved by the INRG Data Commons board. Even in the case of proposal acceptance, this procedure clearly requires some processing time: between the proposal submission and the actual data download, months or even years can go by, with consequent delays that can negatively affect the project itself. Of course, this restricted approach limits the possibility of using their data.

A similar initiative, the Pediatric Cancer Data Commons (; ), was launched by University of Chicago and provides EHRs data of patients with childhood cancers, including neuroblastoma. However, the access to this dataset is restricted to pre-authorized researchers, too. These limitations of access generate several problems: pre-authorization can limit data access and research transparency and sharing of work, that could be the basis for more research.

Since a public free resource containing public open unrestricted data derived from electronic health records of patients diagnosed with neuroblastoma is currently missing, we gathered the five open datasets derived from EHRs of neuroblastoma described in this survey in a website called Neuroblastoma Electronic Health Records Open Data Repository, that we freely released online. Moreover, we converted these datasets into numerical values, making them computer-readable for any computational analyses.

Users and researchers can take advantage of this resource by downloading the datasets and performing their scientific analyses that might lead to new discoveries about this pediatric cancer. Moreover, we performed a detailed descriptive analysis of all the clinical features present in the five listed datasets, by providing detailed information about all the variables. Some information about these variables were absent even in the original datasets publications.

We organize the rest of this study as follows. After this Introduction, we describe the datasets of our survey in section 2, and we discuss the main insights of their clinical variables in section 3. We finally outline some conclusions in section 4.

2 Data and methodology

We performed a thorough literature search of scientific articles related to neuroblastoma electronic health records (EHRs) in November and December 2020, using the Google Scholar () search engine. We performed this dataset search by following the guidelines by Marco Pautasso (), by looking for the keywords ‘electronic medical records and neuroblastoma’, ‘EMR and Neuroblatoma’, ‘electronic health records and Neuroblastoma’, ‘EHR and neuroblastoma’, and ‘clinical records and neuroblastoma’. Following the example of Khan and colleagues (2021), we collated and screened the results returned from the first ten pages of the search.

We identified nine articles including EHRs data of patients diagnosed with neuroblastoma in their main texts or in their supplementary information.

Five of them have public open datasets released under CC BY 4.0 license and therefore downloadable and usable without non-commercial restrictions (; ; ; ; ), included in the supplementary information of the articles. Three of them contain datasets, but without an open license, and therefore cannot be used (; ; ). Moreover, one article contains an open dataset, but it is related to coronary artery disease and only minimally pertinent to neuroblastoma ().

We report the quantitative characteristics of these five open datasets in Table 1. Each of these five open datasets is available as an CSV file, thus it can be opened with most spreadsheet software, including free open source available ones such as LibreOffice Calc () or processed with appropriate software packages on any computer, in R for example ().

Table 1

Quantitative characteristics of the analyzed datasets. #patients: number of patients. #features: number of clinical features. #missing values: number of missing data instances.


DATASET NAMEREFERENCE#PATIENTS#FEATURES#MISSING VALUESTABLE

dataBB2013Banelli et al. ()1211150Table 2

dataCK2018Kim et al. ()201629Table 3

dataEV2013Villamón et al. ()191113Table 4

dataYBC2019Choi et al. ()7100Table 5

dataYM2018Ma et al. ()1691352Table 6

interval[7, 169][10, 16][0, 52]

mean67.212.228.8

We generated the frequency table image Figure 1 with the ggplot2 R package ().

Figure 1 

Presence and absence of the clinical features in the five analyzed datasets. x axis: datasets names. y axis: clinical features of the datasets. dataBB2013: Banelli et al. () (Table 2). dataCK2018: Kim et al. () (Table 3). dataEV2013: Villamón et al. () (Table 4). dataYBC2019: Choi et al. () (Table 5). dataYM2018: Ma et al. () (Table 6).

2.1 Details on the datasets

In this section, we describe the details about the datasets, including the quantitative characteristics of the patients and of the clinical features, and the meaning of each of them.

2.1.1 dataBB2013

The dataBB2013 dataset was released by Banelli et al. (), and was previously employed by the same research team in earlier studies (,, ). The dataset contains data from 121 patients, including survived and deceased subjects, collected at the hospital of Istituto Giannina Gaslini (Genoa, Italy) between 1990 and 2004. Each patient has clinical features (Table 2), but one of them is fixed (4th Stage of International Neuroblastoma Staging System – INSS).

Table 2

Meaning and values of the features of the dataBB2013 dataset. Number of patients: 121. Number of features: 11. AICR: alive in complete remission. AWD: alive with disease. AWSD: alive with stable disease. CR: complete remission. DOD: Dead of disease. GNB: ganglioneuroblastoma. HR: high risk. INRG: International Neuroblastoma Risk Group MYCN: MYCN oncogene. NaN: not a number. NB: neuroblastoma. NS: not specified, it was impossible to make a more precise diagnosis (). OS: overall survival. PCDDHB: Protocadherin Beta Cluster. PFS: progression-free survival. SFN: Stratifin gene. Additional information can be found in the dataset original article by Banelli et al. ().


DATABB2013

FEATUREMEANINGTYPEVALUES

age at diagnosisage at diagnosisinteger2, …, 196

ferritinferritin serum levelng/ml–99, 19, …, 2250

histological categoryhistological category of the neuroblastomacategoricalNS, NB, GNB

INRG Risk classificationrisk group: HR high risk and I/LR intermediate/low riskcategorical0, 1

INSS Stagestage of the tumor (only stage 4 patients)categorical1

MYCN amplificationstatus of nMYC oncogene: 0, amplified, 1, unamplified;binary0, 1

Methylation PCDHB cluster (%)methylation of 17 genes of the Protocadherin B clusterpercentage29.44, …, 88.93

Methylation SFN (%)methylation of the SFN genepercentage36.6, …, 99

OSoverall survivalfloat0.27, …, 164.47

outcomeclinical outcomecategoricalAICR, AWD,

AWSD, CR, DOD

PFSprogression free survivalfloat3.7, …, 74.7, NaN

The dataset includes some typical features, such as age at diagnosis, histological category, risk group, MYCN amplification, overall survival, progression-free survival, and outcome. The outcome indicates if the patient survived or not. This dataset also contains two features about the methylation of a gene family (Protocadherin Beta Cluster, PCDHB) and of a specific gene (SFN). The prognostic role of the methylation of these two genes is the key aspect of the original study ().

The dataBB2013 dataset is the only one including data related to the methylation of the PCDHB gene family and to the methylation of the SFN gene. The methylation of Protocadherin Beta cluster genes is known to be associated to poor prognosis in patients with neuroblastoma (), while the association between the methylation of the protein-coding Stratifin gene and advanced stage, high-risk neuroblastoma was shown by the dataset authors in a previous study ().

This dataset contains only one blood test variable: ferritin. Ferritin is a protein that contains iron and is known to be a clear prognostic factor for neuroblastoma ().

2.1.2 dataCK2018

The dataCK2018 dataset was published by Kim et al. (). It contains data of 20 patients, collected at the Samsung Medical Center from January 2009 to December 2015 at the Samsung Medical Centre of the Sungkyunkwan University (Seoul, South Korea). All the patients had chemotherapy and surgery, with or without local radiation therapy, followed by differentiation therapy. Each patient has 16 clinical factors (Table 3), making it the dataset with the higher number of features among the five listed in this article.

Table 3

Meaning and values of the features of the dataCK2018 dataset. Number of patients: 20. Number of features: 16 BM: bone marrow. CR: complete response. CT: chemotherapy. DT: differentiation therapy. F: female. gy: gray units. ID: identifier. LDH: lactic acid dehydrogenase. LN: lymph node. LSE: neuron-specific enolase. M: male. MR: mixed response. PR: partial response. S: surgery. U/L: units per liter. VGPR: very good partial response. VMA: vanillylmandelic acid. mg: milligrams. mo: months ng/mL: nanograms per milliliter. no: number. In the original article dataset, sex and age are joined in a unique feature called ‘Sex/age (mo)’, and outcome and outcome months are joined in a unique feature called ‘Outcome (mo)’. Additional information can be found in the dataset original article by Kim et al. ().


DATACK2018

FEATUREMEANINGTYPEVALUES

11q–presence of chromosomal aberration at 11q siteBooleanyes, no, –

17q+presence of chromosomal aberration at 17q siteBooleanyes, no, –

1p–presence of chromosomal aberration at 1p sitebooleanyes, no, –

ageage at diagnosismonths0, …, 10.2

ferritinferritin levelsng/mL15, …, 1638.6

LDHlactic acid dehydrogenase levelU/L539, …, 6200

local RTx dose (gy)dosage of radiationfloat15, 23.4, 25.2, –

BM, bone, kidney, liver, LN,

metastatic sitessite of metastsoizationcategoricallung, mediastinum, muscle,

pleura, skin

NSEneuron-specific enolase levelsng/mL7.3, …, 947

outcomeevent-free survival (EFS) or No Evidence of DiseaseBooleanEFS, NED

outcome_mofollow-upmonths17, …, 91

primary sitetumor primary sitebinaryabdomen, mediastinum

sexmale or femalebinaryM, F

tumor response after CT & Sresponse after chemotherapy and surgerycategoricalCR, MR, PR, VGPR

tumor response after DTresponse after differentiation theraphy (DT)categoricalCR, MR, PR, VGPR

urine VMAvanillylmandelic acid levels in urinemg/day0.5, …, 53.9

The clinical feature list includes several traditional variables such as age, outcome, sex, follow-up duration, tumor primary site, metastatic sites. Some uncommon variables related to the treatment received by the patients are present: local dosage of radiation, response after chemotherapy and surgery, and response after differentiation therapy.

The main scientific statement of the original study by the authors was that: ‘patients younger than 18 months with stage 4 MYCN nonamplified neuroblastoma had high survival changes, without significant late adverse effects, when treated with alternating cycles of cyclophosphamide (CEDC) and ifosfamide, carboplatin, and etoposide (ICE), followed by surgery and differentiation therapy’ ().

This dataset contains variables of ferritin and LDH levels in the blood, both recognized as prognostic factors for neuroblastoma in the medical community (). It includes a variable related to a urine test (urine vanillylmandelic acid), which is an indicator for neuroblastoma diagnosis in children screen tests, and a variable of the blood neuron-specific enolase level, which can be used auxiliary test for neuroblastoma.

This dataCK2018 dataset has the only cohort for which data of genetic factors are indicated: 11q–, 171+, and 1p– chromosomal aberrations. These genetic abnormalities are common in patients diagnosed with neuroblastoma.

2.1.3 dataEV2013

The dataEV2013dataset was made open by Villamón et al. (). It contains data from 19 patients and 11 clinical variables (Table 4).

Table 4

Meaning and values of the features of the dataEV2013 dataset. Number of patients: 19. Number of features: 11. ADF: alive disease-free. AWD: alive with disease. B: bone. BM: bone marrow. CR: complete response. DOD: died of disease. DOS: died of sepsis. DP: disease progression. DTC: died of treatment complication. F: female. HR-NBL1: High-Risk Neuroblastoma Study 1. INES: Infants Neuroblastoma European Study, SIOPEN protocols. LN: lymph nodes. M: male. N-II-92 and NAR-99: names of national clinical trials in Spain (). PR: partial response. ST: soft tissue. SurPR: surgical partial resection. VGPR: very good partial response. nGNB: nodular ganglioneuroblastoma. pdNB: poorly differentiated NB. uNB: undifferentiated neuroblastoma. VGPR: very good partial response. Additional information can be found in the dataset original article by Villamón et al. ().


DATAEV2013

FEATUREMEANINGTYPEVALUES

age at diagnosisagemonths9, …, 108

follow-up timeoverall survivalmonths1, …, 132

metastasespresence of metastasisbooleanyes, no

outcomeclinical outcomecategoricalADF, AWD, DOD, DOS, DTC

pathologypathological categorycategoricalnGNB, pdNB, uNB

protocol treatmenttreatment protocolcategoricalHR-NBL1, INES, LNESG1, N-II-92, NAR-99

relapseif the cancer relapsed or notbooleanyes, no

sexmale or femalebinaryM, F

stagestage of the tumorcategorical1, 2, 3, 4

time to first relapsetime to first relapsemonths4, …, 28

treatment responseresponse to first line treatmentcategoricalCR, DP, PR, SurPR, VGPR

All the 11 features are health-related, and they do not contain any genomics or genetic information. The patients’ samples were collected between 1999 and 2007, at the Spanish Reference Centre for Neuroblastoma Biological and Pathological studies at the time of diagnosis.

The original study is focused on genetic instability and intratumoral heterogeneity with MYCN amplification, and 11q deletion. According to the NIH Genetic and Rare Diseases Information Center, the chromosome 11q deletion is “a chromosome abnormality that occurs when there is a missing (deleted) copy of genetic material on the long arm (q) of chromosome 11” (National Health Institutes (), that can cause developmental delay, intellectual disability, behavioral problems and distinctive facial features.

2.1.4 dataYBC2019

The dataYBC2019 dataset was publihsed by Choi et al. (). It contains data from 7 patients, each of them with 10 health indicators (Table 5). This dataset has the smallest cohort of patients among the datasets described in this report, but it is the only complete dataset without any missing values or empty ones. The data of the patients were collected from January 2012 to December 2014 in South Korea.

Table 5

Meaning and values of the features of the dataYBC2019 dataset. Number of patients: 7. Number of features: 10. A: amplified. BM: bone marrow. CEC: carboplatin, etoposide, and cyclophosphamide CR: complete response. CT×5: five cycles of chemotherapy. CT×6: six cycles of chemotherapy. CT×7: seven cycles of chemotherapy. Dx: diagnosis. HDCT: high-dose chemotherapy. L-RT: local radiotheraphy. LNs: lymph nodes. MEC: melphalan, carboplatin, and etoposide. MIBG-TM: high-dose 131I-metaiodobenzylguanidine treatment, thiotepa, and melphalan NA: not amplified. PR: partial response. SCT: stem cell transplantation. TTC: topotecan, thiotepa, and carboplatin. VGPR: very good partial response. m: months. y: years. Additional information can be found in the dataset original article by Choi et al. ().


DATAYBC2019

FEATUREMEANINGTYPEVALUES

age at Dx.age at diagnosisyears1.5, …, 3.5

age at relapseage at relapseyears4.1, …, 8.6

HDCT1 regimenfirst high-dose chemotherapybinaryTTC, CEC

HDCT2 regimensecond high-dose chemotherapybinaryMEC, MIBG-TM

interval to relapseinterval to relapsemonths12, …, 75

MYCN statusamplified (A) or not amplified (NA)binaryA, NA

relapsed sitesrelapse sites in the bodycategoricalPrimary, Brain, Bone, LNs, BM

stage at Dxonly metastistic tumorscategorical4

treatment prior to haplo-SCTtreatment prior to haploidentical SCTcategoricalSurgery, L-RT, CT×5, CT×6, CT×7

tumor status at haplo-SCTtumor status at haploidentical SCTcategoricalPR, CR, VGPR

The clinical variables of this dataset include traditional factors such as age at diagnosis, MYCN status, as well as uncommon features related to the treatment (HDCT1 and HDCT2 chemotherapy, treatment prior to haploidentical stem cell transplantation). Moreover, this dataset contains multiple interesting factors related to cancer relapse: age at relapse, body sites of the cancer relapse, and interval to relapse. This dataset, however, includes no genomics or genetic variable about the patients’ biological profile.

The original study of the dataset curators was focused on verifying if early natural killer cell infusion following haploidentical stem cell transplantation would have reduced the relapse in patients with neuroblastoma (). In both that study and its dataset, the relapse information was of great importance.

2.1.5 dataYM2018

The dataYM2018 dataset was released by Ma et al. (). It contains data from 169 patients, with 13 clinical factors (Table 6): it is the largest patients’ cohort among the datasets listed in this study. The data were collected at Children’s Hospital of Fudan University (China) between 2010 and 2015 ().

Table 6

Meaning and values of the features of the dataYM2018 dataset. Number of patients: 169. Number of features: 13. FH: favorable histology. MYCN: MYCN oncogene. UH: unfavorable histology. Additional information can be found in the dataset original article by Ma et al. ().


DATAYM2018

FEATUREMEANINGTYPEVALUES

age0: < 12 months; 1: 12–60 months; 2: ≥ 60 months.integer0, 1, 2

autologous stemautologous stem cell transplantation: 0: no; 1: yes.binary0, 1

cell transplantation

degree of differentiation0: undifferentiated; 1: poorly differentiated; 2: differentiated.categorical0, 1, 2

histology prognosis1: FH favorable histology, 0: UF unfavorable histologybinary0, 1

MYCN statusstatus of nMYC oncogene: 0: amplified; 1: unamplified.binary0, 1

outcomeclinical outcome: 1, dead of disease, 0, alive or lost follow-up.binary0, 1

radiationif the patient had radiationboolean0, 1

riskrisk group: 0: intermediate-risk; 1: high-risk.categorical0, 1

sex0: male; 1: female.integer0, 1

siteprimary tumor site: 0: adrenal gland; 1: mediastinum; 2: others.categorical0, 1, 2

stagestage of the tumorcategorical1, 2, 3, 4

surgical methodstotal or partial resectionbinary0, 1

timeoverall survivalmonths1, …, 100

This dataset includes several traditional variables that can be found in many neuroblastoma electronic health records, such as age, MYCN status, outcome, sex, risk, and overall survival time. It also includes information about treatment, that is if the patient had autologous stem cell transplantation and radiation, and information about the primary tumor site.

The focus of the original study is the MYCN amplification. The authors claim that the MYCN amplfication was an independently adverse prognostic factor in this cohort of patients with neuroblastoma ().

The autologous stem cell transplantation is one of the therapies employed for patients with neuroblastoma ().

2.2 Shared clinical features

Most frequent features. The age at diagnosis is the only variable present in all the five datasets, while the outcome is present in four out of five (all the datasets except dataYBC2019). The neuroblastoma stage is present in in four out of five datasets, too: all the datasets except dataCK2018. MYCN and sex are present three times: MYCN in dataBB2013, dataYBC2019, and dataYM2018, while sex in dataCK2018, dataEV2013, and dataYM2018. Seven other variables are present in two datasets (ferritin, histology, interval to relapse, overall survival, primary site, risk, treatment response). All the other features are present only in one dataset.

We show the features present in each dataset in Figure 1.

INRG features. Some clinical variables refer to the International Neuroblastoma Risk Group (INRG) classification system, which has been developed to establish an international consensus for pre-treatment risk stratification and addressing patients to the most suitable treatment protocol (). Age at diagnosis, tumor stage, MYCN status, histologic category, degree of tumor differentiation, ploidy, and loss of 11q (11q–) are currently used in the INRG schema to assign a very-low, low, intermediate or high-risk group. schema (). All features, except ploidy, have been reported in at least one of the studies (Figure 1).

Treatment variables. In a modern therapy, heterogeneous treatment ranging from observation, for very low-risk patients, to combinations of intensive multi-agent induction chemotherapy, surgery, radiation, myeloablative consolidation therapy with stem cell rescue and transplantation, 13-cis retinoic acid, and immunotherapy for high-risk patients are provided to patients diagnosed with neuroblastoma (). Risk group, treatment protocol, surgical method, radiation, autologous stem cell transplantation, or local radiation theraphy (RTx) dose were reported in one of the studies (Figure 1). Standardized methods to define and interpret response to first line (diagnosis) are important to efficiently monitoring and advancing therapy for neuroblastoma (). Response to first line treatment has been included in three out five studies, but this clinical feature has been indicated with different names including tumor response (), treatment response (), and tumor status at haplo-SCT ().

Sex. Significant positive, but modest, association has been observed between male sex and neuroblastoma (). Although a significant impact of sex to neuroblastoma prognosis or diagnosis has not been described yet, the sex feature was included in four out five datasets (Figure 1). Serum lactate dehydrogenase (LDH), serum ferritin, neuron-specific enolase (NSE), urine vanillylmandelic acid (VMA) are well-known catecholamines used in the clinical setting to perform neuroblastoma diagnostic and prognostic evaluations (; ; ; ).

LDH, ferritin, genetic aberrations. Serum lactate dehydrogenase (LDH) and serum ferritin are strong prognostic molecular markers with potential ability to identify ultra-high-risk and to refine risk stratification (). Although the clinical utility of these molecular markers has been reported in the literature, the low impact of these factors in multivariate analyses has prevented their inclusion in the INRG risk classification schema (). The genetic aberrations of chromosomes 1p, 11q, and 17q are associated with poor outcome in neuroblastoma (; ; ; ). Despite an increasing consideration in the clinical setting respect to the past is evident, these chromosome aberrations were only reported in one of the datasets (Figure 1). Nevertheless, both aberration 1p and 17q are statistically correlated with nMYC amplification: this correlation is the main reason why the INRG classification does not utilize them, and why under the proper circumstances nMYC amplification could be used as a proxy for these chromosomal aberrations ().

Primary tumor sites and metastases. The location of the primary tumor and any metastatic sites dictates the symptomatology (). Adrenal medulla is the most common site to develop a primary tumor, but tumor may arise also from the paraspinal or other sympathetic ganglia and can be present anywhere from the neck to the pelvis (). Metastases are present at diagnosis in about 50% of patients, with the bone marrow, bone and regional lymph nodes being the most common sites (). Two of out five datasets (Figure 1) reported the primary or metastasis sites. Neuroblastoma patients’ outcome, indicating the patients’ status alive or dead, and overall survival, that is the time interval between the last follow-up date and the date of diagnosis, are the primary endpoints of the whole clinical activity and biomarkers have been proposed in the literature to improve survival of patients (; , , ; ). Two of the datasets have reported the outcome and the overall survival. The study by Kim et al. () used the feature outcome but they refer to the onset of a relapse.

Relapse. Neuroblastoma patients’ relapse indicates whether patients experienced a relapse. Relapse-free survival is the time interval between the date of first relapse and the date of diagnosis (). Currently, no curative salvage regimens for recurrent neuroblastoma are known, thus several studies have been reported in the literature to fill this gap (). Only one of the five datasets’ studies (Figure 1) reported features about relapse, relapse-free survival or relapse sites. Relapse-free survival was referred with progression free survival feature by Banelli et al. (). Methylation SFN, Methylation PCDHB cluster, HDCT2 regimen, HDCT1 regimen, and age at relapse are study-specific features and their role on neuroblastoma development or progression still remain to be validated.

Set of patients feature refers to a technical split of a cohort into two distinct subsets of patients for training and validate a computation model. One study reported this kind of feature.

Recap. Taken together, previous evidences suggest that:

  1. The number and type of clinical features is heterogeneous across the five datasets’ studies;
  2. The datasets’ studies investigated distinct subsets of patients, which included high-risk patients with stage 4 tumor (), patients younger than 18 months, stage 4 and MYCN not amplified tumor (), patients with MYCN amplified and low 11q tumor (), patients of all stages () and patients with relapsed or refractory disease after HDCT or auto-SCT ();
  3. The features collected in previous studies are taken at different time points including diagnosis, treatment, patient’s response, time of relapse, and follow-up);
  4. Sometimes different names are reported to refer to the same clinical parameter, and this is the case of treatment response () and tumor response () or time to first relapse () and interval to relapse ();
  5. Once, the same feature name is reported to indicates different types of data as it is the case of outcome in the dataCK2018 () and outcome of dataBB2013 dataset ().

3 Analysis

Neuroblastoma is the most common malignant tumors diagnosed in children under one year old, and is derived from the sympathetic nervous system. The clinical presentation of neuroblastoma patients is often vague, with many signs and symptoms being non-specific. Initial clinical presentation frequently includes weight loss, fever, and lethargy, with the most overt clinical signs and symptoms like a unilateral mass and opsoclonus myoclonos not showing up until much later in the disease progression (). Neuroblastoma is a tumor with clinical and prognostic heterogeneity; for some patients, the disease is an often aggressive and terminal disease, while other patients develop a benign disease which often has complete and spontaneous regression (). When diagnosed early as a localized disease, patients suffering from neuroblastoma frequently have high survival rates, approaching 90% in stage I and stage IVS of the disease. The survival rate, however, gets quite dismal in patients with stage III, with only 21% survival rate at 10 years ().

Epidemiologically, neuroblastoma is a rare disease with an incidence of around 11 cases per million children (). The heterogeneity of clinical presentation coupled with a low incidence rate make most clinical centers lack of data points to carry out meaningful clinical studies. Therefore, pooling data from multiple centers is critical for the development of new therapies and management guidelines. Share of patients’ data across many medical centers have led to breakthrough studies in the past: Cohn and Pearson () led a large consortium that recollected data from 8,800 patients suffering from neuroblastoma, and further analysis of this dataset led to the development of a new tumor staging system on the basis of surgical risk factors (). Risk stratification is pivotal in patients suffering from neuroblastoma, as it aids in choosing the most proper clinical management. Patients belonging to the low-risk strata are usually assigned to undergo local surgery resection; in contrast, patients in the high-risk strata are usually treated with high doses of chemotherapy, surgery, external beam therapy, and anti-GD2 immunotherapy ().

Even if stratification of neuroblastoma patients is pretty clear and concise at both end of the risk spectrum, however, there has not been a coherent consensus into how to accurately stratify those patients at an intermediate risk. Great efforts have been made to create a uniform risk stratification to share among research groups, the most important is the International Neuroblastoma Risk Group (INRG) classification system (). This risk stratification score utilizes age, histology, grade of tumor differentiation, MYCN, 11q aberration, and ploidy to assess the survival risk in pre-treatment patients.

In this research study, we found five open unrestricted datasets which have a diverse set of features that include clinical, genetic and laboratory data. Clinical data includes age and sex: age showing up in every data set and sex present in three out of five. Age at diagnosis is one of the most important factors at the moment of defining prognosis in neuroblastoma patients and is treated as a proxy for the underlying genetic and biologic features of the tumor (). Infants less than one year old usually suffer a far more a benign course than their older counterparts, with close to 80% survival rate, while adolescents suffer, for the most part, an aggressive and terminal disease with abysmal survival rates.

Clinical lab findings are missing in most data sets, with only dataCK2018 containing most of the lab findings like ferritin, LDH, NSE, and Urine VMA. This aspect can be concerning as these labs are for the most part are readily accessible even in underprivileged area, more so they are a key component in the work up of patients suffering from this oncologic disease. These markers are not routinely used in risk stratification neuroblastoma patients, as they are not as precise as the genetic markers (); nevertheless, further inspection into the possible interactions with genetic and clinical values might shed a light into a more precise risk stratification.

It is crucial to further analyze how ferritin and LDH might aid in subgroup analysis for both prognosis and treatment, as it has been shown in the past that LDH aids on selecting the appropriate therapeutics in other type of cancers. For example, initial levels of LHD can predict benefit of bevacizumab in colorectal cancer (), and LDH response to first line treatment predicts survival in breast cancer ().

Clinical markers are insufficient for a precise risk stratification and subsequent treatment, and this is why the reason obtaining genetic data is crucial at the moment of managing patients with neuroblastoma. Analysis of cytogenetic profiles of neuroblastoma cells has revealed that whole chromosomal changes without any segmental alteration is associated with great outcomes even in older patients, or disseminated disease, meanwhile segmental chromosome imbalances are associated with worse prognosis and high risk of relapse, even in patients that presents whole chromosomal changes (). MYCN amplification, a crucial genetic marker for risk stratification (), is present in three out of five datasets. Other common genetic markers, such as ploidy and 11q abberations, which are used in the INGR risk stratification score, are severely lacking in the datasets, with 11q only being present in the dataCK2018 and ploidy missing completely.

Confirming the diagnosis of neuroblastoma requires a biopsy, which also brings important information for the prognosis. These tumors are classified depending on the amount of Schwannian stroma present in the tumor () and, generally speaking, undifferentiated histology confers a worse prognosis, while highly differentiate histology confers a good prognosis. Histopathological information is present in the dataBB2013, dataEV2013 and dataYM2018 datasets, and can help researchers to further elucidate how histology of this tumor might interact with genomics and clinical markers.

Treatment in neuroblastoma patients has a binary trend due to its heterogeneity, the tendency in the past years has been to reduce therapy in patients belonging to the low-risk strata, while increasing it in those at the high-risk end. The focus in the past decade has been to use higher doses of chemotherapy and radiotherapy in patients with high risk neuroblastoma (). Chemotherapy data is present in dataYBC2019, dataCK2018, and dataEV2013 datasets, and radiotherapy data is included in dataYM2018 and dataCk2018 cohorts, along with clinical, genomic and outcome data this could be used to support more individualized clinical decision making at the moment of picking the right treatment in neuroblastoma patients ().

Limitations Integration of the datasets is possible due to presence of a subset of features that are reported for the majority of the datasets, such as age, stage, and MYCN status. These variables can be used for clinical and molecular characterization of the patients and their diseases.

However, the scarce overlap among the features across all datasets and the differences on the scale of feature values represent a limitation for these datasets, since this might reduce the comparability among the datasets and their usability in future studies of new biomedical markers for neuroblastoma. Compatibility across datasets is an important characteristic of the dataset and an aim of any standardization effort put in place by the scientific community. We believe that a larger set of common features would be of great benefit for the scientific community that will improve comparability across publicly available datasets and enhance the reusability of data for future studies on neuroblastoma.

The low number of patients included in a study is one of the most limiting factors in studies on rare diseases such as neuroblastoma. Three out of five datasets reported features for less than twenty neuroblastoma patients. The low number of patients represents an additional limitation of these datasets because it might reduce the robustness of the analysis be carried out on these datasets in future studies. A larger number of patients would be necessary to report robust conclusions and to achieve a sufficient trustworthy analysis able to support clinical decisions.

4 Conclusions

Neuroblastoma is a child cancer that affects thousands of newborns worldwide, and scientific research on the data derived from electronic health records can reveal new discoveries about this disease. EHR neuroblastoma datasets are available in the internet, but located on different websites or among supplementary information of published articles and therefore difficult to find. We alleviate this issue by presenting here this survey where we describe each of the five public datasets on neuroblastoma EHRs currently available in the scientific literature, by reporting the quantitative characteristics and the clinical features of the five analyzed datasets, and highlighting which important variables were present in the datasets and which ones were absent (subsection S1.1).

In this survey, we also introduce our Neuroblastoma EHRs Open Data Repository, an online catalogue containing the five datasets which can be used by researchers and computers worldwide for any scientific analysis. Unlike the INRG Data Commons () and the Pediatric Cancer Data Commons (), our repository’s data can be accessed openly without any specific permission or project proposal, by anyone in the world at any time.

We believe our survey and data repository can be useful resources that facilitate new scientific discoveries about neuroblastoma and that can lead to an improvement of the conditions of the patients worldwide.

In the future, we plan to write surveys and to develop data repositories derived from electronic health records of patients with other diseases, such as sepsis () or amyotrophic lateral sclerosis ().

S1 Supplementary information

S1.1 Additional considerations

Initial work up of a recent patient diagnosed with neuroblastoma usually include a complete blood lab, with coagulogram, uric acid, electroyles, kidney function, liver function test, ferritin and LDH. LDH and Ferritin are both correlated with a poor prognosis when elevated (; ). These bloods tests are not included in the INRG classification system due to lack of specificity (), nonetheless the employment of these markers might enhance the granularity of risk stratification while adding a negligent overhead to the management of patients; these lab tests are affordable and readily accessible even in underprivileged areas of the world. LDH can only be found in the dataCK2018 dataset, while ferritin is present in both dataBB2013 and dataCK2018.

Patients suspected of suffering of neuroblastoma should have a urine specimen recollected, since high levels of cathecolamis and their downstream metabolites are often elevated and support the diagnosis. Vanillylmandelic acid (VMA) and homovanillic acid (HVA) are both catechoalime metabolites that can be found in urine in up to 90% of children with neuroblastoma (). There are some information that these metabolites might aid in prognosis, especially when taking into account the DA (Dopamine)/VMA ratio, might help in biological grading ().

Neuron-specific enolase (NSE) a specific marker for neurons and peripheral neuroendocrine cells, have been found to be increased in advanced neuroblastoma and is correlated with poorer prognosis (). NSE is only available in the dataCK2018 dataset, and it might also help with sub-group analysis.

Various segmental chromosomal aberrations have been found to be associated with poor prognosis. Patients with loss of heterozygosity at 11q and 1p () and gain at 17q () have been linked with poorer prognosis, only 11q is included in the INRG classification, mostly because 1p loss and 17q gain are statically associated with nMYC amplification (). These genetic markers have been a key addition to risk stratification in the last decades, being nMYC amplification one of the strongest predictors for high-risk disease.

Data Accessibility Statement

The five datasets described in this study are publicly available, under the CC BY 4.0 license (), at the following web address:

https://davidechicco.github.io/neuroblastoma_EHRs_data or at

https://doi.org/10.5281/zenodo.6915403

Additional File

The additional file for this article can be found as follows:

Supplementary datasets.

Supplementary file containing additional considerations about the datasets investigated. DOI: https://doi.org/10.5334/dsj-2022-017.s1