A Survey on Publicly Available Open Datasets Derived From Electronic Health Records (EHRs) of Patients with Neuroblastoma

–


INTRODUCTION
Neuroblastoma is a pediatric cancer that forms in certain types of nerve tissue, most commonly arising from the adrenal glands.Neuroblastoma is the most common cancer among newborns and affects thousands of children worldwide (American Cancer Society, 2021).Ninety percent of neuroblastoma cases are diagnosed by age five.Neuroblastoma forms in specific types of nerve tissue and usually develops in the adrenal glands in the abdomen (Colon & Chung 2011).The 5-year survival rate for children with neuroblastoma in England is approximately 67% (Children with cancer UK, 2021).High-risk neuroblastoma is usually treated with intensive chemotherapy, surgery, radiation therapy, bone marrow and hematopoietic stem cell transplantation.
Knowledge about this pediatric cancer can be discovered not only with laboratory results and clinical trials, but also by analyzing the data contained in the electronic health records (EHRs) of patients.Computational statistical methods and machine learning techniques applied to data derived from structured EHRs can, in fact, be effective tools to infer new knowledge about neuroblastoma and eventually to help medical doctors to develop better treatments (Adkins 2017).
Datasets derived from electronic health records, although extremely useful for the progress of scientific research, are often kept reserved and unshared with the rest of the scientific community, because of privacy issues or because of lack of data sharing culture in the hospital or research centre.In recent years, a wave of requests for open data sharing has been launched by researchers around the world, which culminated in the definition of FAIR principles for data sharing (Bertagnolli et al. 2017;Stall et al. 2019;Wilkinson et al. 2016).
The advantages of open data sharing have been already demonstrated by important initiatives that have had a profound impact on scientific research.The University of California Irvine Machine Learning Repository (University of California Irvine 1987), for example, is an online catalogue of free datasets coming from several domains (biology, medicine, physics, computer science and engineering, social sciences, business, and games) that started in 1987 and now contains 588 different datasets.It includes some derived from electronic health records, but not of neuroblastoma.
Another online resource that shares public datasets is Kaggle.Since 2010, this online community of data scientists has collected and provided open datasets to their users, to be used for scientific competitions or for independent analyses.To date, Kaggle lists around ten thousand public datasets (Kaggle 2022).In 2017, Google launched the Dataset Search (Google 2022) engine, where users can easily look for public datasets on the internet.Re3data (2022) is another interesting resource to mention: it a public registry of scientific data repositories available online.
In bioinformatics, Gene Expression Omnibus (GEO) (US National Center for Biotechnology Information 2021) has been providing thousands of open gene expression and methylation datasets to researchers worldwide, producing thousands of studies and publications too.GEO contains gene expression data of neuroblastoma, too, that led to some significant genetic discoveries (Cangelosi et al. 2020;Melaiu et al. 2020).Neuroblastoma bioinformatics data were publically shared and integrated for the CAMDA 2017 conference Neuroblastoma Data Integration Challenge (Francescatto et al. 2018;CAMDA 2017).Multiple datasets of images have also recently been released on public repositories: ophthalmological images (Khan et al. 2021) and cancer images (Clark et al. 2013) among them.
Regarding neuroblastoma, the International Neuroblastoma Risk Group (INRG) recently released the INRG Data Commons (International Neuroblastoma Risk Group 2017;Volchenboum et al. 2017), a database of thousands of EHRs of patients with neuroblastoma.Despite its usefulness, the access to the INRG Data Commons is restricted to the participants to approved projects: to obtain the data, one needs to fill and submit a project application, which might or might not be approved by the INRG Data Commons board.Even in the case of proposal acceptance, this procedure clearly requires some processing time: between the proposal submission and the actual data download, months or even years can go by, with consequent delays that can negatively affect the project itself.Of course, this restricted approach limits the possibility of using their data.
A similar initiative, the Pediatric Cancer Data Commons (Plana et al. 2021;Volchenboum et al. 2021), was launched by University of Chicago and provides EHRs data of patients with Since a public free resource containing public open unrestricted data derived from electronic health records of patients diagnosed with neuroblastoma is currently missing, we gathered the five open datasets derived from EHRs of neuroblastoma described in this survey in a website called Neuroblastoma Electronic Health Records Open Data Repository, that we freely released online.Moreover, we converted these datasets into numerical values, making them computerreadable for any computational analyses.
Users and researchers can take advantage of this resource by downloading the datasets and performing their scientific analyses that might lead to new discoveries about this pediatric cancer.Moreover, we performed a detailed descriptive analysis of all the clinical features present in the five listed datasets, by providing detailed information about all the variables.Some information about these variables were absent even in the original datasets publications.
We organize the rest of this study as follows.After this Introduction, we describe the datasets of our survey in section 2, and we discuss the main insights of their clinical variables in section 3. We finally outline some conclusions in section 4.

DATA AND METHODOLOGY
We performed a thorough literature search of scientific articles related to neuroblastoma electronic health records (EHRs) in November and December 2020, using the Google Scholar (Google 2021) search engine.We performed this dataset search by following the guidelines by Marco Pautasso (Pautasso 2013), by looking for the keywords 'electronic medical records and neuroblastoma', 'EMR and Neuroblatoma', 'electronic health records and Neuroblastoma', 'EHR and neuroblastoma', and 'clinical records and neuroblastoma'.Following the example of Khan and colleagues (2021), we collated and screened the results returned from the first ten pages of the search.
We identified nine articles including EHRs data of patients diagnosed with neuroblastoma in their main texts or in their supplementary information.
Five of them have public open datasets released under CC BY 4.0 license and therefore downloadable and usable without non-commercial restrictions (Banelli et al. 2013;Kim et al., 2018;Villamón et al., 2013;Choi et al., 2019;Ma et al., 2018), included in the supplementary information of the articles.Three of them contain datasets, but without an open license, and therefore cannot be used (Federico et al. 2015;Rosenbaum et al. 2013;Smith et al. 2010).Moreover, one article contains an open dataset, but it is related to coronary artery disease and only minimally pertinent to neuroblastoma (Matsumura et al. 2018).

dataBB2013
The dataBB2013 dataset was released by Banelli et al. (2013), and was previously employed by the same research team in earlier studies (Banelli et al. 2005b(Banelli et al. ,a, 2010)).The dataset contains data from 121 patients, including survived and deceased subjects, collected at the hospital of Istituto Giannina Gaslini (Genoa, Italy) between 1990 and 2004.Each patient has clinical features (  The dataset includes some typical features, such as age at diagnosis, histological category, risk group, MYCN amplification, overall survival, progression-free survival, and outcome.The outcome indicates if the patient survived or not.This dataset also contains two features about the methylation of a gene family (Protocadherin Beta Cluster, PCDHB) and of a specific gene (SFN).The prognostic role of the methylation of these two genes is the key aspect of the original study (Banelli et al. 2013).
The dataBB2013 dataset is the only one including data related to the methylation of the PCDHB gene family and to the methylation of the SFN gene.The methylation of Protocadherin Beta cluster genes is known to be associated to poor prognosis in patients with neuroblastoma (Abe et al. 2005), while the association between the methylation of the protein-coding Stratifin gene and advanced stage, high-risk neuroblastoma was shown by the dataset authors in a previous study (Banelli et al. 2010).
This dataset contains only one blood test variable: ferritin.Ferritin is a protein that contains iron and is known to be a clear prognostic factor for neuroblastoma (Moroz et al. 2020).

dataCK2018
The dataCK2018 dataset was published by Kim et al. (2018).It contains data of 20 patients, collected at the Samsung Medical Center from January 2009 to December 2015 at the Samsung Medical Centre of the Sungkyunkwan University (Seoul, South Korea).All the patients had chemotherapy and surgery, with or without local radiation therapy, followed by differentiation therapy.Each patient has 16 clinical factors (Table 3), making it the dataset with the higher number of features among the five listed in this article.The clinical feature list includes several traditional variables such as age, outcome, sex, followup duration, tumor primary site, metastatic sites.Some uncommon variables related to the treatment received by the patients are present: local dosage of radiation, response after chemotherapy and surgery, and response after differentiation therapy.
The main scientific statement of the original study by the authors was that: 'patients younger than 18 months with stage 4 MYCN nonamplified neuroblastoma had high survival changes, without significant late adverse effects, when treated with alternating cycles of cyclophosphamide (CEDC) and ifosfamide, carboplatin, and etoposide (ICE), followed by surgery and differentiation therapy' (Kim et al. 2018).
This dataset contains variables of ferritin and LDH levels in the blood, both recognized as prognostic factors for neuroblastoma in the medical community (Moroz et al. 2020).It includes a variable related to a urine test (urine vanillylmandelic acid), which is an indicator for neuroblastoma diagnosis in children screen tests, and a variable of the blood neuron-specific enolase level, which can be used auxiliary test for neuroblastoma.
This dataCK2018 dataset has the only cohort for which data of genetic factors are indicated: 11q-, 171+, and 1p-chromosomal aberrations.These genetic abnormalities are common in patients diagnosed with neuroblastoma.

dataEV2013
The dataEV2013dataset was made open by Villamón et al. (2013).It contains data from 19 patients and 11 clinical variables (Table 4).
All the 11 features are health-related, and they do not contain any genomics or genetic information.The patients' samples were collected between 1999 and 2007, at the Spanish Reference Centre for Neuroblastoma Biological and Pathological studies at the time of diagnosis.
The original study is focused on genetic instability and intratumoral heterogeneity with MYCN amplification, and 11q deletion.According to the NIH Genetic and Rare Diseases Information Center, the chromosome 11q deletion is "a chromosome abnormality that occurs when there is a missing (deleted) copy of genetic material on the long arm (q) of chromosome 11" (National Health Institutes (NIH), Genetic and Rare Diseases Information Center (GARD), 2021), that can cause developmental delay, intellectual disability, behavioral problems and distinctive facial features.

dataYBC2019
The dataYBC2019 dataset was publihsed by Choi et al. (2019).It contains data from 7 patients, each of them with 10 health indicators (Table 5).This dataset has the smallest cohort of patients  The original study of the dataset curators was focused on verifying if early natural killer cell infusion following haploidentical stem cell transplantation would have reduced the relapse in patients with neuroblastoma (Choi et al. 2019).In both that study and its dataset, the relapse information was of great importance.

dataYM2018
The dataYM2018 dataset was released by Ma et al. (2018).It contains data from 169 patients, with 13 clinical factors (Table 6): it is the largest patients' cohort among the datasets listed in this study.The data were collected at Children's Hospital of Fudan University (China) between 2010 and 2015 (Ma et al. 2018).
This dataset includes several traditional variables that can be found in many neuroblastoma electronic health records, such as age, MYCN status, outcome, sex, risk, and overall survival time.It also includes information about treatment, that is if the patient had autologous stem cell transplantation and radiation, and information about the primary tumor site.
The focus of the original study is the MYCN amplification.The authors claim that the MYCN amplfication was an independently adverse prognostic factor in this cohort of patients with neuroblastoma (Ma et al., 2018).
The autologous stem cell transplantation is one of the therapies employed for patients with neuroblastoma (Trahair et al. 2007).dataCK2018.MYCN and sex are present three times: MYCN in dataBB2013, dataYBC2019, and dataYM2018, while sex in dataCK2018, dataEV2013, and dataYM2018.Seven other variables are present in two datasets (ferritin, histology, interval to relapse, overall survival, primary site, risk, treatment response).All the other features are present only in one dataset.

SHARED CLINICAL FEATURES
We show the features present in each dataset in Figure 1.
INRG features.Some clinical variables refer to the International Neuroblastoma Risk Group (INRG) classification system, which has been developed to establish an international consensus for pre-treatment risk stratification and addressing patients to the most suitable treatment protocol (Cohn et al. 2009).Age at diagnosis, tumor stage, MYCN status, histologic category, degree of tumor differentiation, ploidy, and loss of 11q (11q-) are currently used in the INRG schema to assign a very-low, low, intermediate or high-risk group.schema (Cohn et al. 2009).All features, except ploidy, have been reported in at least one of the studies (Figure 1).

Treatment variables.
In a modern therapy, heterogeneous treatment ranging from observation, for very low-risk patients, to combinations of intensive multi-agent induction chemotherapy, surgery, radiation, myeloablative consolidation therapy with stem cell rescue and transplantation, 13-cis retinoic acid, and immunotherapy for high-risk patients are provided to patients diagnosed with neuroblastoma (Qi & Zhan 2021).Risk group, treatment protocol, surgical method, radiation, autologous stem cell transplantation, or local radiation theraphy (RTx) dose were reported in one of the studies (Figure 1).Standardized methods to define and interpret response to first line (diagnosis) are important to efficiently monitoring and advancing therapy for neuroblastoma (Park et al. 2017).Response to first line treatment has been included in three out five studies, but this clinical feature has been indicated with different names including tumor response (Kim et al. 2018), treatment response (Villamón et al. 2013), and tumor status at haplo-SCT (Choi et al. 2019).
Sex. Significant positive, but modest, association has been observed between male sex and neuroblastoma (Williams et al. 2019).Although a significant impact of sex to neuroblastoma prognosis or diagnosis has not been described yet, the sex feature was included in four out five datasets (Figure 1).clinical setting to perform neuroblastoma diagnostic and prognostic evaluations (Barco et al. 2014;Cangemi et al. 2012;Ferraro et al. 2020;Tolbert & Matthay 2018).
LDH, ferritin, genetic aberrations.Serum lactate dehydrogenase (LDH) and serum ferritin are strong prognostic molecular markers with potential ability to identify ultra-high-risk and to refine risk stratification (Moroz et al. 2020).Although the clinical utility of these molecular markers has been reported in the literature, the low impact of these factors in multivariate analyses has prevented their inclusion in the INRG risk classification schema (Cohn et al. 2009).
The genetic aberrations of chromosomes 1p, 11q, and 17q are associated with poor outcome in neuroblastoma (Attiyeh et al. 2005;Bown et al. 1999;Caron et al. 1996;Cohn et al. 2009).Despite an increasing consideration in the clinical setting respect to the past is evident, these chromosome aberrations were only reported in one of the datasets (Figure 1).Nevertheless, both aberration 1p and 17q are statistically correlated with nMYC amplification: this correlation is the main reason why the INRG classification does not utilize them, and why under the proper circumstances nMYC amplification could be used as a proxy for these chromosomal aberrations (O'Neill et al. 2001).
Primary tumor sites and metastases.The location of the primary tumor and any metastatic sites dictates the symptomatology (Tolbert & Matthay 2018).Adrenal medulla is the most common site to develop a primary tumor, but tumor may arise also from the paraspinal or other sympathetic ganglia and can be present anywhere from the neck to the pelvis (Tolbert & Matthay 2018).Metastases are present at diagnosis in about 50% of patients, with the bone marrow, bone and regional lymph nodes being the most common sites (Tolbert & Matthay 2018).Two of out five datasets (Figure 1) reported the primary or metastasis sites.Neuroblastoma patients' outcome, indicating the patients' status alive or dead, and overall survival, that is the time interval between the last follow-up date and the date of diagnosis, are the primary endpoints of the whole clinical activity and biomarkers have been proposed in the literature to improve survival of patients (Cangelosi et al. 2020;Cangelosi et al. 2014Cangelosi et al. , 2016Cangelosi et al. , 2013;;Ognibene et al. 2017).Two of the datasets have reported the outcome and the overall survival.The study by Kim et al. (2018) used the feature outcome but they refer to the onset of a relapse.
Relapse.Neuroblastoma patients' relapse indicates whether patients experienced a relapse.
Relapse-free survival is the time interval between the date of first relapse and the date of diagnosis (Cangelosi et al. 2013).Currently, no curative salvage regimens for recurrent neuroblastoma are known, thus several studies have been reported in the literature to fill this gap (Basta et al. 2016).Only one of the five datasets' studies (Figure 1) reported features about relapse, relapse-free survival or relapse sites.Relapse-free survival was referred with progression free survival feature by Banelli et al. (2013).Methylation SFN, Methylation PCDHB cluster, HDCT2 regimen, HDCT1 regimen, and age at relapse are study-specific features and their role on neuroblastoma development or progression still remain to be validated.
Set of patients feature refers to a technical split of a cohort into two distinct subsets of patients for training and validate a computation model.One study reported this kind of feature.
Recap.Taken together, previous evidences suggest that: (i) The number and type of clinical features is heterogeneous across the five datasets' studies; (ii) The datasets' studies investigated distinct subsets of patients, which included high-risk patients with stage 4 tumor (Banelli et al. 2013), patients younger than 18 months, stage 4 and MYCN not amplified tumor (Kim et al. 2018), patients with MYCN amplified and low 11q tumor (Villamón et al. 2013), patients of all stages (Choi et al. 2019) and patients with relapsed or refractory disease after HDCT or auto-SCT (Ma et al. 2018); (iii) The features collected in previous studies are taken at different time points including diagnosis, treatment, patient's response, time of relapse, and follow-up); (iv) Sometimes different names are reported to refer to the same clinical parameter, and this is the case of treatment response (Villamón et al. 2013) and tumor response (Kim et al. 2018) or time to first relapse (Villamón et al. 2013) and interval to relapse (Choi et  (v) Once, the same feature name is reported to indicates different types of data as it is the case of outcome in the dataCK2018 (Kim et al. 2018) and outcome of dataBB2013 dataset (Banelli et al. 2013).

ANALYSIS
Neuroblastoma is the most common malignant tumors diagnosed in children under one year old, and is derived from the sympathetic nervous system.The clinical presentation of neuroblastoma patients is often vague, with many signs and symptoms being non-specific.Initial clinical presentation frequently includes weight loss, fever, and lethargy, with the most overt clinical signs and symptoms like a unilateral mass and opsoclonus myoclonos not showing up until much later in the disease progression (Keikhaei et al. 2012).Neuroblastoma is a tumor with clinical and prognostic heterogeneity; for some patients, the disease is an often aggressive and terminal disease, while other patients develop a benign disease which often has complete and spontaneous regression (Brodeur & Bagatell 2014).When diagnosed early as a localized disease, patients suffering from neuroblastoma frequently have high survival rates, approaching 90% in stage I and stage IVS of the disease.The survival rate, however, gets quite dismal in patients with stage III, with only 21% survival rate at 10 years (Bernstein et al. 1992).
Epidemiologically, neuroblastoma is a rare disease with an incidence of around 11 cases per million children (Spix et al. 2006).The heterogeneity of clinical presentation coupled with a low incidence rate make most clinical centers lack of data points to carry out meaningful clinical studies.Therefore, pooling data from multiple centers is critical for the development of new therapies and management guidelines.Share of patients' data across many medical centers have led to breakthrough studies in the past: Cohn and Pearson (Cohn et al. 2009) led a large consortium that recollected data from 8,800 patients suffering from neuroblastoma, and further analysis of this dataset led to the development of a new tumor staging system on the basis of surgical risk factors (Maris 2010).Risk stratification is pivotal in patients suffering from neuroblastoma, as it aids in choosing the most proper clinical management.Patients belonging to the low-risk strata are usually assigned to undergo local surgery resection; in contrast, patients in the high-risk strata are usually treated with high doses of chemotherapy, surgery, external beam therapy, and anti-GD2 immunotherapy (Maris 2010).
Even if stratification of neuroblastoma patients is pretty clear and concise at both end of the risk spectrum, however, there has not been a coherent consensus into how to accurately stratify those patients at an intermediate risk.Great efforts have been made to create a uniform risk stratification to share among research groups, the most important is the International Neuroblastoma Risk Group (INRG) classification system (Cohn et al. 2009).This risk stratification score utilizes age, histology, grade of tumor differentiation, MYCN, 11q aberration, and ploidy to assess the survival risk in pre-treatment patients.
In this research study, we found five open unrestricted datasets which have a diverse set of features that include clinical, genetic and laboratory data.Clinical data includes age and sex: age showing up in every data set and sex present in three out of five.Age at diagnosis is one of the most important factors at the moment of defining prognosis in neuroblastoma patients and is treated as a proxy for the underlying genetic and biologic features of the tumor (Cheung et al. 2012).Infants less than one year old usually suffer a far more a benign course than their older counterparts, with close to 80% survival rate, while adolescents suffer, for the most part, an aggressive and terminal disease with abysmal survival rates.
Clinical lab findings are missing in most data sets, with only dataCK2018 containing most of the lab findings like ferritin, LDH, NSE, and Urine VMA.This aspect can be concerning as these labs are for the most part are readily accessible even in underprivileged area, more so they are a key component in the work up of patients suffering from this oncologic disease.These markers are not routinely used in risk stratification neuroblastoma patients, as they are not as precise as the genetic markers (Sokol & Desai 2019); nevertheless, further inspection into the possible interactions with genetic and clinical values might shed a light into a more precise risk stratification.
It is crucial to further analyze how ferritin and LDH might aid in subgroup analysis for both prognosis and treatment, as it has been shown in the past that LDH aids on selecting the appropriate therapeutics in other type of cancers.For example, initial levels of LHD can predict benefit of bevacizumab in colorectal cancer (Yin et al. 2014), and LDH response to first line treatment predicts survival in breast cancer (Pelizzari et al. 2019).
Clinical markers are insufficient for a precise risk stratification and subsequent treatment, and this is why the reason obtaining genetic data is crucial at the moment of managing patients with neuroblastoma.Analysis of cytogenetic profiles of neuroblastoma cells has revealed that whole chromosomal changes without any segmental alteration is associated with great outcomes even in older patients, or disseminated disease, meanwhile segmental chromosome imbalances are associated with worse prognosis and high risk of relapse, even in patients that presents whole chromosomal changes (Janoueix-Lerosey et al. 2009).MYCN amplification, a crucial genetic marker for risk stratification (Cao et al. 2017), is present in three out of five datasets.Other common genetic markers, such as ploidy and 11q abberations, which are used in the INGR risk stratification score, are severely lacking in the datasets, with 11q only being present in the dataCK2018 and ploidy missing completely.
Confirming the diagnosis of neuroblastoma requires a biopsy, which also brings important information for the prognosis.These tumors are classified depending on the amount of Schwannian stroma present in the tumor (Shimada et al. 2001) and, generally speaking, undifferentiated histology confers a worse prognosis, while highly differentiate histology confers a good prognosis.Histopathological information is present in the dataBB2013, dataEV2013 and dataYM2018 datasets, and can help researchers to further elucidate how histology of this tumor might interact with genomics and clinical markers.
Treatment in neuroblastoma patients has a binary trend due to its heterogeneity, the tendency in the past years has been to reduce therapy in patients belonging to the low-risk strata, while increasing it in those at the high-risk end.The focus in the past decade has been to use higher doses of chemotherapy and radiotherapy in patients with high risk neuroblastoma (Haghiri et al. 2021).Chemotherapy data is present in dataYBC2019, dataCK2018, and dataEV2013 datasets, and radiotherapy data is included in dataYM2018 and dataCk2018 cohorts, along with clinical, genomic and outcome data this could be used to support more individualized clinical decision making at the moment of picking the right treatment in neuroblastoma patients (Kent et al. 2018).
Limitations Integration of the datasets is possible due to presence of a subset of features that are reported for the majority of the datasets, such as age, stage, and MYCN status.These variables can be used for clinical and molecular characterization of the patients and their diseases.
However, the scarce overlap among the features across all datasets and the differences on the scale of feature values represent a limitation for these datasets, since this might reduce the comparability among the datasets and their usability in future studies of new biomedical markers for neuroblastoma.Compatibility across datasets is an important characteristic of the dataset and an aim of any standardization effort put in place by the scientific community.We believe that a larger set of common features would be of great benefit for the scientific community that will improve comparability across publicly available datasets and enhance the reusability of data for future studies on neuroblastoma.
The low number of patients included in a study is one of the most limiting factors in studies on rare diseases such as neuroblastoma.Three out of five datasets reported features for less than twenty neuroblastoma patients.The low number of patients represents an additional limitation of these datasets because it might reduce the robustness of the analysis be carried out on these datasets in future studies.A larger number of patients would be necessary to report robust conclusions and to achieve a sufficient trustworthy analysis able to support clinical decisions.

CONCLUSIONS
Neuroblastoma is a child cancer that affects thousands of newborns worldwide, and scientific research on the data derived from electronic health records can reveal new discoveries about this disease.EHR neuroblastoma datasets are available in the internet, but located on different websites or among supplementary information of published articles and therefore difficult to find.We alleviate this issue by presenting here this survey where we describe each of the five public datasets on neuroblastoma EHRs currently available in the scientific literature, by reporting the quantitative characteristics and the clinical features of the five analyzed datasets, and highlighting which important variables were present in the datasets and which ones were absent (subsection S1.1).
In this survey, we also introduce our Neuroblastoma EHRs Open Data Repository, an online catalogue containing the five datasets which can be used by researchers and computers worldwide for any scientific analysis.Unlike the INRG Data Commons (Volchenboum et al. 2017) and the Pediatric Cancer Data Commons (Plana et al. 2021), our repository's data can be accessed openly without any specific permission or project proposal, by anyone in the world at any time.
We believe our survey and data repository can be useful resources that facilitate new scientific discoveries about neuroblastoma and that can lead to an improvement of the conditions of the patients worldwide.
In the future, we plan to write surveys and to develop data repositories derived from electronic health records of patients with other diseases, such as sepsis (Chicco & Jurman 2020) or amyotrophic lateral sclerosis (Kueffner et al. 2019).

S1.1 ADDITIONAL CONSIDERATIONS
Initial work up of a recent patient diagnosed with neuroblastoma usually include a complete blood lab, with coagulogram, uric acid, electroyles, kidney function, liver function test, ferritin and LDH.LDH and Ferritin are both correlated with a poor prognosis when elevated (Hann et al. 1985;Quinn et al. 1980).These bloods tests are not included in the INRG classification system due to lack of specificity (Sokol & Desai 2019), nonetheless the employment of these markers might enhance the granularity of risk stratification while adding a negligent overhead to the management of patients; these lab tests are affordable and readily accessible even in underprivileged areas of the world.LDH can only be found in the dataCK2018 dataset, while ferritin is present in both dataBB2013 and dataCK2018.
Patients suspected of suffering of neuroblastoma should have a urine specimen recollected, since high levels of cathecolamis and their downstream metabolites are often elevated and support the diagnosis.Vanillylmandelic acid (VMA) and homovanillic acid (HVA) are both catechoalime metabolites that can be found in urine in up to 90% of children with neuroblastoma (Strenger et al. 2007).There are some information that these metabolites might aid in prognosis, especially when taking into account the DA (Dopamine)/VMA ratio, might help in biological grading (Strenger et al. 2007).
Neuron-specific enolase (NSE) a specific marker for neurons and peripheral neuroendocrine cells, have been found to be increased in advanced neuroblastoma and is correlated with poorer prognosis (Georgantzi et al. 2018).NSE is only available in the dataCK2018 dataset, and it might also help with sub-group analysis.
Various segmental chromosomal aberrations have been found to be associated with poor prognosis.Patients with loss of heterozygosity at 11q and 1p (Attiyeh et al. 2005) and gain at 17q (Lastowska et al. 1997) have been linked with poorer prognosis, only 11q is included in the INRG classification, mostly because 1p loss and 17q gain are statically associated with nMYC amplification (O'Neill et al. 2001).These genetic markers have been a key addition to risk stratification in the last decades, being nMYC amplification one of the strongest predictors for high-risk disease.

Table
. Each of these five open datasets is available as an CSV file, thus it can be opened with most spreadsheet software, including free open source available ones such as LibreOffice Calc (The Document Foundation 2022) or processed with appropriate software packages on any computer, in R for example (Software Carpentry 2022).

Table 2 Meaning and values of the features of the dataBB2013 dataset. Number
OS: overall survival.PCDDHB: Protocadherin Beta Cluster.PFS: progression-free survival.SFN: Stratifin gene.Additional information can be found in the dataset original article by Banelli et al. (2013).Chicco et al.Data Science Journal DOI: 10.5334/dsj-2022-017

Table 3 Meaning and values of the features of the dataCK2018 dataset
. Number of patients: 20.Number of features: 16 BM: bone marrow.CR: complete response.CT: chemotherapy.DT: differentiation therapy.F: female.gy: gray units.

Table 4 Meaning and values of the features of the dataEV2013 dataset.
among the datasets described in this report, but it is the only complete dataset without any missing values or empty ones.The data of the patients were collected from January 2012 to December 2014 in South Korea.
The clinical variables of this dataset include traditional factors such as age at diagnosis, MYCN status, as well as uncommon features related to the treatment (HDCT1 and HDCT2 chemotherapy, treatment prior to haploidentical stem cell transplantation).Moreover, this dataset contains multiple interesting factors related to cancer relapse: age at relapse, body sites of the cancer relapse, and interval to relapse.This dataset, however, includes no genomics or genetic variable about the patients' biological profile.