Data access and analysis barriers within biomedical and social sciences research can arise for a variety of reasons including: i) ethical-legal restrictions surrounding confidentiality and the sharing of, or access to, disclosive data; ii) intellectual property or licensing issues surrounding research access to raw data; iii) the physical size of the data.
There are three processes by which individual level data (microdata) in biomedical research is typically shared or accessed (Table 1). Under a repository release model data is released to researchers via e.g. (encrypted) hard drives; email; direct download; secure ftp; or utilising cloud sharing and storage systems e.g. Google Drive. However, these release methods may not satisfy privacy, ethical and legal restrictions nor data security concerns associated with these data. In such examples, these risks are mitigated by applying statistical disclosure limitation (Karr and Reiter, 2014, Shlomo et al. 2015) or anonymisation/pseudonymisation methods (Sweeney, 2002; Elliot et al. 2016) to the data prior to repository release.
|Repository release||Data are stored in a repository and released to users with or without governance controls.||NCDS (Power and Elliot, 2005), UK Biobank (Sudlow et al. 2015), UK Data Archive,1 European Genome-phenome Archive (Lappalainen et al. 2015).|
|Repository release mitigating disclosure||Repository releases data to users in a modified format to prevent disclosure.||Methods include:
Aggregation based on the microdata, data redaction/suppression, addition of noise, simulation data with the equivalent structure (Karr and Reiter, 2014; Shlomo etal. 2015).
Anonymisation/pseudonymisation of the data (e.g. Sweeney, 2002; Elliot et al. 2016).
|Repository direct access-analysis||Users can analyse data stored in a repository. Restrictions on data extraction or analytic functionality may apply.||UK Data Service Secure Lab,2 UK SERP (Jones et al. 2016).
Open source solutions include:
DataSHIELD (Gaye et al. 2014; Wolfson et al. 2010), ViPAR (Carter et al. 2016).
Under direct access-analysis models, users can analyse data within a closed virtual or physical environment (e.g. a secure analysis platform, data safe haven, virtual machine distributed analysis platform), but may face restrictions on analytic functionality or movement of data and/or outputs outside the environment to prevent disclosure.
DataSHIELD3 has been created to address the additional requirement in the biomedical and social sciences to co-analyse microdata that may be sensitive from different sources, without physically sharing the data (Wolfson et al. 2010; Gaye et al. 2014). It is an infrastructure for distributed analysis that facilitates the direct access-analysis of repository data from multiple studies simultaneously.
The DataSHIELD infrastructure
Architecturally, DataSHIELD is built as a client-server model, the server contains the individual level data and sits with the data owner. This means it can sit behind the data owner’s firewall, with the data owner maintaining complete control of who is allowed to access the data, and what commands are allowed to operate on the data.
A researcher uses the client to issue requests for DataSHIELD commands to be run on the server. Only if the researcher has been granted permission to run that particular command on that particular data set, will it run. All of the outputs from the server back to the researcher are summary statistics and have to adhere to the disclosure settings set by the data owner (or consortium) and are discussed in depth below (see: DataSHIELD non-disclosure mechanisms). It is this client-server pairing of functions and the built in disclosure controls that sit at the heart of DataSHIELD.
Based on this core modus operandi, DataSHIELD can be run in a variety of different data partition scenarios: no data partitioning (single-site DataSHIELD), horizontally partitioned data (multi-site DataSHIELD), and vertically partitioned data. In each case it is important to reiterate that all of the disclosure control works in the same way, regardless of the partitioning structure of the data. Each of these data scenarios is represented in Figure 1, and discussed further in the relevant sections below.
Core DataSHIELD architecture
DataSHIELD is built on two pieces of modular and open-source software: the Ranalytic environment (R Core Team, 2015) and the data warehouse Opal.4 For clarity, we refer to the server located behind the data owner’s firewall containing the data as the DataSHIELD server, and the client issuing analysis requests as the DataSHIELD client.
The DataSHIELD server comprises: Opal; a standard R environment preconfigured with the DataSHIELD server-side R packages; and an R parser that only allows DataSHIELD functions and their dependencies to be run. The DataSHIELD client is most commonly deployed as R Studio Server5 with the DataSHIELD client-side R packages (responsible for initiating DataSHIELD server-side functions) also installed. An approved researcher can access the DataSHIELD client from their web browser, without any software installation. Alternative DataSHIELD client infrastructure models are discussed in the sections below specific to different data partitionings.
DataSHIELD client-side functions are issued by the researcher through the R command line, coordinated by the DataSHIELD client and communicated to Opal on the DataSHIELD server via standard REST commands over HTTPS. The analysis on the microdata (i.e. running DataSHIELD server-side commands) occurs in the R environment behind the data owner firewall after commands have been checked through the DataSHIELD R parser. The DataSHIELD client only receives low-dimensional summary statistics from the DataSHIELD server that are calculated from the individual level data, these are then communicated to the researcher. The optimal choice of summary statistics utilised is function-specific, usually self-evident and, wherever possible, makes use of sufficient statistics i.e. they carry 100% of the information held in the data relating to the particular analysis being undertaken.
Hardware and software requirements
Hardware or virtual server requirements for DataSHIELD infrastructure implementation are relatively low. Servers in the system require a recent server-grade CPU (or ≥ 2 virtual CPUs), a minimum 2GB RAM (recommended > 4GB RAM) and an appropriate amount of disk space for the dataset (approximately 10GB for the operating system and 4 GB per 10000 participants). The client runs optimally from a consumer grade CPU or ≥ 2 virtual CPUs. Within a DataSHIELD infrastructure, data processing servers or virtual servers benefit from utilising threaded, multi-core CPUs or multiple virtual CPUs as these ensure efficiency in running multiple DataSHIELD analysis sessions simultaneously. Similarly, increased RAM > 8GB can be assigned for use by the analysis environment and/or the data management layer to increase efficiency in analysing or importing larger data sets respectively.
It is recommended that the DataSHIELD server infrastructure is deployed on a Debian derivative or an RPM based linux distribution, with Java, MongoDB and/or MySQL, and R as software prerequisites. Implementation of the DataSHIELD infrastructure is flexible but is dependent on the data partitioning (Figure 1) and whether pooled analysis is required.
There is a wide variety of data formats and structures utilised in biomedical studies. It is common in longitudinal studies to store the canonical processed data in formats such as SPSS (.sav) or STATA (.dta) data files rather than databases. Historically, this would have been an easier mechanism to distribute their data to end users. In its current form, DataSHIELD is built to work with tabular data. As such, it is possible to import data and data dictionaries into Opal (on the DataSHIELD server) using a variety of file formats including comma-separated values (.csv), Microsoft Excel (.xls), SPSS data file (.sav) as well as SQL tables. Once imported, the data owner can then use the simple Opal web interface to manage their data availability in DataSHIELD, user permissions and define which analyses are allowed to be carried out on the data.
It is worth emphasising that Opal is a data warehouse to facilitate access to the data in this setting. As such only a copy, and not the canonical version, of the data is required. If necessary, it can hold a much reduced data set e.g. with identifiers stripped out, data aggregated up etc. The utility of Opal mitigates against issues around revoking user access and data deletion (e.g. due to withdrawal of consent) as these can be managed centrally.
Single-site DataSHIELD for data without partitioning
Single site DataSHIELD is akin to a secure data enclave allowing the analysis of data from one provider alone (Figure 2). Examples of use include enabling open access to simple descriptive statistics from a rigorously governed study and analytic access to sensitive (but not ultra-sensitive) datasets that have been linked through record linkage. Beyond this, single-site DataSHIELD also has applications as a free-ware and low-cost solution to accessing and analysing datasets from individual studies based in low and middle income countries – ensuring that the intellectual property and control of the dataset remains with the study.
In this simple case, the client portal may be located with the data owner alongside the server instance, or with a third-party e.g. the body responsible for governing access. Alternatively, it is possible to allow the researcher to be the DataSHIELD client in the system e.g. to run the DataSHIELD packages locally in R without running R Studio Server.
Multi-site DataSHIELD for horizontally partitioned data
To date, the most common implementation of DataSHIELD is for the co-analysis of harmonised data within a consortium, whereby the data is horizontally partitioned i.e. each study holds the same variables but different individuals. In this setting, DataSHIELD can be used for secured individual participant data meta-analysis or study-level meta-analysis.
Figure 3 summarises the DataSHIELD infrastructure used in a multi-site instance, with each data owner hosting a DataSHIELD server. Data owners will need to harmonise their variables in order for pooled analysis to work. This can be done prior to ingesting their data sets into their respective Opal instances, or new harmonised data sets can be derived from the relevant data already ingested into the Opal instances (Doiron et al. 2015; Fortier et al. 2016).
The DataSHIELD client can sit inside, or separate to, the firewall of any data provider in the consortium – or even at a third party locality such as a national data facility. Functions initiated by the researcher are coordinated by the DataSHIELD client for co-analysis across all studies. This process is function dependant and is detailed further in the section below (see: DataSHIELD analytic methodology). The DataSHIELD servers communicate with the DataSHIELD client alone, and not each other. They return low-dimensional, non-disclosive summary statistics to the DataSHIELD client, which are then processed for pooled analysis and communicated to the researcher. DataSHIELD multi-site analysis is typically fully efficient, acting as if the microdata were centrally warehoused and analysed collectively using conventional analytic methods (Jones et al. 2012).
DataSHIELD for vertically partitioned data
A third application of DataSHIELD is currently being explored for secure analysis of sensitive vertically partitioned data. Applications include a record linkage setting e.g. with each study holding different variables from the same individuals. This version of DataSHIELD can potentially be used as a secure approach to undertake the record linkage process itself, or to enable statistical analysis when data are so sensitive that none of the data providers are willing for any single data provider – or even a trusted third party – to hold the combined data set once linkage has been completed.
Development of DataSHIELD for vertically partitioned data is at a much earlier stage than that for horizontally partitioned data, and may be implemented under a different architecture for direct study to study communication or through a client-server model. Additional disclosure protection is necessary in the form of sequential encryption and decryption in several settings including the passing of secure information between DataSHIELD servers, or in the case of a client-server implementation, between the DataSHIELD servers and the DataSHIELD client.
Analytic proof-of-principle has been demonstrated by implementing a generalised linear modelling (glm) algorithm across vertically partitioned data. As with development of multi-site DataSHIELD (Wolfson et al. 2010), once a methodology for fitting glms has been achieved, there is the immediate potential for extensions to many other classes of analysis in biostatistics. These will be reported in a forthcoming paper by the DataSHIELD team.
Crucially, the use of DataSHIELD for vertically partitioned data can be negated if the data owners can authorise a pseudonymised version of the linked datasets to be sited with one of the data owners, or at a trusted third party facility such as the UK Secure eResearch Platform (UKSeRP) operated by Swansea University for the Farr Institute (Jones et al. 2014). In such a case, only a single-site DataSHIELD implementation would need to be implemented to provide a privacy-protected analysis mechanism.
Additional data considerations
Regardless of the application and implementation of DataSHIELD, one must carefully consider all aspects of the data environment, specifically the context in which the data are held and the particular threats that may arise (Elliot et al. 2016). Key challenges include identifying:
- a secure location for the data to be held
- individuals that should maintain or manage the dataset
- a formal governance mechanism for data access via DataSHIELD
- optimal rules for disclosure protection
- whether contextual rules for disclosure protection are required – this is of particular importance when using text data.
DataSHIELD analytic methodology
The DataSHIELD analytic methodology is based on client-server function pairs (Gaye et al. 2014). From the DataSHIELD client the researcher runs DataSHIELD client-side functions, these call DataSHIELD server-side functions to run on the individual level data that only return low dimensional (non-disclosive) outputs, and wherever possible configured as sufficient statistics. For example, for a generalised linear model (glm), the relevant outputs are score vectors and information matrices and the resultant analysis is mathematically identical to placing all of the microdata in a single data warehouse and analysing those data using a standard glm (Jones et al. 2012; Jones et al. 2013).
In principle, any native R function or R package can be implemented in DataSHIELD, but any disclosure risk must be blocked. Some functions can therefore be implemented directly, while others require some components of their output to be changed or removed. For example, unlike the native R glm() function, the DataSHIELD equivalent ds.glm() function will not return regression residuals because they are disclosive. Similarly, the quantileMeanDS function that drives the estimation of means and quantiles will not permit the 0 % or 100 % quantiles to be returned because they are potentially disclosive. We are now developing methods to enable some of these disclosive outputs to be utilised in a non-disclosive manner, rather than blocking them e.g. methods for the glm residuals to be used for regression diagnostics.
Finally, there are some R functions with a primary purpose that is fundamentally disclosive and blocking that disclosure risk would negate the value of the function itself. For example, DataSHIELD cannot include the equivalent of R’s native print() function which lists every element of a designated data object – clearly this would be highly disclosive.
In practice there are three types of DataSHIELD analysis:
- One-step analyses – the client-side function requests non-disclosive output from all data sources e.g. ds.table2D (creates two dimensional contingency table) or ds.quantileMean (generates the mean and selected quantiles for a quantitative variable) (Gaye et al. 2014).
- Multi-step analyses – where the client-side function sequentially calls a number of server-side functions to be run for an analysis e.g. ds.histogram where the first step is to calculate the data bins for the histogram across all servers, and the second step calculates the frequency of each bin at each server.
- Iterative analyses – the DataSHIELD client coordinates parallel processes linked together by non-identifying summary statistics e.g. ds.glm (Wolfson et al. 2010, Jones et al. 2012; Jones et al. 2013; Gaye et al. 2014).
DataSHIELD v.4.06 currently includes approximately 140 client-side/server-side functions comprising core analytic functionality: descriptive statistics (e.g. mean); exploratory statistics (e.g. histogram); contingency tables (e.g. ID and 2D); and modelling (survival analysis using piecewise exponential regression, glm). In addition to these, a number of functions are available for testing in our beta-test branch.
DataSHIELD non-disclosure mechanisms
The DataSHIELD architecture itself subsumes numerous measures aimed at mitigating the risk of sensitive data disclosure. These include that: microdata analysis occurs only behind the firewall at each data provider; typically each DataSHIELD server communicates solely with a single DataSHIELD client with a fixed IP address; the DataSHIELD client authenticates with the DataSHIELD server(s) using SSL certificates, with communications via secure web services (REST over HTTPS); each DataSHIELD server contains an R parser configured to permit DataSHIELD approved functions – with approved arguments – to be run on it. For example, the parser will block text strings as they may contain requests to active subroutines.
Within the DataSHIELD infrastructure, the data owner remains in primary control of who can analyse their data and in what way. Thus, DataSHIELD users should expect to apply for data access in an appropriate manner agreed by the individual data owner (or across the consortium as a whole). Only then may the putative user be given a logon for the DataSHIELD client, and the data owner will maintain the right (and ability) to block any individual user (or all users) at any time.
It is strongly advised that DataSHIELD should only be implemented in settings where a sound governance structure exists and the basic infrastructure holding all study data is already fundamentally sound and robust. For example, in our view, all potential users of sensitive data (via DataSHIELD or any other mechanism) should formally agree – at least via a user license or terms and conditions – not to try to identify any individual from the data they are analysing, and to acknowledge that sanctions will be applied if they do. However, a paradoxical consequence of this position is that while the long term aim of DataSHIELD is to enable governance thresholds to fall, thereby streamlining data access, in the short-term before DataSHIELD is well known and widely viewed with confidence, it is likely that access and ethics committees may demand a higher level of scrutiny than usual. Thus, applicants may need to seek approval for access to the microdata in the usual way, as well as seeking approval to use DataSHIELD on those data. As access via DataSHIELD is fundamentally less disclosive than having direct access to the microdata themselves, this position may be seen as being logically perverse but it is crucial that full respect and understanding is paid to the concerns of governance committees that are encountering a new approach for the first time.
The data to be made available via DataSHIELD are typically a small subset of the study data repository. It is best practice for data services such as DataSHIELD to sit separately located to a study’s canonical data storage, and that data made available on a DataSHIELD server is also maintained separately. In addition, it is always recommended that the data held in Opal are pseudonymised to a level acceptable to the data owner. For example, all direct identifiers should be removed unless they are absolutely fundamental to the proposed analysis and their inclusion has been discussed and agreed with all data owners. To strengthen disclosure control, DataSHIELD allows the data owner to set (and control) a variety of optional privacy levels. These determine for example, the minimum acceptable cell count in a contingency table returned to the DataSHIELD client and the maximum number of parameters that are allowed in a mathematical model relative to the number of observational units in a given study. In a co-analysis involving multiple data owners these privacy settings need to be discussed and determined by all data owners and analysts in the consortium. Although different studies may elect to use different privacy levels, this inevitably complicates statistical inference and is not recommended unless absolutely essential for governance purposes.
As each new function is developed and implemented in DataSHIELD, prevention of disclosure is the top priority. To include a pre-existing R function in DataSHIELD, all components of its usual output are scrutinised, with any potentially disclosive outputs removed or modified (see DataSHIELD analytic methodology). There are also some ad hoc statistical methods for non-disclosure built into certain DataSHIELD functions. For example, in the function ds.lexis (a multi-step function to facilitate data preparation for a piecewise regression analysis) the first step is for each study to return to the DataSHIELD client the addition of a random positive error to the calculated value of the maximum survival time. This enables an identical set of survival epochs to be created in all studies, with certainty that even the longest survival time in any of the studies will be encompassed by the final epoch and yet there is no need to reveal the precise – potentially disclosive – value of that maximum survival time.
Finally, every DataSHIELD analytic process can be logged and saved on the DataSHIELD server. These logs may therefore be monitored, manually or via data mining techniques, to flag potential disclosure risks e.g. where a user sends a series of related commands to subset and analyse the data, the combined outcomes of which could lead to sensitive data disclosure. To date no data providers have worked with these command logs, but going forward, this will be an important component of the security systems in DataSHIELD.
Existing applications of DataSHIELD
Following initial proof-of-principle (Wolfson et al. 2010; Jones et al. 2012; Gaye et al. 2014; Doiron et al. 2013) a stable platform has been developed (available under a GPL3 license) and the legal, ethical and social issues arising from the DataSHIELD approach to the analysis of biomedical data have been reviewed (Wallace et al. 2014; Budin-Ljøsne et al. 2014; Murtagh et al. 2012, 2016). DataSHIELD has been successfully piloted within two epidemiological projects in the FP7-funded BioSHaRE-EU consortium7 – co-analysing phenotypic data from separately located European biobanks investigating i) healthy obesity comprising 10 biobanks with 99 phenotypic variables (Gaye et al. 2014) and ii) the effect of environmental determinants on health comprising five biobanks with 51 phenotypic variables and 14 environmental variables extracted from exposure models (Cai et al. 2016; Zijlema et al. 2016).
A number of consortia are now in the process of implementing multi-site DataSHIELD in order to co-analyse harmonised horizontally partitioned data relating to nutrition (ENPADASI),8 diabetes (Interconnect)9 and intra-uterine determinants of child health and development and perinatal health services in Quebec and Shanghai, China (SPIRIT).10 Further interest in DataSHIELD has stemmed from a requirement to analyse: geospatial data linked to health data, national clinical audit datasets and commercially sensitive datasets.
Current DataSHIELD prototyping and integration
Increasing DataSHIELD functionality will facilitate its sustainability and broader use beyond biomedical research. Three distinct areas currently prototyped and discussed below are: applications in post-publication data access, text analysis and data visualisation.
Post-publication data access
The open science trend is seeing more funders and publishers requiring datasets underpinning biomedical research to be published alongside academic papers, or to be accessible for reuse by researchers (Boulton et al. 2011; Ross and Krumholz, 2013). Overall this encourages research and peer review transparency as well as encouraging reproducibility and data reuse (Kratz and Strasser, 2014). Difficulties in making sensitive datasets available in this way can arise from legal, ethical and governance concerns related to data privacy and security. Presently, any researcher wanting to replicate a published analysis would have to complete the data access request for each study/studies to gain access to the microdata used in an article. The current lack of ubiquity of DOIs for data extracts from repositories also means that there is no way of guaranteeing a request for the same data will yield an identical dataset. This has obvious implications for reproducibility.
In collaboration with F1000Research,11 under the AMASED project (Access Methods for Analysing SEnsitive Data; Wilson and Burton 2015a; Wilson and Burton 2015b), the application of a single-site DataSHIELD was scoped to provide a means to replicate analysis published in a paper. F1000Research provided an existing peer reviewed paper and published tabular dataset (in SPSS .sav file format) pertaining to syphilis and HIV status of migrant and refugee women at the Thai-Myanmar border (McGready et al. 2015).
As described above, data and the associated variable/data dictionary from many different file types, including SPSS .sav files, can be imported into Opal on a DataSHIELD server. Once held in Opal, the data were available for analysis using DataSHIELD in a single-site infrastructure model similar to Figure 2. DataSHIELD was used to replicate three separate analyses as identified by the original paper (Table 2). Re-analysing the published dataset in DataSHIELD produced identical values for seroprevalence of HIV, with the original paper reporting this as a percentage and DataSHIELD as an odds ratio. The prevalence of syphilis was deemed invalid in the DataSHIELD analysis, as data from less than five participants was returned in the result and the privacy level for the minimum cell count in a contingency table was, at that time, set by default at five. In discussion between the journal, authors and data owners, there is no reason that the relevant privacy level could not be relaxed to three (the current default) or even to one (no restriction on minimum cell count). The latter setting may rationally be determined appropriate for studying a rare condition where low cell counts can be so common that any attempt to block them so hinders the scientific analysis. In such a case, a decision has to be taken that balances the small and uncertain risk that a significant disclosure event may occur, against the ‘damage’ to scientific knowledge and the broader interests of society as a whole of prohibiting the analysis from taking place at all.
|Original description in paper||DataSHIELD command||DataSHIELD output|
|seroprevalence for HIV 0.47 % (0.30 – 0.76 95 % CI)||ds.glm()||0.004723534 (Odds Ratio)
0.002938389 (lower 95 % CI)
0.007584949 (upper 95 % Cl)
|seroprevalence for HIV (17/3599)||ds.table1D()||negative
|syphilis was lower in refugees (1/1469)||ds.table2D()||refugee – migrant status|
Applying DataSHIELD in academic publishing
The proposed implementation of DataSHIELD within academic publishing is intended to facilitate replication of analysis – and not new analysis – in the paper, through a restriction of DataSHIELD functionality (i.e. only a subset of functions would be available to run on the dataset in order to replicate the analysis). This is necessary when articles are published based on data from studies that have their own immutable legal or governance constraints that mandate formal oversight of full data access and new analysis provision e.g. most of the UK’s major cohort studies, biomedical studies and other sensitive datasets.
F1000Research already facilitate the publishing (with DOIs) of ‘Data Notes’ to describe new datasets and implement methods to cite and access data from their publications e.g. via direct download from the paper; data citation from the paper; data availability statement which can include a DOI or link to an external data source (e.g. figshare,12 national research data repositories etc). When revisions of a paper, or the data within the paper, occur these can be clearly identified in subsequent paper versions by the author(s) and given a new (but related) DOI. Working within the existing data and publication versioning framework of F1000Research it would be possible to manage, and make available, multiple versions of a dataset within the DataSHIELD infrastructure, for example through the hierarchical naming of data tables imported into Opal. In this way, a reader could re-analyse in DataSHIELD the version of a data set associated with a version of a paper, however it is critical to provide transparency to the reader about these details.
We have demonstrated the utility of single-site DataSHIELD (with a single data source) in the context of post-publication data access. It provides a mechanism to fulfill the increasing requirements by research funders to make data more accessible, and can facilitate readers and reviewers to flexibly explore the data sets underpinning published articles without having physical access to the raw microdata. Further engagement with data owners, however, is required to ensure synchronicity of DataSHIELD post-publication data access with their own data governance processes.
Structured text data
The back end infrastructure and compatibility of structured text analysis within DataSHIELD have been scoped within two seed funded projects: AMASED (Wilson and Burton 2015a; Wilson and Burton 2015b) and BRISSKit13 (Butters et al. 2016).
Under AMASED, a DataSHIELD approach was scoped for application to digitised text held by the British Library. The British Library holds tens of thousands of digitised books, hundreds of thousands of digitised newspaper pages, and billions of web pages. Each month the library’s digital and digitised collection of text, image and audio-visual material grows by 6.8 terabytes (British Library Report, 2015). Many of these digitised materials are available to researchers as open data, however some are only available under license. The license used for each item stems from the copyright status of, and licensing agreements relating to, the digitised material. Some licenses limit researchers to analysing the data hosted onsite at the British Library or to only view/analyse a percentage of the data.
A test dataset of ~15,000 openly available digitised books was used in this scoping exercise. Each digitised book was a collection of well structured XML files, following the ALTO (open XML) schema. Each digitised book page was represented by an XML file, with each row of the XML file comprising one word and its metadata (Figure 4). This standardised format meant that a table could be generated in Opal to hold the data, with the relevant data types. This was achieved by using the Opal REST API to automatically build a table for each book. Each page was then iterated through, extracting the data from the XML and importing it into Opal using its REST API, the result of which is a flattened table structure (Figure 5, code available from Butters, 2016). Once held in Opal the data are available for analysis using standard R analysis packages – and any appropriate DataSHIELD R packages.
The R package tm (Feinerer et al. 2008) was used to conduct simple word analysis including word frequency and length. These analyses demonstrated that while there are some analyses that can return aggregate or non-disclosive information (e.g that in Figure 6) text data can be highly disclosive and may include identifiable information e.g. first and surnames in graphical outputs (Figure 7).
A second use case relating to structured text has been demonstrated through the integration and interoperability of DataSHIELD within the BRISSKitopen source software stack (Butters et al. 2016) tailored for application to biomedical/clinical research. BRISSKit utilises i2b214 – a common open source clinical data warehouse. A key feature of i2b2 is that it tags each variable with an ontology code, this can be a standard ontology (e.g. SNOMED CT, which is one of the most comprehensive medical terminology references used internationally) or a bespoke one designed for the specific needs of the data. In either case, i2b2 presents the ontologies in a hierarchical manner, meaning that it is easy to infer information about a variable by looking at it’s parent ontology code – e.g. in SNOMED CT a parent of Syphilis is sexually transmitted infectious disease. Further to this, by using a standard ontology it is generally easy to look up additional information about a given code, see e.g. services like the BioPortal.15
In order to integrate i2b2 and Opal, a simulated clinical data set was exported from i2b2 and flattened, this kept the relevant ontology codes, so the data maintained its full semantic description. Using the Opal REST API a table and all the relevant variables was then automatically generated, and the data imported (source code available from Butters and Issa, 2016). A key point of this was that the ontology codes were imported into Opal as an attribute of the variable, so no descriptive information about the data was lost. This extra metadata would help reduce any ambiguity the DataSHIELD end users may encounter with short variable names e.g. a variable labelled as ‘dressing’ may mean ‘can dress self’ or ‘has a surgical dressing’, or one labelled as ‘cold’ may mean ‘has common cold’ or ‘has cold sensation’, in each case the use of an ontology should disambiguate it.
Preparing DataSHIELD for text analysis
A vast range of text mining tools based on proprietary and open source software, including R, already exist (e.g. summarised in Feiner et al. 2008; Miwa et al. 2012; Rak et al. 2012; Paynter et al. 2016), many of which can be implemented to provide remote and/or distributed analysis of typically open text sources. Examples of privacy preserving text mining tools are dominated by applications within a healthcare setting, particularly with respect to electronic patient/hospital records and the de-identification of text (Dehghan, 2015; Meystre et al. 2010; Zhou et al. 2015). The successful import of structured text and clinical ontology data into DataSHIELD as presented here, combined with the modular nature of the infrastructure, would make it possible to integrate and utilise existing open source text mining tools to give DataSHIELD users increasing functionality. Implemented in this way, DataSHIELD has the potential to facilitate co-analysis of multiple data sources and associated data types whilst protecting against disclosure e.g. within a biomedical setting, observational data combined with text from electronic health records could be co-analysed. Additionally, as demonstrated with the British Library books example, DataSHIELD can be used in an environment in which intellectual property is a limiting factor to data access.
DataSHIELD data visualisation
Existing and prototyped DataSHIELD data visualisation functionality has certain methodological features for the representation of the relationships between different variables whilst preserving their statistical properties and assuring the data privacy protection. We report in this section two new developments related to privacy protected data visualisation applied to i) graphical outputs and ii) data visualisation in virtual reality.
There are presently limited plotting functions in the current release of DataSHIELD (v4.0) for the non-disclosive graphical illustration of the statistical properties of the data (i.e. distributions and correlations). The protection of sensitive information is achieved by the suppression of cells, grids or bins with low counts in the generation of histograms, contour plots and heat map plots. An updated version of the existing graphical functions, including new prototyped functions for scatter plots (code available from Avraam and Wilson, 2017) and box plots, utilise statistical disclosure limitation approaches to mask the microdata. One approach is based on the k-Nearest Neighbours algorithm (Wu et al. 2008), which searches for the (k-1) nearest neighbours (having minimum metric distances) of each observation and then replaces its coordinates with the coordinates of the centroid between itself and its nearest neighbours. The method retains the original data structure and features (see an exemplar scatterplot in Figure 8) and ensures a privacy protected analysis. Evaluation and implementation of this, and alternative methods, are outside the scope of this paper and are included in a forthcoming paper (D. Avraam, pers comms).
The complexity and size of biomedical data often demands a coherent representation of multidimensionality in a form easily, and quickly interpretable by humans. New technologies such as Virtual Reality (VR) can play a key role with regards to data exploration and also have applications for public engagement. Such visualisation can only be implemented if measures to protect data privacy are included, DataSHIELD could provide a such a mechanism.
Working with industry partners in computer games development – Masters of Pie16 and Lumacode17 – a VR data visualisation and exploration software (vARC)18 has been prototyped for application to a complex, longitudinal dataset simulated from the ALSPAC birth cohort study19 (Figure 9). The vARC prototype was the winning entry in the 2015 EPIC Games Wellcome Trust Big Data VR Challenge,20 demonstrating that VR can provide an intuitive and easily navigable environment enabling users to rapidly explore expansive views of the data and drill-down to fine granularity.
We are presently scoping how DataSHIELD can be integrated with this VR software within a data visualisation pipeline while ensuring the statistical structure and properties of the visualisations represent the microdata without disclosure. We also hold interests in applying the prototyped DataSHIELD privacy protected graphical visualisation methodologies described above to the VR environment.
DataSHIELD applications for data visualisation
Disclosure control in graphical outputs utilising sensitive biomedical data is usually mitigated by analysis restrictions placed in the terms and conditions of data use. Measures may include the agreement that no plots can be created, plots are not allowed to be published or that plots can not be taken away by the researcher. In cases where analysts are allowed to prepare plots, it must not be possible to reconstruct the graphic without access to the microdata (Hundepool et al. 2012). Present developments of privacy preserving graphical functionality are based on statistical disclosure limitation (Karr and Reiter, 2014; Shlomo et al. 2015) or secure multi-party computation (Yuan et al. 2015) approaches. The DataSHIELD prototype scatter plot function described above is based on a statistical disclosure limitation approach and can be implemented in DataSHIELD.
VR software is currently dominated by applications for entertainment and gaming purposes. Existing business or research data visualisation applications tend to be focussed on exploiting the VR environment for representation of spatial data such as engineering drawings, structures and mapping (Berg and Vance, 2016; Boulos et al. 2017; Sastry and Boyd, 1998; Seth et al. 2011). Within health and medical sciences VR is used for predominantly therapeutic or rehabilitation applications (e.g. after stroke), typically utilising spatial information (Howard et al. 2017; Iruthayarajah et al. 2017). VR for visual analytics still sits in a very niche area of applications, with just a few limited examples outside our own developments existing in the literature (Coffey et al. 2011; Donalek et al. 2014) and industry.21 This is reflective of the challenges that exist in this area that include: balancing the availability of cost effective and efficient computing resource; combining methods for big data processing and analysis; exploiting the VR environment for representation of multidimensional data; and overcoming the limitations of human perception and cognition to facilitate human interaction with virtual objects (Olshannikova et al. 2015).
Rapid technological development has vastly increased the scale and complexity of data routinely collected in biomedical studies. To optimise return on scientific investment, such data must be readily discoverable and accessible, placing a high value on rapid, intuitive ways to visualise data. Combined with DataSHIELD, emerging VR technologies can undoubtedly provide a powerful way to discover, explore and interpret big and complex data whilst maintaining data privacy.
As a result of the associated legal, ethical and governance restrictions surrounding the use and sharing of biomedical and health data, access to analyse these data is often via either a closed secure platform (e.g. a data safe haven) to which data are imported for researchers to use (but are unable to remove or download any data from), or a distributed analysis network.
Within the UK, the Secure Anonymous Information Linkage (SAIL) system and the Scottish Health Informatics Programme (SHIP) are examples of data safe havens that enable approved researchers to remotely analyse anonymised/pseudonymised microdata and linked Welsh and Scottish health records, respectively, without the researchers downloading the original data itself or removing data from the environment (Ford et al. 2009; Lyons et al. 2009; SHIP Report, 2012). NHS Digital provides access to national data from the UK National Health Service and provides a registry of safe havens22 through which researchers can connect to and analyse these data. Examples from other countries include the Australian SURE project23 and the NIH funded iDASH project that has created tools for secure data access, data analysis and privacy-preserving data sharing (Ohno-Machado et al. 2012).
Alternative approaches for the analysis of population health studies and health data based on distributed database networks have been developed using both proprietary (e.g. Brown et al. 2010a; Brown et al. 2010b) and open source software (e.g. Carter et al. 2016; Narasimhan et al. 2017). The Canadian Network for Observational Drug Effect Studies (CNODES, Suissa et al. 2012) and Mini-Sentinel (a safety surveillance system developed by the U.S. Food and Drugs Administration, Platt and Carnahan, 2012) are both platforms to facilitate the running of analysis requests from approved users locally, along with disclosure checks, prior to securely combining the results centrally as a meta-analysis.
Each of these solutions meets a requirement to securely access or analyse sensitive microdata with disclosure controls, but there are limitations to these approaches. Data safe havens represent a major investment in informatics infrastructure and by their very nature as a centralised data warehouse, they have limited application for a co-analysis of multiple data sources i.e. all data sources would have to include their data in the safe haven. Data safe havens are often closed (not open source) systems, meaning they have to be treated as a black box to some extent, this may make it difficult to reproduce the analysis in the future as not all aspects of the system will be known to the end user. In addition to this, it can be difficult for users to contribute to the development of the system or find bugs if they cannot see the underlying code. Data custodianship may also be a concern, as data are deposited into these systems, and are outside the immediate control of the data owner. Related issues around the devolvement of management of data access, as well as keeping the deposited data in sync with the master data sets, may also arise. Finally, some safe havens charge a fee for use – whilst this may help the sustainability of a safe haven, cost may be a barrier to some users.
Distributed analysis networks avoid the requirement to implement, maintain, and manage access to a centralised data warehouse. They can be built from licensed or open source software, and have the capacity to be free at the point of use. Under distributed analysis networks, data remain under the control and management of data owners, however there may be delays in returning results to users. Analysis requests may take of the order of days, weeks or even longer to complete, as each data source is required to perform an analysis on its own data and check the output for disclosure before it can be combined with other sources. Further delays can arise from difficulties gaining access to the data or obtaining approvals for individual studies.
ViPAR uses an alternative co-analysis approach, utilising a central server to securely virtually pool data from distributed sources into memory to perform an analysis, deleting them on completion (Carter et al. 2016). It enables greater analytic flexibility for researchers than DataSHIELD, allowing them to maintain control of their analysis, with the ability to analyse anonymised data using scripts written in open source (R) and licensed (SAS, STATA) statistical software. Unlike DataSHIELD, however, ViPAR currently does not offer additional disclosure controls but has been successfully applied to population health studies (Schendel et al. 2013) with collaboration agreements in place and adherence to the stringent governance procedures of individual data owners.
Advantages and limitations of DataSHIELD
DataSHIELD provides an access-analysis solution to sensitive biomedical datasets – for both a single or multiple data sources – using a combination of computational and statistical controls implemented to prevent information disclosure. For example, communications across the DataSHIELD infrastructure only convey non-disclosive information e.g. analysis requests, summary or sufficient statistics. This adds an additional layer of protection within the system compared to other solutions, since even if this communication were somehow intercepted it contains no identifying information. The safeguarding of intellectual property, governance procedures and data disclosure concerns of the data owner(s) means that DataSHIELD can potentially ‘lower the bar’ for the governance processes of such studies, thereby shortening the time taken for approved users to gain access to health-related microdata.
Similar to other distributed analysis network methods, under DataSHIELD, data remains with the data owner where they maintain full control of the data and access permissions. This means all local governance processes can be adhered to at all times, e.g. if a participant withdraws consent and is removed from a DataSHIELD server then that change will be reflected in all analyses connected to that server. This does highlight the issue of versioning of data – changing the number of participants in a data set will change the science. End users need to be aware when this happens, and if publications have been written based on a given data version then this should be archived somewhere. This is not a problem limited to DataSHIELD – all of the alternative approaches have to address this, and it is best resolved by policies rather than new technologies.
An advantage of DataSHIELD, is that researchers are given greater analytic flexibility – they are able to make complete use of microdata in their analyses, without seeing or downloading it. Additionally, as users are able to control their own analyses they do not experience the delays (of days, weeks or longer) associated with existing methods for meta-analysis that rely on third parties to perform their analysis requests at each study. In the most common implementation of the DataSHIELD client, that includes R Studio Server, once approval has been granted to a user they only require a modern web browser and internet access to connect and start analysing data. Being browser based, analysis can be conducted across a range of operating systems without the need for high specification computer hardware – we have successfully used DataSHIELD for analysis from Windows, Linux, MacOS, Android and iOS.
Unlike closed systems, by using an entirely open source stack there are no operating or software licence costs incurred by data owners for implementation, and no costs at the point of use. Further to this, in keeping with the aims of open and reproducible research, end-users can interrogate the DataSHIELD software, report bugs/issues as well as submit solutions and new functionality for consideration to be included in DataSHIELD.
The main limitations of DataSHIELD are due to the way in which it is built. Each statistical function in DataSHIELD – including the adaptation of standard R functions – has to be written from the ground up to work with the infrastructure (i.e. be in client-server pairs) and has to have all of the DataSHIELD methods for disclosure control incorporated into it. As they are more complex (and take more time) to develop and test, there is a limited number of functions currently available. Engagement to grow the DataSHIELD community has encouraged contributions to this open source project and is essential for the longer term software sustainability. We have had several new pieces of development contributed by end users.
DataSHIELD is built exclusively for the open-source R analytic environment. Not all researchers use R, and instead may be more familiar with licensed statistical software such as SPSS or STATA. This can mean there is a steep learning curve for some DataSHIELD users. R, however, is well established with an active development community and numerous packages that can give DataSHIELD developers the flexibility to create additional functionality, including the analysis of additional or new data types (e.g. images, text and ‘omics data) that may not be possible in the other commonly used statistical software.
DataSHIELD in a co-analysis and a vertical DataSHIELD setting will always take marginally longer than that of a single-site instance. This is because the speed of analysis is limited by the network latency or lowest specification hardware or virtual server in the whole system. For one step analysis this should have a negligible impact, but for multi-step and iterative analyses this may be more noticeable. Where this effect will be most noticeable is when one (or more) of the data providers has a particularly slow connection or lower server specification compared to the others, since each step of the analysis has to wait for the entire step to finish before progressing to the next. Certain methods employed for prevent statistical disclosure also require marginally longer computing time. For example, additional processing is required to compute the k-Nearest Neighbours based algorithm to populate a scatter plot from multiple data sources. This takes a few seconds longer than creating a scatter plot and is not a limitation of DataSHIELD per se – but the consequence of the statistical technique to prevent data disclosure.
Contemporary bioscience depends critically on the effective access, sharing and exploitation of “big” and “complex” data. At the same time the legal, ethical and data governance requirements associated with the data must be adhered to, without hindering the research process. We have shown that DataSHIELD uniquely provides a mechanism for the (co-)analysis of sensitive data by building in statistical disclosure controls and security measures to meet the requirements of data owners. Unlike existing approaches, DataSHIELD does not require the setup of substantial infrastructure (technical and social) that is necessary of a closed repository or data safe haven. It is this unique placement and flexibility that we believe makes DataSHIELD an attractive solution for data owners, who have a requirement to make their data more widely available, but may not be able to deposit it in a closed (or public) repository.
A key strength of DataSHIELD is that it avoids the serious inferential and analytic shortcomings of approaches that are aimed at rendering data truly anonymous, that often discard or distort information that may be of analytic relevance. For the researcher, DataSHIELD can reduce data governance restrictions (giving overall quicker access to the data), and can reduce the time taken for co-analysis – unlike existing approaches, all DataSHIELD analytic functions have disclosure control built in and do not require manual or third party disclosure checks. By using a completely open software stack with flexible components, additional functionality and the ability to process additional data types (such as those highlighted in this paper), can be built with automated disclosure controls and incorporated into DataSHIELD for users.
In its current form DataSHIELD has been shown to have a firm foundation in the biomedical domain as evidenced by its use in various international projects. In this paper we have demonstrated that it has potential applications in other domains, where disclosure control or data sensitivity is important. We also demonstrate the utility of DataSHIELD within the wider research data cycle such as academic publishing. In the next phase of DataSHIELD we will build on the prototype work outlined here and broaden out in both domain and scope to help reduce the barriers to transparency and reproducibility in biomedical research, and enhance the discoverability and usability of associated data.