Introduction

Funders have been seeking wider access to the research data they fund for many years (MRC 2000; OECD 2007). In the UK, in 2016 these aspirations have coalesced (RCUK 2016) and data sharing has become mandatory for some researchers (EPSRC 2014). However, in all cases funders’ policies make exceptions for confidential human subject data (University of Cambridge 2016).

In the context of clinical trials, the Alltrials campaign has been arguing for openness, with the slogan ‘All trials registered, all results reported’ (AllTrials 2013). But it falls short of demanding the sharing of individual patient data (IPD). OpenTrials (Goldacre and Gray 2016), the implementation arm of Alltrials, does not intend to include IPD, as it ‘often presents privacy risks that mean it cannot be simply posted online’ (OpenTrials 2016).

Recognising that sharing IPD should not be out of scope, since 2012, the Institute of Medicine’s (IOM) ‘Committee on Strategies for Responsible Sharing of Clinical Trial Data’ has been working on the theme, culminating in a January 2015 report ‘Sharing Clinical Trial Data: maximizing benefits, minimizing risks’ (IOM 2015).

In January 2016, the International Committee of Medical Journal Editors (ICMJE) responded with a widely-published proposal (Taichman et al. 2016) that requires authors to share with others the IPD underlying the results presented in the article.

They announced a year’s delay in implementation, to give time for the necessary ethical reviews and informed consents to be in place.

From De-identification to Anonymity

While there were some protests at the ICMJE proposal, these were concerned with loss of scientific freedom implied (Lewandowsky and Bishop 2016) and included the controversial ‘research parasites’ editorial in NEJM (Longo and Drazen 2016). That this sort of data sharing in general, and de-identification in particular, was achievable was not questioned.

The ICMJE considers de-identification to be a way of mitigating the risk of disclosure in IPD from consenting trial participants. IOM, on the other hand, is happy to consider the sharing of unconsented data – where the data subjects were not asked about data sharing, rather than refused – if data can be de-identified ‘sufficiently’ (IOM 2015: 144). This implication was not lost on Dal-Ré (2016).

The long IOM Appendix B entitled ‘Concepts and Methods for De-identifying Clinical Trial Data’ cites and relies on the successful deposition of the International Stroke Trial (IST) database (Sandercock, Niewada and Członkowska 2011, and see below). This is an implementation of a widely published BioMed Central article (Hrynaszkiewicz et al. 2010), which discusses the preparation of raw clinical data for publication, by the redaction of direct and indirect identifiers. This is analogous to the US ‘Safe Harbor’ method of de-identifying protected health care information (HIPAA 2010).

While Hrynaskiewicz et al. (2010) recommend ‘consent for publication should be sought’, it is not considered essential, and the article chooses to focus instead on ‘confidentiality and anonymity’.

The International Stroke Trial Database

The IST did not seek consent for data sharing from trial participants, and the de-identified data is available as Open Data from the University of Edinburgh data repository (Sandercock, Niewada and Członkowska 2011).

The deposited data for IST has 19435 participants, and 112 variables. It is a very large trial, but some of the variable counts give cause for concern. For example, while there are 6257 participants from the UK, and 3437 from Italy, there are only 9 from Japan and 2 from France, a man and a woman.

Perhaps the concern is misplaced: the IST authors have removed all 28 direct and indirect identifiers referenced in Hrynaskiewicz et al. (2010) and there are millions of elderly men and women in France. However, while the checklist of direct and indirect identifiers does not include ‘Country’, the first indirect identifier has ‘Place of treatment or health professional responsible for care’. Therefore, the identity of the two French participants may only be protected by the fact that membership of the collaborative trial group was not published.

Subsequent guidance on anonymisation from the ICO makes the context-sensitive nature of indirect identifiers clearer (ICO 2012). In an anonymised dataset, data does not have to be aggregated into frequency records – a common misconception. Individual-level data records are permitted, and allowed to be unique, providing that the direct identifiers have been removed (name, address, DOB, NHS number etc., cf. HIPAA (2010)); and indirect identifiers are either removed or are put into classes that (taken in combination) reduce disclosure risk to an acceptably low level. An example may be to report age in age-bands, rather than as actual age; or location as the first part of the postcode rather than the whole.

Indirect identifiers depend on context. While gender is usually considered an unproblematic variable, in a dataset about breast cancers, the few male cases would stand out.

In order to define the context in which a variable is an indirect identifier, we need to answer the question ‘out of what set of people does a record have to be anonymous’? While tempting to answer ‘out of the whole population’, in a research setting the eligibility criteria and locations of subject recruitment are published, increasingly in Open Access journals. Therefore the upper bound for considering the anonymity of data records is that a record should be anonymous within the eligibility pool. And given that a subject’s participation in a study or trial might be known (e.g. through social media), a lower bound is that a record should be anonymous within the pool of subjects actually recruited.

As a footnote, the IST team advertised the availability of Open Data for a subsequent study, IST-3, in the Lancet (Sandercock et al. 2016). However, on the repository website (Sandercock et al. 2016) the data is embargoed until 2021, and is only available, if at all, to physical visitors ‘in order to comply with UK NHS Information Governance’.

The Current State of Clinical Data Sharing

The IST authors are not alone in being confused as to what they can or should share. Even relatively recent advice to researchers in the UK (Tudur Smith et al. 2015, referencing NHS HRA 2014) does not quite clarify to what participants are being asked to consent, suggesting that informed consent forms should include the phrase:

I understand that the information collected about me will be used to support other research in the future, and may be shared anonymously with other researchers.

There is a widespread belief in the value of open science and the possibility of anonymity, but with little experience of managing access to controlled data, the current state of sharing clinical trial IPD can be characterised in Figure 1 (Strom et al. 2014; Bierer et al 2016).

Figure 1 

The Open Data Institute’s (ODI) Data Spectrum, showing the data sharing options most widely considered in clinical research. The ODI source image is CC-BY licensed.

Artificially constraining data sharing choices is problematic if the anonymity of Open Data fails or falls under question. This happened in 2008, when a forensic application with the potential to identify DNA in a pool of samples intruded into academic genetic research, suggesting that participants could be identified from published aggregate data (Homer et al. 2008). This led to the widespread removal of Open Data from public websites, with data placed under the control of data access committees (Zerhouni and Nabel 2008). There are now discussions to try to reverse the process (NHGRI 2016), to increase the availability of data and reduce the governance burden.

Failures of Anonymisation

There have been many high-profile cases of failures in anonymisation. These failures can be split into four classes, all investigated and referenced in Barth-Jones (2014):

  1. carelessness in redacting direct identifiers. An example is the Personal Genome Project distributing some sequence data in files named after the participants;
  2. carelessness in redacting or grouping indirect identifiers. An example is the potential to identify men, by surname, from their Y-chromosome data;
  3. cases where the records are individually anonymous, but cease to be in combination. An example is the potential to identify mobile phone users by linking multiple locations;
  4. cases where external knowledge can be linked to anonymous records to break that anonymity. While cases of linking media reports of prominent individuals to released datasets abound (e.g. NYC Taxi FOIL release, and health records in Iceland, UK and US ), more subtle attacks exist – e.g. adding IMDb records to a Netflix release.

Failures in classes 1–3 are amenable to risk analysis. Failures in class 4 are less so, as we cannot know what other people know, nor the tools they will deploy (Ohm 2009). El Emam et al. (2015) suggest that claims of anonymity should be time-limited (to 18–24 months before review) to take account of changing technological capabilities. The ICO Anonymisation guide, puts it like this (ICO 2012: 16):

You may be satisfied that the data your organisation intends to release does not, in itself, identify anyone. However, in some cases you may not know whether other data is available that means that re-identification by a third party is likely to take place.

The impact on participants should be noted. Whereas anonymisation advocates may be content with a small known residual risk (arrived at by testing for risks of class 1–3), the unknowability of risk of failure of class 4, and the widespread reporting of such failures, may persuade participants that their participation in research is not worth the privacy risks.

Alternatives to All or Nothing Approach

Once clinical trialists accept de-identification as a process of risk reduction, rather than a guarantor of a state of anonymity, they are free to adopt the risk assessment model that has been used so successfully by the social science community. In the UK, the UK Data Archive (UKDA 2016) holds such data, and its approach can be summarised in Figure 2. Most (consented) individual-level data is distributed to registered users from registered host institutions using familiar website login mechanisms. But there is also provision to allow access to less heavily de-identified data, typically where some indirect identifiers remain, or where data is linked across sources, e.g. surveys linked to genetics. There is a range of data access mechanisms: by licence, where data is made available to a user who can supply the relevant credentials and has fulfilled the necessary training requirements; by application to a data access committee, where a user presents a scientific justification for use of more identifiable data; and by use of a secure computing setting. These measures are combined: only a trained user with a scientifically approved proposal is allowed access to the secure computing resource. Usage of these measures is shown in Figure 3 for one of the major studies hosted at UKDA (Understanding Society 2016). Different versions of the same dataset, e.g. with geospatial identifiers included or redacted, may appear in several access categories, with the goal that the most heavily de-identified data is the most accessed. These access rates compare very favourably to reported rates from e.g. Strom et al. (2014). And as new identifiability threats are found, risks can be re-assessed, and data may be moved up and down the spectrum: the Homer et al (2008) problem (where aggregate data was shown to be potentially disclosive) would simply result in data moving from Anyone to Group-based Access, to use the ODI wording.

Figure 2 

The Open Data Institute’s (ODI) Data Spectrum, showing the data sharing options used by the UK Data Archive. The ODI source image is CC-BY licensed.

Figure 3 

The Open Data Institute’s (ODI) Data Spectrum, with figures overlaid of successful access requests by the Understanding Society study group in 2014. The ODI source image is CC-BY licensed.

Conclusion

There is a political agenda to de-identification – advocates claim it is possible, and that failures are the result of incompetence, a view that is heavily contested (Ohm 2009; Cavoukian and Castro 2014; Narayanan and Felten 2014).

There are incentives to make exaggerated claims of anonymity: primarily to enhance openness, and to reduce governance burden. However, this may have the unwanted side effect of denying participants’ legitimate expectations and concerns.

Regardless of public expectation, the inability to guarantee anonymity should reinstate the ICMJE proposal in full: that de-identified IPD should be shared responsibly, though managed access, and in line with informed consent.

Finally, there are models beyond an All or Nothing approach, once the possibility of anonymity is dismissed and risk assessment is addressed seriously.