Web-based persistent identifiers have been around for more than 20 years, a period long enough for us to start observing patterns of success and failure. Persistent identifiers were invented to address challenges arising from the distributed and disorganised nature of the internet, which not only allowed new technologies to emerge, it also made it difficult to maintain a persistent record of science (Dellavalle et al. 2003; Lawrence et al. 2001). This phenomenon, also dubbed “link rot”, affects all digital resources on the web, including research data (Vines et al. 2014).
It has been argued that “link rot” can be avoided by careful management of web servers to keep URLs stable over a long time, a principle called “Cool URIs” (Berners-Lee 1998). For semantic applications the use of Cool URIs has been proposed and it has been questioned whether DOI are necessary in a world of Cool URI (Bazzanella, Bortoli, and Bouquet 2013).
“Pretty much the only good reason for a document to disappear from the Web is that the company which owned the domain name went out of business or can no longer afford to keep the server running.” (Berners-Lee 1998).
Unforeseen to Berners-Lee, a few years after his statement the “dot.com bubble” burst and many companies went out of business, leaving many web domains orphaned. Other companies were acquired and merged into existing entities, and again sometimes losing their original web domain.
One way of addressing the root problem of the persistence of locators on the web was by the introduction of persistent identifiers which separated the identity of an object from its location on the web (see e.g. Arms 1995; Lawrence et al. 2001; Lynch 1997). Adding a system to ensure global uniqueness makes persistent identifiers a tool that allows us unambiguous identification of resources on the net. The expectations were that persistent identifiers would lead to greater accessibility, transparency and reproducibility of research results. The discussion of PID vs. Cool URI in Bazzanella, Bortoli, and Bouquet (2013) shows that persistent access to web resources is not merely a technical question, but rather a “social contract” that needs to be entered by the stakeholders aiming to maintain persistent references to objects on the web.
In this paper we want to review 20 years of persistent identifier practice and the uptake of different persistent identifier systems. In a series of case studies we want to characterise well known persistent identifier systems, assess their successes and failures, and extract what can be learned from these examples.
Uptake of Persistent Identifiers
One way to assess the success of particular identifier systems is to survey their adoption by research data repositories. This might seem like a straightforward approach, but it turns out to be difficult to define a measure for the uptake of persistent identifier systems because the sizes of research data repositories and the granularities vary by orders of magnitude. Our analysis of the uptake of persistent identifier systems by research data repositories is based on data from the Registry of Research Data Repositories (re3data.org). re3data.org is a global registry of research data repositories that covers all academic disciplines. The registry arose from two separate projects, re3data.org (Pampel et al. 2013) and DataBib (Witt and Giarlo 2012) and is now managed by DataCite. The total sample we obtained from re3data.org in December 2015 listed 1381 repositories. Out of this total a subset of 475 repositories used some type of persistent identifier. Note that some repositories use more than one type of persistent identifier. Figure 1 summarises the use of persistent identifier types used by repositories listed in the re3data.org database.
The focus of re3data.org is on research data repositories and despite its size the re3data.org registry does not claim global coverage. Still, its catalogue can be considered to be a representative overview. Not covered by re3data.org are collections of research specimens like herbaria or cultural artefacts, which might also use persistent identifiers for the identification of items in their collections. Furthermore, identifiers are also used outside of research, for example in the identification of companies on stock exchanges. In this paper we will only discuss the use of persistent identifiers in the context of research data and the record of science.
In their standardised descriptions of research data repositories re3data.org distinguish between only a few identifier systems (Rücknagel et al. 2015). Digital Object Identifiers (DOI) are clearly the most widely adopted persistent identifier in research data repository systems. Figure 2 differentiates persistent identifiers used by three kinds of repositories, namely disciplinary repositories, institutional repositories and repositories that fall in neither of the two former categories. Remarkable is the relatively frequent use of “other” persistent identifier systems, not differentiated in the re3data.org description, by disciplinary repositories. This points at an important role of discipline specific identifiers.
Again, DOI are by far the most used persistent identifiers. “Other” types of identifiers seem to play an important role in disciplinary repositories and 57 of these repositories serve the life sciences. This indicates a special role that “other” identifier systems play in particular disciplines. Institutional repositories frequently use identifiers based on the Handle system, which may be due to the fact that some institutional repository software, like DSpace (Smith et al. 2003), use Handle-based identifiers to identify objects in their holdings.
Years of Crisis
While persistent identifiers are adopted and implemented by a growing number of data archives, not every PID system experienced a success story during the last years. A few PID systems even experienced severe problems to the degree that lead to a temporary shutdown of some of their core services, which in turn led to orphaned, unresolvable or unmanaged PIDs.
As mentioned above, the years 2015 and 2016 turned out to be years of crisis for some persistent identifier systems, in particular for Persistent URL (PURL) and Life Science Identifiers (LSID). While PURL seems to have gained a new lease on life through transferring to a new organisational and technical base, the future of LSID as a resolvable persistent identifier seems uncertain.
PURL was introduced by the Online Computer Library Center, Inc. (OCLC) as a bridging technology to prepare for introduction of Universal Resource Names (URN). PURL implements the URI concept and thus it does not separate between identifier and resolving mechanism. PURL has no single global resolving mechanism and PURL resolvers do not communicate amongst each other to share resolving information like DNS or Handle servers would do. For most of its history PURL had little social infrastructure and formal governance. In 2014 OCLC withdrew its institutional support and the future of PURL became unclear while PURL experienced severe technical problems for some time and the system was put into a ‘read-only’ maintenance mode (Baker 2015). In September 2016 OCLC and the Internet Archive announced that the URL redirection service, on which PURL is based, will in future be operated by the Internet Archive (OCLC 2016). This move brought PURL back from the brink of extinction. In December 2015 a total of 16 research data repositories in re3data.org were listed as using PURL, and only few of them were using PURL exclusively. Using Google Scholar as a search engine we estimate that about 16,400 PURL identifiers are being used in the entire scholarly record indexed by Google Scholar. Of these, less than 5,000 seem to identify digital objects like data, most seem to identify semantic concepts.
LSIDs had been introduced by the Object Management Group (OMG) in 2004 as a way to naming and identifying data resources stored in multiple, distributed data stores. From 2009 onwards the biodiversity informatics communities’ standardisation authority (Taxonomic Database Working Group, TDWG) strongly supported LSIDs as the preferred GUID technology. LSIDs were thought to be used by all globally leading providers for biodiversity data to identify organism names. However, LSIDs provide neither a global resolving mechanism nor a centralised provider registration. The implementation of this standard is relatively complex, resolution is DNS based and requires a multistep procedure, the associated metadata format is RDF. As a consequence the technology was controversially disputed (e.g. Hyam 2015; Page 2016) and opinions in discussion forums seemed to favour a simpler identifier system such as HTTP URIs.
As a result of these difficulties the system remained fragmented and fragile. In 2016, maintenance on TDWGs LSID resolution service was terminated and TDWGs support of LSIDs came into questioning by members of the group (TDWG 2016). After two months without a central resolving system, a resolver has been made available at http://www.lsid.info. However, the discussion is ongoing and significant parts of the biodiversity informatics community recommend switching from LSID to cool URI (Guralnick et al. 2015). Using Google Scholar as a search engine we estimated that about 14,000 LSIDs have been used in the scientific literature.
Which are here to stay?
At the same time as criteria for trusted repositories were developed (Dobratz et al. 2009; Sesink, van Horik, and Harmsen 2008), similar efforts looked at criteria for trustworthy persistent identifier systems. Most notable are the criteria for trusted persistent identifier systems developed by Bütikofer (2009) in the context of the German nestor research programme on long-term preservation, and the review of persistent identifier systems as tools for science by Duerr et al. (2011). While the criteria of Bütikofer emphasize technical and organisational criteria, the review of Duerr et al. focuses more on usability of identifier systems as part of the academic record. Even though the authors come to different conclusions about which systems are likely to persist, both recognise the importance of organisational sustainability. If organisational stability is the Achilles Heel of persistent identifier systems, are there ways we can achieve better sustainability of PID systems?
A first step towards better sustainability of PID systems would be more transparency. This should include all aspects of a PID system, technical documentation, policies, governance and in particular the data and metadata which are necessary to resolve a PID.
Today, most of discussions related to the status of PID systems are hidden in online discussion fora and email lists and is only rarely made public. It is entirely unsatisfactory to publicly and officially promote a PID system while exit strategies are being discussed in the background or services are silently ceased. This needs to change, and clearly a more participatory attitude and proactive communication strategy would be beneficial for all PID systems stakeholders.
A set of criteria, analogous to the criteria for the description of research data repositories published by re3data.org (Rücknagel et al. 2015), would help with the evaluation of PID systems. In conjunction with the re3data.org criteria, they would also help to identify weaknesses in the shared responsibility of the data provider and the operator of the PID resolver service for a reliable resolution of identifiers to web endpoints.
The large number of discipline specific resolver systems tells us that there may be very specific needs in the governance of a PID system that are not met by the generic services. Here it is necessary to have a close look at the value proposition of a particular PID system and the services it provides.
In addition to organisational criteria, the value proposition of a particular PID system also asks us to evaluate its technical basis and alternative technical solutions. As we have seen from the LSID example, the seeming simplicity of Cool URIs is still attractive. The HTTP protocol is simple and universally available but the risks of “link rot” have not gone away. Other interesting technical alternatives are based on peer-to-peer networking technologies such as Blockchain (Bolikowski, Nowiński, and Sylwestrzak 2015) or MagnetLinks (Golodoniuc, Car, and Klump, this volume). Peer-to-peer technologies would allow a “Devolution of Power” and community-based backup strategies for PID resolution.
Coming back to the question of the value proposition of PID systems, do we need PID resolvers? Yes, because we do not yet live in a semantic web world where linked data graphs would lead us to resources as proposed by Sachs and Finin (2010). As an interim solution data providers should take advantage of available web search engines and make their data holdings discoverable. A possible approach would be to use mainstream web technologies as a supplement and potential fall-back solution to PID systems. Candidate technologies are microformats or JSON-LD which are suited to expose both, metadata as well as potentially multiple identifiers associated to a digital object. Complementary sitemaps or catalogue services catalogued in a publicly available registry such as the GEOSS CSR could enable the implementation of common, generic resolution services.
There is, however, the other value proposition of PID, the persistent identification of elements of the record of science. Properly identifying these elements in a way that can be consumed by human and machine clients alike, and maintaining the persistence of objects and identifier resolution, is not a purely technical problem but is maintained through a social contract. The stability of this social contract, together with a sustainable and adaptable technological base, will determine the sustainability and resilience of a PID system. It is tempting to assume that a social contract becomes increasingly binding as user community relying on a PID particular system grows. With the examples discussed in this paper we show that this is most likely an illusion. The DOI system, which is arguably the most successful PID system today, has a strong commercial backing while minor systems such as URN and ARK have the backing of national libraries. It might be a bitter pill to swallow for some members in the research data community wary of all things commercial, but business models are essential aspects of PID systems – sustainable PID systems do not come for free.