Different Preservation Levels: The Case of Scholarly Digital Editions

Elias Oltmanns; Tim Hasler; Wolfgang Peters-Kottig; Heinz-Günter Kuper

Introduction

Digital resources are becoming crucially important to the humanities. Products of digital scholarship, i.e. research data and publications, are no longer predominantly confined to the linear representation of text but include a wide spectrum of data types, from digital objects like images or audio files to databases and aggregated software environments. A characteristic digital resource in many humanities disciplines is the multi layered annotated collection, particularly in the form of a Scholarly Digital Edition (SDE). SDEs are now establishing themselves as the norm in many areas of philological endeavour (). During the Zuse Institute Berlin’s participation in the Humanities Data Centre project HDC () we came across several projects considered typical examples of research data presented as web applications by our partners from the digital humanities, among them the SDE ‘Opus postumum’.

The conservation interdependence of such a software stack can be a daunting task. Like other digital resources, complex software environments tend to become less accessible over time due to financial, technological, legal, or organizational issues, as well as due to the lack of auditing or due to cultural changes which threaten usability. The more technological features an SDE incorporates, the faster and more likely new problems in an evolving environment will occur, including system security concerns, failed compatibility and unsupported external dependencies (). Sahle and Kronenwett () pose a central question, ‘Who cares for the presentational systems/“living systems” in the long run?’. In an ideal world, this task is shared between an information infrastructure provider (data facility/computing centre) and scholars of digital humanities with an insight into digital editing.

From our point of view as an infrastructure provider, research data has particularly good prospects of being preserved for a long time if it can be dissected into singular digital objects like TIFF images and XML files without losing important information implicitly hidden in the application code. Such digital objects and their metadata can be stored with reasonable technical and organizational effort in a digital preservation system compliant with the Open Archival Information System standard (OAIS; ). Hence, standard preservation actions like the migration of digital objects to other formats upon detection of data format obsolescence can be applied.

This perspective may easily be in conflict with scholars’ interest to interact with the original web interface whose content and technology remains unchanged or, even better, is continually enhanced. Ongoing maintenance of the fully functional online presentation would seem to be the ideal preservation solution but gets unaffordable in the long run. Employing the preservation strategy of emulation to enable an authentic representation of SDEs would require the encapsulation and distribution of the complete hardware and software stacks, including the operating system and driver interdependencies. A complex, likewise expensive undertaking potentially incurring intellectual property rights issues when offered as a service.

In this paper we discuss a set of service levels for the long-term preservation of SDEs beyond the day that project funding has expired. After giving just a little background on SDEs and the challenging question what actually needs to be preserved in the next two sections, we go on to introduce the service levels putting them into perspective with regard to the interests/needs of contemporary and future scholars as well as resource constraints. Concentrating on two service levels, we discuss their applicability and potential to complement each other considering two specific examples of SDEs in the section “Use Cases: Opus postumum and Edition Visualization Technology”. These use cases shall demonstrate how decisions in the design phase of an SDE may affect the time span and the size of the community it will be accessible for.

Excursus: Scholarly Digital Editions

Sahle () defines a scholarly edition to be ‘the critical representation of historic documents’. A representation entails recoding, a transformation from one medium to another, e.g. by means of transcription or fascimiles. ‘Document’ is to be understood as a more generalised, i.e. broader category of material than just text.

SDEs are deemed to be guided by a digital paradigm in their theory, method and practice (). It seems natural yet still worth noting that an SDE is not simply the online version of a traditional scholarly edition. Popular features include parallel views, for instance of the transcription juxtaposed with the facsimile, search and annotation facilities or tools to study a particular segment in detail.

The increasing importance of the SDE has already been acknowledged by Gabler (), who states that the digital medium is becoming the scholarly edition’s original medium. From a technical point of view, most current SDEs are based on the guidelines of the Text Encoding Initiative (TEI). TEI provides guidance on the markup of text in XML, accounting for numerous aspects of semantics as well as visual appearance. Even the correspondence between passages of transcribed text and their original locations on facsimiles can be encoded in TEI-XML. Hence, preserving an SDE basically means taking care of text encoded in TEI-XML, as well as possibly accompanying facsimiles in some image format, and a (web) application rendering all that information in a form deemed conducive to scholarly work by the creators of the SDE.

What to preserve?

When considering the preservation of an SDE as a complete digital resource, one question to be addressed should be how to decide which features are essential to be preserved: What are the ‘significant properties’ of the SDE? The InSPECT Project defined significant properties to be ‘[t]he characteristics of digital objects that must be preserved over time in order to ensure the continued accessibility, usability, and meaning of the objects, and their capacity to be accepted as evidence of what they purport to record’ (). The identification of significant properties is a complicated task that must reflect the consensus within an edition project regarding the most important aspects of an SDE. Since the idea that certain properties of a digital object be deemed significant derives from resource constraints, this concept is central to the question of which service level for the preservation of an SDE is appropriate (and economically feasible).

In contrast to a traditional scholarly edition in print, the information provided by an SDE is only indirectly accessible to humans and always requires sophisticated decoding and processing in order to be presented in some legible form.

One of the most obvious reasons for digitally encoding data in the first place is the intention to perform complex and/or fast operations on that data. A default route to the contents of print media, though not necessarily a full understanding of its meaning, may be assumed due to prevalent cultural techniques that only change at a moderate pace. A digital paradigm cannot rely on such a default route and should account for rapidly changing processing environments.

The challenge posed by the digital stewardship of SDEs lies in the fact that the data is presented in specialised aggregations that themselves are significant in terms of understanding, using, and curating the application (). However, scholars in the digital humanities tend to dismiss preservation efforts that do not encompass the ‘look and feel’ of the original web presentation in all its detail. During the HDC project, a non-representative survey of scholars at Berlin-Brandenburgische Akademie der Wissenschaften revealed a strong commitment to the concept of exact, authentic representation of an SDE as being paramount to its scholarly value.

TEI-XML represents a significant investment of labour in the course of encoding a text and discarding it would be unwise, especially considering the fact that XML is not concerned with presentation by design, thus making it a flexible format that can be used in other contexts (e.g. after applying XSLT transformations). Also, Turska, Cummings, and Rahtz () point out that ‘different usage scenarios call for different presentations’. Returning to Sahle’s notion of a digital paradigm, the data might very well be valuable outside the context of the original application.

Service levels for Long Term Availability of SDEs

A comprehensive solution to the ongoing threat of research data loss should involve permanently funded institutions, i.e. institutions of a durable nature like data facilities providing storage capacity (computing centres). Since project funding is per se always restricted in time there is an inherent problem for the sustainability of infrastructures. When there is limited time to prove the benefit of a service the possible outreach is also limited. On the other hand only services with considerable backing in the community have the chance to get long-term funding.

The establishment of humanities data centres as a collaboration between well established infrastructure providers and digital humanists is one possible approach to ensure a trustworthy long term preservation. Examples are the HDC (), the Data Center for the Humanities (DCH) Cologne (), the Data and Service Center for the Humanities (DaSCH) () and the Kompetenznetzwerk Digitale Edition (). Key to the success of such infrastructures is the relevance to the research community of the services provided.

With regard to the sustainability of software environments such as SDEs, a diverse spectrum of service levels is conceivable. We present four potential service levels, ordered by increasing effort required on the part of interested scholars to work with the preserved research (Figure 1).

Figure 1

Different aspects of the proposed preservation levels.

1. Continuous Maintenance

Service level 1 is discussed mostly for the sake of completeness and is essentially hypothetical, because it implies the ongoing maintenance of the original application. Naturally, this form of indefinite curation can be assumed to be the preservation level most desired by the researchers, due to the undiminished ‘functional completeness’ of the application. However, it requires permanent funding and it does not scale. Continuous Maintenance entails active development in order to accommodate changes not only on the host system but also changes to popular user clients. As long as the research project is alive and funded there are enough resources for the maintenance of the software stack, subsequently one could envision an indefinitely funded foundation or trust which is indefinitely funding the ‘ongoing editing, enrichment and processing of digital research data’ (). However well this approach may work for a single application, it becomes an increasingly laborious undertaking with each research product added to the service portfolio of an operator/data centre, since there are currently no agreed upon standards on how to use the tools of the trade. Technological development and accumulating security issues probably render the curation of an SDE for more than a few years a futile endeavour.

2. Application Conservation

Service level 2 is a more pragmatic approach to the long-term availability of applications in the humanities realm using virtualisation technology. This concept has been developed in the context of the HDC project (under the name ‘Application Preservation’; ). All the components making up the application are transferred to a virtualised environment, e.g. a docker container or a Virtual Machine. In the particular case of web applications, both server components (webserver, deployed web applications and their runtime environments, possibly a database server, etc.) and client components (webbrowser, maybe some extensions, etc.) need to be virtualised in this way, possibly in two separate environments. Each time a user requests access to this particular research resource, the docker container gets fired up and the conserved environment is presented to the user. Freezing the web application in an environment together with a tried-and-tested client (i.e. a particular browser version), which is then made accessible to interested users as a remote desktop client in their favourite browser, allows the conserved application to retain a close match to the original user experience. Initial changes to the web application will be required when it is first moved to a virtualised environment run by an infrastructure provider. A drawback is the lack of modifiability at the point of access: the application can be used as is. A user can interact with the current application and reproduce research results retrospectively. Access to underlying raw data is usually not possible.

Once up and running, the virtualised environment is frozen, effectively growing into an unpatched system with serious security implications. Appropriate counter measures will have to be put into place on the host system, and, in all likelihood, access will have to be restricted to authenticated users. Hence, the SDE – including its original user interface – can be conserved as long as the required environment is supported by the virtualisation solution and access restrictions for security reasons do not have to be too rigorous. This results in a limited lifespan of maybe 5–10 years, depending on the time that can be invested. This service intentionally does not extend to the sophisticated emulation of hard- and software environments. Once the application cannot be catered for within the deployed virtualisation infrastructure, it is no longer considered accessible and the service is terminated.

3. Application Data Preservation

Service level 3 aims at the separate preservation of all digital data objects underlying the application. With regard to SDEs, we concur with Rosseli del Turco (), who states that ‘the only viable solution to ensure that an edition is usable for the foreseeable future is to completely decouple the edition data from the visualization mechanism’. Application Data Preservation means that the relevant data reflecting significant properties of the application are exported, or rather, extracted from the software environment. The ‘atomised’ application data can be stored in a long-term digital preservation system using open, well-known and documented data formats, accompanied with technical, administrative and descriptive metadata explaining the application, its intended performance, and the underlying assumptions on presentation or visualisation. Migration of data formats from an obsolete format to a new format is the predominant preservation action for this rather static type of (research) data.

This level may be compared to well-documented raw data storage in the empirical sciences. Close cooperation between data curators and scholars should ensure thorough description prior to the ingest into a long-term digital preservation system. In this case, the lifespan of the intellectual content of the application is expected to be much higher than that for Application Conservation (level 2). Compared to Application Conservation, more effort has to be invested in the data acquisition process and also in any activity to virtually resurrect the data in the original or a new context. On the other hand, there is potential for the future reuse of data in hitherto unseen research contexts, provided that the data are maintained according to the FAIR principles of research data management (): Findability, Accessibility, Interoperability, and Reusability.

4. Bitstream Preservation

It can be assumed that Bitstream Preservation provides the longest prospective time for which the integrity of preserved data can be assured. The concept is well understood and addresses ‘the first requirement of digital preservation – to remain an intact physical copy of the digital object’ (). It relies on different copies ideally stored on different machines with different technologies. The application (or at least its integral components like underlying databases, presentation code, logic code and configuration) is saved as is to some long-term storage, typically without further context information. We suggest that the sole bitstream be enriched with at least technical and administrative metadata, ideally with descriptive metadata. Bitstream Preservation is the preservation level with the least functional completeness. It relies entirely on future users and their ability to reverse engineer the environment needed to run the application, be it software or hardware. This approach is considered a last resort when a higher level of preservation or maintenance is unfeasible.

Use Cases: Opus postumum and Edition Visualization Technology

Despite the existence of the de-facto encoding standard TEI-XML, SDEs are far from uniform. In fact, they can vary considerably with regard to the software components involved, implementation details and the data model. We are going to study two setups of SDEs with rather differing design goals more closely. Their characteristics shall be discussed in relation to the requirements and goals of the service levels proposed in this paper. In particular, it shall be demonstrated why Application Data Preservation is not a general purpose service level suitable for arbitrary applications but under certain conditions should be considered a valuable service complementary to the, in some sense, more generic application conservation approach. The Opus postumum is a critical edition of an extant manuscript by the famous German 18th century scholar Immanuel Kant. First preparatory steps toward digitising and re-editing the manuscript took place in 2001 and work has continued at the Berlin-Brandenburgische Akademie der Wissenschaften (BBAW) ever since (Figure 2). The BBAW, being one of the partners in the HDC project mentioned in the introduction had provided the data of this SDE for our investigations into possible solutions for long-term preservation.

Figure 2

Opus postumum Online-Edition. Original screenshot (https://xmlpublic.bbaw.de/legacy/apps/kant/web/index.html).

The Opus postumum has been realised as a web application based on the XML database eXist (http://exist-db.org/), which stores the TEI encoded transcriptions, and the digital image library digilib (https://robcast.github.io/digilib/) providing access to the facsimiles. The setup has been designed not only to publish project results but also to aid project participants in the process of editing the text. Apart from native XML processing, the database provides an access control and permissions system and its resources are accessible from the popular Oxygen XML editor. digilib serves the facsimile images in different resolutions at the user’s request and restricts data transmission to the segment currently being viewed, increasing performance when zooming in and studying the facsimile in detail. The frontend of the Opus postumum, developed at the BBAW, renders the facsimile and its transcription one manuscript page at a time, side by side, providing the user with various options of zooming in and navigating through the manuscript. In particular, the Opus postumum boasts reciprocal linking between passages in the transcription and the corresponding segments on the facsimile. The two representations of a passage are highlighted when hovering the mouse over either one of them, bringing the other one into view first if necessary, which effectively provides simultaneous scrolling.

Searching for existing approaches to reusability in the domain of scholarly digital editing, we came across the publishing tool Edition Visualization Technology EVT (http://evt.labcd.unipi.it/) (Figure 3). Its development was started as part of the ‘Digital Vercelli Book’ project (). Distinguishing between the data forming the essence of the SDE and its presentation to the user has been a guiding principle of the project. Consequently, EVT has been designed to transform a TEI-encoded text or transcription, possibly accompanied by facsimile images, into a complete web application, i.e. to build the edition around the data ().

Figure 3

The Digital Vercelli Book in EVT. Original screenshot (http://evt.labcd.unipi.it/).

Contrary to the Opus postumum, EVT is based on a client-only approach relying on established standard technologies like HTML 5, CSS3 and JavaScript. The intention is to ease (re)deployment of an SDE on just about any web server, without further software being required on the server. On the other hand, EVT lacks the image annotation services and transfer performance optimisations provided by digilib and the database features of eXist, mentioned earlier as part of the Opus postumum, and which is intended to assist scholars (not least the editors themselves) in their work on and with the SDE.

The two approaches toward creating SDEs described above would benefit in different ways from the service levels discussed in this article. Application Data Preservation being all about making data available for reuse in other applications, a few tests have been performed on samples from the Opus postumum and the Digital Vercelli Book kindly provided by the BBAW and Roberto Rosselli Del Turco (Univ. Turin), respectively. Simple XML schema validation on the TEI-XML encoded data revealed that both data sets violated the schema in some way or other. This was less surprising than one might think given that the TEI guidelines are complex and voluminous. However, fixing the markup turned out to be fairly trivial with regard to the Digital Vercelli Book, whereas the task was more laborious, i.e. not always scriptable in an obvious way, in the context of the Opus postumum. Moreover, we tried our hand at processing some data from the Opus postumum for visualisation based on EVT. For this endeavour, the EVT documentation could be relied upon but the Opus postumum, not being designed for data exchange, provides purely non-technical documentation explaining the user interface only. As it turned out, preparing a single page of the manuscript is not trivial but possible. Anything beyond that, however, involves the rather complex merging of TEI documents handling ID collisions along the way due to an unorthodox data model.

In defence of the Opus postumum it must be emphasised at this point that reusability of the underlying data in different contexts was, quite simply, not a design priority during its development, unlike EVT. Therefore, EVT has to be considered at an advantage with regard to the investigations described above. All the same, the results indicate that considering interoperability and reusability at the design stage really does make a difference. The intentional separation of data and interface make any EVT-based SDE a promising candidate for Application Data Preservation. At this service level, the data is readily available to the research community, allowing scholars to use it with their own tools. Meanwhile, a plugin called TEI Publisher (https://teipublisher.com/) is available for recent releases of eXist, the database serving the Opus postumum. The purpose of TEI Publisher is to generate standalone web applications from an application originally served by eXist. Eliminating eXist from the requirements to access the SDE obviously reduces complexity in the software stack to maintain. Also, such a tool may conceivably serve the course of data reusability by enforcing restrictions on the TEI input to be processed, and generating a certain degree of uniformity on the produced output.

On a more hypothetical basis it can be argued that the client-only approach pursued in the development of EVT supports the hosting of an unmaintained SDE on an otherwise fully maintained webserver without too many risks for the hoster for an extended period of time. Of course, clients are subject to change and may even be replaced possibly breaking the interface of the SDE in the course of time. At this point, Application Conservation may give the user access to a legacy browser that still renders the SDE as originally intended. Since EVT has been developed for scholars to build SDEs from their own datasets, it is shipped with documentation that may be a valuable source of information even when EVT ceases to function after the discontinuation of development. It is of potential interest to anyone performing preservation actions on an EVT-based SDE as part of service level 3 or trying to make sense of such an SDE preserved only at the bitstream level.

Application Conservation appears to be a good fit for the Opus postumum, because the whole environment can be cloned with only minor changes to the setup. This flexibility comes at a cost, however, since the service can only be operated under conditions that mitigate the risk of attacks on an unmaintained system. The Opus postumum is based on rather complex server components executed in a Java runtime environment. Access restrictions may conceivably be imperative sooner than hoped for in order to protect the system. As far as Bitstream Preservation is concerned, Application Conservation as an intermediary stage will probably assist in the curation process. It provides a fairly straight forward test case ensuring that all components required for execution (and thus potentially helpful for understanding the application) have been included. Obviously, this does not automatically enforce the appropriate level of supplemental documentation.

Summary and Conclusions

Infrastructure providers cannot step in for the developers of applications when they run out of resources, be it because project funding expires or their interests and priorities have shifted. Nonetheless, the service levels proposed and discussed with special regard to SDEs address different needs arising from varying usage scenarios and application architectures. Providers implementing service levels 2, 3, and 4 should be in a position to offer reasonable or even attractive solutions for the mid and long term preservation of research data in the humanities. The Zuse Institute operates the digital preservation service EWIG () based on to the OAIS reference model, thus providing the infrastructure for service levels 3 and 4. Service level 2 is offered at the HDC in Göttingen. As stated in the introduction, the infrastructure is only part of the responsibilities and the curation effort preceding the transfer of research data to the infrastructure provider involves a lot of dedication on the part of the scholars familiar with the data.

Still, the impact of decisions early in the design stage and all through the development of an SDE’s on its preservation prospects can hardly be overestimated. In particular, the scope of Application Data Preservation (service level 3) is restricted due to two important factors: it only makes sense if a subset of the information built into the application can be envisaged as a valuable contribution to future research without the full context of the original application; secondly, the extraction of this information from the application modelled and encoded according to well established and documented standards, must be affordable. The major benefit is that we are now dealing with objects that can be preserved in an OAIS-compliant way with a fair chance of being subjected to preservation actions when required. Bitstream Preservation is the only one among the remaining service levels that can be handled by an OAIS-compliant preservation system but obviously limits the options for preservation planning.

Application Conservation, on the other hand, may be considered closer to hosting than a preservation service in the traditional sense. It is very flexible, though, and probably the only viable option for maintaining access to certain applications and their data at least for some time. Note that service level 2 is more flexible and therefore applicable to far more web applications than service level 3 but only operable for a limited time. For a certain period of time, the two service levels are complementary for the majority of applications that satisfy the requirements of Application Data Preservation. As indicated in Figure 1, service level 3 can be assumed to preserve reusable and possibly interoperable data for much longer than service level 2 is able to retain access to the original application – both exceeded by the virtually indefinite time spanned by Bitstream Preservation. Together, service levels 2 and 3 are a good match to provide findable, accessible, reusable and possibly interoperable research data in the humanities.

Data Science Journal

Research Papers