Perseids: Experimenting with Infrastructure for Creating and Sharing Research Data in the Digital Humanities

Bridget Almas

Overview

The Perseids project provides a platform for creating, publishing, and sharing research data, in the form of textual transcriptions, annotations and analyses. An offshoot and collaborator of the , Perseids is also an experiment in reusing and extending existing infrastructure, tools, and services.

This paper discusses infrastructure in the domain of digital humanities (DH). It outlines some general approaches to facilitating data sharing in this domain, and the specific choices we made in developing Perseids to serve that goal. It concludes by identifying lessons we have learned about sustainability in the process of building Perseids, noting some critical gaps in infrastructure for the digital humanities, and suggesting some implications for the wider community.

General Approaches

What constitutes infrastructure, and how does it facilitate data sharing in the domain of DH, and in the Perseids project in particular? According to Mark Parsons, Secretary General of the Research Data Alliance (RDA), infrastructure can be defined as ‘the relationships, interactions and connections between people, technologies, and institutions that help data flow and be useful ().’

In the realm of DH, any of the following might be considered infrastructure: original digital collections, linked data providers, general purpose and domain-specific platforms, content management systems (CMSs), virtual research environments (VREs), online tools and services, repositories and service providers, aggregators and portals, . Table 1 provides some specific examples of these in the DH and digital classics (DC) communities, illustrating the diversity and breadth of infrastructure in this community.

Table 1

Examples of infrastructure in digital humanities and digital classics.

Infrastructure type	Examples in DH and DC

Original digital collections	PDL, Papyri.info, NINES, Digital Latin Library, Coptic Scriptorium, Roman de La Rose
Linked data providers and gazetteers	Pleiades, PeriodO, , VIAF, Getty, Trismegistos, DBPedia
General purpose platforms, CMS, VREs, tools and services	Omeka, MediaWiki, Heurist, TextGrid, Voyant, Mirador, CollateX, JUXTA, Neatline
Domain-specific platforms, CMS, VREs, tools and services	Perseids, , Symogih, PECE
Repositories and service providers	CLARIN, DARIAH, EUDAT, MLA Commons/CORE, HumaNum, Hathi Trust Research Center, California Digital Library
Aggregators and portals	Europeana, Digital Public Library of America, HuNi, EHRI
	IIIF, OA, TEI, OAUTH, /SAML, CTS

Enabling data sharing includes ensuring that data objects have persistent, resolvable identifiers, providing descriptive and structural metadata, providing licensing and access information, and using standard data formats and ontologies. The recent W3C recommendation ‘Data on the Web Best Practices’ () cites many strategies such as providing version history, provenance information, and data quality information.

Above and beyond this, ensuring that adequate editorial and/or peer review has taken place before data is shared is often an important criteria for data sharing in the humanities.

Background

Perseids evolved to fill a critical need of the digital classics community of scholars and students (): infrastructure that supports textual transcription, annotation, and analysis at a large scale, with review, in both scholarly and pedagogical contexts. Such infrastructure would give us the ability to work with text-centric publications containing a variety of different data types, and would include:

stable, persistent identifiers for all publications;
a versioned, collaborative editing environment;
the ability to extend the environment with data type-specific behaviors and tools;
customized review workflows.

Perseids is, in part, a successor to a prior ambitious, but ultimately unsuccessful, infrastructure effort in the humanities, Project Bamboo (). One of the aims of Project Bamboo was to develop a Service Oriented Architecture (SOA) that could serve a wide variety of use cases and requirements for textual analysis and humanities research. This accorded with the goal of the PDL: to begin to decouple the many services making up the Perseus 4 application, so that they could be recombined and reused to build new applications (). The PDL’s contribution to Bamboo included development (and implementation) of service APIs for and . These services, intended to be shared on the Bamboo Services Platform, reused code from two main sources: the PDL’s web application and the Project’s reading environment, and were designed to be easily extended to serve additional languages and use cases. They provided essential functionality for textual analysis and annotation.

At the same time, we also began separately investigating development of a scalable solution for engaging undergraduate students in the production of original transcriptions and translations of Medieval Latin Manuscripts and Greek Epigraphy. This work was inspired by, and involved reuse of architecture and tools from, two major projects in digital classics, the Homer Multitext and Papyri.info ().

One thing that prevented Bamboo from succeeding was the assumption that scholars would be willing to give up their domain-specific tools and services for a more general infrastructure to which everyone would contribute (Dombrowski 2015). Humanities use cases at the time appeared too diverse for that, and technologies were moving very fast. It is unclear whether or not Bamboo could have succeeded but the project ended before we could develop a critical component needed for our own use cases, a platform for management of the data and scholarly workflow which would allow for full peer and professorial review.

Perseids took up in part where Bamboo left off, but with a more modest goal of providing infrastructure for our own specific set of use cases. We reused the services we built for Bamboo in Perseids, and also reused an existing piece of infrastructure from another project, the Son of SUDA Online (SoSOL), to fill the role of managing the data and review workflows.

Drawing on the experiences of Bamboo, we decided that Perseids would support a looser coupling of existing tools and services. One goal of infrastructure is to connect what already works, adding value and capacity without reinventing solutions. Our development approach for Perseids was thus based on three principles:

data interoperability;
flexibility and agility;
tool interoperability.

We wanted not only to support our scholarly workflows, but also to be sure that the outputs would be fully sharable and preservable.

Perseids currently serves an active user base, averaging between one and two thousand sessions by at least five hundred unique users per month during the academic year, the majority of which come from six active DH communities: Tufts, the University of Nebraska at Lincoln, the College of Letters and Science of the Sao Paulo State University, the University of Leipzig, the University of Lyon, and the University of Zagreb. Several external projects also connect to Perseids’s tools and review workflow via its API.

Functionality

Use Cases

Perseids offers functionality for creation, curation and review of texts, translations and annotations. It enables its users to:

Create and edit a new text transcription.
Edit an existing text transcription.
Create and edit a new text translation.
Edit an existing text translation.
Create and edit a new commentary annotation.
Create and edit a new treebank annotation.
Create and edit a new text alignment Annotation.
Ingest and edit simple annotation data from external sources.
Create and edit simple annotations on texts.

The process of creating a publication on Perseids involves workflows fulfilling one or more of these use cases (Figure 1).

Figure 1

The Perseids home screen, showing a variety of data types and actions.

Workflows

A workflow, in this context, is a series of actions carried out by a user to achieve some goal. In a typical workflow on Perseids the user creates a publication containing one or more of the supported data types. She uses an editing tool appropriate to the data type to edit and curate her work and then submits it to a review board for acceptance. For example, she may choose to create and edit a Treebank annotation using the editing tool (Figure 2).

Figure 2

Annotating a Treebank in Arethusa.

If the work is being done in the context of a pedagogical assignment, the review board is likely to be made up of the professor and teaching assistants for the class. If the work is being done in the context of a specific project or community, the review board will be composed of peers or expert members of an editorial team (Figures 3 and 4).

Figure 3

Perseids user interface – voting on a publication.

Figure 4

Perseids review workflow.

The ability to support peer-review functionality is a distinguishing feature of the Perseids infrastructure, and an important driving factor behind the architectural decision to built it upon the SoSOL platform. As we discuss further below, a common driver for external projects to integrate with Perseids is to take advantage of the flexible review workflow features it offers.

Architecture

The Perseids architecture (Figures 5, 6, 7) supports these workflows through a complex sequence of interactions between its core components, hosted tools and services, 3^rd party applications and platforms and external identity providers and content repositories.

Figure 5

Perseids infrastructure and ecosystem.

Figure 6

Perseids core components.

Figure 7

Perseids hosted tools and services.

SoSOL is the core of the Perseids platform. It is a Ruby on Rails application, built on top of a Git repository, that provides an open-access, version-controlled, multi-author web-based editing environment that supports working with collections of related data objects as publications. SoSOL was developed for the Papyri.info site by the Integrating Digital Papyrology project, a multi-institution project aimed at supporting interoperability between five different digital papyrological resources () and is now maintained jointly by the Duke Collaboratory for Classics Computing and the Perseids project.

A Git repository provides versioning support for all documents, annotations and other related objects managed on the platform. SoSOL also provides additional functionality on top of Git’s, including document validation, templates for documentation creation, review boards, and communities. It uses a relational database (MySQL) to store information about document status and to track the activity of users, boards, and communities. SoSOL uses the and /SAML protocols to delegate responsibility for user authentication to social or institutional identity providers. Social identity providers (IdP) are supported through a third-party gateway, currently Janrain Engage.

The Perseids deployment of SoSOL incorporates the Canonical Text Services (CTS) protocol. The CTS specification defines an API protocol and a URN syntax for identifying and retrieving text passages via machine-actionable, canonical identifiers (). To support CTS, as well as provide features such as tokenization of texts, the Perseids deployment of SoSOL delegates some functionality to external databases and services.

The SoSOL application itself provides lightweight user interfaces for creating and editing documents and annotations, but in order to support an open-ended set of different editing and annotation activities, we rely on integrations with external web-based tools for editing and annotating. These integrations are enabled by API interactions between the tools and the SoSOL application.

The Perseids Client Applications component acts as a broker between the end-user, the SoSOL platform, external repositories and services, and the web-based editing and annotation tools. Built on the , this component implements a client-side workflow for the creation of new annotations of text passages identified by CTS URN. It uses the CTS abstraction libraries from infrastructure for CTS URN resolution and processing, as does the Nemo browsing interface, which offers a discovery interface for identifying texts to annotate and an anchoring point for front-end annotation tools and visualizations.

A recent addition to the platform is a Service which enables us to send data directly to external GitHub repositories after it has been through the review workflow. (See the ‘Tool Interoperability’ section below for further details on these scenarios.)

The role that each component of the architecture and ecosystem plays in supporting the workflow is described in the ‘Tools Interoperability’ section below.

Information Model

Data publications produced on Perseids are collections of related data objects of different types. The SoSOL information model was designed for this type of publication. The “Publication” is the container for a collection of data objects belonging to a parent abstract class of “Identifier.” Different type object types are implemented as derivations of the “Identifier” class, which add type-specific behaviors and properties, such as schema validation rules. Figure 8 shows how this design applies in Perseids.

Figure 8

Perseids information model.

However, Perseids publications can also be thought of as research objects (), where the object of the research is a passage or passages of canonically-identifiable text. Figure 9 shows our original vision for a CTS-focused publication on Perseids (Figure 9).

Figure 9

Perseids publication as a CTS focused research object.

Tool interoperability

Decoupling data creation tools from the sources and destinations of the data was a key part of our design approach. are critical components of infrastructure, and integration and sharing require that data be retrievable from and persistable to any source ().

Perseids offers an API for Create, Read, Update, and Delete update operations for all data types supported by the platform. API clients can authenticate using the OAuth 2.0 protocol () or co-hosted tools have the option of using a shared session cookie. These approaches enable integration with specific tools and services, such as the Arethusa Annotation Framework and the , as well as external projects such as () and the Gazetteer (Figure 10).

Figure 10

Creating and submitting a publication from an external application using OAuth2.

Perseids also uses external APIs to pull data from other infrastructures. We use the Canonical Text Services URN protocol and API () to identify and retrieve textual transcription, translation, and annotation targets (Figures 11 and 12).

Figure 11

Sequence of API interactions for creating and editing a CTS-focused annotation template using the Perseid Client Apps and a locally hosted editing tool.

Figure 12

Using the Perseids Client Apps to create a new translation alignment annotation in Perseids for editing via the . Texts available for use are populated via a call to the CTS API.

We also offer a lightweight URL-based API which lets individual scholars and smaller projects, particularly those without the time or skills to develop client software, pull their own data in or integrate Perseids with their application. Professors such as Robert Gorman at University of Nebraska Lincoln () are using this feature to produce templates for new annotations that they publish on their university Learning Management Systems (LMS). They then include links to Perseids in their syllabi that instruct Perseids to pull the templates from the LMS to create a new annotation publication (Figure 13).

Figure 13

Sequence of actions for creating a publication from an LMS-hosted syllabus and annotation template.

Other applications such as use Perseids’s URL API to offer links to Perseids with specific content already identified for transcribing, translating, or annotating (Figures 14 and 15).

Figure 14

Sequence of actions for creating a CTS targeted text annotation publication from a link from Digital Athenaeus.

Figure 15

Screenshot of the Digital Athenaeus interface (at http://www.digitalathenaeus.org) showing the links to Annotate in Perseids.

We also implemented a workflow for Marie-Claire Beaulieu’s course which allows students to use the annotation tool to annotate named entities and social networks of mythological characters from Smith’s Dictionary of Greek Names. This workflow uses the API to pull the annotations into Perseids for review and publication (Figure 16).

Figure 16

Perseids workflow.

The Perseids/EAGLE integration uses a combination of both of these pull strategies: links from EAGLE to Perseids identify a resource on the EAGLE site, and trigger a callback to the EAGLE MediaWiki API to pull metadata and data from that resource into new translation publications on Perseids (Figures 17 and 18).

Figure 17

Perseids/EAGLE workflow.

Figure 18

Screenshot of the EAGLE Portal (http://www.eagle-network.eu/wiki) showing a link to edit a translation in Perseids.

We also use external APIs to push data to external repositories. For the EAGLE project integration, Perseids uses the MediaWiki API to publish data to the EAGLE repository once it has passed through a review workflow. Through a new NEH-funded collaboration with the project, we have developed a service which allows us to push data to external GitHub repositories at the end of the review workflow (See Figure 4, Step 5b). Eventually we’d like to be able to support pushing data to any external API endpoint.

Designing for Flexibility and Agility

From the outset, we have taken an agile approach to development of Perseids. While we do not use official sprints and strictly scheduled iterations, we approach planning in short increments, guided by a long-term vision and goals. In addition, we aim to deploy features to users as quickly as possible, so that we can get feedback from them. We do this not only for internal-facing features, but also to prototype new integrations with external services and projects. This flexibility allows us to try many things, keeping those that work and prove to be useful and deprecating those that do not.

To support this approach, we could not commit to a specific set of hardware requirements in advance, as we needed the flexibility to extend and reduce resources used as development proceeded. We therefore chose to budget for cloud-based resources on the Amazon Web Services (AWS) platform rather than using university IT resources. Full ownership and control over our infrastructure allowed us to experiment with features and integrations that otherwise would not have been possible; however, it did have some drawbacks and unexpected costs. These are described in the ‘Sustainability’ section below.

Standards for Data

Data Interoperability

A strategic principle in our development is to take steps to ensure data interoperability through the use of stable identifiers and standard formats.

We use CTS URNs to identify both texts and annotation targets. These URNs can be considered stable identifiers, but do not quite qualify as persistent identifiers as they are not universally resolvable or guaranteed to be available. Other identifier systems, such as Handles (), are designed for persistence, and one approach we might take in the future to address this would be to map CTS URNs to the Handles (), but in the absence of this piece of infrastructure, the CTS URNs do provide stable, machine actionable identifiers that are technology independent.

We also use other types of stable identifiers within our annotations and texts, including the URIs published by the . We are working towards ensuring that any data published by the platform has a persistent identifier as well. We are therefore participating in the Research Data Alliance’s to develop a multidisciplinary, collections-based approach to data management that supports persistent identifiers for the collections themselves, and for the items within a collection.

We also strive to use standard data formats and ontologies for our data and to validate all objects against these. The primary data format standards supported on the platform include the TEI Schema for textual transcriptions and translations, the Open Annotation protocol for annotations, the ALDT/ALGT schemas for treebank data, the Alpheios Alignment Scheme for translation alignments, and the SNAP ontology for social network annotations.

Provenance and Preservation

Incorporating provenance information in our publications is an important enabling factor for data sharing. We have taken steps in this direction, for example by supporting /SAML protocol for authentication on Perseids in order to to be able to ensure a chain of authority for university repository systems. We have also included provenance information for tokenization services and tools in our annotation documents, and have explored models for more comprehensive approaches (). However, capturing and recording provenance information reliably across a diverse ecosystem of tools and services is difficult, and we need general-purpose solutions that we can reuse. As articulated by Padilla (): “A researcher should be able to understand why certain data were included and excluded, why certain transformations were made, who made those transformations, and at the same time a researcher should have access to the code and tools that were used to effect those transformations. Where gaps in the data are native to the vagaries of data production and capture, as is the case with web archives, these nuances must be effectively communicated.” We recognize that we fall short of meeting these goals currently and aim to do a more complete job of this in the future.

It is also very important to us that the research data produced with Perseids be preserved. However, our data models and approach to publications are constantly evolving, making coordination with the university library to preserve this data challenging, as they don’t necessarily fit the data models the library is already able to support. As a publicly available and open infrastructure, we also have many users from many institutions across the world, and it is not clear what responsibility Tufts, the university hosting the infrastructure, should have for data created by external users. We mitigate this with Perseids by providing links that users can use to access and download their data, and encouraging them to take responsibility for publishing and preserving it on their own. We continue to explore general models such as the Research Object (), or , which will enable users to export data in a format that is ready to store in a repository. Another question is that of software preservation (). As the Perseids software is under active development, it is difficult to keep the code for digital publications up to date with all the underlying services providing the data (). We need to plan better for this preservation, including taking into account the need to represent interdependencies between visualizations and the underlying services and software ().

Sustainability

Human and Governance Factors

We have learned much about infrastructure building throughout the course of this project. The technical hurdles to interoperability and sharing are usually much less difficult to overcome than those of social issues, funding, and governance. Even where there was a clear interest in interoperability and it was technically possible, we failed sometimes to implement or sustain an integration because doing so wasn’t in the funded mandate of the partner project. This was the case for us with the application from the Pelagios Project. But even where explicit funding support doesn’t exist, interoperability can still succeed if one project can fill a key gap in another, and if there are people willing to champion the effort to ensure its success. One example was our integration with the EAGLE project, where Perseids provides a review workflow for EAGLE, and which was implemented without being a funded deliverable for either project, but it remains to be seen if we can sustain it indefinitely. This is an area where more formal governance structures, such as those offered by larger research infrastructures such as CLARIN and DARIAH () could be useful. The key challenge for the community is to encourage and support ad-hoc collaborations to get initial solutions working, and then move from there to more formal agreements to ensure sustainability.

Hardware and Software Factors

Laura Mandell talks about the various models being considered for where and how to position DH, and points out that the question of how to support diverse infrastructure needs is still unsolved (). A second lesson we have learned from our experience on Perseids is that for development of interoperable infrastructure to succeed and be sustainable, we need better collaborative models for working with our university Information Technology departments and libraries. We knew we needed the flexibility to change our hardware requirements as we developed, and to deploy new code and services quickly to support rapid prototyping. This allows us to develop and try out new solutions more rapidly than we would have been able to if we had to go through university policies and procedures, but it also involved a lot of extra system administration work we had not anticipated, leaving us with a somewhat over-complicated infrastructure at the end of the first phase of the project. Accordingly, in the second phase we built in funding for a devops consultant, who helped us move to a fully configuration-managed system, so that the Perseids platform can be deployed easily by others and sustained for the long term. This is a critical characteristic for software-related infrastructure - in order for it to be reproducible by others, building and deploying it must be automated. In hindsight, having such consultancy from the outset would have been beneficial; collaboration between developers and the IT staff responsible for deploying and sustaining software is a more viable model than throwing code ‘over the wall’ at the end of a project (). As cloud computing becomes increasingly cost-efficient, and new models of deployment, such as container-based solutions, are introduced, there is a need for models in which university IT departments can partner with projects to provide expertise and facilities (for example, private cloud or container infrastructure, or extending university infrastructure to the public cloud).

Conclusion

With Perseids, we have explored an agile approach to infrastructure development, emphasizing reuse of both software and data. This has been successful on many levels. Reuse of existing infrastructure components leads to collaborations which increase the chances of sustainability, such as the joint maintenance of the SoSOL application. Agile approaches to prototyping cross-project integration also benefit all parties involved. However, transitioning to more formal governance models and increased engagement with host institutions will be essential to longer term success.

Data Science Journal

Practice Papers