Modeling Citable Textual Analyses for the Homer Multitext

The Homer Multitext project ( hmt ) is documenting the language and structure of Greek epic poetry, and the ancient tradition of commentary on it. The project’s primary data consist of editions of Greek texts; automated and manually created readings analyze the texts across historical and thematic axes. This paper describes an abstract model we follow in documenting an open-ended body of diverse analyses. The analyses apply to passages of texts at different levels of granularity; they may refer to overlapping or mutually exclusive passages of text; and they may apply to non-contiguous passages of text. All are recorded in with explicit, concise, machine-actionable canonical citation of both text passage and analysis in a scheme aligning all analyses to a common notional text. We cite our texts with urns that capture a passage’s position in an Ordered Hierarchy of Citation Objects ( ohco 2 ). Analyses are modeled as data-objects with five properties. We create


Background
The Homer Multitext is an ongoing collaborative project in editing and analyzing the primary source documents for Greek epic poetry, particularly the Byzantine manuscripts from the 10th through the 13th Centuries ce that preserve versions of the Homeric Iliad and the ancient tradition of commentary from Alexandrian and Roman scholars. The project's primary data consists of digital images of manuscript folios and diplomatic editions of the poetic and commentary texts that they contain.
The editions are encoded in xml validated against the Text Encoding Initiative's p5 schema (TEI Consortium Editors, 2016), but the hmt uses only a very small subset of the tei's tagset, limited to three semantic areas: 1. markup applying a citation scheme to the text 2. markup documenting the editorial status of portions of the text (such as text erased, added or corrected by the original scribe) 3. markup identifying textual tokens that are not valid Greek lexical forms (such as Greek letters used as numbers, or regular abbreviations expanding to full forms) This approach to editing is well understood today, but how best to model and organize analyses is an open question. We want to produce and publish citable analyses of our textual data, and be able to associate additional information with passages of text at different levels of granularity. Some examples can serve to illustrate the challenge. The text of the first two lines of the Homeric Iliad, as it appears in one manuscript, reads: μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾽ Ἀχαιοῖς ἄλγε᾽ ἔθηκε, One obvious analysis of these lines might attach lexical information to each word-token. The first wordtoken is "μῆνιν". Another anlysis might attach metrical information to each poetic foot. The first poetic foot corresponds to the text "μῆνιν ἄ". A syntactic analysis might want to identify the first noun-phrase of the Iliad, which is "μῆνιν οὐλομένην" (the first word of the first line, and the first word of the second line, but nothing in between).
This kind of analysis cannot be embedded in an xml edition. As a practical matter, the examples given above represent overlapping hierarchies that, even if they can be encoded in xml, would quickly make the document unmanageably complex. As a scholarly matter, we assume that there are an unlimited number of possible analyses, and we would like our approach to data to accommodate them all.

Annotation Standards
The hmt's workflow and data model for annotation takes advantage of several existing standards for annotation of digital texts. Our editors capture transcriptions of the textual content of manuscripts as TEI-conformant xml files. In our current implementation, all project data (except for binary image files) is stored on the server as rdf statements, using some established namespaces, and some project-specific namespaces. 1 As described above, tei, and any xml-specific standard (such as XPointer 2 ) for annotation was deemed insufficient and impractical for capturing multiple, concurrent or even competing, textual analyses. The Open Annotation working group's Web Annotation Standard, a W 3 C Candidate Recommendation as of September, 2016 3 , provides a framework for annotation built using Linked Data fundamentals. As a generic framework, Open Annotation's vocabulary could express many of the relationships we describe below. Because our textual analyses reflect and extend a specific model of "text", we have chosen to use project-specific rdf with an hmt namespace. 1 The hmt's data is transformed to .ttl, to be served by an rdf triple-store. The project's .ttl data includes project-specific vocabularies from the cts, cite, and hmt namespaces. It also includes vocabularies from the following namespaces: dcterms

An Abstract Model of 'Text'
We treat citable texts as an ordered hierarchy of citation objects (ohco2) (Smith and Weaver, 2009 (DeRose et al., 1990).) ohco2 defines a citable text as a set of citable nodes that: • belong to a bibliographic hierarchy • belong to a citation hierarchy • are ordered This model frees us to convert representations of a citable text to any format determined to be equivalent under ohco2. When analyzing textual content, a tabular representation is often convenient; when integrating textual content with other related material, we use directed graphs. For editing, we instantiate a tree model using tei-xml (a widely used standard for encoding humanist texts) with associated metadata.

Canonical Text Services URNs
The cts urn captures the bibliographic hierarchy of a text-text-group, work, version-and the hierarchy of citation, in a concise, machine-actionable canonical citation (Blackwell and Smith, 2012b) 4 . This cts urn identifies the two lines of the Iliad quoted above: Each citable node contains text content, which may be structured as the editor of a specific edition chooses. The text contents of one edition might follow a plain-text model; others might used a mixed-content model represented by markdown or xml. This distinction between the canonical citation object, a fundamental component of the text, and its text content has implications for how we represent analyses of a text.
The Canonical Text Services (cts) protocol 5 defines a networked service for retrieving passages of text identified by cts-urn. Cts separates the concern of retrieval by canonical citation from the related but subsequent concern of analyzing the text content. We will now consider how we can work with analyses that do not align with the boundaries of citable nodes retrievable in the cts protocol

Citation of Other Data
The hmt models data other than texts as versioned records in a collection of objects with common properties. We refer to these objects with cite urns (Blackwell and Smith, 2012a). A cite urn identifies a collection of objects unique within a namespace, a uniquely identified object in the collection, and a version-identifier for that object. : : : . .

Declaration vs. Alignment
Expressing the analysis of a passage of text can be reduced to associating a cite urn with a cts urn. The cts urn can point to any passage: a single citable node in the text, a range of citable nodes, or a larger citation unit, 'Iliad Book 2', for example. When we analyze the contents of a citable node, however, we may need to work with textual contents that do not correspond perfectly to one or more citable nodes. We can align our analysis with characters in a particular edition by means of an extension to a cts urn identifying an indexed substring within a citable portion of specific text. urn:cts:greekLit:tlg0012.tlg001.msA:1.1@μῆνιν[1], for example, points to the first instance of the string 'μῆνιν' in Book 1, line 1, of the 'msA' edition of the Iliad.
In the cite architecture, @ extends a urn with a type-specific subreference. Cts urn subreferences index substrings from an origin of 1. The subreferenced string is based on the cdata content of the identified version of the text (excluding any markup), in the character encoding of that version. They can be treated naively as strings, and they can serve for more sophisticated comparisons using language-aware methods. 6

Requirements for Declarative Analysis
For our editorial and analytical work to constitute a foundation for further scholarship, we want to emphasize the declarative over the procedural. That is, while computational processes such as search, tokenization, or difference operations are tools for scholarship, we want subsequent scholars to be able to point, explicitly and unambiguously, to any results of those operations.
With urn citations, we can construct what we call an ' analysis-relation' associating some data with a span of text. Each analysis-relation is a member of a collection of analysis-relations; the collection is ordered by the document order of the text analyzed. The textual component of the analysis-relation may be defined at any scale (a single character, for example, or, 'Book 2 of the Iliad'). The data component may be unique to this analysis-relation ('the first scribal correction on manuscript A') or applied to many analysis-relations (' dactyl').
Every analysis ' deforms' the text, to borrow Stephen Ramsay's portmanteau term for the activity of reading: ' deformance' (Ramsay, 2011). An analysis of urn:cts:greekLit:tlg0012.tlg001.msA:1.1@μῆνιν as a lexical token might deform the text of that version simply as 'μῆνιν' with some conversion of Unicode characters to ensure precombined accents; it might deform it to 'mh=nin', using an ascii representation of polytonic Greek. A lemmatized analysis of this same text might deform the text to 'μῆνις' (the nominative, singular 'lexicon form'); a metrical analysis might deform it to ¯ ˘.
We fully document an analysis with five pieces of information: 1. The analyzed text is the citation to a version of the text, a cts urn, perhaps with an aligning sub-reference, identifying the text we are analyzing. 2. The analysis is a cite urn identifying the data resulting from this analysis. 3. The sequence is the index of this analysis-relation in the ordered collection of analyses. 4. The text deformation is a string of characters expressing the reading resulting from this analysis. 5. The analysis record is the canonical identifier for this cluster of objects: the five items documenting this instance of applying a specific analysis to a specific passage of text.

A Simple Example
Each of these five pieces of information is necessary, even in such an apparently simple case as tokenizing a text by word. The Homer Multitext's edition of the Iliad as it appears on the Venetus A manuscript-Marcianus Graecus Z454 [=822]; for an overview of this manuscript, see Dué (2008)-documents the text and editorial status of Book 2, line 4, with this tei xml markup: The scribe wrote the verb-form τιμήσῃ, but added an alternate ending -ει supralinearly. For our lexical analysis, our project's convention is to capture the original text. The first analysis, then, analyzes the textual content of our xml edition from the first instance of the character tau through the first instance of the combined eta with iota-subscript (τιμήσῃ). We can express this with the cts urn urn:cts:greekLit:tlg0012. tlg001.msA:2.4@τ[1]-2.4@ῃ [1]. That urn accurately identifies the textual content under analysis, but it does not itself represent a coherent text. Naively resolved, it would result in 'τιμήσ<choice><reg>ει </reg><orig>ῃ', which is neither Greek nor well-formed xml. Resolved without markup, it would result in 'τιμήσειῃ', which is not a Greek word. In either case, a sensible reading of the resulting text would require further, potentially complex, processing.
We identify explicitly the position of this analysis in the sequence of all lexical tokens in this version of the Iliad. We treat the lexical token 'τιμήσῃ' as a data-object in a collection of lexical tokens, and identify it with a unique cite urn; every instance of τιμήσῃ in our Iliad will be analyzed with this urn.
In this analysis by lexical token we choose to ignore editorial markup, but because its tokens are still be aligned to the Edition, the editorial status of any given token-unclear, supplied, vel sim.-can be determined, even though the reading that results from the analysis is straightforward Greek without markup.

Metrical Feet
The meter of Greek epic poetry disregards word-boundaries. A compresensive documention of poetic meter, encoded in xml markup without violating the rule against overlapping hierarchies would result in an xml edition so complex as to be unusable, especially when combined with markup documenting editorial status.
Here is the first line of the Iliad, divided into six metrical feet, with the analysis of each foot (

Syntax
Syntactical analysis may also fail to align with word-boundaries. The Greek word 'οὔτε', for example, performs two syntactic functions. The 'οὔ' is an adverb, and the 'τε' is a coordinator. One response to this problem has been to create editions tailored to this specific type of analysis by inserting editorial word divisions into forms like οὔτε. For our work, we want to avoid introducing non-standard orthography into our edited texts merely to serve a single kind of analysis. By generating a 'syntax-token analysis' we can leave our edition intact, while precisely assigning syntactical roles to parts of words where needed.   Text-Deformation ὑπὸ τοῦ Μελίσσου καὶ Περικλέα αὐτὸν ἡττηθῆναι ναυμαχοῦντα πρότερον Table 6: Identifying and reuniting non-contiguous reported speech.
Both analyses, above, align to the same string of characters in the original edition. But in the colletion of analyses, they are two elements, one following the other in sequence.

Non-contiguous Text
Modelling analyses of 'text reuse'-quotation, paraphrase, allusion-can be challenging because it is often necessary to treat non-contiguous spans of text as a single unit. Here, bold type highlights 'text reuse' ( In this example, we analyze a string of text from our Edition, associating it with an Analysis urn that identifies an instance of text-reuse. For the text-deformation of our analytical exemplar, however, we choose to omit the verbum dicendi and speaker-attribution (i.e. 'φησὶν. . . Ἀριστοτέλης'), and the sentence-adverbial ('δὲ'), which are not actually part of the quotation. We have kept this analysis separate from our base edition, but we can present our reading of quotation as we choose, and attach commentary to the object pointed to by the Analysis urn.

Analytical Exemplars
Every analysis of a text is a reading tokenizing a text into a series of analyzed units. Our approach to documenting analyses results in an ordered collection of readings, aligned to the citation hierarchy of a version (through the analyzed text cts urn). Each of these analyses has its own text content (the text deformation), which is controlled and defined by its association with the analysis urn). So we essentially have the necessary components for a new text, in the ohco2 model. We implement this by defining an ' analytical exemplar', derived from a specific version, with an additional level added to the bibliographic hierarchy of a cts urn: 7 urn:cts:greekLit.tlg0012.tlg001.msA: (The ms. A edition of the Homeric Iliad) urn:cts:greekLit.tlg0012.tlg001.msA.lexTokens: (An analytical exemplar derived from the ms. A edition of the Homeric Iliad) The analytical exemplar's citation scheme follows that of the version it analyzes, but its citation hiearchy adds a further level: urn:cts:greekLit.tlg0012.tlg001.msA:2.4 (Book 2, line 4, of the ms. A edition of the Homeric Iliad) The most finely grained unit of citation from the Edition resolves to:

τιμήσ<choice><reg>ει</reg><orig>ῃ</orig></choice>· ὀλέσῃ δὲ πολέας ἐπὶ νηυσὶν Ἀχαιῶν·
We can resolve our analytical exemplar more finely, while being sure to get meangful text content: 7 A cts urn captures a bibliographic hiearchy that is similar to that defined by the frbr (Functional Requirements for Bibliographic Records) recommendation of the International Federation of Library Associations and Institutions (IFLA) (http://www.oclc.org/ research/activities/frbr.html). frbr asserts a hierarchy of: work, expression, manifestation, item. The last, "item", is defined as "a single exemplar of a manifestation". cts's "exemplar" aligns with frbr's "item" for physical volumes-e.g. Homer, Iliad, editio of Villoison, Thomas Jefferson's personal copy thereof. For digital texts, cts defines an "exemplar" as "a specific transformation explicitly derived from a specific version of a text." urn:cts:greekLit.tlg0012.tlg001.msA.lexTokens:2.4.1 (Book 2, line 4, lexical token 1 of the 'lexical tokens analytical exemplar' derived from the ms. A edition of the Homeric Iliad) This resolves to: τιμήσῃ By creating an analytical exemplar, we can separate concerns more effectively. For an analysis of the language of the Iliad, an analytical exemplar of lexical tokens provides clear, editorially controlled, data. Our textmining for lexical forms or morphology need not work around paleographical or codicological issues. At the same time, every citation in an analytical exemplar is aligned to the version from which it is derived, and so all analytical exemplars are implicitly aligned to each other.

Field Value
Analyzed

Clauses
Analytical exemplars also allow us to read and cite a text according to analytical tokenizations. For example, it might be desirable to 'read' the Iliad in chunks defined not by poetic line, but by grammatical clauses. By analyzing the text by clauses and creating an analytical exemplar, we can make this possible.
-Iliad 1.1-1.2 The first grammatical clause of the Iliad is 'μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην'. This includes all of 1.1, and the first part of 1.2. The second is 'ἣ μυρί᾽ Ἀχαιοῖς ἄλγε᾽ ἔθηκε', the rest of 1.2. This would present a problem of overlapping hierarchies, if we were to embed this analysis in the xml of the edition. The following tables add two additional properties to each analysis, the analytical exemplar urn, by which tokens in this new reading can be cited, and for each citeable node of the exemplar, the next citable node. This satisfies the ohco2 requirements, documenting citable nodes in a citation hierarchy, and their sequence (Table 7, 8, 9). There are two clauses, identified by the Analysis urns: urn:cite:hmt:clauses.1 and urn:cite:hmt:clauses.2. There are three entries in our record of these two clauses. The first two both have urn:cite:hmt:clauses.1 as their Analysis Record and their Analysis (because in this case, the analysis is unique: the first clause of this edition of the Iliad).
The Analytical Exemplar urns are the key for understanding why we have two entries for the first clause. This analytical aligment is creating an exemplar that is tokenized and citeable according to clauses. The Analytical Exemplar urns, and the aligned analyses, make the following identifications: • The first citable analysis of Iliad 1.1 is clauses.1. • The first citable analysis of Iliad 1.2 is clauses.1.
• The second citable analysis of Iliad 1.2 is clauses.2.
If we were to navigate our Edition via a cts service, the following urns would return the following textcontent (  If we were to navigate our Analytical Exemplar via a cts service, the following urns would return the following text-content (

Editing, Archiving, and Publication
Our reliance on abstract data models allows the hmt to use the most appropriate tools and formats for each of the tasks of editing, archiving, and publishing. Editors transcribe texts into tei xml files, taking advantage of automated validation. Analytical work is captured in plain-text .csv files; our editors often work via the GitHub web interface, which offers validation of tabular data. Data is archived as tabular text files; the project's archival repository (https://github.com/orgs/ homermultitext) saves versions of texts as both xml and tabular files derived from them.
The project's online publication is through the HMT Digital web-application, which uses an rdf database (currently Apache Fuseki) as a datastore. The archival data is transformed with our cite Archive Manager utility into rdf statements. HMT Digital accepts queries on urns and allows browsing of texts, delivering data either as raw xml or json objects, or as xml transformed to html for human readers. 8

Conclusion
This approach to managing analytical data affords a number of benefits: • It allows us to separate the concern of editing a text in machine-readable form from the concern of publishing analyses of that text. • It allows us to record an open-ended number of analyses. We are not limited to any given xml vocabulary. • It allows us to cite our analyses at a very granular level. • It explicitly aligns each analysis to the edition analyzed, and so implicitly aligns all analyses to each other. • Its simple structure can be represented by plain-text tabular data files, clear to read, easily repurposed and shared. • It frees us to represent Greek using different encodings or orthographies without losing the connection to the primary source evidence of our manuscripts. • It supports more complex sets of analyses than anything that can be embedded in xml markup.
This approach lends itself to automated analyses, such as morphological parsing or tokenizations by word+punctuation (as required for certain approaches to documenting syntax), and to automated generation of exemplars based on the editorial status of the text. It also lends itself to analyses hand-crafted by human editors, such as analyses of speeches, extended similes, or more complex instances of text reuse.
One goal of the Homer Multitext is the fullest possible account of the traditional language of the Greek epic poetic tradition. Such an account does not exist, and will depend on systematic analysis of syntax, morphology, and meter across full editions of Homeric texts and quotations of Homeric texts in ancient commentaries. While our work aligning citable analyses to citable texts remains experimental, we are confident that, as the hmt's collaborators continue to expand the project's collection of digital diplomatic editions, our approach adequately serves our present scholarly requirements, and is creating durable data. We are optimistic that our simple data formats can be readily transformed as future needs dictate, and will continue to offer new opportunities for engagement with and insights from the hmt corpus.