Automated granularity to integrate digital information: the "Antarctic Treaty Searchable Database" case study

Access to information is necessary, but not sufficient in our digital era. The challenge is to objectively integrate digital resources based on user-defined objectives for the purpose of discovering information relationships that facilitate interpretations and decision making. The Antarctic Treaty Searchable Database (http://aspire.nvi.net), which is in its sixth edition, provides an example of digital integration based on the automated generation of information granules that can be dynamically combined to reveal objective relationships within and between digital information resources. This case study further demonstrates that automated granularity and dynamic integration can be accomplished simply by utilizing the inherent structure of the digital information resources. Such information integration is relevant to library and archival programs that require long-term preservation of authentic digital resources.


Era of Digital Information
We have reached the threshold in our 'world information society' when accessing more information does not equate with generating more knowledge. Knowledge, which emerges from understanding relationships within and between information resources, derives from the process of integration. Distinctions between information access and integration underlie technological solutions for the future when "knowledge is the common wealth of humanity", as expressed by His Excellency Adama Samassekou at the 2004 CODATA meeting in Berlin. The purpose of this paper is to assess the challenges, strategies and efficiencies to integrate digital information resources.
To assess the challenges with the digital medium, it is instructive to take a broad view of written communications in our civilization. From stone and clay to paper onto digital media, each era has increased our capacity to transport, produce and integrate information (Fig. 1). For example, the Internet has been evolving since the late 1960's (Berners-Lee et al. 2001, Pastor-Satorras andVespignani 2004) with the number of Internet hosts increasing from 213 in 1981 to over 350,000,000 in 2005 (Internet Systems Consortium 2005). Since 1972, microprocessor speeds have increased 5 orders of magnitude (Intel Corporation 2005) while satellite systems have made it possible to collect and transmit information on a global scale (Evans 2000). Moreover, the volume of digital information doubled in the three years after 1999 with more than 5 exabytes (10 18 bytes) of information stored on print, optical and magnetic in 2002 alone (Lyman et al. 2003). We also have powerful search engines to retrieve digital information from vast warehouses. These features all point to the observation that access to digital information has become effectively infinite and instantaneous. FIGURE 1: Thresholds in the preservation and dissemination of written information in our civilization. Each of the media prior to digital had been used for millennia (Senner 1989). From stone to digital media: (a) the transport of information across time and space has increased; (b) the volume and rate of information produced has increased; and (c) the capacity to integrate information into new knowledge has increased.
While it is easy to understand that the capacities to transport and produce information have increased with each era (Fig. 1), it is less obvious that the capacity to integrate information also has increased. Today, more than 80% of the digital information is considered to be "unstructured", which means that it cannot be automatically decomposed into relational schema. Consequently, information integration is effectively limited to the remaining 20% of the digital resources that are structured with databases, metadata and markup. A principal challenge with the digital medium is being able to integrate information independent of whether it is "structured" or "unstructured" (Blumberg and Atre 2003).

Automated Granularity
Information has three indivisible elements -content, context and structure -that together provide meaning (Fig. 2). For example, when a message is encrypted (i.e., the structure is altered) it still has content and context, but no meaning. Similarly, if the names or dates and places are removed from an information resource, it still has context and structure, but limited meaning without the salient facts. Removing context features that can be used to authenticate an information resource also will compromise its meaning.
The paradigm shift created by digital technologies is the opportunity to utilize the structure of information as well as its content and context. A printed book can be managed based on its content (as in libraries) or its context (as in archives), but it is not possible to break a book into smaller units that can be managed automatically. It is this ability to automatically manipulate the granularity of information resources that distinguishes digital media from all of the hardcopy predecessors that have been applied throughout human civilization. This concept of granularity refers to the inherent conceptual units (i.e., information granules) that compose an information resource. With text, the granules could be as small as individual letters or characters, each of which could be identified within a byte-offset ontology relative to their parent resource (Berkman and Morgan 2003). More reasonably, the granules will be large enough to stand alone with sufficient content and context, such as paragraphs or chapters. A critical feature of each granule is that it internally retains information about its unique hierarchal position within its parent resource.
Automated granularity has been considered previously to implement networked systems of embedded computers with "vision of a world filled with large numbers of computing elements, many of which are hidden inside other objects and networked together" (National Research Council 2001). Automated granularity similarly extends to digital records, which have embedded content (Berkman and Morgan 2003). The value of automated granularity for digital libraries, archives and warehouses is that it provides a dynamic strategy for searching, retrieving, organizing and integrating both "structured" and "unstructured" information resources.
This paper uses the Antarctic Treaty Searchable Database (http://aspire.nvi.net, previously http://webhost.nvi.net/aspire) to assess the applications and implications of automated granularity.

Implementation Technology
The underlying technology to implement the Antarctic Treaty Searchable Database is the Digital Integration System (DigIn ® ). Operation of DigIn ® , which is based on patented technologies assigned to EvREsearch LTD (Maynard 2001(Maynard , 2002(Maynard , 2003(Maynard , 2004(Maynard , 2006, involves four principal modules that can be used together or separately: • GRANULARITY MODULE: creates information granules by using the inherent structure and patterns that bound relevant units of content. A unique categorical tag is assigned to each granule based upon an analysis of its provenance, parent-child location and contents. The categorical tags contain information to generate expandable-collapsible hierarchies. generates a database with the address (referenced within each categorical tag), content strings (words, numbers or other symbols) and their frequencies within each information granule.
• INTEGRATION MODULE: searches through the index to retrieve the information granules with terms or content strings that match the user-defined search queries in textual, numeric or other symbolic forms.
• AGGREGATION MODULE: combines relevant information granules based on their hierarchal relationships and user-defined criteria.
Each of the modules acts upon a set of expert rules that define its automated operation. These rules, which can be conveniently written with regular expressions (Friedl 2002), are optimized iteratively to integrate and display the relevant information granules within expandable-collapsible hierarchies. In addition, because DigIn ® is modular, it can interface with statistical, graphical, semantic web, natural language or other types of software solutions that could be treated as additional modules.
DigIn ® provides a general method that operates independently from any specific hardware and software. For example, DigIn ® operates with ASCII or UNICODE as well as proprietary schema. DigIn ® also operates with metadata, mark-up and databases that each have standardized patterns to organize information in a structured manner (e.g., Sowa 1984). Moreover, DigIn ® currently is written in PERL, which provides a stable cross-platform programming language that can read and write binary files as well as process very large files. DigIn ® also could be written in other languages depending on the circumstances. Consequently, DigIn ® is an interoperable method that can be utilized into the future in a persistent manner.

Implementation Design
The general activities to create the digital record of the Antarctic Treaty Searchable Database, as well as similar databases of policy documents, are illustrated in Figure 3. The first step is to define the collection parameters, which includes the components of the collections as well as the resulting granularity and organization of the hierarchal displays that will be dynamically generated in response to the integration queries. After compiling the collection elements, the next step is to implement the appropriate granularity with a header tag in each granule that describe its unique hierarchal position relative to its parent resource. These tags, which preserve the provenance of each granule, will be used to dynamically generate expandable-collapsible hierarchies that comprehensively and objectively display granule relationships within and between the information resources. After searching and integrating the granules, the granule displays are assessed to determine whether the collection should be revised or whether the completed digital record should be fixed for archival purposes.

FIGURE 3:
A generalized activity-flow diagram (Bobak 1997) of the processes to create the Antarctic Treaty Searchable Database or other digital records with the Digital Integration System™ (DigIn ® ) from EvREsearch LTD. Adapted from Berkman et al. (2005).
More specifically, the initial edition of the Antarctic Treaty Searchable Database was implemented in collaboration with the National Science Foundation and United States Department of State. Based on the characteristics of the Antarctic Treaty Handbook. 8 th Edition (United States Department of State 1994), the following rules were used for compiling the contents of the initial Antarctic Treaty Searchable Database: Rule 1: Include only the "measures" that were adopted by the Antarctic Treaty Consultative Parties "in furtherance of the principles and objectives of the Treaty." Rule 2: Content of each adopted "measure" would include its text along with any tables or figures.
Rule 3: Exclude any "extracts," "introductory notes" or other additions from the United States Department of State, which is the depository government, because they were not formally adopted by the Antarctic Treaty Consultative Parties The next decision was to identify the appropriate granularity of the policy documents that would be searchable. Each Antarctic Treaty Consultative Meeting (ATCM) produced a report with adopted "recommendations," "decisions," "measures" or "resolutions", which sometimes included "appendices," "annexes" or "attachments." Periodically, the Antarctic Treaty Consultative Parties also adopted Conventions and larger policy documents that included specific "articles" along with "annexes." Based on these types of adopted measures, the following rules define the granularity of the policy documents for the Antarctic Treaty Searchable Database: Rule 4: Each "recommendation," "decision," "measure" or "resolution" would be treated as a complete information granule (within the context of the ATCM and year of adoption as the two overlying hierarchal levels).
Rule 5: Each "appendix," "annex" or "attachment" would be treated as a complete information granule (within the context of the "recommendation," "decision," "measure" or "resolution" as well as within the ATCM and year of adoption as the three overlying hierarchal levels).
Rule 6: Each "article" and "annex" would be treated as a complete information granule (within a Convention or Protocol and year of adoption as the two overlying hierarchal levels).
The initial edition of the Antarctic Treaty Searchable Database, which was constructed in an automated manner based on the above rules, has been continuously updated as: (1) new measures have been adopted by the Antarctic Treaty Consultative Parties; (2) missing measures have been identified; and (3) missing components from the measures (e.g., tables or figures) have been identified.
These updates involve the insertion, tagging and editing of individual granules. Each update or edition of the Antarctic Treaty Searchable Database has been fixed by preserving all files and functionality on a webCDserver™ (Berkman 2002). Throughout, the contents of the Antarctic Treaty Searchable Database have been incorporated directly from authentic sources (i.e., United States Department of State, Marine Mammal Commission, Committee for Environmental Protection, and host nations for the ATCM). Overall implementation of the Antarctic Treaty Searchable Database is illustrated in Figure 4.  (Bobak 1997)

Implementation History
The history of the Antarctic Treaty Searchable Database goes back to 1998, when the United States Department of State was contacted about access to digital versions of the policy documents that they were managing as depository government for the 1959 Antarctic Treaty. This query was prompted because information management was rapidly moving toward digital media and the Antarctic Treaty Handbook (United States Department of State 1994) had become unwieldy for case-study activities in an undergraduate Antarctic science and policy course that had been taught since 1982 (Berkman 2002). In 1999, within a month of initially implementing the Antarctic Treaty Searchable Database, the Department of State introduced it at the 23 rd ATCM in Lima, Peru.
Although originally intended as a supplement for the university course on Antarctic science and policy (Berkman 2002), the Antarctic Treaty Searchable Database soon evolved into a digital archive that has been maintained and updated subsequently to benefit a diverse community of Antarctic stakeholders ( Table 1). The redefined purpose of the Antarctic Treaty Searchable Database has been to facilitate knowledge discovery about the policies and strategies that promote "international cooperation" and the "use of Antarctica for peaceful purposes only" as stated in the Preamble of the 1959 Antarctic Treaty. Treaty System by adopting Decision XXIV-1 at the 24 th ATCM to establish the Antarctic Treaty Secretariat in Buenos Aires. As these international negotiations regarding the Antarctic Treaty Secretariat were underway, the Antarctic Treaty Searchable Database was linked to the websites for the 24 th and 25 th ATCM in St. Petersburg and Warsaw, respectively. In addition to being the first digital collection of Antarctic Treaty documents ever produced, the Antarctic Treaty Searchable Database remains as the most comprehensive source globally for integrating policy documents from the Antarctic Treaty System.

Conventional Granularity Limitations
The potential to discover meaningful relationships within and between the information resources is directly proportional to their granularity. For example, for a given search query, two books could generate 4 possible results (i.e., one book or the other, both books together, neither of the books). If each book were divided into two granules, there would be 16 possible combinations with 0, 1, 2, 3 or 4 granules. If each book were divided into four granules (i.e., 8 total), there would be 256 possible combinations with 0 to 8 granules. Consequently, among N granules there are 2 N possible relationships. Being able to express and then decompose the ternary, quaternary and higher-order relationships may reveal functional dependencies among the granules or digital entities (Jones and Song, 1996). Practically, the number of possible relationships among even 100 digital objects (i.e., 2 100 ) is too large to manage comprehensively on the front-end. Nonetheless, conventional strategies involve descriptions of relationships on the front end with markup languages (Gill andRatnakar 2001, Fensel et al. 2003) that add structure to information resources with tags to delimit, contain, or define the borders of certain content. For example, this front-end limitation applies to ontologies (McGuiness andHarmelen 2004, Lagoze et al. 2005) that describe relationships among components, properties, functions and processes of digital resources as well as taxonomies (Szykman et al. 1999, Daconta 2005. Aside from these limitations, there also is the practical feature that adding markup tags throughout a digital information resource is a form of contamination that may compromise its authentic content into the future. Importantly, by defining the relationships on the front end for the purpose of accessing information, results on the back end are effectively constrained, which greatly reduces the opportunity to be surprised. Given, these limitations and the suggestion that relationships cannot be managed comprehensively on the front-end, what strategies are available to reveal the 2 N relationships among granules on the back end? In addition to applications of markup, which is considered to be structural metadata, knowledge discovery also is facilitated by descriptive and administrative metadata (Hodge 2001). With regard to descriptive metadata, there is an expanding universe of schema for different disciplines, institutions and activities (e.g., http://www.mapageweb.umontreal.ca/turner/meta/english/) that each contains different sets of attributes (e.g., name, size, data type, use restrictions, etc.) that must be defined or documented with specified nomenclatures for every digital object (Duval et al. 2001). More importantly, metadata does not scale, which is a principal reason behind the widely-held notion that there is "structured" and "unstructured" information. Recognizing that all information has structure (Fig. 2), however, in reality digital information is either "managed" or "unmanaged" with conventional technologies.
A simple experiment can be constructed to illustrate the scalability limitations of metadata (Fig. 5). Consider a book that that has a volume of 20 (in arbitrary units) and that each completed metadata schema has a volume of 1 (in arbitrary units). If the book is divided into two granules, each of which must have its own metadata schema, then the total volume of the book remains constant while the total volume of metadata schema has doubled. If the granularity is continuously doubled, the volume of metadata soon will overwhelm the volume of the actual data that is being managed. The additional metadata also requires increased effort to generate, store and process -which translates into costs and efficiencies. Moreover, if the metadata is stored in separate repositories, then loss of the metadata could compromise the preservation of the actual data.

FIGURE 5:
A simple model to illustrate the exponentially increasing volume of metadata, as the granularity is doubled, relative to the actual data within a single digital resource. In this model, the total volume of all information granules is constant (arbitrarily 20 units) and total volume of the metadata schema for each granule is fixed (arbitrarily 1 unit) independent of granule size. Adapted from Berkman and Morgan (2003).
The Antarctic Treaty Searchable Database is an example of information resources that are being managed with increased granularity, but without conventional metadata or markup. Moreover, the Antarctic Treaty Searchable Database integrates granules to achieve 2 N possible information relationships without conventional "database" manipulations of tables.
The capacity to discover relationships with the Antarctic Treaty Searchable Database is further reflected by its 822 information granules, in contrast to the Website for the United States Department of State (http://www.state.gov/g/oes/rls/rpts/ant/) that includes the Handbook of the Antarctic Treaty System in 18 'locked' PDF files along with HTML files of five major documents (e.g., 1991 Protocol on Environmental Protection to the Antarctic Treaty). With such websites, each user is required to conduct full-text searches, one digital resource at a time before the user is able to cut-and-paste and then organize the relevant pieces of information -steps that are automated with the Antarctic Treaty Searchable Database and other DigIn ® applications (e.g., Marine Mammal Commission Digital Library of International Environmental and Ecosystem Policy Documents -http://nsdl.tierit.com).

A Dynamic Integration Application
The ability to integrate and generate objective relational schema based on the inherent structure of the information resources can be illustrated with the Antarctic Treaty Searchable Database. From the 822 granules in the 6 th edition of the Antarctic Treaty Searchable Database (Table 2), for example, 23 granules contain the term "peaceful" (Fig. 6).

FIGURE 6:
Expandable-collapsible hierarchy that was dynamically generated from the 6 th Edition of the Antarctic Treaty Searchable Database (Table 2) with "peaceful" as the integration query. Objective policy relationships within and between years are derived from the information granules that were generated based on Rules 1-6 (see text).  (Table 2). These relational profiles were based on: (a) search terms of "minor", "asses*", "impac*" and "valu*" where * is the wildcard character; and (b) combinations of 2, 3 or 4 of the above search terms. The number of policy measures equates with the number of granules that are displayed in the expandable-collapsible hierarchies (e.g., Fig. 5).
The granules, which are displayed in the expandable-collapsible hierarchy, identify policy relationships within and between the Antarctic Treaty meetings that were convened from 1959 to 2005. As can be seen, "peaceful" is a common feature of "dispute settlement" in the legal institutions that emerged from the Antarctic Treaty in 1980, 1988 and 1991. Moreover, upon closer inspection of the individual granules, the same phrase was reproduced in each year (i.e., "…dispute resolved by negotiation, inquiry, mediation, conciliation, arbitration, judicial settlement or other peaceful means…"). These results are objective because all relevant granules (i.e., those with "peaceful") are identified and each unique granule only occurs once in the hierarchy.
Relationships that can be displayed objectively also can be quantified accurately to test hypotheses, such as key policy concepts have been increasingly integrated into the adopted "measures" over time. As an illustration, consider Antarctic environmental protection, which involves human impacts that are assessed as being "minor or transitory" in relation to various Antarctic Treaty System values. Based on data extracted from the hierarchal displays (e.g., Fig. 6), trends in the incorporation of key environmental concepts into Antarctic Treaty measures can be identified (Fig. 7).
Figures 7a shows that key terms have been incorporated increasingly into new Antarctic Treaty policies, with the largest change among "impact" concepts. In addition, their date of first use can be identified, as with the "value" concept that began appearing in 1961. Similarly, Figure 7b shows that policy measures progressively incorporated 2 then 3 and finally all 4 of the key environmental terms. Not only do the quantitative analyses (Figs. 7a,b) support the above hypothesis, but they reveal that trends can be objectively extracted from otherwise qualitative information in relation to fixed coordinate systems, such as time or space.

Persistent Dynamic Displays
The Antarctic Treaty Searchable Database has expanded from 608 to 822 granules between 1999 and 2005 ( Table 2). Each of the annual editions of the Antarctic Treaty Searchable Database is preserved on a webCDserver™ (Berkman 2002) that contains a fully-executable, stand-alone copy of the Website with all of the associated files. This type of preservation activity can be used to archive fixed digital records (Gilliland-Swetland and Eppard 2000) while facilitating persistent access in dynamic environments, despite the obsolescence of the original hardware and software. Such solutions are necessary to resolve the paradox of digital preservation (Chen 2001): "to maintain digital information intact" while providing "access to this information in a dynamic use context" -which is a central feature of the International Research on Permanent Authentic Records in Electronic Systems Project that involves the national archives from 13 countries (Duranti 2005a; http://www.interpares.org).
A necessary feature of records in an archive is that they are fixed at the time of preservation to ensure that they have not been altered in an undocumented manner. The practical result of fixity is consistent and reproducible access to records. With the Antarctic Treaty Searchable Database, the dynamic integration of granules results in reliable, reproducible and accurate hierarchies for given queries and time periods (e.g., Fig. 6). Under such circumstances, the results of a dynamic process would be fixed.
Records, which are created in the course of business, "constitute a primary and privileged source of evidence about the activities and the actors involved in them" (Thibodeau 2001). Records, which are set aside for archiving, also have necessary characteristics (Duranti 2005b): • fixed form that can be rendered; • unchangeable content; • explicit linkages to other records; • identifiable administrative context; • author, addressee and writer; and • action in which the record participates or supports.
In digital environments, however, these six characteristics may not be sufficient to provide the necessary evidence about the accuracy of a digital record that was generated dynamically by a computer system in response to an interaction or query. For example, if someone contested Figure 6 on the grounds that there was an error with its generation, what would be necessary to validate its accuracy?
The persistent solution involves being able to reconstruct the record with the original software or an emulation and then to test for anomalies in its content or relationships (Thibodeau 2002). To accomplish this reconstruction, it would be necessary to have detailed documentation about the system content, parameters and functionalities at the time the record was generated as well as a log of the interaction or query. With the Antarctic Treaty Searchable Database, the documentation is represented by the content of the webCDservers™ (Table 2), the flow diagrams (Figs. 3 and 4) and detailed descriptions of the underlying digital integration system (Berkman andMorgan 2003, Berkman et al. 2005).
The challenge with digital records is to provide persistent access beyond a static screenshot or locked image file, which are effectively hardcopy records. It also is relevant to consider the efficiency and costeffectiveness (Thibodeau 2001) of storing large volumes of static records that are generated by dynamic processes based on user interactions, such as querying a geographic information system or relational database for some administrative decision. The bottom line is that static records are insufficient for all evidentiary purposes, as illustrated above. Consequently, it is necessary to establish strategies and methods to implement dynamic records that utilize the inherent structure of information, which is the unique distinction between digital and hardcopy information resources (Figs. 1 and 2). The Antarctic Treaty Searchable Database and its underlying methods offer a case study to implement persistent dynamic records that can be trusted.

CONCLUSION
The paradigm shift created by digital technologies is the opportunity to dynamically and objectively manage the structure of information as well as its content and context. Unlike the subjective decisions that may vary from person to person to describe the context and content of a record, the structure is an inherent element of a record that can be described objectively. It is this ability to automatically utilize the inherent structure of information that distinguishes information management with digital media from the hardcopy media that had been applied previously in our civilization (Figs. 1 and 2).
The Antarctic Treaty Searchable Database demonstrates a well-defined integration method that utilizes the inherent structure of digital information resources to automatically generate information granules. Based on user-defined integration queries, the information granules then can be dynamically combined into accurate, reliable and reproducible relational schema. The power of automated granularity is in efficiently discovering objective relationships among information resources without conventional markup, metadata or databases (e.g., Figs. 5-7). Such information integration is relevant to library and archival programs that require long-term preservation of authentic digital resources, as investigated by the International Research on Permanent Authentic Records in Electronic Systems Project (http://www.interpares.org). Automated granularity also has implications for realizing the vision of the World Summit on the Information Society when discovering "knowledge is the common wealth of humanity."