Building a web of data via the Semantic Web requires consistency in how data are represented. In the Semantic Web, anyone can invent and express new concepts, but the “universal Web” can only emerge through tying concepts together (Berners-Lee, Hendler, & Lassila, 2001). Introductory texts on ontology development often list the ability to reuse existing ontologies as one of the key attractions of the Semantic Web (Allemang & Hendler, 2011). Adopting existing ontologies for a new application, however, requires substantial knowledge of Semantic Web principles and underlying technologies. Naïve approaches, such as simply selecting ontologies that match keywords of interest, can lead to “Frankenstein ontologies” that run afoul of good ontological engineering practices (Corcho, Poveda-Villalon, & Gomez-Perez, 2015).
The EarthCollab project seeks to utilize the Semantic Web to advance geoscience research and the discovery of connections among research projects. This paper explores the effort of efficiently developing a local ontology while employing established ontologies that enhance connections to the broader community. The EarthCollab project, a collaboration between the University Corporation for Atmospheric Research (UCAR)/National Center for Atmospheric Research (NCAR), UNAVCO, and Cornell University, is using Semantic Web and linked data technologies to facilitate the coordination and organization of complex scientific projects, their communities, their tools, and their products. The EarthCollab partners are using the VIVO semantic software suite to enable more coherent discovery of distributed information and data for large multi-disciplinary scientific projects. To facilitate easier integration and data sharing across the geoscience community, the EarthCollab goal has been to reuse existing ontologies as much as possible when developing project ontologies and web applications. The questions this paper addresses are as follows:
- What are the key decision points for new Semantic Web applications in deciding when to reuse existing ontologies and when to develop original ontologies?
- How can new Semantic Web projects most efficiently and effectively identify and select ontologies to reuse?
2 EarthCollab Technology
EarthCollab is using the VIVO open-source software suite to represent and describe scientific networks (http://vivoweb.org). Since 2003, VIVO has been using a semantic approach to model research and scholarly activities focused on connecting many different types of entities – people, organizations, events, courses, grants, and publications – through named relationships. The primary use of VIVO is to provide an infrastructure for research networking and to enable the discovery of research and scholarship across disciplines by leveraging personal profiles, publication records, grant information, and subject expertise information. A sample VIVO profile is shown in Figure 1. This is a customizable visual display of information about a UNAVCO staff member, showing publications, research expertise, datasets and other information in one location.
The EarthCollab project is extending the use of the VIVO software to geoscientific settings, representing datasets, geoscientific instruments, and research projects (Figure 2). Datasets have associated publications, organizations, grants, creators, managers, instruments, and derivative datasets. Within the VIVO data model, almost everything can be represented as a first-order object, as long as appropriate ontologies are declared (Khan et al. 2011). EarthCollab is one of a small number of projects that are using VIVO to represent scientific information (Ma et al. 2014; Wilson et al. 2014).
The VIVO software suite leverages the VIVO-ISF (Integrated Semantic Framework) ontology, which defines types (classes), and the relationships between them (properties), for researchers, organizations, and a range of research activities (Mitchell et al. 2011). Existing widely used ontologies are reused in the core VIVO ontology, such as Friend of a Friend (FOAF) to define people, ensuring that VIVO can easily exchange and integrate data with external data sources.
3 EarthCollab Ontology Development Methodology
The EarthCollab project has identified a number of decision drivers that impact how to evaluate appropriate ontologies, and how to develop a set of effective ontological structures (Figure 3). This set of drivers includes the development of use cases, the constraints that existing systems and metadata place on new Semantic Web applications, and the characteristics of the VIVO application itself. In addition, EarthCollab has attempted to follow community recommendations for good ontological modeling practices in evaluating the characteristics of external ontologies.
3.1 Use Cases
Semantic Web methodologies recommend that ontology design and selection should follow from concrete use cases and application-specific concept designs (Fox & McGuiness 2008; Guizzardi 2010). The two main stakeholder communities for the EarthCollab project are: (1) the Bering Sea Project, an interdisciplinary field program whose data archive is hosted by NCAR’s Earth Observing Laboratory (EOL), and (2) UNAVCO, a geodetic facility and consortium that supports diverse research projects. Use case development exercises were conducted within the NCAR EOL and UNAVCO data center teams to compile high-level statements of specific tasks that scientists should be able to achieve through the use of EarthCollab systems. The use case descriptions followed the template provided by Fox and McGuinness (2008). This use case activity identified key tasks that EarthCollab should support, including: “finding all publications that used the GPS data held by UNAVCO”, and “identifying the people responsible for the collection of a specific Bering Sea dataset held by NCAR EOL.” The concept models based on the use cases depict the key entities of interest, e.g., publications, data, people, organizations, instruments, and projects. The concept models also depict the relationships that can exist between entities.
In addition to the data center-focused use cases, EarthCollab conducted a user engagement workshop in December of 2014 in which nine scientists from the Bering Sea Project and UNAVCO communities were led through use case discussions. The resulting science-focused use cases had a couple of different foci. One focus was on identifying geospatial regions where certain data parameters show specific features, such as “Find the areas in a defined region where benthic biomass is high enough for walrus feeding”. Another use case focused on finding information related to a spatially and/or temporally specific natural event, such as an earthquake or a major field project (i.e. ice sheet melting in Greenland and its consequences).
3.2 Existing Metadata
Another set of key ontology concepts, terms, and relationships was developed through looking at the metadata schemes already in place at UNAVCO and NCAR EOL. These existing metadata stores contain significant amounts of structured and unstructured information with community-vetted terminologies and established relationships between entities. They imposed important constraints on the EarthCollab ontology, because they contain information about particular entity and relationship types. Existing metadata stores limit the statements that can be made about datasets and their relationships to other key entities, such as investigators and observational platforms, because they may not contain certain information of interest or may have incomplete information.
Existing metadata may not directly map to established ontologies. For example, UNAVCO metadata identifies Principal Investigators for GPS/GNSS stations. This is incompatible with the domain and range assertions associated with the “Principal Investigator Role” class in the subset of VIVO-ISF bundled with VIVO. The VIVO-ISF ontology does not include a “station” concept, and the “Principal Investigator Role” is modeled in association with grants only. Similarly, publications that have used UNAVCO or NCAR EOL datasets might not directly cite the datasets that they have used, instead indicating data use in the “acknowledgements” or “methods” sections of papers. The VIVO-ISF ontology, however, only models relationships between publications and datasets via a “cited as a data source” property. A paper describing an earthquake early warning system is one example where the paper references a collection of datasets used by the system (identified in existing metadata as a “related dataset”), but the authors did not use the datasets as data sources for the publication as the object property states.
Another example of the importance of existing metadata in our development of the EarthCollab ontology approach is in our use of standard vocabularies of concept and keyword terms. The NCAR EOL data management team already assigns Global Change Master Directory (GCMD) keywords to datasets within their existing metadata. UNAVCO has created a local taxonomy that allows UNAVCO community members to self-identify as having expertise in certain scientific topics, software tools, and instrumentation types. We are currently investigating mappings between this local UNAVCO taxonomy, GCMD, and other potentially overlapping vocabularies, such as the Semantic Web for Earth and Environmental Terminology (SWEET) ontologies and Library of Congress Subject Headings (LCSH), as well as keyword terms provided by scientific societies (e.g. Geological Society of America (GSA) and American Geophysical Union (AGU)) and journal publishers via CrossRef.
3.3 Semantic Application
The VIVO application is built around the VIVO-ISF ontology. Additional external ontologies to be used with VIVO must therefore be compatible with the VIVO-ISF ontology approach. The VIVO-ISF ontology uses the Basic Formal Ontology (BFO) as its upper level ontology. BFO models roles, such as an authorship role, as a class (Arp & Smith 2008), whereas many other ontologies model authorship via a direct property (e.g. “isAuthorof”). VIVO-ISF uses these kinds of “context classes” (Rodriquez, Bollen, & Van de Sompel 2007; Rust 2005), shown in Figure 4, for all roles. Modeling “Authorship” as a context class instead of a property allows for the designation of additional information, such as author order, to be related to the authorship of a particular document. Various aspects of the VIVO application input forms and web site display are structured to make use of these context classes.
Reusing ontologies often requires addition of entities and relationships that are needed for a new application. For example, UNAVCO and NCAR EOL are consortia of universities and related geoscience organizations. The person and organization aspects of the ontology need to represent classes of members that are external to either organization. Figure 5 shows an EarthCollab custom class, AssociateMemberRole, and a custom property, hasLiaison, which are important to the UNAVCO application in connecting the UNAVCO consortium to its member organizations and associated representatives.
Another example of a VIVO application dependency is the application’s reliance on domain and range assertions. To provide flexibility, many ontologies do not have domain and range declarations. In the VIVO application, however, it is often helpful to have domain and ranges expressed through value constraint property restrictions (http://www.w3.org/TR/owl-ref/#Restriction) because they are used to structure web forms, make intelligent autocomplete suggestions, and organize information displays.
3.4 External Ontology Characteristics
In parallel with the above activities, the EarthCollab team investigated existing ontologies that mapped to the concept and relationship structures emerging from the use cases and metadata analysis. This process was largely informal, and was based on project members’ participation in discussions within the geoscience community, such as the Earth Science Information Partners (ESIP) Semantic Web Cluster, the NSF EarthCube program, and the American Geophysical Union’s Earth and Space Science Informatics (ESSI) focus group. Domain-specific ontology portals, such as http://ontobee.org or http://bioportal.bioontology.org/, are of limited utility for projects outside of those domains.
As possibly relevant ontologies were identified, EarthCollab project members examined their structural and semantic characteristics for applicability to our project goals. This examination was important as the full implications of ontology reuse may only be visible through detailed comparison of conceptual models. Conceptual models that use differing underlying higher-level ontologies may not be compatible, even if the same concepts are being modeled (Cox 2013; Cox in press).
In examining external ontologies, EarthCollab project members examined the definitions for possibly relevant classes, along with the domains and ranges defined for properties that connected those classes. For example, in evaluating whether the VIVO-ISF ontology would support the modeling of natural hazard events, such as earthquakes or hurricanes, we examined the Event ontology (http://purl.org/NET/c4dm/event.owl). The Event ontology was developed to model human events, such as conferences and concerts, but we investigated whether it is generic enough that natural hazard events could be modeled. The VIVO-ISF ontology includes the Event:Event class, along with a property labeled “related documents”. Initially, this combination of class and property appeared appropriate to our case, as it could be stated that a given natural hazard event, such as an earthquake, had a set of related documents housed at UNAVCO or NCAR EOL. Looking closer, however, the actual property underlying “related documents” is bibo:presents, which would be inappropriate in relation to natural hazard events.
Maintenance and versioning are key considerations for ontology reuse (Hyvönen 2010). Some ontologies have more robust maintenance support than others. Ontology updates can be a source of improvement or problems for ontology re-users. For example, the VIVO-ISF use of context classes was introduced as part of an ontology update. The original VIVO ontology modeled person and organizational roles via direct relationships (Corson-Rikert et al 2012; Mitchell et al 2011; Torniai et al. 2013). VIVO application users who transitioned to the new ontology faced a sizable conversion process, and a small minority of VIVO applications users decided not to make the switch. These kinds of ontology maintenance and versioning challenges may push new Semantic Web applications to limit the scope of ontology reuse, because potential versioning problems are inherently tied to the number of ontologies in use.
3.5 Ontology Modeling Recommendations
In defining an EarthCollab ontology approach, we attempted to follow appropriate guidelines and recommendations from published ontological practices. For example, one area of research focuses on evaluating representational models and task-specific assumptions and decisions (Uschold et al. 1998). Another body of research, exemplified by the “Minimum Information to Reference an External Ontology Term” (MIREOT) recommendations, addresses ontology importing and the selective reuse of a subset of terms from larger ontologies (Courtot et al. 2009; d’Aquin et al. 2007; Pan, Serafini, & Zhao 2006).
Researchers have discussed potentially problematic aspects of improper ontological modeling practices. The possible negatives of improper ontology reuse come in a number of forms. Hogan, Harth, and Polleres (2008) describe the potential for ontology reuse to lead to “ontology hijacking,” where the secondary ontology can introduce ontological commitments that change the semantics of specific classes or properties in the original ontology. For example, declaring new superclasses of classes from commonly used ontologies, such as foaf:Person, is non-problematic in a local instance, but can introduce problematic inferences for other applications that use those same classes when data and ontologies come together at larger scale. Similarly, inconsistent use of the owl:sameAs property can result in inferencing of equivalences between two entities that may not be appropriate in all contexts (Halpin et al. 2010).
It is rare that a pre-existing ontology meets all of a new application’s needs. In the EarthCollab case, clearly defined use cases helped to determine the extent to which existing ontologies should be used. Inventing a new ontology is required when an application has specific needs or concepts that existing ontologies do not address. In assessing existing ontologies based on the five categories of criteria shown in Figure 3, two ontologies were identified as being most appropriate for EarthCollab applications: (1) the Global Change Information System (GCIS) ontology, and (2) the Data Catalog (DCAT) ontology. These two ontologies complement EarthCollab’s use of the VIVO-ISF ontology for modeling people, organizations, grants, and publications.
The GCIS ontology (http://data.globalchange.gov/gcis.owl) was developed by the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute (RPI), as part of the U.S. Global Change Research Program. EarthCollab use of the GCIS ontology focuses on modeling scientific instruments, platforms, datasets and the relationships between them. The DCAT ontology (http://www.w3.org/TR/vocab-dcat/) provides a compact structure to model datasets, data catalogs, and associated information. EarthCollab’s use of the DCAT ontology focuses on describing datasets in greater detail than is possible with the VIVO-ISF and GCIS ontologies. Additional EarthCollab-specific ontology components, such as the ec:sourcePlatform property (Figure 6), filled gaps that none of the other ontologies covered.
Figure 6 shows how these ontologies are coupled together for the EarthCollab applications. This diagram depicts a set of core concepts and relationships that encompass key aspects of the use cases and map to existing metadata schemes used by UNAVCO and NCAR EOL. Many details are omitted in this figure (such as class attributes), but it shows classes and properties from the VIVO-ISF ontology, the GCIS ontology, and other ontologies, such as the Bibliographic Ontology (bibo), which are incorporated into the VIVO-ISF and/or GCIS ontology.
Another task involved evaluating and consolidating modeling among classes with the same name that are present in multiple ontologies (Figure 7). In particular, the VIVO-ISF and GCIS ontologies both contain Project classes, and VIVO-ISF, GCIS, and DCAT ontologies all contain Dataset classes. We chose to model these relations via subclass properties in the EarthCollab ontology to avoid “hijacking” the VIVO-ISF and GCIS Project classes, as well as to take advantage of properties associated with both classes. In the Dataset case, the GCIS and DCAT ontological structures have much more comprehensive modeling related to datasets than the VIVO-ISF ontology. Based on our discussions within the VIVO developer community, the Dataset class in the VIVO-ISF ontology has not been used extensively in VIVO applications. As such, there is little concern about causing unintended ontological commitments by defining it as a subclass of the corresponding GCIS and DCAT Dataset classes.
The EarthCollab project emphasized the reuse of existing ontologies to support the sharing of information about scientific data, projects, publications, instruments, and researchers. Using the VIVO software suite with its built-in VIVO-ISF ontology as our base, EarthCollab has investigated and selected an additional set of existing ontologies to more fully cover the concepts and relationships relevant to geoscientific research. The process used to integrate these additional ontologies into the VIVO application has involved multiple steps, and has closely adhered to recommended practice for ontology import and reuse. The EarthCollab ontology namespace, which only includes ten statements related to the issues noted above, is available at https://library.ucar.edu/earthcollab/schema/.
NCAR EOL and UNAVCO have each set up an installation of the VIVO system, and are actively ingesting information about the people, publications, instruments, and datasets within their respective communities. Ingested information comes from existing metadata databases at NCAR EOL and UNAVCO and from newly developed sources. The two VIVO instances for the EarthCollab project are: