A Method for Extending Ontologies with Application to the Materials Science Domain

Huanyu Li; Rickard Armiento; Patrick Lambrix

1 Introduction

From the beginning of the 2000s materials science has shifted towards its fourth paradigm, (big) data-driven science (). More and more researchers in materials science have realized that data-driven techniques could accelerate the discovery and design of materials. Therefore, a large number of research groups and communities have developed data-driven workflows including data repositories (for an overview see ()) and data analytics tools for particular purposes. As data-driven techniques become widely used, big data challenges regarding volume, variety, variability and veracity () and challenges in reproducing, sharing, and integrating data (, , , , ) are growing at the same time.

These challenges also occurred in other fields. For instance, in () the problems of locating, retrieving and integrating data in the biomedical field were addressed. These problems relate to the more recently introduced FAIR principles that aim to support machines to automatically find and use data, and individuals to reuse the data (). The FAIR principles state that data should be Findable, Accessible, Interoperable, and Reusable, respectively. In different areas research is on the way to conform data management to these principles, including in the materials science domain (). One of the recognized enablers for the principles are ontologies and ontology-based techniques. Ontologies provide a shared standardized representation of knowledge of a domain. By describing data using ontologies, the data will be more findable. By using ontologies for representing the metadata, the level of accessibility can be raised. By using the same terminology as defined by ontologies, interoperability is enabled. Finally, as ontologies are shared and standardized, reusability is supported.

Taking nanotechnology as an example, in () it is stated that there exists a gap between data generation and shared data access. The domain lacks standards for collecting and systematically representing nanomaterial properties. In () stakeholder-identified technical and operational challenges for the integration of data in the nanotechnology domain are presented. The technical challenges mainly refer to (i) the use of different data formats, (ii) the use of different vocabularies, (iii) the lack of unique identifiers, and (iv) the use of different data conceptualization methods. In terms of operational challenges, they refer to (i) the fact that organizations have different levels of data quality and completeness, and (ii) the lack of understandable documentation. To solve these challenges, it is proposed that ontologies and ontology-based techniques can play a significant role in the data-driven materials science and enable reproduction, sharing and integration of data. This was, for instance, the main outcome of a workshop on interoperability in materials modelling organized by the European Materials Modelling Council ().

Although in its infancy, some organizations and research groups have started to develop ontologies and standards for the materials domain (Section 2.2), including in the nanotechnology domain. However, developing ontologies is not an easy task and often the resulting ontologies are not complete. In addition to being problematic for the correct modelling of a domain, such incomplete ontologies also influence the quality of semantically-enabled applications such as ontology-based search and data integration. Incomplete ontologies when used in semantically-enabled applications can lead to valid conclusions being missed. For instance, in ontology-based search, queries are refined and expanded by moving up and down the hierarchy of concepts. Incomplete structure in ontologies influences the quality of the search results. In experiments in the biomedical field, an example was given where a search in PubMed (http://www.ncbi.nlm.nih.gov/pubmed/), a large database with abstracts of research articles in the biomedical field, using the MeSH (Medical Subject Headings) (http://www.nlm.nih.gov/mesh/) ontology would miss 55% of the documents if the relation between the concepts Scleral Disease and Scleritis is missing ().

In this paper, we present a novel method for extending existing ontologies by detecting new concepts and relations in the concept hierarchy that should be included in the ontologies. We do this by presenting a new approach, formal topical concept analysis, that integrates a variant of topic modeling and formal concept analysis. Further, we apply our method to two ontologies (NanoParticle Ontology and eNanoMapper) in the materials science domain. The choice of the use of ontologies in the nanotechnology domain is motivated by the fact that, as we have shown before, there is an awareness of the need for ontologies to deal with interoperability and reusability issues. Further, there are not so many ontologies in materials science yet (see Section 2.2) and the chosen ontologies are among the more mature ontologies in the field. Therefore, they represent the most difficult case for extending ontologies.

The remainder of the paper is organized as follows. In Section 2 we describe what ontologies are, efforts on ontologies in the materials domain as well as work on extending ontologies. Section 3 describes our approach while Section 4 shows and discusses the results of the application of our approach in the nanotechnology domain. We show how NanoParticle Ontology and eNanoMapper were extended and evaluate the usefulness of the approach. We also compare our results to the results of an experiment with another popular system on the same data. Finally, the paper concludes in Section 5.

2 Background

2.1 Ontologies

Intuitively, ontologies can be seen as defining the basic terms and relations of a domain of interest, as well as the rules for combining these terms and relations. Ontologies are used for communication between people and organizations by providing a common terminology over a domain. They provide the basis for interoperability between systems, and can be used as an index to a repository of information as well as a query model and a navigation model for data sources. They are often used as a basis for integration of data sources, thereby alleviating the variety and variability problems. The benefits of using ontologies include reuse, sharing and portability of knowledge across platforms, and improved maintainability, documentation, maintenance, and reliability. Overall, ontologies lead to a better understanding of a field and to more effective and efficient handling of information in that field (e.g., ().

From a knowledge representation point of view, ontologies may contain four components: (i) concepts that represent sets or classes of entities in a domain, (ii) instances that represent the actual entities, (iii) relations, and (iv) axioms that represent facts that are always true in the topic area of the ontology. Axioms can represent such things as domain restrictions, cardinality restrictions, or disjointness restrictions. Ontologies can be classified according to which components and the information regarding the components they contain. As an example, Figure 1 represents a small piece of the NanoParticle Ontology () regarding ‘chemical entity’ and ‘quality’. Regarding chemical entities NanoParticle Ontology contains, for instance, the concepts chemical entity, chemical substance, ion, particle, isotope and molecular entity. The black full arrows represent axioms representing is-a relations, i.e. if A is a B, then all entities that belong to concept A also belong to concept B. We also say then that A is a sub-concept of B. In this example we have that chemical substance, particle, ion, isotope and molecular entity are sub-concepts of chemical entity. Therefore, all chemical substances, particles, ions, isotopes, and molecular entities are also chemical entities. Further, all primary particles are particles, all nanoparticles are primary particles, all polymeric nanoparticles are nanoparticles and all gelatin nanoparticles are polymeric nanoparticles. The is-a relation is transitive such that, for instance, a gelatin nanoparticle is also a particle. Regarding different kinds of qualities NanoParticle Ontology contains, for instance, the concepts particle size, molecular weight, particle concentration, organic, inorganic, shape, chemical composition, density, hydrodynamic size, mass, size, and electric charge. Further, particles have qualities; this is represented by an axiom that states that concepts particle and quality are connected to each other by the relation has quality (green dashed arrows in Figure 1). Properties represented by relations are inherited via the is-a hierarchy. Therefore, also the subconcepts of particles are related to qualities.

Figure 1

Example from NanoParticle Ontology.

In Figure 2 we show the part of NanoParticle Ontology that represents particles using the ontology development system Protégé (https://protege.stanford.edu/). On the left hand side the concepts and the is-a hierarchy are shown. The is-a relations are represented by indentation. For instance, gelatin nanoparticle (highlighted in Figure 2) is a sub-concept of polymeric nanoparticle which in its turn is a sub-concept of nanoparticle. On the right-hand side of Figure 2 information related to the axioms are shown using a special notation reflecting constructs in the representation language OWL (http://www.w3.org/TR/owl-features/, http://www.w3.org/TR/owl2-overview/), a knowledge representation language that is often used for representing ontologies and that is based on description logics (). For instance, we note that the concept gelatin nanoparticle was defined to be equivalent to nanoparticleand (has_component_partsomegelatin). This means that every gelatin nanoparticle is a nanoparticle that has a component that is gelatin, and vice versa, whenever a nanoparticle has a component that is gelatin, then it is a gelatin nanoparticle. Further, there is information about the types of qualities that gelatin nanoparticles have (inherited from the particle concept). An advantage of using a description logics-based representation is that it allows for reasoning. In the ontology it was defined that gelatin nanoparticle is equivalent to nanoparticleand (has_component_partsomegelatin) (as we just noted), that polymeric nanoparticle is equivalent to nanoparticleand (has_component_partsomepolymer), and that gelatin is a subconcept of protein which is a subconcept of biopolymer which is in its turn a subconcept of polymer. Based on these axioms the system can derive the additional information that a gelatin nanoparticle is a polymeric nanoparticle, which is also shown on the right-hand side of Figure 2 (under ‘SubClass Of’). Figure 3 shows the actual OWL representation for the concepts gelatin nanoparticle, polymeric nanoparticle and nanoparticle.

Figure 2

Example from NanoParticle Ontology opened in Protégé.

Figure 3

Example from NanoParticle Ontology – OWL/XML Syntax Format.

2.2 Ontologies in materials domain

Within the materials domain the use of semantic technologies is in its infancy with the development of ontologies and standards. According to () domain ontologies have been used to organize materials knowledge in a formal language, as a global conceptualization for materials information integration (e.g. ()), for linked materials data publishing, for inference support for discovering new materials and for semantic query support (e.g., (, )). Most ontologies focus on specific sub-domains of the materials field (e.g., metals, ceramics, thermal properties, nanotechnology) and have been developed with a specific use in mind (e.g., search, data integration, discovery). Some examples of ontologies are the Materials Ontology () for data exchange among thermal property databases, PREMΛP ontology () for steel mill products, MatOnto ontology () for oxygen ion conducting materials in the fuel cell domain, and the FreeClassOWL ontology () for the construction and building materials domain. An ontology design pattern regarding material transformations was proposed in (). Since recently, the European Materials Modelling Council is developing the European Materials Modelling Ontology ().

In the sub-field of nanotechnology, the NanoParticle Ontology () was created for understanding biological properties of nanomaterials, searching for nanoparticle relevant data and designing nanoparticles. It builds on the Basic Formal Ontology (BFO, http://basic-formal-ontology.org/) () and Chemical Entities of Biological Interest Ontology (ChEBI) () to represent basic knowledge regarding physical, chemical and functional features of nanotechnology used in cancer diagnosis and therapy. The eNanoMapper ontology () aims to integrate a number of ontologies such as the NanoParticle Ontology for assessing risks related to the use of nanomaterials.

Furthermore, standards for exporting data from databases and between tools are being developed. These standards provide a way to exchange data between databases and tools, even if the internal representations of the data in the databases and tools are different. They are a prerequisite for efficient materials data infrastructures that allow for the discovery of new materials ().

In several cases the standards formalize the description of materials knowledge and thereby create ontological knowledge. For instance, one effort is by the European Committee for Standardization which organized workshops on standards for materials engineering data of which the results are documented in (). Another recent effort is connected to the European Centre of Excellence NOMAD ().

2.3 Extending ontologies from unstructured text

The ontology extension problem that we tackle deals mainly with concept discovery and concept hierarchy derivation. These are also two of the tasks in the problem of ontology learning (). Therefore, most of the related work comes from that area. For instance, a recent survey () discusses 140 research papers. Different techniques can be used for concept and relationship extraction. In this setting, new ontology elements are derived from text using knowledge acquisition techniques.

Linguistic techniques use part-of-speech tagged corpora for extracting syntactic structures that are analyzed regarding the words and the modifiers contained in the structure. One kind of linguistic approach is based on linguistics using lexico-syntactic patterns. The pioneering research conducted in this line is in (), which defines a set of patterns indicating is-a relationships between words in the text. Other linguistic approaches may make use of, for instance, compounding, the use of background and itemization, term co-occurrence analysis or superstring prediction (e.g. (, )).

Another paradigm is based on machine learning and statistical methods which use the statistics of the underlying corpora, such as k-nearest neighbors approach (), association rules (), bottom-up hierarchical clustering techniques (), supervised classification () and formal concept analysis (). There are also some approaches that use topic models (, , ) but they focus on concept names that are words, rather than phrases as in our approach.

Ontology evolution approaches (, ) allow for the study of changes in ontologies and using the change management mechanisms to detect candidate missing relations. An approach that allows for detection and user-guided completion of the is-a structure is given in (, ) where completion is formalized as an abduction problem and the RepOSE tool is presented.

3 Approach

Our approach for extending ontologies, shown in Figure 4, contains the following steps. In the first step, creation of a phrase-based topic model, documents related to the domain of interest are used to create topics. The phrases as well as the topics are suggestions that a domain expert should validate or interpret and relate to concepts in the ontology. In the second step the (possibly validated and updated) topics are used in a formal topical concept analysis which returns suggestions to the domain expert regarding relations between topics and thus concepts in the ontology. Both steps lead to the addition of new concepts and (subsumption) axioms to the ontology. In the following subsections we describe these steps.

Figure 4

Approach: The upper part of the Figure shows the creation of a phrase-based topic model with as input unstructured text and as output phrases and topics. The lower part shows the formal topical concept analysis with as input topics and as output a topical concept lattice. In both parts a domain expert validates and interprets the results.

3.1 Phrase-based Topic Model

In our first step we use the phrases-based topic model in the ToPMine system (). Given a corpus of documents and the number of requested topics, representations of latent topics in the documents are computed. Essentially, topics can be seen as a probability distribution over words or phrases. The ToPMine approach is purely data-driven, i.e., it does not require domain knowledge or specific linguistic rule sets. This is important for our application domain as there is a lack of annotated background knowledge. An important property of the system is that it works on bags-of-phrases, rather than the traditional bag-of-words. This means that words occurring closer together have more weight than words far away. Further, as we assume existing ontologies, it is very likely that concepts with one-word names are already in the ontology and we therefore focus on phrases.

The approach consists of two parts: phrase mining and topic modelling. In the first part frequent contiguous phrases are mined, which consists of collecting aggregate counts for all contiguous words satisfying a minimum support threshold. Then the documents are segmented based on the frequent phrases. Further, an agglomerative phrase construction algorithm merges the frequent phrases guided by a significance score. In the second part topics are generated using a variant of Latent Dirichlet Allocation, called PhraseLDA, that deals with phrases, rather than words.

3.2 Formal Topical Concept Analysis

In the second step we define a new variant of Formal Concept Analysis (e.g., ()) and use this new variant on topics. These topics can come directly from the previous step or can be a modified version of the topics of the previous step, where non-relevant topics or phrases are removed.

We first define the notions of formal topical context, formal topical concept and topical concept lattice. (Note that formal topical concepts should not be confused with concepts in the ontologies.)

Definition 1. (Formal Topical Context) A formal topical context is a triple (P, T, I) where P is a set phrases, T is a set topics, and I is a binary relation between P and T (I ⊆ P × T).

Definition 2. (Formal Topical Concept) (A, B) is a formal topical concept of (P, T, I) iff A ⊆ P, B ⊆ T, A′ = B, B′ = A where A′: = {t ∈ T | ∀p ∈ A: < p, t > ∈ I} and B′: = {p ∈ P | ∀t ∈ B: < p, t > ∈ I}. A is the extent and B is the intent of (A, B).

Definition 3. (Topical Concept Lattice) Topical formal concepts can be ordered. We say that (A₁, B₁) ≤ (A₂, B₂) iff A₁ ⊆ A₂. The set Φ(P, T, I) of all formal topical concepts of (P, T, I), with this order, is called the topical concept lattice of (P, T, I).

As an example, in Figure 5(a) we show a matrix representing the occurrence of phrases in topics in a topic model, the resulting formal topical concepts in Figure 5(c) and the topical concept lattice in Figure 5(b). In the lattice a node represents a formal topical concept (same numbering as in Figure 5(a)). For a formal topical concept (A, B), its extent (phrases) is found by collecting all phrases in its node as well as its descendants. The intent (topics) is found by collecting all topics in its node as well as its ancestors.

Figure 5

Examples of (a) phrase occurrences in topics, (b) Formal Topical Concept Lattice and (c) Formal Topical Concepts.

3.3 Domain Expert Validation

As shown in Figure 4, a domain expert is involved in the different steps in our approach to validate and interpret the results of the phrase-based topic model and the formal topical concept analysis.

The domain expert validates or interprets all phrases that appear in all topics. The outcome can be one of the following:

The phrase is a meaningful representation of a concept in the specific domain and it is already in the ontology. For example, gold nanoparticle is a specific concept within the nanotechnology domain and it is already in the NanoParticle Ontology. We distinguish two cases: (1) a concept with the same name or a name that is a synonym of the original form of the phrase already exists in the ontology (EXIST) or (2) a concept with a name that is a modified form of the phrase already exists in the ontology (EXIST-m).
The phrase is a meaningful representation of a concept in the specific domain but it is not in the ontology. For example, microcrystalline silicon is a meaningful representation of a concept but such concept does not exist in the ontology. We distinguish two cases: (1) a concept with the same name as the original form of the phrase should be added into the ontology (ADD) or (2) a concept with as name a modified form of the phrase should be added into the ontology (ADD-m).
No concept related to the phrase should be added to the ontology. This can happen because the phrase does not make sense in the domain (No), but also because it is a meaningful representation of a concept in a more general domain (No-g). For example, electron transfer is a general concept within the perspective in materials science, but should not necessarily be in a nanotechnology ontology.

A second interaction with the domain expert occurs in the interpretation of topics. The outcome can be one of the following:

Using the representative phrases in a topic, the domain expert labels the topic. Using this label as a phrase, we have the outcomes EXIST, EXIST-m, ADD, ADD-m, No-g and No, as above. Furthermore, we add an outcome Q (for query) when the label for the topic is too specific for adding to the ontology, but could be defined using concepts in the ontologies and OWL constructs.
Using a subset of representative phrases in a topic, the domain expert labels the subset. Using this label as a phrase, we have the outcomes EXIST, EXIST-m, ADD, ADD-m, No-g, No, and Q as above. This can be done for different subsets.

Finally, the domain expert interprets the lattice.

Given the relationships in the lattice, as well as the connections of the topics and phrases to concepts in the ontology, new relationships between ontology concepts can be identified.

4 Extending NanoParticle and eNanoMapper Ontologies

In the following subsections, we show the usefulness of our approach by extending two ontologies in the nanotechnology domain.

4.1 Corpus and ontologies

The corpus that we use is based on reports on nanoparticles from the Nanoparticle Information Library (http://nanoparticlelibrary.net). For each nanoparticle report, we take the text in ‘Research Abstract’ as well as the abstracts (or only the titles if there is no abstract) from the publications in ‘Related Publications’. The final corpus contains 117 abstracts from the ‘Research Abstract’ field in the reports and 510 abstracts (or titles) from publications. We have chosen to only retrieve titles and abstracts rather than full texts. The title and abstract cover the basic content of an article. For a research article in the materials science domain they will generally contain a summary of the problem, experiments, simulations and computations. As the ontologies aim to represent basic knowledge in the domain, these parts of a research article often contain enough information for extraction of concepts. When using the full text, more proposals for concepts may be generated, but many of those will not be relevant. In related fields, it has been shown that the use of titles (and abstracts) may be a reasonable approach (e.g., ()).

The ontologies that we extend are the NanoParticle Ontology () (1904 concepts and 81 relations) and the eNanoMapper ontology () (12,531 concepts and 4 relations). Both ontologies are available via BioPortal (https://bioportal.bioontology.org/).

4.2 Experiments Setup

In our experiments, we configure the phrases mining threshold with two values (high and low), and the PhraseLDA with different numbers of requested topics (20, 30 and 40). The other parameters of PhraseLDA are set as follows: the total number of Gibbs sampling iterations over the entire data is 1000, the hyper-parameters are α = 50/T and β = 0.01 where T is the number of topics. These initial values for the hyper-parameters are justified in (). Thus we have six experiments over the data.

After the interpretation of the phrases by the domain expert, for each setting, all (rows regarding) phrases interpreted with No are removed from the phrase occurrence matrix. The updated matrix (with all EXIST(-m), ADD(-m) and No-g phrases) are used as input for the formal topical concept analysis and a formal topical concept lattice is generated.

For the interpretation of the phrases, topics and lattice results a domain expert (second author) worked together with two ontology engineering experts (first and third author). In a first 2 hour session the three experts went through the phrases of all topics for one of the settings (low mining threshold, 40 topics) of the topic model approach. Each phrase was discussed regarding whether it was relevant for a nanotechnology ontology, checked whether concepts with the same or similar names existed in the NanoParticle Ontology, and a decision was made regarding EXIST(-m)/ADD(-m)/No(-g) as well as which axioms may be needed to add to the ontology. In addition to investigating the ontologies, in some cases terms were checked via wikipedia or research articles. As a preparation for the second session, the knowledge engineers prepared suggestions for the phrases for the other settings, based on the interpretation results of the first session and search in the two ontologies. During the second session (4 hours) the phrases for all settings were interpreted and related to both ontologies. Further, the topics for one setting were interpreted. In the third (2 hour) session the remaining topics as well as the lattice results were interpreted.

4.3 Results and discussion of results

In Table 1 we show the results regarding the interpretation of the phrases. In addition to the number of concepts in the EXISTS(-m), ADD(-m), and No(-g) categories, we also show the precision. The precision of the system is the ratio of the number of relevant proposed concepts to the number of proposed concepts. We decided to define a relevant proposed concept as a proposed concept that the domain expert recognizes as a relevant concept, whether it be in the ontology, or more specific than concepts in the ontology, or could belong to a more general ontology. Therefore, the relevant proposed concepts are the ones that do not belong to the ‘No’ category. This conforms to what is relevant in the ontology learning setting.

Table 1

Result of interpreting phrases. The first column defines the case using the number of topics, low or high mining threshold, and ontology. The precision is truncated.

	ADD	ADD-m	EXIST	EXIST-m	No-g	No	precision

20, low, NanoParticle	32	4	26	19	16	9	0.91
20, low, eNanoMapper	29	3	24	25	14	12	0.88
30, low, NanoParticle	30	4	26	18	16	9	0.91
30, low, eNanoMapper	28	3	24	26	12	11	0.89
40, low, NanoParticle	32	4	26	15	16	10	0.90
40, low, eNanoMapper	29	3	24	22	14	12	0.88
20, high, NanoParticle	9	1	14	7	4	0	1.00
20, high, eNanoMapper	8	2	12	10	3	0	1.00
30, high, NanoParticle	8	2	14	8	0	1	0.96
30, high, eNanoMapper	7	1	12	10	0	1	0.96
40, high, NanoParticle	9	2	14	12	4	4	0.91
40, high, eNanoMapper	9	2	12	14	2	4	0.90

For the meanings of ADD(-m), EXIST(-m) and No(-g), see Section 3.3.

For ADD and ADD-m, a new concept is defined in the ontology and one or more subsumption axioms are added.

We note that some phrases may contribute to the addition of multiple concepts and axioms. Furthermore, the low mining threshold settings generate the most number of phrases (in total and per topic). Except for one ‘No’ phrase, all phrases generated by any of the high mining threshold settings are also generated by at least one (and usually all) low mining threshold settings. For the low mining threshold settings there are only small differences regarding the phrases that occur in topics. There are 29 phrases that are generated by all settings. Of these do 13 exist in the ontologies and relate, among others, to kinds of nanotubes, microscopy, spectroscopy, and various properties of nanoparticles. Furthermore, 7 exist in a modified form, e.g., temperature for low/high/room temperature and core-shell nanoparticle for the phrase core shell. The remaining 9 should be added to the ontologies in the same or modified form. These relate to properties (resolution, pore size, band gap, electrical conductivity, crystallinity), a technique (vapor deposition) and nano-objects (mesoporous silica nanoparticle, thin film). Reverse micelle-synthesized quantum dot leads to the creation of a specific kind of quantum dots as well as a specific synthesis technique. Regarding the phrases that are only found by low mining threshold settings, they relate to different kinds of silicons, nanoparticles, properties and techniques, of which many should be added to the ontologies. There are, however, also several phrases that relate to more general concepts in the materials domain that should not necessarily be added to an ontology in the nanotechnology domain. In all settings, we find most EXIST(-m) cases, which shows that the phrases are relevant with respect to the existing ontologies. Furthermore, we found many ADD(-m) cases which lead to new concepts and axioms. There are also some phrases that relate to more general concepts and some phrases that do not lead to anything meaningful in the context of extending the ontology. From Table 2 we note that the more topics the system generates, the lower the percentage of topics that contribute to EXIST(-m) and ADD(-m) categories.

Table 2

The number (and truncated percentage in parentheses) of topics that contribute to extending the ontologies. The first column defines the case using the number of topics, low or high mining threshold, and ontology.

	Contribute to ADD and ADD-m	Contribute to EXIST and EXIST-m	Contribute to No-g

20, low, NanoParticle	18 (90.0%)	16 (80.0%)	6 (30.0%)
20, low, eNanoMapper	18 (90.0%)	16 (80.0%)	5 (40.0%)
20, high, NanoParticle	11 (55.0%)	13 (65.0%)	3 (15.0%)
20, high, eNanoMapper	11 (55.0%)	13 (65.0%)	2 (10.0%)
30, low, NanoParticle	19 (63.0%)	19 (63.0%)	11 (36.6%)
30, low, eNanoMapper	18 (60.0%)	20 (66.6%)	11 (36.6%)
30, high, NanoParticle	10 (33.3%)	19 (63.3%)	3 (10.0%)
30, high, eNanoMapper	9 (30.0%)	20 (66.6%)	2 (6.6%)
40, low, NanoParticle	22 (55.0%)	21 (52.5%)	12 (30.0%)
40, low, eNanoMapper	21 (52.5%)	23 (57.5%)	9 (22.5%)
40, high, NanoParticle	13 (32.5%)	16 (40.0%)	4 (10.0%)
40, high, eNanoMapper	12 (30.0%)	18 (45.0%)	3 (7.5%)

In Table 3 we show the results regarding the interpretation of the topics. We note that the high mining threshold settings generate the most concepts to add to the ontologies. In each setting there are one or two concepts that were not found during the interpretation of the phrases (e.g., high resolution experiment, water soluble reverse micelle systems, core-shell semiconductors). All EXIST(-m) concepts were also found during the interpretation of the phrases. The No-g category consists of earlier found phrases or specializations of those. Furthermore, many of the topics are very specific and it was decided they should not be added to the ontology, but queries (or complex concepts) using concepts in the ontologies and OWL constructs can be constructed. We also observe that the results for the two ontologies are almost the same, which may be because the topic labels are (much) more specific than the phrase labels and the ontologies do not model concepts at the lowest levels of specificity.

Table 3

Result of interpreting topics. The first column defines the case using the number of topics, low or high mining threshold, and ontology. Note that some topics may be empty and some topics may require several concepts. The values in parentheses show the number of added concepts that were not found in the phrase interpretation phase.

	ADD	ADD-m	EXIST	EXIST-m	No-g	Q	No	precision

20, low, both	3(1)	0	2	0	1	13	0	1.00
30, low, both	8(2)	0	4	0	1	13	0	1.00
40, low, both	16(1)	0	11	1	2	10	5	0.88
20, high, both	8(1)	0	3	2	0	7	0	1.00
30, high, both	3(2)	0	10	2	0	7	0	1.00
40, high, NanoParticle	10(2)	0	10	3	2	3	2	0.93
40, high, eNanoMapper	10(2)	0	9	4	2	3	2	0.93

For the meanings of ADD(-m), EXIST(-m), No(-g) and Q, see Section 3.3.

For ADD and ADD-m, a new concept is defined in the ontology and one or more subsumption axioms are added.

In the final step we generated lattices for all settings. As an example, a part of the lattice for the case of 40 requested topics with a low mining threshold is shown in Figure 6. Nodes that contain one topic/one phrase and have as child the bottom node and as parent the top node are not shown. These have been dealt with in the phrase interpretation step and as there are no connections to other nodes (except top and bottom), no additional information can be gained for those nodes.

Figure 6

Part of the lattice for the 40 topics and low mining threshold setting. Nodes that contain one topic/one phrase and have as child the bottom node and as parent the top node are not shown.

The lattices were used in the following ways. First, the domain expert labeled the nodes based on the phrases connected to the nodes. These may be the extents or subsets of the extents of topics. The results are given in Table 4. Some new concepts were found that are more general than concepts related to topics (e.g., core-shell cdse nanoparticles), but in general, few additional information was found.

Table 4

Result of interpreting lattice nodes. The first column defines the case using the number of topics, low or high mining threshold, and ontology. The values in parentheses show the number of added concepts that were not found in the phrase or topic interpretation phases.

	ADD	ADD-m	EXIST	EXIST-m	No-g	Q	No	precision

20, low, both	1(0)	0	1	0	2	0	0	1.00
30, low, NanoParticle	4(2)	0	3	0	1	0	0	1.00
30, low, eNanoMapper	3(2)	0	4	0	1	0	0	1.00
40, low, both	3(0)	0	1	0	0	0	0	1.00
20, high, both	0(0)	0	1	0	1	1	0	1.00
30, high, both	1(1)	0	1	0	0	0	0	1.00
40, high, both	0(0)	0	0	0	0	0	0	1.00

For the meanings of ADD(-m), EXIST(-m), No(-g) and Q, see Section 3.3.

For ADD a new concept is defined in the ontology and one or more subsumption axioms are added.

Secondly, the domain expert labeled the nodes based on the phrases connected to the nodes and their descendants. As a node contains less phrases than all its ancestors, a labeling may lead to the definition of a new concept that is a super-concept of the concepts related to the ancestor topics (and relevant axioms). As, according to the topic interpretation step, many topics are very specific, this approach may give a way to decide on the appropriate level of specificity for concepts to add to the ontology. In our experiments, however, the lattices were very flat and the nodes with empty intent contained only one phrase and thus did not lead to additional concepts.

Thirdly, the domain expert used the lattice as a visualization tool to check the original topic interpretation. According to the domain expert, the use of the lattice provides significant help in interpreting the topics. As it groups phrases that are in common between different topics and distinguishes phrases that are specific for certain topics, the structure of complex concepts (based on other concepts) is clarified. It results in a better organization and visualization of the topics and their underlying notions. For instance, for a topic with phrases ‘particle size’, ‘quantum dot’, and ‘gold nanoparticle’, the phrase ‘particle size’ was in common with another topic. By removing ‘particle size’ from the phrase list of the topic, it was easier to see that the topic was a combination of ‘particle size’ and a notion of ‘quantum dots of gold nanoparticles’.

4.4 General discussion

For the experiments we have currently used few resources, i.e. circa 600 abstracts and less than 10 hours for each of the three experts. Even with these limited resources our approach finds 35 and 32 new concepts for the NanoParticle Ontology and the eNanoMapper ontology, respectively as shown in Table 5, as well as 42 and 37 new axioms, respectively, as shown in Table 6. In addition to the new concepts and new axioms, also other concepts are influenced. Indeed, for a new axiom A is-a B, the sub-concepts of A receive B and all its super-concepts as its super-concepts (and thus inherit their properties), and all super-concepts of B receive A and its sub-concepts as sub-concepts (and thus all instances of these concepts are also instances of B and its super-concepts). In this experiment, 72 concepts from NanoParticle Ontology are influenced by the new axioms. Therefore, the quality of semantically-enabled applications is improved whenever one of the 35 new or 72 influenced concepts is used. For the eNanoMapper ontology the number of influenced existing concepts by adding new axioms is 37. In general, if domain and range are used for the definition of relations in the ontologies, even more concepts would be influenced. Thus, adding these axioms improves the quality of the ontologies and the semantically-enabled applications that use these ontologies. It is clear that the effort for extending the ontologies is worth-while.

Table 5

New concepts for the NanoParticle and eNanoMapper ontologies.

Concepts	NanoParticle	eNanoMapper

amorphous silicon	✓
band gap	✓
Barium Titanate	✓	✓
block copolymer	✓	✓
copolymer	✓	✓
polymer		✓
CdSe nanocrystal	✓	✓
CdTe nanoparticle	✓	✓
copper nanoparticle	✓
conductivity	✓	✓
electrical	✓	✓
gold nanorod	✓	✓
growth mechanism	✓	✓
resolution	✓	✓
layer by layer growth	✓	✓
liquid solid	✓
pressure	✓
MCM 41	✓	✓
mechanical property	✓	✓
viscosity		✓
melt spin	✓	✓
mesoporous silica nanoparticle	✓	✓
mesoporous silica nanosphere	✓	✓
microcrystalline silicon	✓	✓
optical property		✓
polymorphous silicon	✓	✓
pore size	✓
porous silicon	✓	✓
quantum confinement	✓	✓
reverse micelle-type quantum dot	✓	✓
semiconductor nanocrystal	✓	✓
nanocrystal	✓	✓
silicon thin film	✓	✓
thin film	✓	✓
crystallinity	✓	✓
thermal conductivity	✓	✓
tunnel spectroscopy	✓	✓
ZnO nanowire	✓	✓
	35	32

Table 6

New axioms for the NanoParticle and eNanoMapper ontologies.

Axioms	NanoParticle	eNanoMapper

amorphous silicon is a silicon	✓
band gap is a quality	✓
Barium Titanate is an inorganic compound or molecule	✓
Barium Titanate is a chemical substance		✓
block copolymer is a copolymer	✓	✓
copolymer is a polymer	✓	✓
polymer is an organic material		✓
CdSe nanocrystal is a nanocrystal	✓	✓
CdTe nanoparticle is a nanoparticle	✓	✓
copper nanoparticle is a metal nanoparticle	✓
conductivity is an independent general individual quality	✓
conductivity is a quality		✓
electrical conductivity is a conductivity	✓	✓
gold nanorod is a nanorod	✓	✓
growth mechanism is a process	✓	✓
resolution is an independent general individual quality	✓
resolution is a quality		✓
layer by layer growth is a mechanism process	✓	✓
liquid solid is a liquid solid interface	✓
pressure is an independent general individual quality	✓
MCM 41 is a mesoporous silica nanoparticle	✓	✓
mechanical property is a realizable entity	✓
mechanical property is a quality		✓
viscosity is a mechanical property	✓	✓
melt spin is a technique	✓	✓
mesoporous silica nanoparticle is a nanoparticle	✓	✓
mesoporous silica nanosphere is a nanosphere	✓	✓
microcrystalline silicon is a silicon	✓
microcrystalline silicon is a chemical substance		✓
nanotube array has part nanotube	✓	✓
optical property is a property		✓
polymorphous silicon is a silicon	✓
polymorphous silicon is a chemical substance		✓
pore size is a nanoparticle property	✓
porous silicon is a silicon	✓
porous silicon is a chemical substance		✓
raman scatter is a synonym of raman spectroscopy	✓	✓
quantum confinement	✓	✓
reverse micelle-type quantum dot is a quantum dot	✓	✓
semiconductor nanocrystal is a semiconductor and is a nanocrystal	✓	✓
nanocrystal is a nano-object and is a crystal	✓	✓
silicon thin film is a thin film	✓	✓
thin film is a fiat material part and one-dimensional nano-object	✓	✓
crystallinity is an independent general individual quality	✓
crystallinity is a quality		✓
transition metal is a synonym of transition element	✓
thermal conductivity is a conductivity	✓	✓
tunnel spectroscopy is a spectroscopy	✓	✓
scanning tunneling spectroscopy is same as tunnel spectroscopy	✓	✓
chemical vapor disposition is a vapor disposition	✓	✓
physical vapor disposition is a vapor disposition	✓	✓
ZnO nanowire is a nanowire	✓	✓
	42	37

The current corpus is mainly related to the themes of Chemical synthesis, Engine Emissions, Flame Combustion, and Furnace Emissions. A larger corpus would allow us to find more concepts and axioms as well as extend the coverage, i.e., larger parts of the ontologies could be extended.

Our results show that the approach generates many EXIST(-m) cases. This provides a sanity check for our approach as it shows that existing concepts can be found. In a future system we may want to filter out suggestions by checking the existence of the term or a similar term in the ontologies before showing the domain expert. This may lead to less unnecessary validation work for the domain expert as EXIST(-m) cases would be removed. However, this may also lead to missing some new concepts as the terms used in different ontologies may not always mean the same. For instance, in () it was shown that ‘metabolism’ in MeSH has a different meaning than ‘metabolism’ in ToxOntology. Therefore, only using (approximate) string matching and using synonyms may not be enough to filter out EXIST(-m) cases.

For the domain expert it was easier to interpret and label the topics for the settings with high mining thresholds. As mentioned, the number of phrases for topics for the low mining threshold settings is larger than for the high mining threshold settings. Often the topics for the low mining thresholds contained too many phrases to easily interpret the topic. In an extreme case, the domain expert thought that a topic “looked like the subject of a particular research article”.

One issue that the domain expert noted was that it was not always easy to decide which level of granularity to use during the interpretation. The question is how specific or how general the interpretation could be and still make sense for the ontology. Although our approach gives much flexibility in this sense, it does give much responsibility to the domain expert and some way to automate recommendations would be helpful. Another related issue is the fact that we found several concepts that were too general for the nanotechnology domain, but that are still relevant. In this case we did not add these to the ontology, but one may reflect on how to deal with this issue, e.g., by importing or linking to other ontologies.

In this experiment we did not find cases where the lattice was in conflict with the ontologies. In our method the domain expert is involved in interpreting the lattice. Therefore, if there would be a conflict between the domain expert’s validation and the ontologies, there are two possibilities. First, it is possible that the domain expert made a mistake, and by observing the conflict could rectify the mistake. Second, there may be a mistake in the ontologies. By observing the conflict, we now have an opportunity for debugging the ontology using specialized tools (e.g., ().

4.5 Comparison to Other Approaches

Literature As mentioned before, we are mainly dealing with concept discovery and concept hierarchy derivations. As these are also two tasks in ontology learning, we find most related work in that area. While we addressed different methods in Section 2.3, in this section we address systems. A number of ontology learning systems generate concepts. Examples are ASIUM (), CRCTOL (), OntoGain (), OntoLearn () and Text2Onto (). ASIUM applies linguistics-based sentence parsing, syntactic structure analysis, and sub-categorization frames to return concepts. CRCTOL implements both linguistics-based methods and relevance analysis. OntoGain extracts concepts by using linguistics-based part-of-speech tagging, shallow parsing, and relevance analysis. OntoLearn generates concepts based on the concepts and glossary from WordNet. Finally, Text2Onto uses statistics-based co-occurrence analysis. We show the performance of these five systems in Table 7 according to ().

Table 7

Performance of ontology learning systems in different domains (). (Precision is truncated).

System	Domain	Precision

ASIUM	French journal Le Monde	0.86
CRCTOL	Patterns of Global Terrorism	0.92
OntoGain	Computer Science corpus	0.86
	Medical corpus	0.89
OntoLearn	Tourism	0.85
Text2Onto	Text from the paper ()	0.61
	Patterns of Global Terrorism	0.74

Experiment with Text2Onto To compare our approach with another system, we have chosen to experiment with Text2Onto (). It was the only system that we found that we could download and install. However, it is one of the most popular and well-known ontology learning systems and therefore a good choice. Text2Onto is an ontology learning system based on mining textual resources. For extracting concepts from the textual resource, Text2Onto implements four algorithms which are entropy-based, C-value/NC-value-based, relative term frequency-based, and term frequency-based and inverted document frequency (TF-IDF)-based respectively. As shown above, it performed well in different domains.

In this experiment, we use Text2Onto on the same corpus as in the experiment for our approach. We split the corpus into segments as Text2Onto uses too much memory when applied on the whole corpus. We apply Text2Onto with default settings for its four algorithms on our corpus. For each of the settings, Text2Onto returns thousands of candidates ranked based on relevance. We apply the same domain expert validation as in our method in terms of interpreting phrases presented in Section 3.3. Instead of using the complete ranked lists of thousands of proposed concepts, we decided to investigate the results of the sub-lists containing the 100, 200, 300 and 400 top elements in the lists, respectively. The results are shown in Table 8. The entropy-based and C-Value/NC-Value-based methods return exactly the same results. For the relative term frequency-based method the 160 highest ranked proposed concepts are the same as the 160 highest ranked proposed concepts for the entropy-based and C-Value/NC-Value-based methods. The precision for the entropy-based and C-Value/NC-Value-based methods is the highest for each fixed number of proposed concepts, closely followed by the relative term frequency-based method. The TF-IDF-based method has the lowest precision. However, the TF-IDF-based method finds the largest number of relevant new concepts (ADD(-m)). Further, the precision decreases and the number of relevant new concepts increases for all algorithms, when we take larger sub-lists of top elements.

Table 8

The results of Text2Onto with different algorithms and different number of returned candidates. (Precision is truncated).

# of elements	Algorithm	ADD	ADD-m	EXIST	EXIST-m	No-g	No	precision

100	Entropy	5	0	39	19	4	33	0.67
	C-value/NC-value	5	0	39	19	4	33	0.67
	Relative term frequency	5	0	39	20	4	32	0.68
	TF-IDF	17	0	22	12	6	43	0.57
200	Entropy	7	1	63	43	8	79	0.60
	C-value/NC-value	7	1	63	43	7	79	0.60
	Relative term frequency	7	1	63	42	8	79	0.60
	TF-IDF	24	1	38	19	19	99	0.50
300	Entropy	12	1	80	52	16	139	0.53
	C-value/NC-value	12	1	80	52	16	139	0.53
	Relative term frequency	13	1	78	52	16	140	0.53
	TF-IDF	28	1	58	36	29	148	0.50
400	Entropy	18	1	98	62	20	199	0.50
	C-value/NC-value	18	1	98	62	20	199	0.50
	Relative term frequency	19	1	100	61	20	199	0.50
	TF-IDF	36	1	70	44	38	211	0.47

In Table 9, we show the results for Text2Onto when all algorithms are used together for the different sub-lists of top elements and compare it to our method. In Table 10 we show all the new concepts found by our method and Text2Onto for NanoParticle Ontology. 14 concepts were found by both methods. Further, our method found 21 new concepts that were not found by Text2Onto, while Text2Onto found 28 new concepts that were not found by our method. The two methods seem therefore to be complementary.

Table 9

Results for Text2Onto using all algorithms per setting and our method for extending NanoParticle Ontology. (Precision is truncated).

	ADD	ADD-m	EXIST	EXIST-m	No-g	No	precision

Text2Onto-100	20	0	51	27	11	71	0.60
Text2Onto-200	29	1	84	55	26	164	0.54
Text2Onto-300	39	1	118	78	44	266	0.51
Text2Onto-400	41	1	120	73	47	313	0.47
Our Method	32	3	25	18	14	22	0.80

Table 10

New concepts found by our method and Text2Onto for the NanoParticle Ontology.

Concepts	Our method	Text2Onto

acid group		✓
activation energy		✓
amorphous silicon	✓
band gap	✓	✓
Barium Titanate	✓	✓
Barium Titante nanowire		✓
block copolymer	✓	✓
boron nanowire		✓
catalyst		✓
cluster		✓
copolymer	✓	✓
crystallite		✓
crystallinity	✓
CdSe nanocrystal	✓
CdTe nanoparticle	✓
copper nanoparticle	✓	✓
conductivity	✓	✓
diblock copolymer		✓
electrical conductivity	✓
esterification		✓
ethylene oxide		✓
gold nanorod	✓	✓
growth mechanism	✓	✓
intensity		✓
resolution	✓
layer by layer growth	✓
liquid solid	✓
pressure		✓
MCM 41	✓
mechanical property	✓
melting		✓
melt spin	✓
mesoporous silica nanoparticle	✓
mesoporous silica nanosphere	✓
microcrystalline silicon	✓	✓
nano colloid		✓
nano composite		✓
nanocrystal	✓	✓
nano crystalline silicon particle		✓
nanogrid		✓
nano ribbon		✓
nanotube array	✓	✓
nanowire array		✓
oxidation		✓
photo activity		✓
polyelectrolyte		✓
polymorphous silicon	✓
pore size	✓	✓
porous silicon	✓
pressure		P
quantum confinement	✓	✓
reverse micelle-type quantum dot	✓
semiconductor nanocrystal	✓	✓
silicon thin film	✓
silica nanosphere		✓
silicon nanowire		✓
silicon nanowire array		✓
superlattice nanowire		✓
thin film	✓
titanium nanotube		✓
thermal conductivity	✓
tunnel spectroscopy	✓
ZnO nanowire	✓
	35	42

5 Conclusions and Future Work

In this paper we have used a phrase-based topic model approach and introduced a formal topical concept analysis for extending ontologies. A domain expert interprets the results which are phrases, topics and a lattice. This leads to the confirmation of ontological concepts (EXIST(-m)) or to the addition of new concepts and axioms (ADD(-m)). The latter is the actual extension of the ontologies. Also, concepts from more general or other domains may be found, as well as very specific concepts in the domain that need not be added to the ontology. We have shown the usefulness of the approach by extending two ontologies in the nanotechnology domain using approximately 600 abstracts.

In the future we will investigate how to help the domain expert dealing with the granularity issue. In particular, the topical concept lattice explored in this work appears to help refining topics into classifiers of content that are more general and meaningful in the domain. This may be a useful step forward towards a higher level of automation in the process of extracting ontology information out of unstructured text. Furthermore, we will investigate the scalability of our approach by experimenting with more documents. Another possible direction is to investigate synergy possibilities between the topics and the ontology concepts, e.g., by using the ontologies to generate the corpora, or by iterating between topic generation and interpretation.

Data Science Journal

Research Papers