A Method for Extending Ontologies with Application to the Materials Science Domain

In the materials science domain the data-driven science paradigm has become the focus since the beginning of the 2000s. A large number of research groups and communities are building and developing data-driven workflows. However, much of the data and knowledge is stored in different heterogeneous data sources maintained by different groups. This leads to a reduced availability of the data and poor interoperability between systems in this domain. Ontology-based techniques are an important way to reduce these problems and a number of efforts have started. In this paper we investigate efforts in the materials science, and in particular in the nanotechnology domain, and show how such ontologies developed by domain experts, can be improved. We use a phrase-based topic model approach and formal topical concept analysis on unstructured text in this domain to suggest additional concepts and axioms for the ontology that should be validated by a domain expert. We describe the techniques and show the usefulness of the approach through an experiment where we extend two nanotechnology ontologies using approximately 600 titles and abstracts.


Introduction
From the beginning of the 2000s materials science has shifted towards its fourth paradigm, (big) datadriven science (Agrawal & Choudhary 2016). More and more researchers in materials science have realized that data-driven techniques could accelerate the discovery and design of materials. Therefore, a large number of research groups and communities have developed data-driven workflows including data repositories (for an overview see ) and data analytics tools for particular purposes. As data-driven techniques become widely used, big data challenges regarding volume, variety, variability and veracity ) and challenges in reproducing, sharing, and integrating data (Kalidindi & De Graef 2015, Agrawal & Choudhary 2016, Tropsha et al. 2017, Karcher et al. 2018, Rumble et al. 2019) are growing at the same time.
These challenges also occurred in other fields. For instance, in (Lambrix 2005) the problems of locating, retrieving and integrating data in the biomedical field were addressed. These problems relate to the more recently introduced FAIR principles that aim to support machines to automatically find and use data, and individuals to reuse the data (Wilkinson et al. 2016). The FAIR principles state that data should be Findable, Accessible, Interoperable, and Reusable, respectively. In different areas research is on the way to conform data management to these principles, including in the materials science domain (Draxl & Scheffler 2018). One of the recognized enablers for the principles are ontologies and ontology-based techniques. Ontologies provide a shared standardized representation of knowledge of a domain. By describing data using ontologies, the data will be more findable. By using ontologies for representing the metadata, the level of accessibility can be raised. By using the same terminology as defined by ontologies, interoperability is enabled. Finally, as ontologies are shared and standardized, reusability is supported. Taking nanotechnology as an example, in (Tropsha et al. 2017) it is stated that there exists a gap between data generation and shared data access. The domain lacks standards for collecting and systematically representing nanomaterial properties. In (Karcher et al. 2018) stakeholder-identified technical and operational challenges for the integration of data in the nanotechnology domain are presented. The technical challenges mainly refer to (i) the use of different data formats, (ii) the use of different vocabularies, (iii) the lack of unique identifiers, and (iv) the use of different data conceptualization methods. In terms of operational challenges, they refer to (i) the fact that organizations have different levels of data quality and completeness, and (ii) the lack of understandable documentation. To solve these challenges, it is proposed that ontologies and ontology-based techniques can play a significant role in the data-driven materials science and enable reproduction, sharing and integration of data. This was, for instance, the main outcome of a workshop on interoperability in materials modelling organized by the European Materials Modelling Council (European Materials Modelling Council 2017).
Although in its infancy, some organizations and research groups have started to develop ontologies and standards for the materials domain (Section 2.2), including in the nanotechnology domain. However, developing ontologies is not an easy task and often the resulting ontologies are not complete. In addition to being problematic for the correct modelling of a domain, such incomplete ontologies also influence the quality of semantically-enabled applications such as ontology-based search and data integration. Incomplete ontologies when used in semantically-enabled applications can lead to valid conclusions being missed. For instance, in ontology-based search, queries are refined and expanded by moving up and down the hierarchy of concepts. Incomplete structure in ontologies influences the quality of the search results. In experiments in the biomedical field, an example was given where a search in PubMed (http://www.ncbi.nlm.nih. gov/pubmed/), a large database with abstracts of research articles in the biomedical field, using the MeSH (Medical Subject Headings) (http://www.nlm.nih.gov/mesh/) ontology would miss 55% of the documents if the relation between the concepts Scleral Disease and Scleritis is missing (Liu & Lambrix 2010).
In this paper, we present a novel method for extending existing ontologies by detecting new concepts and relations in the concept hierarchy that should be included in the ontologies. We do this by presenting a new approach, formal topical concept analysis, that integrates a variant of topic modeling and formal concept analysis. Further, we apply our method to two ontologies (NanoParticle Ontology and eNanoMapper) in the materials science domain. The choice of the use of ontologies in the nanotechnology domain is motivated by the fact that, as we have shown before, there is an awareness of the need for ontologies to deal with interoperability and reusability issues. Further, there are not so many ontologies in materials science yet (see Section 2.2) and the chosen ontologies are among the more mature ontologies in the field. Therefore, they represent the most difficult case for extending ontologies.
The remainder of the paper is organized as follows. In Section 2 we describe what ontologies are, efforts on ontologies in the materials domain as well as work on extending ontologies. Section 3 describes our approach while Section 4 shows and discusses the results of the application of our approach in the nanotechnology domain. We show how NanoParticle Ontology and eNanoMapper were extended and evaluate the usefulness of the approach. We also compare our results to the results of an experiment with another popular system on the same data. Finally, the paper concludes in Section 5.

Ontologies
Intuitively, ontologies can be seen as defining the basic terms and relations of a domain of interest, as well as the rules for combining these terms and relations. Ontologies are used for communication between people and organizations by providing a common terminology over a domain. They provide the basis for interoperability between systems, and can be used as an index to a repository of information as well as a query model and a navigation model for data sources. They are often used as a basis for integration of data sources, thereby alleviating the variety and variability problems. The benefits of using ontologies include reuse, sharing and portability of knowledge across platforms, and improved maintainability, documentation, maintenance, and reliability. Overall, ontologies lead to a better understanding of a field and to more effective and efficient handling of information in that field (e.g., (Stevens et al. 2000).
From a knowledge representation point of view, ontologies may contain four components: (i) concepts that represent sets or classes of entities in a domain, (ii) instances that represent the actual entities, (iii) relations, and (iv) axioms that represent facts that are always true in the topic area of the ontology. Axioms can represent such things as domain restrictions, cardinality restrictions, or disjointness restrictions. Ontologies can be classified according to which components and the information regarding the components they contain. As an example, Figure 1 represents a small piece of the NanoParticle Ontology (Thomas et al. 2011) regarding ' chemical entity' and ' quality'. Regarding chemical entities NanoParticle Ontology contains, for instance, the concepts chemical entity, chemical substance, ion, particle, isotope and molecular entity. The black full arrows represent axioms representing is-a relations, i.e. if A is a B, then all entities that belong to concept A also belong to concept B. We also say then that A is a sub-concept of B. In this example we have that chemical substance, particle, ion, isotope and molecular entity are sub-concepts of chemical entity. Therefore, all chemical substances, particles, ions, isotopes, and molecular entities are also chemical entities. Further, all primary particles are particles, all nanoparticles are primary particles, all polymeric nanoparticles are nanoparticles and all gelatin nanoparticles are polymeric nanoparticles. The is-a relation is transitive such that, for instance, a gelatin nanoparticle is also a particle. Regarding different kinds of qualities NanoParticle Ontology contains, for instance, the concepts particle size, molecular weight, particle concentration, organic, inorganic, shape, chemical composition, density, hydrodynamic size, mass, size, and electric charge. Further, particles have qualities; this is represented by an axiom that states that concepts particle and quality are connected to each other by the relation has quality (green dashed arrows in Figure 1). Properties represented by relations are inherited via the is-a hierarchy. Therefore, also the subconcepts of particles are related to qualities.
In Figure 2 we show the part of NanoParticle Ontology that represents particles using the ontology development system Protégé (https://protege.stanford.edu/). On the left hand side the concepts and the is-a hierarchy are shown. The is-a relations are represented by indentation. For instance, gelatin nanoparticle (highlighted in Figure 2) is a sub-concept of polymeric nanoparticle which in its turn is a sub-concept of nanoparticle. On the right-hand side of Figure 2 information related to the axioms are shown using a special notation reflecting constructs in the representation language OWL (http://www.w3.org/TR/owl-features/,  http://www.w3.org/TR/owl2-overview/), a knowledge representation language that is often used for representing ontologies and that is based on description logics (Baader et al. 2010). For instance, we note that the concept gelatin nanoparticle was defined to be equivalent to nanoparticle and (has_component_part some gelatin). This means that every gelatin nanoparticle is a nanoparticle that has a component that is gelatin, and vice versa, whenever a nanoparticle has a component that is gelatin, then it is a gelatin nanoparticle. Further, there is information about the types of qualities that gelatin nanoparticles have (inherited from the particle concept). An advantage of using a description logics-based representation is that it allows for reasoning. In the ontology it was defined that gelatin nanoparticle is equivalent to nanoparticle and (has_com-ponent_part some gelatin) (as we just noted), that polymeric nanoparticle is equivalent to nanoparticle and (has_component_part some polymer), and that gelatin is a subconcept of protein which is a subconcept of biopolymer which is in its turn a subconcept of polymer. Based on these axioms the system can derive the additional information that a gelatin nanoparticle is a polymeric nanoparticle, which is also shown on the right-hand side of Figure 2 (under 'SubClass Of'). Figure 3 shows the actual OWL representation for the concepts gelatin nanoparticle, polymeric nanoparticle and nanoparticle.

Ontologies in materials domain
Within the materials domain the use of semantic technologies is in its infancy with the development of ontologies and standards. According to (Zhang, Zhao & Wang 2015) domain ontologies have been used to organize materials knowledge in a formal language, as a global conceptualization for materials information integration (e.g. (Cheng et al. 2014)), for linked materials data publishing, for inference support for discovering new materials and for semantic query support (e.g., (Zhang, Luo, Zhao & Zhang 2015, Zhang et al. 2017). Most ontologies focus on specific sub-domains of the materials field (e.g., metals, ceramics, thermal properties, nanotechnology) and have been developed with a specific use in mind (e.g., search, data integration, discovery). Some examples of ontologies are the Materials Ontology (Ashino 2010) for data exchange among thermal property databases, PREMΛP ontology (Bhat et al. 2013) for steel mill products, MatOnto ontology (Cheung et al. 2008) for oxygen ion conducting materials in the fuel cell domain, and the FreeClassOWL ontology (Radinger et al. 2013) for the construction and building materials domain. An ontology design pattern regarding material transformations was proposed in (Vardeman II et al. 2017 In the sub-field of nanotechnology, the NanoParticle Ontology (Thomas et al. 2011) was created for understanding biological properties of nanomaterials, searching for nanoparticle relevant data and designing nanoparticles. It builds on the Basic Formal Ontology (BFO, http://basic-formal-ontology.org/) (Arp et al. 2015) and Chemical Entities of Biological Interest Ontology (ChEBI) (de Matos et al. 2010) to represent basic knowledge regarding physical, chemical and functional features of nanotechnology used in cancer diagnosis and therapy. The eNanoMapper ontology (Hastings et al. 2015) aims to integrate a number of ontologies such as the NanoParticle Ontology for assessing risks related to the use of nanomaterials.
Furthermore, standards for exporting data from databases and between tools are being developed. These standards provide a way to exchange data between databases and tools, even if the internal representations of the data in the databases and tools are different. They are a prerequisite for efficient materials data infrastructures that allow for the discovery of new materials (Austin 2016).
In several cases the standards formalize the description of materials knowledge and thereby create ontological knowledge. For instance, one effort is by the European Committee for Standardization which organized workshops on standards for materials engineering data of which the results are documented in (European Committee for Standardization 2010). Another recent effort is connected to the European Centre of Excellence NOMAD (Ghiringhelli et al. 2016).

Extending ontologies from unstructured text
The ontology extension problem that we tackle deals mainly with concept discovery and concept hierarchy derivation. These are also two of the tasks in the problem of ontology learning (Buitelaar et al. 2005). Therefore, most of the related work comes from that area. For instance, a recent survey (Asim et al. 2018) discusses 140 research papers. Different techniques can be used for concept and relationship extraction. In this setting, new ontology elements are derived from text using knowledge acquisition techniques.
Linguistic techniques use part-of-speech tagged corpora for extracting syntactic structures that are analyzed regarding the words and the modifiers contained in the structure. One kind of linguistic approach is based on linguistics using lexico-syntactic patterns. The pioneering research conducted in this line is in (Hearst 1992), which defines a set of patterns indicating is-a relationships between words in the text. Other linguistic approaches may make use of, for instance, compounding, the use of background and itemization, term co-occurrence analysis or superstring prediction (e.g. (Wächter et al. 2006, Arnold & Rahm 2013).
Another paradigm is based on machine learning and statistical methods which use the statistics of the underlying corpora, such as k-nearest neighbors approach (Maedche et al. 2003), association rules (Maedche & Staab 2000), bottom-up hierarchical clustering techniques (Zavitsanos et al. 2007), supervised classification (Spiliopoulos et al. 2010) and formal concept analysis (Cimiano et al. 2005). There are also some approaches that use topic models (Schaal et al. 2005, Lin et al. 2012, Rani et al. 2017) but they focus on concept names that are words, rather than phrases as in our approach.
Ontology evolution approaches (Hartung et al. 2011, Dos Reis et al. 2013) allow for the study of changes in ontologies and using the change management mechanisms to detect candidate missing relations. An approach that allows for detection and user-guided completion of the is-a structure is given in (Ivanova & Lambrix 2013, Lambrix et al. 2015 where completion is formalized as an abduction problem and the RepOSE tool is presented.

Approach
Our approach for extending ontologies, shown in Figure 4, contains the following steps. In the first step, creation of a phrase-based topic model, documents related to the domain of interest are used to create topics. The phrases as well as the topics are suggestions that a domain expert should validate or interpret and relate to concepts in the ontology. In the second step the (possibly validated and updated) topics are used in a formal topical concept analysis which returns suggestions to the domain expert regarding relations between topics and thus concepts in the ontology. Both steps lead to the addition of new concepts and (subsumption) axioms to the ontology. In the following subsections we describe these steps.

Phrase-based Topic Model
In our first step we use the phrases-based topic model in the ToPMine system (El-Kishky et al. 2014). Given a corpus of documents and the number of requested topics, representations of latent topics in the documents are computed. Essentially, topics can be seen as a probability distribution over words or phrases. The ToPMine approach is purely data-driven, i.e., it does not require domain knowledge or specific linguistic rule sets. This is important for our application domain as there is a lack of annotated background knowledge. An important property of the system is that it works on bags-of-phrases, rather than the traditional bagof-words. This means that words occurring closer together have more weight than words far away. Further, as we assume existing ontologies, it is very likely that concepts with one-word names are already in the ontology and we therefore focus on phrases.
The approach consists of two parts: phrase mining and topic modelling. In the first part frequent contiguous phrases are mined, which consists of collecting aggregate counts for all contiguous words satisfying a minimum support threshold. Then the documents are segmented based on the frequent phrases. Further, an agglomerative phrase construction algorithm merges the frequent phrases guided by a significance score. In the second part topics are generated using a variant of Latent Dirichlet Allocation, called PhraseLDA, that deals with phrases, rather than words.

Formal Topical Concept Analysis
In the second step we define a new variant of Formal Concept Analysis (e.g., (Ganter & Wille 2012)) and use this new variant on topics. These topics can come directly from the previous step or can be a modified version of the topics of the previous step, where non-relevant topics or phrases are removed.
We first define the notions of formal topical context, formal topical concept and topical concept lattice. (Note that formal topical concepts should not be confused with concepts in the ontologies.) input unstructured text and as output phrases and topics. The lower part shows the formal topical concept analysis with as input topics and as output a topical concept lattice. In both parts a domain expert validates and interprets the results.

Definition 2. (Formal Topical Concept) (A, B) is a formal topical concept of (P, T, I) iff
A is the extent and B is the intent of (A, B).
Definition 3. (Topical Concept Lattice) Topical formal concepts can be ordered. We say that The set Φ(P, T, I) of all formal topical concepts of (P, T, I), with this order, is called the topical concept lattice of (P, T, I).
As an example, in Figure 5(a) we show a matrix representing the occurrence of phrases in topics in a topic model, the resulting formal topical concepts in Figure 5(c) and the topical concept lattice in Figure 5(b).
In the lattice a node represents a formal topical concept (same numbering as in Figure 5(a)). For a formal topical concept (A, B), its extent (phrases) is found by collecting all phrases in its node as well as its descendants. The intent (topics) is found by collecting all topics in its node as well as its ancestors.

Domain Expert Validation
As shown in Figure 4, a domain expert is involved in the different steps in our approach to validate and interpret the results of the phrase-based topic model and the formal topical concept analysis. The domain expert validates or interprets all phrases that appear in all topics. The outcome can be one of the following: (i) The phrase is a meaningful representation of a concept in the specific domain and it is already in the ontology. For example, gold nanoparticle is a specific concept within the nanotechnology domain and it is already in the NanoParticle Ontology. We distinguish two cases: (1) a concept with the same name or a name that is a synonym of the original form of the phrase already exists in the ontology (EXIST) or (2) a concept with a name that is a modified form of the phrase already exists in the ontology (EXIST-m). (ii) The phrase is a meaningful representation of a concept in the specific domain but it is not in the ontology. For example, microcrystalline silicon is a meaningful representation of a concept but such concept does not exist in the ontology. We distinguish two cases: (1) a concept with the same name as the original form of the phrase should be added into the ontology (ADD) or (2) a concept with as name a modified form of the phrase should be added into the ontology (ADD-m). (iii) No concept related to the phrase should be added to the ontology. This can happen because the phrase does not make sense in the domain (No), but also because it is a meaningful representation of a concept in a more general domain (No-g). For example, electron transfer is a general concept within the perspective in materials science, but should not necessarily be in a nanotechnology ontology.
A second interaction with the domain expert occurs in the interpretation of topics. The outcome can be one of the following: (i) Using the representative phrases in a topic, the domain expert labels the topic. Using this label as a phrase, we have the outcomes EXIST, EXIST-m, ADD, ADD-m, No-g and No, as above. Furthermore, we add an outcome Q (for query) when the label for the topic is too specific for adding to the ontology, but could be defined using concepts in the ontologies and OWL constructs. (ii) Using a subset of representative phrases in a topic, the domain expert labels the subset. Using this label as a phrase, we have the outcomes EXIST, EXIST-m, ADD, ADD-m, No-g, No, and Q as above. This can be done for different subsets.
Finally, the domain expert interprets the lattice.
(i) Given the relationships in the lattice, as well as the connections of the topics and phrases to concepts in the ontology, new relationships between ontology concepts can be identified.

Extending NanoParticle and eNanoMapper Ontologies
In the following subsections, we show the usefulness of our approach by extending two ontologies in the nanotechnology domain.

Corpus and ontologies
The corpus that we use is based on reports on nanoparticles from the Nanoparticle Information Library (http://nanoparticlelibrary.net). For each nanoparticle report, we take the text in 'Research Abstract' as well as the abstracts (or only the titles if there is no abstract) from the publications in 'Related Publications'. The final corpus contains 117 abstracts from the 'Research Abstract' field in the reports and 510 abstracts (or titles) from publications. We have chosen to only retrieve titles and abstracts rather than full texts. The title and abstract cover the basic content of an article. For a research article in the materials science domain they will generally contain a summary of the problem, experiments, simulations and computations. As the ontologies aim to represent basic knowledge in the domain, these parts of a research article often contain enough information for extraction of concepts. When using the full text, more proposals for concepts may be generated, but many of those will not be relevant. In related fields, it has been shown that the use of titles (and abstracts) may be a reasonable approach (e.g., (Galke et al . 2017)). The ontologies that we extend are the NanoParticle Ontology (

Experiments Setup
In our experiments, we configure the phrases mining threshold with two values (high and low), and the PhraseLDA with different numbers of requested topics (20, 30 and 40). The other parameters of PhraseLDA are set as follows: the total number of Gibbs sampling iterations over the entire data is 1000, the hyperparameters are α = 50/T and β = 0.01 where T is the number of topics. These initial values for the hyper-parameters are justified in (Steyvers & Griffiths 2007). Thus we have six experiments over the data.
After the interpretation of the phrases by the domain expert, for each setting, all (rows regarding) phrases interpreted with No are removed from the phrase occurrence matrix. The updated matrix (with all EXIST(-m), ADD(-m) and No-g phrases) are used as input for the formal topical concept analysis and a formal topical concept lattice is generated.
For the interpretation of the phrases, topics and lattice results a domain expert (second author) worked together with two ontology engineering experts (first and third author). In a first 2 hour session the three experts went through the phrases of all topics for one of the settings (low mining threshold, 40 topics) of the topic model approach. Each phrase was discussed regarding whether it was relevant for a nanotechnology ontology, checked whether concepts with the same or similar names existed in the NanoParticle Ontology, and a decision was made regarding EXIST(-m)/ADD(-m)/No(-g) as well as which axioms may be needed to add to the ontology. In addition to investigating the ontologies, in some cases terms were checked via wikipedia or research articles. As a preparation for the second session, the knowledge engineers prepared suggestions for the phrases for the other settings, based on the interpretation results of the first session and search in the two ontologies. During the second session (4 hours) the phrases for all settings were interpreted and related to both ontologies. Further, the topics for one setting were interpreted. In the third (2 hour) session the remaining topics as well as the lattice results were interpreted.

Results and discussion of results
In Table 1 we show the results regarding the interpretation of the phrases. In addition to the number of concepts in the EXISTS(-m), ADD(-m), and No(-g) categories, we also show the precision. The precision of the system is the ratio of the number of relevant proposed concepts to the number of proposed concepts. We decided to define a relevant proposed concept as a proposed concept that the domain expert recognizes as a relevant concept, whether it be in the ontology, or more specific than concepts in the ontology, or could belong to a more general ontology. Therefore, the relevant proposed concepts are the ones that do not belong to the 'No' category. This conforms to what is relevant in the ontology learning setting.
We note that some phrases may contribute to the addition of multiple concepts and axioms. Furthermore, the low mining threshold settings generate the most number of phrases (in total and per topic). Except for one 'No' phrase, all phrases generated by any of the high mining threshold settings are also generated by at least one (and usually all) low mining threshold settings. For the low mining threshold settings there are only small differences regarding the phrases that occur in topics. There are 29 phrases that are generated by all settings. Of these do 13 exist in the ontologies and relate, among others, to kinds of nanotubes, microscopy, spectroscopy, and various properties of nanoparticles. Furthermore, 7 exist in a modified form, e.g., temperature for low/high/room temperature and core-shell nanoparticle for the phrase core shell. The remaining 9 should be added to the ontologies in the same or modified form. These relate to properties (resolution, pore size, band gap, electrical conductivity, crystallinity), a technique (vapor deposition) and nano-objects (mesoporous silica nanoparticle, thin film). Reverse micelle-synthesized quantum dot leads to the creation of a specific kind of quantum dots as well as a specific synthesis technique. Regarding the phrases that are only found by low mining threshold settings, they relate to different kinds of silicons, nanoparticles, properties and techniques, of which many should be added to the ontologies. There are, however, also several phrases that relate to more general concepts in the materials domain that should not necessarily be added to an ontology in the nanotechnology domain. In all settings, we find most EXIST(-m) cases, which shows that the phrases are relevant with respect to the existing ontologies. Furthermore, we found many ADD(-m) cases which lead to new concepts and axioms. There are also some phrases that relate to more general concepts and some phrases that do not lead to anything meaningful in the context of extending For the meanings of ADD(-m), EXIST(-m) and No(-g), see Section 3.3. For ADD and ADD-m, a new concept is defined in the ontology and one or more subsumption axioms are added. the ontology. From Table 2 we note that the more topics the system generates, the lower the percentage of topics that contribute to EXIST(-m) and ADD(-m) categories.
In Table 3 we show the results regarding the interpretation of the topics. We note that the high mining threshold settings generate the most concepts to add to the ontologies. In each setting there are one or two concepts that were not found during the interpretation of the phrases (e.g., high resolution experiment, water soluble reverse micelle systems, core-shell semiconductors). All EXIST(-m) concepts were also found during the interpretation of the phrases. The No-g category consists of earlier found phrases or specializations of those. Furthermore, many of the topics are very specific and it was decided they should not be added to the ontology, but queries (or complex concepts) using concepts in the ontologies and OWL constructs can be constructed. We also observe that the results for the two ontologies are almost the same, which may be because the topic labels are (much) more specific than the phrase labels and the ontologies do not model concepts at the lowest levels of specificity.
In the final step we generated lattices for all settings. As an example, a part of the lattice for the case of 40 requested topics with a low mining threshold is shown in Figure 6. Nodes that contain one topic/one  For the meanings of ADD(-m), EXIST(-m), No(-g) and Q, see Section 3.3. For ADD and ADD-m, a new concept is defined in the ontology and one or more subsumption axioms are added. phrase and have as child the bottom node and as parent the top node are not shown. These have been dealt with in the phrase interpretation step and as there are no connections to other nodes (except top and bottom), no additional information can be gained for those nodes. The lattices were used in the following ways. First, the domain expert labeled the nodes based on the phrases connected to the nodes. These may be the extents or subsets of the extents of topics. The results are given in Table 4. Some new concepts were found that are more general than concepts related to topics (e.g., core-shell cdse nanoparticles), but in general, few additional information was found.
Secondly, the domain expert labeled the nodes based on the phrases connected to the nodes and their descendants. As a node contains less phrases than all its ancestors, a labeling may lead to the definition of a new concept that is a super-concept of the concepts related to the ancestor topics (and relevant axioms). As, according to the topic interpretation step, many topics are very specific, this approach may give a way to decide on the appropriate level of specificity for concepts to add to the ontology. In our experiments, however, the lattices were very flat and the nodes with empty intent contained only one phrase and thus did not lead to additional concepts.
Thirdly, the domain expert used the lattice as a visualization tool to check the original topic interpretation. According to the domain expert, the use of the lattice provides significant help in interpreting the topics. As it groups phrases that are in common between different topics and distinguishes phrases that are specific for certain topics, the structure of complex concepts (based on other concepts) is clarified. It results in a better organization and visualization of the topics and their underlying notions. For instance, for a topic with phrases 'particle size', ' quantum dot', and 'gold nanoparticle', the phrase 'particle size' was in common with another topic. By removing 'particle size' from the phrase list of the topic, it was easier to see that the topic was a combination of 'particle size' and a notion of ' quantum dots of gold nanoparticles'.

General discussion
For the experiments we have currently used few resources, i.e. circa 600 abstracts and less than 10 hours for each of the three experts. Even with these limited resources our approach finds 35 and 32 new concepts for the NanoParticle Ontology and the eNanoMapper ontology, respectively as shown in Table 5, as well as 42 and 37 new axioms, respectively, as shown in Table 6. In addition to the new concepts and new axioms, also other concepts are influenced. Indeed, for a new axiom A is-a B, the sub-concepts of A receive B and all its super-concepts as its super-concepts (and thus inherit their properties), and all super-concepts of B receive A and its sub-concepts as sub-concepts (and thus all instances of these concepts are also instances of B and  its super-concepts). In this experiment, 72 concepts from NanoParticle Ontology are influenced by the new axioms. Therefore, the quality of semantically-enabled applications is improved whenever one of the 35 new or 72 influenced concepts is used. For the eNanoMapper ontology the number of influenced existing concepts by adding new axioms is 37. In general, if domain and range are used for the definition of relations in the ontologies, even more concepts would be influenced. Thus, adding these axioms improves the quality of the ontologies and the semantically-enabled applications that use these ontologies. It is clear that the effort for extending the ontologies is worth-while. The current corpus is mainly related to the themes of Chemical synthesis, Engine Emissions, Flame Combustion, and Furnace Emissions. A larger corpus would allow us to find more concepts and axioms as well as extend the coverage, i.e., larger parts of the ontologies could be extended.
Our results show that the approach generates many EXIST(-m) cases. This provides a sanity check for our approach as it shows that existing concepts can be found. In a future system we may want to filter out suggestions by checking the existence of the term or a similar term in the ontologies before showing the domain expert. This may lead to less unnecessary validation work for the domain expert as EXIST(-m) cases would be removed. However, this may also lead to missing some new concepts as the terms used in different ontologies may not always mean the same. For instance, in (Ivanova et al. 2012) it was shown that 'metabolism' in MeSH has a different meaning than 'metabolism' in ToxOntology. Therefore, only using (approximate) string matching and using synonyms may not be enough to filter out EXIST(-m) cases.
For the domain expert it was easier to interpret and label the topics for the settings with high mining thresholds. As mentioned, the number of phrases for topics for the low mining threshold settings is larger than for the high mining threshold settings. Often the topics for the low mining thresholds contained too many phrases to easily interpret the topic. In an extreme case, the domain expert thought that a topic "looked like the subject of a particular research article".
One issue that the domain expert noted was that it was not always easy to decide which level of granularity to use during the interpretation. The question is how specific or how general the interpretation could be and still make sense for the ontology. Although our approach gives much flexibility in this sense, it does give much responsibility to the domain expert and some way to automate recommendations would be helpful. Another related issue is the fact that we found several concepts that were too general for the nanotechnology domain, but that are still relevant. In this case we did not add these to the ontology, but one may reflect on how to deal with this issue, e.g., by importing or linking to other ontologies.
In this experiment we did not find cases where the lattice was in conflict with the ontologies. In our method the domain expert is involved in interpreting the lattice. Therefore, if there would be a conflict between the domain expert's validation and the ontologies, there are two possibilities. First, it is possible that the domain expert made a mistake, and by observing the conflict could rectify the mistake. Second, there may be a mistake in the ontologies. By observing the conflict, we now have an opportunity for debugging the ontology using specialized tools (e.g., (Lambrix 2019).

Comparison to Other Approaches
Literature As mentioned before, we are mainly dealing with concept discovery and concept hierarchy derivations. As these are also two tasks in ontology learning, we find most related work in that area. While we addressed different methods in Section 2.3, in this section we address systems.  ) and Text2Onto (Cimiano & Völker 2005). ASIUM applies linguistics-based sentence parsing, syntactic structure analysis, and sub-categorization frames to return concepts. CRCTOL implements both linguistics-based methods and relevance analysis. OntoGain extracts concepts by using linguistics-based part-of-speech tagging, shallow parsing, and relevance analysis. OntoLearn gener- ates concepts based on the concepts and glossary from WordNet. Finally, Text2Onto uses statistics-based cooccurrence analysis. We show the performance of these five systems in Table 7 according to (Wong et al. 2012).

Experiment with Text2Onto
To compare our approach with another system, we have chosen to experiment with Text2Onto (Cimiano & Völker 2005). It was the only system that we found that we could download and install. However, it is one of the most popular and well-known ontology learning systems and therefore a good choice. Text2Onto is an ontology learning system based on mining textual resources. For extracting concepts from the textual resource, Text2Onto implements four algorithms which are entropy-based, C-value/NC-value-based, relative term frequency-based, and term frequency-based and inverted document frequency (TF-IDF)-based respectively. As shown above, it performed well in different domains.
In this experiment, we use Text2Onto on the same corpus as in the experiment for our approach. We split the corpus into segments as Text2Onto uses too much memory when applied on the whole corpus. We apply Text2Onto with default settings for its four algorithms on our corpus. For each of the settings, Text2Onto returns thousands of candidates ranked based on relevance. We apply the same domain expert validation as in our method in terms of interpreting phrases presented in Section 3.3. Instead of using the complete ranked lists of thousands of proposed concepts, we decided to investigate the results of the sub-lists containing the 100, 200, 300 and 400 top elements in the lists, respectively. The results are shown in Table 8. The entropy-based and C-Value/NC-Value-based methods return exactly the same results. For the relative  term frequency-based method the 160 highest ranked proposed concepts are the same as the 160 highest ranked proposed concepts for the entropy-based and C-Value/NC-Value-based methods. The precision for the entropy-based and C-Value/NC-Value-based methods is the highest for each fixed number of proposed concepts, closely followed by the relative term frequency-based method. The TF-IDF-based method has the lowest precision. However, the TF-IDF-based method finds the largest number of relevant new concepts (ADD(-m)). Further, the precision decreases and the number of relevant new concepts increases for all algorithms, when we take larger sub-lists of top elements.
In Table 9, we show the results for Text2Onto when all algorithms are used together for the different sublists of top elements and compare it to our method. In Table 10 we show all the new concepts found by our method and Text2Onto for NanoParticle Ontology. 14 concepts were found by both methods. Further, our method found 21 new concepts that were not found by Text2Onto, while Text2Onto found 28 new concepts that were not found by our method. The two methods seem therefore to be complementary.