Research on an Agricultural Knowledge Fusion Method for Big Data

Nengfu Xie; Wensheng Wang; Bingxian Ma; Xuefu Zhang; Wei Sun; Fenglei Guo

1 Introduction

Currently, most people use the Internet and the World-Wide-Web for browsing and getting information. In fact, however, you cannot obtain the complete, correct, timely information or knowledge that directly affects your judgment and decision-making in the web environment because of the heterogeneity of the information and big data scenarios. Knowledge fusion can be seen as an advanced information integration approach. Information integration focuses on how to find relevant information, but in knowledge fusion this information is merged to create knowledge that is more complete, less uncertain, and less conflicting than the input (). This reduces the cost of data access and enhances the value of the discovered data. The research on web-oriented knowledge fusion theory, methods, and knowledge of tools and development has become an important concern for knowledge-oriented service (; ).

In the 1960s, the international academic community began to research knowledge fusion, but early scholars did not explicitly put forward the concept of knowledge fusion. In the late 1980s, the rise of knowledge engineering increased attention to knowledge fusion. Feigenbaum () put forward a “knowledge principle”, in which knowledge fusion is one of the most important functional modules. Douglas Lenat’s Cyc project, built upon Feigenbaum’s knowledge principle, was an artificial intelligence project that attempted to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning ().

KRAFT (Knowledge Reuse And Fusion/Transformation) aims to develop a combination of database and artificial intelligence technology to allow scientists and engineers to find and exploit knowledge available on the Internet. KRAFT was a close collaboration between universities and industry (). Based on KRAFT, knowledge fusion has attracted many researchers. Hunter and Williams () advocated a knowledge-based approach to merging semi-structured information. They used fusion rules to manage the semi-structured information that was input for merging. These fusion rules were a form of scripting language that defined how structured reports should be merged. The work assumed that structured news reports did not require natural language processing and used fusion rules to handle their inconsistencies and uncertainty. Fusionplex was a system for integrating multiple heterogeneous and autonomous information sources that used data fusion to resolve factual inconsistencies among the individual sources. To accomplish this, the system relied on source features, which were metadata, on the merits of each information source (). A Dynamic Ontology Construction Method has also been proposed by analyzing knowledge requirements for more effective Knowledge Fusion ().

In the next sections, we will discus the agricultural knowledge fusion problem and propose a general architecture for our fusion method. Finally, we will describe the knowledge fusion process in detail.

2.1 Agriculture Big Data Technologies

Agriculture big data means big data concepts, techniques, and methods practiced in the agriculture domain. In addition to having a vast body mass, modal variety and generating fast, low density value, agricultural big data are pervasive, contralateral, and have other characteristics. In the agricultural domain, agricultural production and research generate a large amount of data; in particular the application of information and communications technology (ICT) in agriculture will produce more in-depth, agricultural data soon achieving the ZB level. Integration and future mining of these data used for the development of modern agriculture will play an extremely important role. The big data technologies, including data-processing models and emerging tools, are being developed for implementation of our fusion system.

2.2 Ontologies

In general, ontology is an explicit specification of conceptualization (). Nevertheless, the term ontology has been controversial in current AI practice, and so far no formal definition exists. In our work, we have selected to use the term domain-specific ontology (DSO). In practical terms, developing an agricultural ontology (AgriOnto) includes three steps:

building a domain-specific knowledge hierarchy;
defining slots of the categories and representing axioms; and
acquiring knowledge, that is to say, filling in the specific data values for slots.

2.3 Information Integration and Knowledge Fusion

Knowledge fusion appears naturally, and its related synonym is information integration. In detail, information integration focuses on how to find related information while knowledge fusion focuses on how to find accurate and complete information based on information integration. Therefore, knowledge fusion can be understood as a high-quality integration method, aimed at solving the conflicts of integration-based data; information documents can be integrated to guarantee that information is understandable by machines.

It is well recognized that information integration based on a ranking function has very limited value in selecting the correct value from diverse web resources because inconsistencies exist among information from different agricultural information sources. Our proposed approach is a six-step data flow process based information integration, called primary knowledge fusion (PKF) (Figure 1). First, it extracts related information from the PKF through a query. Second, the semantic analysis will be calculated if each piece of information is an instance of a concept of agricultural ontology (Agri-ontology) and the knowledge it contains. The third step annotates each instance according to the ontology. In the fourth step, the instances are clustered into different clusters by instance similarity. Next, the instances are fused according knowledge fusion rules. Finally, the fused result is evaluated and a new knowledge object (KO) produced.

Figure 1

The six steps of our approach to knowledge fusion.

When multiple agricultural information sources provide inconsistent information, the knowledge fusion method is called upon to produce new information (knowledge) that is complete and accurate.

3 Agricultural Knowledge Fusion Model

The agricultural knowledge fusion method provides integrated knowledge and involves not only delivering available valuable information via links to users but also analyzing and merging the information results from agricultural information sources by solving result consistencies, removing duplicates, etc., based on agricultural domain ontology.

Definition 1: Given a set of agricultural information sources (AISS), the PKF can be defined as a 3 triple such as PKF = (AISS, M, Q), where:

AISS = {Is₁, Is₂, …, IS_n).

M is the mapping between the global ontology and the ontology of AISS, defined as M = (Ω, O, g)

Ω = {Ω₁, Ω₂, …, Ω_n}, Ω_i is the ontology of IS_i.

O is a global ontology.

g(Ω_i) is the mapping relation of Ω_i in the O.

Q is the user query.

Definition 2: Given PKF = (AISS, M, Q), the agricultural knowledge fusion problem is defined as AKF = (PKF, f, FR), where:

PKF is the primary knowledge fusion.

f is the operating function as f(PKF) = {ω₁, ω₂, …, ω_n), and ω_i is the information instance annotated by the ontology.

FR = {fr₁, fr₂, …, fr_n} is a set of knowledge fusion rules for attributes in agricultural ontology.

Definition 3: Given AKF = (PKF, f, FR), the solution K satisfies:

∀s∈slot(O), if ∃fr∈FR, then K·s = fr(s,ω).

K |= Q means K is the answer to the Q. In this paper, K is the knowledge object and is described as K = (K_Name, ((s₁, v₁), (s₂, v₂), …, (s_n, v_n))). We call (s_i, v_i) a knowledge unit. s_i is the slot attribute of a concept in ontology, and v_i is the value of s_i of an instance.

The above illustrates the agricultural knowledge fusion model in detail and gives a formal description of how to find a solution that merges the information from multi agricultural information sources into consistent knowledge that will answer users’ queries.

4 Agricultural Knowlege Fusion Architecture

In this paper, we propose a general agri-ontology-based knowledge fusion architecture as shown in Figure 2. The architecture consists of three main aspects: 1) agricultural ontology and fusion rules are the cornerstones of the convergence of agricultural knowledge; 2) agricultural ontology-based knowledge representation and matching, as well as mining and automatically selecting fusion rules based on the property of concept, are the key components in knowledge fusion; 3) in order to find more accurate knowledge to satisfy users’ queries, assessment of the fusion results is necessary to enhance knowledge fusion. All these parts form a complete system of knowledge fusion.

Figure 2

The agricultural knowledge fusion architecture.

4.1 AgriOnto

AgriOnto is the formal definition of agriculture and its relationships (see Figure 3). The definition and relationships form an integrated hierarchy of agriculture. With the labor object as the center of the agriculture hierarchy, we divide agriculture knowledge into seven taxa: labor object, production process, production technology, agriculture engineering, agriculture branch, agriculture environment, and agriculture regulation. Putting the labor object as the center of the agriculture knowledge hierarchy aims to aid those users who want labor object knowledge to access related knowledge of other taxa.

Figure 3

Some parts of the agriculture knowledge hierarchy. A favorable hierarchical hierarchy of agriculture knowledge is very useful to building an agriculture ontology. Our AgriOnto is built on this hierarchical structure.

4.2 Fusion Rules

Each fusion rule, such as Min, Max, and Avg, can be looked as an aggregation function in the database (). We divide fusion rules into two types: the single data fusion rule and the multi data fusion rule.

Definition 4: The single data fusion rule (SFR) is a type of aggregation function such that:

$f : D 1 × D 2 ×, . . ., × D n → D$

M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ f:D1 \times D2 \times {\rm{ }},{\rm{ }}.{\rm{ }}.{\rm{ }}.{\rm{ }},{\rm{ }} \times Dn \to D \] \end{document}

where D_i is the value domain that has been unified as a domain so D₁ = D₂ = , …, = D_n. Given v_i∈D (i = 1,2, …, n), ƒ(v₁, v₂, …, v_n) = v, v∈D. In this paper, the SFR includes Majr(Majority rule), Max, Min, Avg, Minr (Min-Priority rule), etc.

Definition 5: The multi data fusion rule (MFR) is a type of aggregation function such that:

$f : D 1 × D 2 × . . . × D n → 2 D$

M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ f:{D_1} \times {D_2} \times {\rm{ }}.{\rm{ }}.{\rm{ }}.{\rm{ }} \times {D_n} \to {2^D} \] \end{document}

given v_i∈D(i = 1, 2, …, n), ƒ(v₁, v₂, …, v_n) = D′, v_i∈D, D′⊆ D. The MFR includes CInt (Interval Rule), Or, and And.

In general, the single data fusion rule and the multi data fusion rule cannot be applied to an information set. Instead, we must analyze the query and answer type and then define a combination of fusion rules. However, usually the user participates in the rules selection to finish the knowledge fusion process. We have defined 13 fusion operator rules based on global ontology. For example, a closed interval operator is a fusion operator whose definition is as follows:

Definition 6: Given a domain D and possible values on it D′ = {v₁′, v₂′, … v_n′}, the closed interval operator(CInt) satisfies:

$C I n t (D ′) = [v i, v j], if ∀ v ′ i ∈ D ′, then v ′ i ∈ [v i, v j]$

M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ CInt(D') = [{v_i},{\rm{ }}{v_j}],{\rm{ if }}\forall {\rm{ }}{v'_i} \in D',{\rm{ then }}{v'_i} \in [{v_i},{\rm{ }}{v_j}] \] \end{document}

Example 1: If there exist three possible tuples: v₁= (Wang da hong; age; 12), v₂ = (Wang da hong; age; 13), and v₃ = (Wang da hong; age; 15), then we will get CInt ({v₂, v₂, v₃}) = (Wang da hong; age; [12–15]).

In our Fusion rule selection, each rule will be limited to some condition that can be deduced by a rule character and a query that can be defined:

Definition 7: Given query ontology Ω, a knowledge fusion query can be formally defined:

$o ⋅ {(s 1, f r 1) = ?, …, (s n, f r n) = ?} | c n t, o ⋅ {(s 1, f r 1) = ?, …,$

M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ o \cdot \{ ({s_1}, f{r_1}) = ?,...,({s_n}, f{r_n}) = ?\} |cnt, o \cdot \{ ({s_1}, f{r_1}) = ?,..., \] \end{document}

where (s_n, fr_n) = ?} represents query objects, and cnt is a set of constraint conditions. O is a concept or instance in Ω, s₁ is a slot (attribute) of o, and fr₁ is a fusion rule. If fr₁ is omitted, the query will be changed into a general query in traditional information integration.

Example 2: Given a query = Potato · (price, Avg), the knowledge fusion system should provide an average price of a price set of potatoes returned by information integration. If Avg is NULL, then the knowledge fusion system will return the potato price in a way similar to traditional information integration. Often a user can select a rule according his preference.

In query ontology Ω, we define a default rule for each slot of a concept, involving two slot types: meta-slot and composite-slot. A meta-slot is a slot that cannot be divided semantically while a composite-slot can be divided into many meta-slots. For example, slot IdentityNo of a concept person is a meta-slot, but Name, usually, is a composite-slot including a meta-slot first-name and a meta-slot last name. A fusion rule for meta-slot is always pre-defined according to the meta-slot definition, but a composite-slot usually needs a concatenate rule. In order to acquire a high quality answer, we need to extend the slots of a concept to filter out useless information. The slots also are called data quality slots including:

Authority (DQa) The data quality authority is used to measure the probability of information correctness in information sources.
Timeliness (DQt) Timeliness presents a means to estimate the goodness (or badness) of information in information sources in terms of time.
Completeness(DQc) The degree to which all data relevant to an application domain have been recorded in an information source.

Therefore, given a concept and its slot set {a₁, a₂, …, a_n}, the extensional slot set will be {a₁, a₂, …, a_n, DQ_a, DQ_t, DQ_c}.

4.3 Knowledge Inconsistency Problem Analysis

In general, knowledge consistency means a judgment is in accord with both historical judgments and current facts. On the other hand, inconsistency means a contradiction between the historical judgments and current facts. From the aspect of ontology, consistency means that the logic relationships of the terminology are consistent while inconsistency means conflicts exist between some parts of the ontologies. For example, we define grain crops and cash crops as disjoint classes that do not have the same instances. If the class wheat belongs to both grain crops and cash crops, an inconsistency will occur.

In this paper, agricultural ontology consistency includes consistency between the ontology definition and the knowledge based on the ontology. This means that we cannot obtain conflicting knowledge from the knowledge base. Generally, when a knowledge base exists, conflict knowledge depends on the following conditions:

The consistency of concept defining. That is to say, the formal definition contains the same meaning as the informal one. Take the concept “dogs” as an example. If the formal definition of dogs is the same as that of the concept cats, inconsistency exists.
The consistency of concept extension. In terms of formal or non-formal concept definitions, conflict knowledge can exist through concept explanation (including reasoning). For example, cats can catch mice, but we cannot say that mice can catch cats.
The consistency of axioms. The axiom system will not produce conflict knowledge.

From the viewpoint of knowledge application, the knowledge base can guide users to make correct decisions and ensure that no confusing conclusions arise. In brief, consistency is an important criterion with which to evaluate an ontology-based knowledge base. Knowledge inconsistency will lead to unreliable service, which threatens knowledge correctness. This paper proposes a method with which to check ontology consistency.

Definition 8: Given knowledge base K, the knowledge inconsistency problem is a 3 triple KI = (K, Y, Q), which satisfies :

Y = {y1, y2, …, yn} is a knowledge operation set.

Q is a given knowledge query.

Definition 9: Given knowledge inconsistency problem KI = (K, Y, Q). If a knowledge conflict exists in K, it satisfies the following conditions:

∃ k, k₁₁, k₂₂, …, k_1j ∈ K, y₁₁, y₁₂, …, y_1j ∈ Y, $∑ 1 j y 1 l (k 1 l) | = k ∧ k → Q$ M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \sum\limits_1^j {{y_{1l}}({k_{1l}})} | = k \wedge {\rm{k}} \to {\rm{Q}} \] \end{document} . The symbol |= indicates “reason out” and → represents “can satisfy”.

∃ k, k₁₁, k₂₂, …, k_1m ∈ K, y₂₁, y₂₂, …, y_2j ∈ Y, $∑ 1 j y 2 l (k 2 l) | = ¬ k ∧ ¬ k → Q$ M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \sum\limits_1^j {{y_{2l}}({k_{2l}})} | = \neg k \wedge \neg {\rm{k}} \to {\rm{Q}} \] \end{document} .

From the above definitions, we see that the knowledge base has inconsistency if there are two pieces of contradictory knowledge. It is very import to find a mechanism or method to check this knowledge base inconsistency ().

5 Agrionto-Based Knowledge Fusion

5.1 Equivalent Entity Distinguishing

Equivalent entity distinguishing uses a clustering algorithm to classify the same entities into categories using identity slots (IS); that is to say, if IS(entiy1) = IS(entiy2), then entiy1is equivalent to entiy2, from the entity viewpoint (entiy1 ≈ entiy2). We also think that the two entities have different descriptions of an object. From the equivalent entity definition, we can conclude the following propositions: Proposition1: if E1 ≈ E2 ∧ E2 ≈ E3, then E1 ≈ E3; Proposition2: if E1 ≈ E2 ∧ E2 ≠ E3, then E1 ≠ E3; Proposition3: if E1 ≈ E2, then E2 ≈ E1

In order to determine whether two entities are equivalent, we need to analyze the identity slots’ values:

Abbreviation. An abbreviation is a shorter way to say something, for example, Massachusetts = Mass.
Synonym. Given two words that are synonyms, they represent the same entity or concept, for instance, corn and maize.
Prefix & Suffix. An abbreviation using the first or last letter of each word, for example, IM = Instant Messaging.

If data in the identity slot are pre-processed and IS(entiy1) = IS(entiy2), then entiy1 ≈ entiy2.

5.2 Fusion Method

In our research, we define fusion rules at attribute granularity. Each fusion rule can be looked at as an aggregation function in the database, such as Min, Max, and Avg. In general, single data fusion rules and multi data fusion rules cannot be applied to an information set. Instead, we need to analyze the query and answer type and then define the necessary combination of fusion rules. Usually, however, a user needs to participate in rule selection to finish the knowledge fusion process. Generally, the attribute constraint determines the rule selection that is affected by the query. We divide knowledge fusion into attribute fusion, instance fusion, and concept fusion.

• Attribute fusion

Attribute fusion merges the different values at an attribute, for example (see Figure 4), “What price is the wheat at market 1?” The information fragments of two equivalent instances are extracted from information sources. In this case, the two values of the price are inconsistent so the last fused price will be “1.925¥/kilo” using the Avg rule. This is especially useful when the price value is an editing error.

Figure 4

The extracted information fragments of two instances.

• Instance fusion

Instance fusion merges equivalent instances that have different descriptions of the same object (see Figure 5). Because most information sources describe a part of an object, the fused result is the union of the equivalent instances based on the attribute fusion.

Figure 5

The instance fusion process.

• Concept fusion

Concept fusion takes into account the correlations among equivalent instances by combining different instances that are divided into different sets of equivalent instances by the cluster algorithm (see Figure 6).

Figure 6

The concept fusion process.

6 Conclusion

Data have become strategic resources as important as natural resources and human resources with an implied great value and have caught the attention of both the scientific and business communities. With the recent rapid growth in the amount of data, existing data processing technology has great difficulty in meeting the large demand placed on it, and the data are very difficult to mine. In this paper, we propose a generic agricultural knowledge fusion method to fuse information from diverse information sources, such that a more comprehensive basis can be obtained for data analysis and knowledge discovery for agricultural big data. In recent years, information systems integration or business integration have received much attention (; ). Now we must pay attention to the integration of agricultural data in the area of big data because once the data are gathered and stored in an integrated database, they will have new value. This paper describes how to make full use of agriculturual information from the aspect of knowledge fusion technology, which will accelerate the correct use of agricultural knowledge and give a knowledge basis for big data mining. In the future, we will further study data consistency, ontology-based rules, and fusion algorithms and conduct more application tests under the open agricultural big data environment.

Data Science Journal

Proceedings Papers

Research on an Agricultural Knowledge Fusion Method for Big Data

Abstract

1 Introduction

2.1 Agriculture Big Data Technologies

2.2 Ontologies

2.3 Information Integration and Knowledge Fusion

3 Agricultural Knowledge Fusion Model

4 Agricultural Knowlege Fusion Architecture

4.1 AgriOnto

4.2 Fusion Rules

4.3 Knowledge Inconsistency Problem Analysis

5 Agrionto-Based Knowledge Fusion

5.1 Equivalent Entity Distinguishing

5.2 Fusion Method

6 Conclusion

7 Acknowledgments

References

Proceedings Papers

Research on an Agricultural Knowledge Fusion Method for Big Data

Abstract

1 Introduction

2 Related Techological Aspects

2.1 Agriculture Big Data Technologies

2.2 Ontologies

2.3 Information Integration and Knowledge Fusion

3 Agricultural Knowledge Fusion Model

4 Agricultural Knowlege Fusion Architecture

4.1 AgriOnto

4.2 Fusion Rules

4.3 Knowledge Inconsistency Problem Analysis

5 Agrionto-Based Knowledge Fusion

5.1 Equivalent Entity Distinguishing

5.2 Fusion Method

6 Conclusion

7 Acknowledgments

References