Improving database quality through eliminating duplicate records

Redundant or duplicate data are the most troublesome problem in database management and applications. Approximate field matching is the key solution to resolve the problem by identifying semantically equivalent string values in syntactically different representations. This paper considers token-based solutions and proposes a general field matching framework to generalize the field matching problem in different domains. By introducing a concept of String Matching Points (SMP) in string comparison, string matching accuracy and efficiency are improved, compared with other commonly-applied field matching algorithms. The paper discusses the development of field matching algorithms from the developed general framework. The framework and corresponding algorithm are tested on a public data set of the NASA publication abstract database. The approach can be applied to address the similar problems in other databases.


INTRODUCTION
With the development of WWW and databases, data are becoming more available from different resources. To benefit the data storing and retrieving, majority of useful data is stored in large databases under certain database management systems (DBMS). Due to problems in different stages of data flow (Dasu and Johnson, 2003), it is very easy to introduce data quality problems in databases. Based on Kukich (1992aKukich ( , 1992b, the average error rate is 1─3% in typed data, 1─16% in optical character recognition (OCR) processed data, and 5─6% in data obtained by voice communication, respectively. The errors are caused by misspellings, typos, abbreviations both standardized and non-standardized, character recognition problems or phonetic problems inherent in these data acquisition methods. Rahm and Do (2000) analyze data quality problems in single and integrated data sources, as data quality problems being classified as single-sourced and multiple-sourced problems. Single-sourced data quality problems include data quality problems caused by data acquisition and data modeling problems, as illegal values and inter-field mapping problems, duplicate records; while multi-sourced data quality problems mainly refer to duplicate or redundant records that are caused by different data conventions and formats in multiple databases.
For above mentioned reasons, data quality problems block recording of real-world objects correctly. When occurring in the identifying attribute domains, these problems also obstruct the efficient data access. Problems created by low quality data is often shown by an example of customer data, as described in Ananthakrishna, Chaudhuri and Ganti (2002). When multiple variations of a customer's contacts are saved in databases, as [Lisa Simpson,Seattle,WA,USA,98025] and [Lisa Simson,Seattle,WA,USA,98025], duplicated information will significantly increase the costs on direct mailing. In addition, such duplicates will cause incorrect results in data analyses, such as a survey of customers of a specific chain store or the customers' shopping pattern identification. With all stunning large numbers in its costs for enterprises and industries in white papers, as cited in a white paper from Trillium Software (2004), data quality problems are drawing peoples' attention.
In this paper, the central problem of field matching in database management is discussed. A general field matching framework is proposed to consider token-based field matching in general. By introducing the concept of String Matching Points (SMP) in string comparison, improved efficiency and accuracy are observed. Major string matching problems in different attribute domains are analyzed. A corresponding field matching algorithm is developed for testing data set from the NASA publication abstract database. Further improvement of computational efficiency is discussed in the experiments. Promising results are observed.
The rest of this paper is organized as follows. Section 2 reviews the related work on the field matching problem and related algorithms; Section 3 presents the concept of String Matching Points in string comparison; Section 4 presents the proposed field matching framework; Section 5 depicts the experimental setups and results; and Section 6 concludes the paper.

RELATED WORK
The nature of field matching is to identify the similarities of domain values in string representations for determining their equivalency. String matching has been studied intensively for its applications in signal processing, computational biology, information retrieval, intrusion detection and others, as summarized in Sankoff and Kruskal (1983). In those areas, the goal of approximate string matching is to identify the Longest Common Subsequence (LCS), or to identify one feature in one sequence from the other sequence, or to identify the abnormality in the targeting sequences. Many algorithms have been developed to resolve approximate matching problems in these fields. Navarro (2001) provides a comprehensive overview of these algorithms.
Since field matching is to identify equivalent string values in attribute domains in databases, it is required to develop algorithms based on the characteristics of string matching problems and the goals. It is to identify semantically equivalent (identifying) attribute values in syntactically different representations. As in other application areas, the equivalency of two string values is modeled by their similarity degree in the range of [0,1], with one as equivalent, zero no similarity. Selection of a proper approximate matching algorithm is a domain-dependent task, and should be done according to the nature of quality problems in selected domains.
In the literature, field matching algorithms can be classified into three broad categories: a) characterbased, b) q-gram-based and c) token-based algorithms.

Character-based Field Matching
Character-based string matching is a large family for its wide application areas, as summarized by Navarro (2001). Although designed with different strategies, character-based string matching takes strings as sequences of characters, as in n c c c w ... 2 1 = , and compares two strings character by character. The most commonly applied algorithm in this category is the Levenshtein algorithm (Levenshtein, 1966), also called simple edit distance algorithm, which calculates the minimum number of atomic editing operations (e.g. insertions, deletions, transpositions and substitutions) that are required to transform one string to the other. The edit distance of two strings can be denoted as ED(str1,str2). An example is 1 ) " " , " (" = universty university ED . Levenshtein algorithm is often applied by its normalized variations (Hernandez and Stolfo, 1995, Elfeky, Verykios and Elmagarmid, 2002, Chaudhuri et al., 2003. The most popular variation is to normalize the edit distance of two strings by their maximum length, formulated as Character-based field matching is capable of comparing strings with slight edit variations created in data entry and OCR processing. There are other character-based field matching algorithms, such as the Smith-Waterman algorithm (Smith and Waterman, 1981) and the Jaro algorithms (Jaro, 1989). Some algorithms are developed to consider first character errors, as a variation of the Jaro algorithm described in Winkler (1990). Many are designed for searching certain string pattern in long textual files in online search, as a two-way string comparison algorithm (Crochemore and Perrin, 1991). Online pattern search is a different prospect from the field matching problem, which will not be addressed in this paper.

Q-Gram-Based Field Matching
Given a string S and an integer q, the set of q-gram of S, denoted as ) (S G q , is obtained by sliding a window of length q over the characters of string S. For instance, the 3-gram set of "shackleford" can be represented as String matching using q-grams is based on following observation: if two strings are similar to each other they share a large number of q-grams in common (Ukkonen, 1992). The similarity degree of two strings is usually modeled as the ratio of common q-grams to the total number of distinct q-grams in two strings.
Combined with database management systems, Q-grams can be efficiently applied in string matching, as reported in Gravano et al. (2001). Q-gram-based methods employ the sequential substrings to consider the similarity of two strings. Positional q-grams consider the out-of-order or word transposing problems in string matching. But q-gram is an inherent space expensive technique, with (m-q+1) q-grams for a string with length of m. High space consumption means high computational cost in database systems.

Token-Based Field Matching
Token-based string matching considers strings as consequences or sets of words, as in m w w w S ... The importance and requirement of token-based field matching in resolving field matching problems has been addressed in literature (Kukich, 1992a, Navarro et al., 2003, Han et al., 2004. Both simple and sophisticated algorithms have been developed and applied in field matching. The simplest token-based field matching algorithm is the Jaccard similarity metrics, which counts the number of common words N C and the number of distinct words N D of two strings in comparison, and takes the ratio of N C /N D as the similarity degree of two strings. The simple field matching algorithm (Monge, Elkan, 1996)  , where |str 1 | is the number of words in str 1 , and |str 2 | is the number of words in str 2 . Given proper thresholds, these algorithms are capable to resolve some equivalence errors without introducing high false positives (e.g., irrelevant string values identified), but they generate a large number of true negatives (e.g., semantically equivalent string values unidentified).
In Lee et al. (1999), an algorithm was proposed to improve the field matching accuracy. First, it sorts tokens in each attribute value in the selected domain, and then sorts records based on selected domain(s) to bring duplicate records as close as possible. The concept of Record Similarity (RS) was proposed to calculate the similarity of records by combining the similarity of tokens and similarity of field values. By defining some string matching patterns, this algorithm improved the field matching accuracy significantly.
Because of the sorting process, the algorithm fails some matches, as in processed strings comput science department new mexico institute of mining and technology and cs dept new mexico tech, all lowercased.
After removing the stop words and being sorted correspondingly, two sequences become comput institute mexico new science technology and cs dept mexico new tech. It is impossible to match cs and comput science, and hence to identify them as completely equivalent strings by all means, because the original allocation of words is changed.
From above discussion, it is obvious that word allocation plays significant roles in string matching.
Under different contexts, one word means and functions differently. For example, a same word new has different semantic meanings in fragments of a new cloth and new mexico state. Therefore, it is important to keep the original allocations in string comparison.
A more sophisticated field matching algorithm was presented in Elkan (1996, 1997), the recursive field matching algorithm. This algorithm recursively compares substrings of two strings str A and str B to obtain the maximum degree of similarity, and calculates the overall similarity by averaging the total similarity degrees, as shown in formula . In the recursive string matching algorithm, strings are considered as sequences of words, and best matching is achieved by comparing substrings iteratively. By cooperating with string matching patterns, this algorithm greatly improves the string matching accuracy. But it is an extremely expensive algorithm due to its recursive comparison of substrings. Another essential problem is that oftentimes it is difficult to find the best matching substrings, especially when related fields are messed up.
Hence, an ideal field matching algorithm should consider the allocation of words in strings, and define how strings match each other by defining string matching patterns, and have an approach to identify potential matching substrings fast for the purpose of improving the string matching efficiency. In this paper, we propose a field matching algorithm that tries to address these problems and to improve the string matching accuracy and efficiency.

THE CONCEPT OF STRING MATCHING POINTS
The concept of String Matching Points (SMP) was inspired by the process of comparing two strings using our inborn eyesight, in which we obtain a coarse similarity degree of two strings by two steps. In the first step, two strings are investigated briefly, and potential matching substrings are identified. In the second step, two potential matching substrings are compared carefully in characters or based on pre-defined string matching rules. Repeat these two steps and achieve the overall similarity of two strings.
More formally, SMP are defined as follow: given two strings str A and str B , a SMP of a substring sub in str A is defined as the index of tokens tsmp in str B , where substring sub and token tsmp share the same first characters. By the definition, it is possible that one substring has multiple SMP or has no SMP in the other string. For example, the SMP of substrings of cs dept new mexico tech in comput science department new mexico institute mining technology (all lowercased), will be 1, 3 and 4 consequently, since cs shares the first character with comput, dept with department, and new with new.
It is obvious that each substring will spot its SMP if two correct strings match each other, even with abbreviations and shorter representations. Sometimes spotting SMP can be a problem for some erroneous variations when the errors occur to the first letters. But based on Kukich (1992a), we can trust the first letter of a word over the others in string comparison. Hence those with SMP will dominate the matches in equivalent strings.
By introducing SMP, string matching is turned into an iterative procedure of following two steps: 1) looking for the string matching points and 2) comparing following substrings detailedly. As the SMP searching is as simple as character comparison, it is an efficient strategy to improve string matching efficiency. As string matching processes, SMP move forward for unmatched tokens. Figure 1 shows the string matching process with the assistance of SMP. SMP searching make identification of maximum matching substrings easier, rather than a trial-and-error process in the recursive field matching algorithm proposed by Monge and Elkan (1997).

General Framework
As discussed above, the general field matching framework simulates two steps in string comparison. The general field matching framework follows the basic steps of token-based field matching algorithms, as shown in Figure 2.  The actual string matching starts from the shorter string because the shorter string is likely more compressed by abbreviations, and hence less SMP search is required. The framework considers string matches with SMP and matches without SMP. As discussed above, in two equivalent strings, the matches in the first category are expected to be the majority; those in the second category are complementary to pick up matches with errors in the first characters.
In string matching with SMP, first step is to identify the SMP for unmatched tokens, and then construct substrings from the SMP to implement the actual comparisons. In actual string matching applications, the field matching algorithm can be developed from the framework by customizing string matching patterns according to data quality problems in the selected domain. Matches without SMP are caused by the errors in the first characters. No matches in this category fit for pre-defined string matching patterns for matches without SMP. Therefore it is required to compare each unmatched substring in one string with all other unmatched substrings in the other string to find the maximum matches, as in recursive field matching algorithm. Since matches in this category are infrequent, they would not cost much.

String Matching Patterns
To resolve string matching problems, it is important to define rules or string matching patterns to match syntactically different substrings. As different domains have different types of equivalence errors, different string matching patterns are required. We classify all possible string matching patterns into following categories for facilitating the algorithm development: 1) One-to-one matching: matches between two single words, which include: • P1: exact match as match → match; • P2: prefixed match as sci → science; • P3: concatenation of prefix and postfix as dept → department; • P4: shorter representation as blvd → boulevard; • P5: error resilient as institute → instutite; 2) One-to-n matching: matches between a single word and a multi-word string, which include: • P6: acronymic match as NMT → New Mexico Tech; • P7: concatenation of prefixes as CalTech → California Institute of Technology; 3) N-to-n matching: matches between two multi-word strings, which include: • P8: word boundary problem as New Mexico Tech → NewMexico Tech.
The above string matching patterns are discussed in resolving field matching problems in publications (Monge, Elkan, 1996, 1997, Lee et al.,1999. It is obvious that not all patterns are required for all field matching applications and this list is not a complete one for all field matching problems.
In order to improve the string matching efficiency, substring construction in the general framework is done based on the defined string matching patterns, as defined as followings: • for one-to-one matching, words W Ai and W Bj are taken to compare; • for one-to-n matching, continuously unmarked tokens from W Ai in str A and continuously unmarked tokens from W Bj in str B are concatenated as sub Ai and sub Bj respectively, and string comparisons are implemented between W Ai and sub Bj , or W Bj and sub Ai ; • For n-to-n matching, continuously unmarked tokens from W Ai in str A and continuously unmarked tokens from W Bj in str B are concatenated as sub Ai and sub Bj respectively, and string comparisons are implemented between sub Ai and sub Bj .
Because comparisons in one-to-one, one-to-n and n-to-n categories are not same in computational costs, it is important to select proper string matching patterns to resolve emerged string matching problems in specific databases. For example, if defined matches are in one-to-one match, it is not necessary to concatenate the neighboring words into multi-word substrings; and two tokens obtained from two strings are enough. For n-to-n matches, it is required to concatenate continuously unmatched tokens into multiword substrings for both strings. It is obvious that word concatenation and matching multi-word substrings will cost more than one-to-one matches. Statistical results of equivalence errors in a set of 2,000 distinct affiliation names and a set of 2000 distinct authors' names from the NASA publication abstract database on Astronomy and Astrophysics, and a set of oil producers' names from New Mexico ONGARD database are shown in Table 1. They are obtained by analyzing the equivalence errors in equivalent string values of these domains. problems, one-to-one patterns will be sufficient to consider majority of equivalence errors.

Setups
A set of affiliation names is abstracted from NASA publication abstract database on Astronomy and Astrophysics. Since it is an official publication database on Astronomy and Astrophysics, it plays important roles in literature retrieval in related research. Data standardization will benefit greatly the data management and other research conducted on the database. Among authors' name, authors' affiliation, paper title, publication title, and other domains, authors' affiliation is the most problematic domain, containing many string matching problems as summarized in Table 1.
The experiments are set up to classify equivalent affiliations into one class, which can great aid in data standardization or removing duplicates in the later stages of data cleaning. To eliminate the effect of searching algorithms, a complete comparison is applied, in which every string compares with every other string to determine their equivalency. Strings with similarity degrees higher than pre-defined threshold are classified into one class. The string matching accuracy is evaluated by the accuracy of the classification.
In this practice, each resulting class belongs to one of following types, compared with a validated classification result: Type 1: All representations in a class are of the same affiliation, and the class contains all representations of the same affiliation, which is the perfect case for classification.
Type 2: All representations in a class are of same affiliation, and the class does not contain all representations of the same affiliation, which is still correct classification.
Type 3: The other cases that include a class contain representations of multiple affiliations, and totally wrong classification.

Customized Field Matching Algorithm
Based on Table 1, equivalence errors in one-to-one category is essential to resolve majority field matching problems. By considering one-to-one equivalence errors, including P1, P2, P3, P4 and P5, in matching with SMP and error resilient type of equivalence errors (e.g., P5) in matching without SMP, majority string matching problems can be resolved. By considering these string matching patterns, the actual field matching algorithm is customized as follows in Figure 3.

Comparison of Field Matching Algorithms
In comparison of field matching algorithms, we consider classification accuracy and time efficiency. Time efficiency is evaluated by computational costs, and classification accuracy is evaluated by both perfect classification and correct classification. Results are shown in Figures 4, 5 and 6.
From Figure 4, it is obvious that time complexity of the new algorithm is between the simple tokenbased algorithms, including Jaccard similarity metrics and the simple field matching algorithm, and complicated token-based and character-based algorithms, including normalized edit distance (NED), Jaro and record similarity algorithm (Lee et al., 1999). It indicates that the new algorithm belongs to sophisticated token-based algorithms that consider multiple string matching patterns in string matching; also the new algorithm is an efficient algorithm by skipping a large number of comparisons by searching the SMP before detailed comparison. The new algorithm is a less expensive algorithm compared with NED, and it is more sophisticated in considering potential problems in string matching, such as word transpositions and abbreviations and single-error misspellings.
Results of perfect classification (type 1) are the best results revealing the classification accuracy. As shown in Figure 5, they all achieve their peak performance when threshold is 0.85, which indicates similarity threshold of 0.85 can be considered as the optimized threshold in identifying equivalent values using these algorithms. The best performed algorithms are NED, Jaro and the new algorithm, with the new algorithm being much cheaper.
Correct classification, including classification in Type 1 and Type 2, is also an indicator of the performance of a field matching algorithm. As shown in Figure 6, the NED, Jaro and the new algorithms perform best. This observation discovers that token-based algorithms work no better than character-based algorithms unless necessary string matching patterns are designed, and it confirms the conclusion that proper string matching patterns are required in string matching. Because correct classification only considers classes that are correctly clustered, with one perfect cluster maybe split into several correct clusters, the results are more complicated. For the new algorithm, its accuracy of correct classification is basically climbing as the increase of similarity thresholds, up to its peak at threshold of 0.85.

Effect of String Matching Patterns
Effect of string matching patterns is studied using the same problem, as shown in Fig. 7  This is because that prefixed match is an important pattern, as indicated in Table 1, and ignoring the prefixed matches would cause loss in identifying equivalent values. But it does not necessarily mean down performance in correct classification because correct classification does not require all equivalent values be clustered in one cluster.
Results also reveal that once the most important string matching patterns are considered, no substantial increases in accuracy are observed by adding one or more other string matching patterns. This conclusion also accords with the statistical results in Table 1, which says that one-to-one patterns dominate matches in different attribute domains and other equivalence errors are really minus problems in string matching.

CONCLUSION
In this paper, a central problem of entity identity, also named as object identification or field matching in data cleaning community is addressed. Following conclusions are achieved: • The purposes of a general field matching framework are to facilitate the field matching problems in different domains and databases.
• By introducing SMP, string matching is converted to a two-step process: firstly identifying potential matching starting-points for particular substrings, and then matching substrings based on pre-defined string matching patterns. • Algorithm development from the general framework is addressed, and an algorithm is developed for a set of affiliation from NASA publication abstract database.
• Experimental results discover the advantages of new field matching strategy over other field matching algorithms, both in string matching accuracy and computational efficiency.
• With the proposed general framework, field matching problems in different domains and databases should be addressed more easily and more accurately.