Start Submission Become a Reviewer

Reading: Improving database quality through eliminating duplicate records

Download

A- A+
dyslexia friendly

Research Papers

Improving database quality through eliminating duplicate records

Authors:

Mingzhen Wei ,

University of Missouri-Rolla, 1870 miner Circle, Rolla, MO, 65409, USA, US
X close

Andrew H Sung,

New Mexico Institute of Mining and Technology, Socorro, NM, 87801, USA, US
X close

Martha E Cather

New Mexico Petroleum Recovery Research Center/New Mexico Tech, Socorro, NM, 87801, USA, US
X close

Abstract

Redundant or duplicate data are the most troublesome problem in database management and applications. Approximate field matching is the key solution to resolve the problem by identifying semantically equivalent string values in syntactically different representations. This paper considers token-based solutions and proposes a general field matching framework to generalize the field matching problem in different domains. By introducing a concept of String Matching Points (SMP) in string comparison, string matching accuracy and efficiency are improved, compared with other commonly-applied field matching algorithms. The paper discusses the development of field matching algorithms from the developed general framework. The framework and corresponding algorithm are tested on a public data set of the NASA publication abstract database. The approach can be applied to address the similar problems in other databases.
DOI: http://doi.org/10.2481/dsj.5.127
How to Cite: Wei, M., Sung, A.H. & Cather, M.E., (2006). Improving database quality through eliminating duplicate records. Data Science Journal. 5, pp.127–142. DOI: http://doi.org/10.2481/dsj.5.127
6
Views
8
Downloads
1
Citations
Published on 28 Nov 2006.
Peer Reviewed

Downloads

  • PDF (EN)

    comments powered by Disqus