Entity matching
Publish date: 2012-03-13
Report number: FOI-R--3265--SE
Pages: 104
Written in: English
Keywords:
- Record matching
- duplicate entry detection
- entity resolution
- vertex similarity
- ensemble classification
- data fusion
- information fusion
Abstract
This report serves as a review and survey of earlier work in the field of entity matching as well as current software implementations in this area. Entity matching uses string matching methods known as field metrics to find similar text strings that could correspond to similar names or addresses. The outputs from these field metrics are often used with different classification methods to determine if the strings (or the entire entry the strings are a part of) are matching or unmatching. These classification methods include both supervised and unsupervised methods originating in statistics and machine learning. This report proposes using other classifiers including vertex similarity and text mining-methods to generate additional evidence that two entities match. Vertex similarity is studied in network analysis and aims to identify nodes sharing a large fraction of common neighbors, indicating that the entities have similar social or communication networks. Text mining-methods are useful in finding similar documents and other written longer texts, indicating that two entities have the same language usage or deal with the same topics. Some small experimental evaluations are offered using citation data from two different sources to test these two methods of finding similar entities. Furthermore, the report proposes methods based on data fusion to combine these classifiers with the traditional field metrics into an ensemble.