Content-type: text/html Downes.ca ~ Stephen's Web ~ Merging Data Sets Based on Partially Matched Data Elements

Stephen Downes

Knowledge, Learning, Community

This is a difficult read (especially as the code is not authored with clarity in mind) but it's a really interesting topic. At issue is how you equate data elements that are only partially matched. For example, human readers have no problem knowing that the string "S. Korea" and the string "South Korea" refer to the same country. But to a computer, this is a difficult problem. This post describes one algorithm for matching these sorts of pairs. You might think, it's just country names, do it by hand. But gRSShopper extracts author data from posts. Are "Clayton Wright" and "C.R. Wright" the same person? I have 8617 author records; I can't do it by hand. So - a difficult but significant problem.

Today: 4 Total: 105 [Direct link] [Share]

Image from the website


Stephen Downes Stephen Downes, Casselman, Canada
stephen@downes.ca

Copyright 2024
Last Updated: Nov 23, 2024 4:28 p.m.

Canadian Flag Creative Commons License.

Force:yes