For some odd reason, this week at work I’ve been pulled back into data cleaning and data structuring. St. Mungos, a local hospital is trying to bounce their service data against Mayhew Insurance claims data. We received several hundred scrolls of names and claims. None of their data is quite structured like our claims and enrollment data, so we’ve been building a crosswalking system. For data security purposes, none of the hospital data contains Social Security numbers and for Pain In The Ass purposes, none of their data contains the Mayhew Insurance coverage number.
So that means we have to find the identities of people in their service universe by a combination of birth date, last name, first name, provider, and date of service. Creating automated queries that match on a significant number of not quite unique keys has allowed for very high confidence matches on 97% of the probable unique individuals. The remaining 3% are manual look-ups and judgement calls.
For instance, is Harold J. Potter with a DOB of 7/31/80 from the hospital file the same person as Harry Potter DOB 7/31/80, with multiple dates of death in the insurance file the same individual? Eyeballing the data and making an informed judgement, I am guessing at 95% confidence that this is the same person. How about Sirius R. Black and S. Regulas Black? Are they the same person? Probably, but not definately. Being slightly more serious, how do you match Markos Moulitsas Zúniga to Kos Moulitsas where the DOBs are off by a day? Going through the 3% is costing the company at least a dollar per name, and probably closer to $1.50/name.
For the purposes of my current task getting 98% or 99% of the data set matched at high confidence is more than sufficient. The research plan allowed for a 5% mismatch rate.
Eyeballing the names that are in the unmatchable bucket, there are three notable groups. The first are recent births where the name is Baby Girl Smith matched against 15 girls born on that date in the shared claims universe. These are no big deal. The second group of unmatched names are women whose relationship status has changed multiple times. My wife would be in this group as the hospital data has her as Jane Doe while the insurance data has her as Jane Mayhew. The final group are people with names that fall outside of culturally dominant naming conventions. Names with extra punctuation and names whose ordering structures don’t follow the First/Last or First/Maiden-Last convention fall out very readily.
This is important because error is acceptable even if we are willing to spend money to minimize the error. Verifying voter rolls based on name and date of birth matches is guaranteed to produce significant unmatched. There is a good argument that voting rolls should be as up to date as possible, but to do that in a good faith manner requires spending serious money as one person dropped from the rolls who should not have been dropped is one too many. $5 to $10 per name at 100% confidence would be reasonable estimate of the cost of scrubbing an entire list of votes. The vast majority of names would be passed at a dime per name, while the last 5% to 1% might cost $2 per name to verify, and the last 1% will cost $50 per name to verify.
Scrubbing an entire state for pennies per name is a guarantee of producing false positives and statistically certain mismatches. And given who is more likely to have mismatched names between data sets, cynicism should reign.