A couple bucks per name and voting rights

by David Anderson| June 19, 20153:31 pm| 21 Comments

This post is in: Domestic Politics, Election 2016, Fuck The Poor, Ever Get The Feeling You've Been Cheated?, I Can't Believe We're Losing to These People, I Reject Your Reality and Substitute My Own

Facebook Tweet Email

For some odd reason, this week at work I’ve been pulled back into data cleaning and data structuring. St. Mungos, a local hospital is trying to bounce their service data against Mayhew Insurance claims data. We received several hundred scrolls of names and claims. None of their data is quite structured like our claims and enrollment data, so we’ve been building a crosswalking system. For data security purposes, none of the hospital data contains Social Security numbers and for Pain In The Ass purposes, none of their data contains the Mayhew Insurance coverage number.

So that means we have to find the identities of people in their service universe by a combination of birth date, last name, first name, provider, and date of service. Creating automated queries that match on a significant number of not quite unique keys has allowed for very high confidence matches on 97% of the probable unique individuals. The remaining 3% are manual look-ups and judgement calls.

For instance, is Harold J. Potter with a DOB of 7/31/80 from the hospital file the same person as Harry Potter DOB 7/31/80, with multiple dates of death in the insurance file the same individual? Eyeballing the data and making an informed judgement, I am guessing at 95% confidence that this is the same person. How about Sirius R. Black and S. Regulas Black? Are they the same person? Probably, but not definately. Being slightly more serious, how do you match Markos Moulitsas Zúniga to Kos Moulitsas where the DOBs are off by a day? Going through the 3% is costing the company at least a dollar per name, and probably closer to $1.50/name.

For the purposes of my current task getting 98% or 99% of the data set matched at high confidence is more than sufficient. The research plan allowed for a 5% mismatch rate.

Eyeballing the names that are in the unmatchable bucket, there are three notable groups. The first are recent births where the name is Baby Girl Smith matched against 15 girls born on that date in the shared claims universe. These are no big deal. The second group of unmatched names are women whose relationship status has changed multiple times. My wife would be in this group as the hospital data has her as Jane Doe while the insurance data has her as Jane Mayhew. The final group are people with names that fall outside of culturally dominant naming conventions. Names with extra punctuation and names whose ordering structures don’t follow the First/Last or First/Maiden-Last convention fall out very readily.

This is important because error is acceptable even if we are willing to spend money to minimize the error. Verifying voter rolls based on name and date of birth matches is guaranteed to produce significant unmatched. There is a good argument that voting rolls should be as up to date as possible, but to do that in a good faith manner requires spending serious money as one person dropped from the rolls who should not have been dropped is one too many. $5 to $10 per name at 100% confidence would be reasonable estimate of the cost of scrubbing an entire list of votes. The vast majority of names would be passed at a dime per name, while the last 5% to 1% might cost $2 per name to verify, and the last 1% will cost $50 per name to verify.

Scrubbing an entire state for pennies per name is a guarantee of producing false positives and statistically certain mismatches. And given who is more likely to have mismatched names between data sets, cynicism should reign.

« Terrorism, Racism, And The Charleston Massacre

Portraits in Cowardice »

21Comments

1.

Amir Khalid

June 19, 2015 at 3:47 pm

I think I can guess what you’ve been reading lately. ;)
2.

Jim Parish

June 19, 2015 at 3:48 pm

Shortly after my father died, we discovered that the date he and we had always celebrated as his birthday was not the date on his birth certificate. We have no idea which date was actually correct.
3.

Richard Mayhew

June 19, 2015 at 3:50 pm

@Amir Khalid: Actually, I have not read HP in at least the past 15 months, and I am holding off on re-reading it until my daughter is slightly older so we can enjoy 3-4 pages a night for the next 5 years together.

I was looking for some good examples that I could use with absolutely no privacy concerns . So Harry Potter it was…
4.

Fester Addams

June 19, 2015 at 3:58 pm

At least no one’s name was

Robert’); DROP TABLE Patients;–

(XKCD #327)
5.

Mike J

June 19, 2015 at 4:19 pm

@Fester Addams: Not an issue for competent DB admins.
6.

patrick II

June 19, 2015 at 4:22 pm

your numbers assume good intentions for the people trying to match names. Jeb Bush in Florida in 2000 hired a company who’s algorithm was refined enough that it differentiated between the various criminals he was trying to wipe off the poles and actual people with the right to vote. so he fired them and hired a company with a less rigorous algorithm with more false matches and got his brother elected.
7.

Richard mayhew

June 19, 2015 at 4:30 pm

@patrick II: exactly. This is just a bullshit detection tool post
8.

Roger Moore

June 19, 2015 at 4:31 pm

How about Sirius R. Black and S. Regulas Black? Are they the same person?

How about Sirius R. Black and Stubby Boardman? I’ve heard they’re actually the same person.
9.

Eric S.

June 19, 2015 at 4:50 pm

I’ve got a friend whose first name is “Jean-Paul”. On almost every legal document or financial form he is known as “Jean”. Most computer systems just balk and don’t know what to do with the hyphen.

Another friend has the same first and last name (like “James James”) and he is Junior. His credit report shows he’s owned cars and houses prior to his birth. He’s had it scrubbed and they pop back up.

All of us are programmers and we kind of shrug with these things happen attitude. It doesn’t make them any less troublesome and for the most part this is with people with good intentions.
10.

BillinGlendaleCA

June 19, 2015 at 4:57 pm

@Jim Parish: My wife’s birthday on legal docs(birth certificate, passport, driver’s license, etc) is not her real birthday, it was the date her father got around to registering her birth(about 7 months later).
11.

FlyingToaster

June 19, 2015 at 5:03 pm

I remember the multiple phone calls to our Blue when BabyGirl Toaster turned into WarriorBabyGirl Doktor. The hospital sent them the damn birth certificate with everyone’s name on it, and they still couldn’t figure out who was who, until HerrDoktor drove over to Cambridge and got three copies and faxed it to them (again).

Her Blue card did turn up when she was 7 weeks old.
12.

RSA

June 19, 2015 at 5:27 pm

@Eric S.:

Most computer systems just balk and don’t know what to do with the hyphen.

My last name has a period and a space in it. I can’t count the times a computer system has told me that my last name is invalid. Or that a clerk has had to try several variations to find me in the system.

It used to be irritating. Okay, it still is, but I’m more used to it.
13.

Old Dan and Little Ann

June 19, 2015 at 5:32 pm

My 48 year old sister has been writing Kelly Anne her whole life as first and middle name. My mom told me yesterday my sister recently looked at her original birth certificate and Kelley Ann was typed on it. Whoops.
14.

piratedan

June 19, 2015 at 5:44 pm

@RSA: the problem is that most medical systems have an issue as to when to use the hyphen as a “punctuation” versus an “arithmetic” expression.Plus you’re talking about delineating names into easily defined fields, it used to be just a first name, middle name and last name. Now you have suffixes, hyphens, apostrophes and fourth and fifth names. This is because we’re no longer as homogenized as a country as we were prior to the mid 70’s when medicine started to get invaded by tech and it’s still slow to catch up culturally. This makes it difficult for everyone from admitting clerks to nurses to try and decipher just who is whom.. The other side of this coin is that medical data is so space intensive, that usual transmission of data isn’t just solely numbers, but entire forms, medical histories and insurance requirements.

With many hospitals losing money (be they for profit or not) the idea of springing for millions of dollars to upgrade is difficult to sell to folks. Yet the idea that you can handle Risk Management better by reducing patient misidentification is a huge motivator and when the ACA was passed, there were numerous incentives that many hospitals took advantage of to try and keep up with the age.

The thing is, you have the same issues with airlines and other businesses ( I bet even banking has their issues as well).
15.

Annamal

June 19, 2015 at 5:49 pm

Slightly off from the point you are making (which is absolutely valid) but my partner is trying to do very similar sorts of data matching, just with parish records from the 18th and 19th centuries.

Since the preference is for 100% certainty of matches, ironically the foreign or peculiar names ( or those with odd occupations) actually come with a bit of an advantage since the data is full of John Smiths who were labourers and can not be matched but Thomas Peach who was a petrefactioner is much easier to identify.

We’re using metaphone3 to account for the scattershot approach to spelling back then.
16.

RSA

June 19, 2015 at 5:59 pm

@piratedan: Nice summary! Thanks.
17.

WaterGirl

June 19, 2015 at 6:06 pm

Excellent example and great post, Richard!
18.

Roger Moore

June 19, 2015 at 6:21 pm

@piratedan:

Plus you’re talking about delineating names into easily defined fields, it used to be just a first name, middle name and last name. Now you have suffixes, hyphens, apostrophes and fourth and fifth names. This is because we’re no longer as homogenized as a country as we were prior to the mid 70’s when medicine started to get invaded by tech and it’s still slow to catch up culturally.

Names have always been more complex than first, middle, and last, or at least they have been since well before the mid 70’s. There are plenty of Western European names that involve spaces and punctuation (Jean-Pierre, O’Malley, von Trapp, St. John, etc.) that should have clued people in. I suspect that a lot of the problem was that computer back in the 70s were underpowered by modern standards so they didn’t want to deal with a free-form name. So they made the simplifying assumption that names could fit into their nice little boxes and put the weight of dealing with outliers on the unlucky people who didn’t fit their patterns. Once that pattern was started, it got stuck by backwards compatibility.
19.

piratedan

June 19, 2015 at 6:33 pm

@Roger Moore: to a degree, part of the issue is history itself, a lot of the systems got around the exclusion of punctuation values by allowing the space or simply ignoring the punctuation itself. Why is the punctuation such a big deal? Goes back to the languages that were used to handle medical data because back in the day, space used to be an issue and cramming the amount of data intensive records, that were filled with diagnosis codes (ICD values, I believe that we’re now using ICD 10 these days) textual reports and raw data results from instrumentation and many of the systems were compartmentalized, i.e. dietary has theirs, as does radiology, the lab, the hospital information system and some have separate ER systems as well as some hospitals separate out their medical records. We haven’t even covered the instrumentation that is used, because now they transmit images, films from biopsies, autopsies, EKG’s etc

So you could chalk that up to being underpowered but a lot of the issues come back to space, because back in the day, the amount of space that you had available to store medical record data was limited. Now that this is much less of a concern, we’re seeing the integration of a lot more windows style information being laid over databases with middle ware easing the translation from the look to how the data is stored. Still a long way to go because of the limitations on how the systems talk to each other. Even one size fits all systems from a sole vendor have issues because their individual modules aren’t even fully integrated (as many of these systems bought out competing systems for one module and are STILL integrating them into seamless communication modules.
20.

Zoe Johnson

June 19, 2015 at 10:46 pm

Oh man, I wish I was in your shoes. I can’t think of anything more fun than matching lists of names and cleaning. I used to match motor and voter lists to get jury lists. 98% was our goal. I used SAS and especially soundex coding to get em close enough to figure it out. I got forcibly retired because I was to old at 58. Damn I love data cleaning and list matching. Most fun I can think of other than playing with grandbabies. I envy you. Those were the good old days.
21.

Big Jim Slade

June 21, 2015 at 5:49 pm

Jeb Bush and Katherine Harris should be in prison for purposefully disenfranchising thousands of citizens.

Comments are closed.

brantl on Tuesday Late Night Open Thread (Apr 24, 2024 @ 4:55am)

SWMBO on Tuesday Late Night Open Thread (Apr 24, 2024 @ 4:50am)

Chet Murthy on Tuesday Late Night Open Thread (Apr 24, 2024 @ 4:29am)

SectionH on Tuesday Late Night Open Thread (Apr 24, 2024 @ 4:17am)

Martin on Tuesday Late Night Open Thread (Apr 24, 2024 @ 3:54am)

The rest of the comments were smacking Boebert like she was a piñata.

The party of Reagan has become the party of Putin.

It’s easier to kill a dangerous animal than a man who just happens to have different thoughts/values than one’s own.

The revolution will be supervised.

Balloon Juice, where there is always someone who will say you’re doing it wrong.

Within six months Twitter will be fully self-driving.

Never entrust democracy to any process that requires republicans to act in good faith.

Good lord, these people are nuts.

He imagines himself as The Big Bad, Who Is Universally Feared… instead of The Big Jagoff, Who Is Universally Mocked.

Is it negotiation when the other party actually wants to shoot the hostage?

Innocent people don’t delay justice.

Shallow, uninformed, and lacking identity

In short, I come down firmly on all sides of the issue.

Pelosi: “He either is stupid, or he thinks the rest of us are.” Why not both?

Too often we hand the biggest microphones to the cynics and the critics who delight in declaring failure.

Schmidt just says fuck it, opens a tea shop.

…and a burning sense of injustice to juice the soul.

fuckem (in honor of the late great efgoldman)

I’m just a talker, trying to find a channel!

No one could have predicted…

Glad to see john eastman going through some things.

Just because you believe it, that doesn’t make it true.

Russia bombs Ukraine’s maternity hospitals; Republicans in the House can’t sort out supporting Ukraine.

Israel is using food as a weapon of war. Unforgivable.

A couple bucks per name and voting rights

Reader Interactions

21Comments