Hacker News

by Ryan Harman

Twice this week, I have come across embarassingly bad data (successfulsoftware.net)

69 points by hermitcrab 4 hours ago

stared 3 hours ago

I dislike the premise. I mean, good data is wonderful.

But if institutions are expected to release clear data or nothing, almost always it is the later.

What is important, is to offer as much methodology and caveats as possible, even if in an informal way. Because there is a difference between "data covers 72% of companies registered in..." vs expecting that data is full and authoritative, whereas it is missing.

(Source: 10 years ago I worked a lot with official data. All data requires cleaning.)

Mordisquitos an hour ago

But surely we should expect some basic sanity checks on published data? This isn't some petrol stations being placed in the middle of a field due to minor typos or bad rounding, or some petrol stations' prices being listed as all 1.00 £/l out of laziness, or even a case of all unknown locations being listed as 0°0'0" N, 0°0'0" E by default. What the author reports appear to be mistakes which should be rather trivially detectable on input.

ZiiS 7 minutes ago

The problem is stats can actually do more with all the data including obvious errors. If you start filtering out data where they miss entered lat log you might introduce a new bias.

chaps an hour ago

Sure we should indeed expect that they do that. But look at enough data and you'll learn that those expectations are a path towards never-ending frustration. I've been there, spending >100 hours cleaning data... that never got published because I was too damn focused on the dozens of years of errors that many, many people created.

To be clear, I'm not saying that we should accept messy data. Just, reality is messy and it's naive to think we can catch and remove all of reality's messiness -- which includes the bureaucratic slop that led to the data being published in the first place.

freehorse an hour ago

I don't think these issues are close to the issues the article talks about. The author does not talk about data coverage, data collection methodologies or missing values or whatever, but data that is actually wrong, ie location coordinates, prices, numbers that make no sense. Including swapping latitude/longitude and wrong decimal points in numbers.

On the other hand, I agree that bad (but usually fixable) data is better than no data.

stared an hour ago

Yep, expect in real data actually confusing columns, NaNs casted to values like 1673, duplicates, etc, etc.

I prefer to get data with swapped lat/lng (a trivial fix), or prices said in dollars but being in cents, to no data.

sd9 2 hours ago

Agreed, pretty much all data is flawed. I still want my hands on it.

readthenotes1 2 hours ago

I read the premises as "1. at least look at it 2. Have a way to fix it"

Those seem reasonable asks.

Edit to add: the tragedy of the school in Minab is an example of how bad things can go--and it just hints at how much worse bad data can bem

chaps 3 hours ago

I have mixed feelings about this. On one hand, yeah stop publishing garbage data, but as a FOIA nerd... I'll take the data in any state it is. I'm not personally going to be able to clean the data before I receive it. Does that mean I shouldn't release the unsanitized (public) data knowing that it has garbage data within? Hell no. Instead, we should learn and cultivate techniques to work with shit data. Should I attempt to clean it? Sure. But it becomes a liability problem very, very quickly.

torginus 3 hours ago

What does it mean to clean the data?

Do you remove those weird implausible outliers? They're probably garbage, but are they? Where do you draw the line?

If you've established the assumption that the data collection can go wrong, how do you know the points which look reasonable are actually accurate?

Working with data like this has unknown error bars, and I've had weird shit happen where I fixed the tracing pipeline, and the metrics people complained that they corrected for the errors downstream, and now due to those corrections, the whole thing looked out of shape.

chaps 2 hours ago

"What does it mean to clean the data?"

This isn't possible to answer generally, but I'm sure you know that.

Look -- I've been in nonstop litigation for data through FOIA for the past ten years. During litigation I can definitely push back on messy data and I have, but if I were to do that on every little "obviously wrong" point, then my litigation will get thrown out for me being a twat of a litigant.

Again, I'd rather have the data and publish it with known gotchas.

Here's an example: https://mchap.io/using-foia-data-and-unix-to-halve-major-sou...

Should I have told the Department of Finance to fuck off with their messy data? No -- even if I want to. Instead, we learn to work with its awfulness and advocate for cleaner data. Which is exactly what happened here -- once me and others started publishing stuff about tickets data and more journalists got involved, the data became cleaner over time.

torginus 13 minutes ago

hermitcrab 3 hours ago

So you expect the 1000s of people trying to use the fuel price data to each individually clean and validate it, rather than the supplier doing it?

yorwba 2 hours ago

One of those people can republish their cleaned and validated version and the 999 others can compare it to the original to decide whether they agree with the way it was cleaned or not.

chaps 3 hours ago

What...?

GMoromisato 3 hours ago

Clean data is expensive--as in, it takes real human labor to obtain clean data.

One problem is that you can't just focus on outliers. Whatever pattern-matching you use to spot outliers will end up introducing a bias in the data. You need to check all the data, not just the data that "looks wrong". And that's expensive.

In clinical drug trials, we have the concept of SDV--Source Data Verification. Someone checks every data point against the official source record, usually a medical chart. We track the % of data points that have been verified. For important data (e.g., Adverse Events), the goal is to get SDV to 100%.

As you can imagine, this is expensive.

Will LLMs help to make this cheaper? I don't know, but if we can give this tedious, detail-oriented work to a machine, I would love it.

hermitcrab 3 hours ago

>Clean data is expensive--as in, it takes real human labor to obtain clean data.

Yes, data can contain subtle errors that are expensive and difficult to find. But the 2nd error in the article was so obvious that a bright 10 year would probably have spotted it.

GMoromisato 3 hours ago

Agreed--and maybe they should have fixed it.

But sometimes the "provenance" of the data is important. I want to know whether I'm getting data straight from some source (even with errors) rather than having some intermediary make fixes that I don't know about.

For example, in the case where maybe they flipped the latitude and longitude, I don't want them to just automatically "fix" the data (especially not without disclosing that).

What they need to do is verify the outliers with the original gas station and fix the data from the source. But that's much more expensive.

chaps 3 hours ago

hermitcrab 3 hours ago

gdulli 3 hours ago

Why would you give this sort of work to a machine that can't be responsibly used without checking its output anyway?

GMoromisato 3 hours ago

It's not obvious to me that LLMs can't be made reliable.

torginus 3 hours ago

Data and metrics is 90% what upper management sees of your project. You might not care about it, and treat it as an afterthought, but it's almost the most important thing about it organizationally.

People who don't heed this advice get to discover it for themselves (I sure did)

IF you can't make the data convincing, you'll lose all trust, and nobody will do business with you.

genthree 2 hours ago

I have learned that you must have data.

I have also learned that rarely does anyone care if it’s any good, or means anything. This is generally true, but it’s especially true if you are going with the prevailing winds of whatever management fads are going on.

Like, right now, you can definitely get away with inflating the efficacy of “AI” any way you can, in almost any company. Nobody with any authority will call you on it.

Look at what management’s talking about and any pro-that numbers you come up with can be total gibberish, nobody minds. “Oh man, collecting good numbers for this and getting a baseline etc etc is practically impossible” ok so don’t and just use bad numbers that align with what management wants to do anyway. You’ll do great.

torginus 20 minutes ago

Not sure, but I have worked a lot on stuff where the metrics were often very easily convertible to business decisions, like expenses or income. Things like how much do we need to pay for infrastructure, how much money each customer/products brings in and how etc.

If the company were an airplane essentially upper management were flying it by instrument. It would've been a scandal if the metrics had serious issues.

Some of the metrics less directly tied to business stuff were a bit more 'creative' - as in I could justify why I did them that way, but still not 100% solid.

Stuff like optimizing data pipelines, where data scientist experiments which tended to take 1hr, now only took 10 mins.

I could say that data people were 6x as productive, but it's just as well possible they were just more careless with what they ran, but whatever, a white lie.

However saying that stuff takes 1/6th the time, when in fact it doesn't, is an absolute no go. Neither is not knowing why is there a run that took 500 hours or 5 seconds, both of which should be impossible.

Doing that stuff destroys the confidence in the rest of the data.

agent_anuj 3 hours ago

It is not just embarrassing, it can potentially kill your demo, project or even product as user will first look at data and then the tech behind it. If the data is wrong, it means the tech does not work. I never took data seriously during my demos in the first 10 years of my career and no wonder the audience rejected most of my work though it was backed by solid platforms.

Phlogistique 3 hours ago

That it's it's better to publish the garbage data than to not publish it though. I would worry about complaining too much lest they just decide to stop publishing it because it creates bad PR.

nick__m 3 hours ago

As long as the garbage data is authentic and the method used to produce it is adequately detailed, I agree with you that: "it's better to publish the garbage data than to not publish it"

But fake data or garbage data without the method, is better left unpublished !

hermitcrab 3 hours ago

Hard disagree on that. They just need a basic smell test before they put it out.

Tempest1981 3 hours ago

Agree. Maybe just add a Disclaimer.md file.

bobro 2 hours ago

This article assumes that there is a person with dedicated time to validate the data. Imagine you want this data and ask for it, but the government says, “sorry, we have this data, but we read an article that said we can only publish it if we spend a lot of time validating it. This data changes frequently and we don’t have a chunk of a full-time data analyst’s salary to spend on it, so we just aren’t going to publish anything. We’d rather put out nothing than embarrass ourselves, so you can’t even try to validate it yourself.”

hermitcrab 37 minutes ago

>we don’t have a chunk of a full-time data analyst’s salary to spend on it

I found the errors in a few minutes with a $198 tool.

chaps 2 hours ago

In fact, the government agencies will argue that they have zero legal obligation to clean the data, let alone figure anything about the data, and that they're just giving you the data as-is. This happened to me on a FOIA call where I was trying to get data from the county state's attorney. They insisted they could only run a specific report and that they had no obligation to run any query, meaning I can't even get access to the data I need.

Clean vs not clean data is the wrong fight.

albert_e 3 hours ago

Concluding passage:

> Authors should have their work proof read

Agreed.

Opening passage:

> A quick plot of the latitude and longitude shows some clear outliners

"outliners"

Ouch!

hermitcrab 3 hours ago

OP here. Ouch indeed. I did actually get it proofread. But that was missed. I can't fire my proofreader, as we are married. ;0)

Now fixed.

rdiddly 3 hours ago

Not fixed at this hour

hermitcrab 3 hours ago

bobosola an hour ago

A couple of days after the UK Fuel Finder service launch last month, I wrote a hobby site using its API to get the cheapest local fuel prices: https://fuelseeker.net. I too discovered prices which had obviously been entered in pounds rather than pennies, or even missing altogether some cases. You would think that they could have done a bit more basic data cleansing on the server to catch that type of thing.

But, hey, we’re all wise after the event. To their credit though, they do seem to be actively reacting to feedback. I also contacted them about the bad data issue, and they are now adding user warnings about bad price values at the point of data entry (according to https://www.developer.fuel-finder.service.gov.uk/release-not...).

hermitcrab 38 minutes ago

There an obvious incentive for petrol stations to 'accidentally' put too low price, so they can get top of the table on services like yours. So they probably need to do more than add warnings.

hermitcrab 2 hours ago

Why did the title of this post get moderated from:

"Stop Publishing Garbage Data, It’s Embarrassing"

To the rather lamer:

"Twice this week, I have come across embarassingly bad data"

mlaretallack 3 hours ago

I saw the RAC one this morning, though I was miss reading the graph, as why would the RAC publish such an obvious mistake.

I have written my own Home Assistant custom component for the UK fuel finder data, and yes, the data really is that bad.

alias_neo 3 hours ago

I was looking at that RAC chart this morning. Given it's Sunday, and I was reading before my morning coffee, I'm not ashamed to say it took me a good few seconds of zooming in and out to realise they'd used a decimal point where a comma should have been.

Easy type to make, but seriously, does no one even take a cursory look at the charts when publishing articles like this? The chart looks _obviously_ wrong, so imagine how many are only slightly wrong and are missed.

The fuel prices one could surely be solved with a tiny bit of validation; are the coordinates even within a reasonable range? Fortunately, in the UK, it's really easy to tell which is latitude and which is longitude due to one of them being within a digit or two of zero on either side.

Frank-Landry 3 hours ago

Did a bot write this title?

hermitcrab 3 hours ago

If you are putting out data without doing even the most basic validation, then you should be ashamed.

ramon156 3 hours ago

What about most of Show HN's projects nowadays? Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?

What about people who don't know how their own code works? Despite it working flawlessly? I'm asking because I don't really know.

Calazon 3 hours ago

> Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?

Yes.

akudha 3 hours ago

How is it fair to compare a Show HN project with official government datasets? People depend on government datasets, multi-billion dollar businesses are built on top of them. A show HN project is typically someone building it in a weekend. They’re not even remotely in the same league.

Sure it is expensive to check every number, but at least some of it can be automated and flagged for human review, no? Switching lat/long numbers. For example

add-sub-mul-div 3 hours ago

This has become a spam site for AI shovelware projects that are nearly always posted by accounts with no activity here outside of self promotion.

subscribed 3 hours ago

If they publish a lie they should be ashamed, even if their lie is orders of magnitude less impactful.

And if someone publishes a flawless code but have no idea how it works, its not their code, quite clearly, AMD they should be ashamed if they lie it is.

It's just, like, my opinion, but I like it :)

hermitcrab 3 hours ago

>Sometimes the docs straight up lie, and it takes 5 minutes to figure that out. Should they also be ashamed?

Yes. Lying is bad, even if some people are trying hard to normalise it.

>What about people who don't know how their own code works? Despite it working flawlessly?

I think that is fine, as long as you aren't making untrue claims.

Hacker News

by Ryan Harman

Twice this week, I have come across embarassingly bad data (successfulsoftware.net)

stared 3 hours ago [-]

Mordisquitos an hour ago [-]

ZiiS 7 minutes ago [-]

chaps an hour ago [-]

freehorse an hour ago [-]

stared an hour ago [-]

sd9 2 hours ago [-]

readthenotes1 2 hours ago [-]

chaps 3 hours ago [-]

torginus 3 hours ago [-]

chaps 2 hours ago [-]

torginus 13 minutes ago [-]

hermitcrab 3 hours ago [-]

yorwba 2 hours ago [-]

chaps 3 hours ago [-]

GMoromisato 3 hours ago [-]

hermitcrab 3 hours ago [-]

GMoromisato 3 hours ago [-]

chaps 3 hours ago [-]

hermitcrab 3 hours ago [-]

gdulli 3 hours ago [-]

GMoromisato 3 hours ago [-]

torginus 3 hours ago [-]

genthree 2 hours ago [-]

torginus 20 minutes ago [-]

agent_anuj 3 hours ago [-]

Phlogistique 3 hours ago [-]

nick__m 3 hours ago [-]

hermitcrab 3 hours ago [-]

Tempest1981 3 hours ago [-]

bobro 2 hours ago [-]

hermitcrab 37 minutes ago [-]

chaps 2 hours ago [-]

albert_e 3 hours ago [-]

hermitcrab 3 hours ago [-]

rdiddly 3 hours ago [-]

hermitcrab 3 hours ago [-]

bobosola an hour ago [-]

hermitcrab 38 minutes ago [-]

hermitcrab 2 hours ago [-]

mlaretallack 3 hours ago [-]

alias_neo 3 hours ago [-]

Frank-Landry 3 hours ago [-]

hermitcrab 3 hours ago [-]

ramon156 3 hours ago [-]

Calazon 3 hours ago [-]

akudha 3 hours ago [-]

add-sub-mul-div 3 hours ago [-]

subscribed 3 hours ago [-]

hermitcrab 3 hours ago [-]

stared 3 hours ago

Mordisquitos an hour ago

ZiiS 7 minutes ago

chaps an hour ago

freehorse an hour ago

stared an hour ago

sd9 2 hours ago

readthenotes1 2 hours ago

chaps 3 hours ago

torginus 3 hours ago

chaps 2 hours ago

torginus 13 minutes ago

hermitcrab 3 hours ago

yorwba 2 hours ago

chaps 3 hours ago

GMoromisato 3 hours ago

hermitcrab 3 hours ago

GMoromisato 3 hours ago

chaps 3 hours ago

hermitcrab 3 hours ago

gdulli 3 hours ago

GMoromisato 3 hours ago

torginus 3 hours ago

genthree 2 hours ago

torginus 20 minutes ago

agent_anuj 3 hours ago

Phlogistique 3 hours ago

nick__m 3 hours ago

hermitcrab 3 hours ago

Tempest1981 3 hours ago

bobro 2 hours ago

hermitcrab 37 minutes ago

chaps 2 hours ago

albert_e 3 hours ago

hermitcrab 3 hours ago

rdiddly 3 hours ago

hermitcrab 3 hours ago

bobosola an hour ago

hermitcrab 38 minutes ago

hermitcrab 2 hours ago

mlaretallack 3 hours ago

alias_neo 3 hours ago

Frank-Landry 3 hours ago

hermitcrab 3 hours ago

ramon156 3 hours ago

Calazon 3 hours ago

akudha 3 hours ago

add-sub-mul-div 3 hours ago

subscribed 3 hours ago

hermitcrab 3 hours ago