In Data Analytics, there’s one phrase I hear over and over—so often it’s become a reflex: “We need clean data before we can do anything.”
And honestly, over the years I became really annoyed by the sentence. Don’t get me wrong: “Shit in, Shit out” is true. But could we please not always operate in the extremes? After years of hearing this, it starts to feel less like a genuine concern and more like a default excuse – whether to delay analytics projects, justify bigger budgets for data integration, or simply avoid accountability.
I’m especially annoyed when a discussion about Data Analytics & AI kicks off and someone throws that generic roadblocker in. Because – seriously: if you’ve been gathering data for years and it is STILL not ready to do SOMETHING, someone has really failed.
But recently, I heard something that finally put this frustration into words! Recently I attended the TECH’N’DRINKS X THOUGHTWORKS Meetup where Tiankai Feng gave a talk and he said a sentence that I have been missing for over a decade:
Data must be Fit for Purpose
Exactly that! Clean data cannot be an end in itself because it is not a goal in itself. Data must be clean enough to fit a certain purpose. In some cases, a rough estimate is better than nothing. In some cases, you must be 100% accurate (by law). So just demanding all data to be “clean” might be premature optimization.
But again: let’s not operate in the extremes: This doesn’t mean you should integrate Data without trying to have good quality (Shit in, well …). But don’t overdo it!
So next time someone insists on ‘clean data,’ ask them: “Clean for what?“
Leave a Reply