Monday – my preferred day to learn new stuff! As you know, this quarter is all about building a data strategy. Customers sometimes ask: “could you build a data strategy in one go for us.” The simple answer is: “no.” The more elaborate answer would be: building a data strategy involves many moving parts. Every Monday of this quarter, we address one of these moving parts. Could we work on each of these and – over the course of multiple months – build a data strategy for you? Definitely! In this post, I will address the question of data quality (DQ).
“You know what statisticians say all the time? Garbage in, garbage out.”
Let us start with the “why” again. Why are we fixing data issues? Well, because without decent data, the company will lose revenue (imagine a recommendation engine on an ecommerce website recommending wrong things to its audience) and will suffer from poor decision making (which in turn means lost revenue, right?).
What kind of data quality issues can one encounter? Here’s a list of the most common issues (I’ll start with those that fall under PEBCAK – if you don’t know this acronym, look it up😸) :
- Invalid data: imagine people entering faulty data in a system resulting in negative inventory levels for example.
- Inconsistent formats: when people are asked to write the date, you’ll find all kinds of variations (e.g: 2 November 2022, 02/11/2022, 11/02/2022, 11/02/22).
- Inaccurate data (for sales teams: members who don’t practice “CRM hygiene”)
- Duplicate data (can be due to human or system error)
- Incomplete information (for example: omission of middle names when going through a birth register – may prevent you from finding the right person)
- Data inconsistency (metric vs imperial systems being the most obvious one)
- Imprecise data (can be due to ETL processes or statistical choices: for example grouping data makes it “less precise”).
Now, how do you fix these issues you might ask? If you were hoping for a “quick fix,” you’ll be disappointed. There are three main strategies:
- Fix the issues in the source systems
- Fix during the ETL process
- Fix at the metadata level (this is not really fixing, more like patching as the core data remains untouched)
To a certain extent, artificial intelligence has already solved some of these issues. Being able to auto-adjust formats based on location (for example: if data was entered on US soil, it is likely they used the imperial system) is one example. “Self-healing” data pipelines are another example that is worth mentioning. Without going into too much detail, data quality has been the main blocker for organizations looking into leveraging their data. While AI seems to lend a helping hand, organizations still want to get up to speed by:
- Creating strong data foundations
- Define which data can serve which KPI
- Assign data quality accountability to BUs (business units) next to a CDO
- Embed data quality improvement into the organizational culture
In case you want to get started, feel free to reach out.