Data Cleaning Is the Foundation of Data Analysis. Why Data Analysis Is Actually Hard

12 January 2026

data cleaning in data analysis - data cleaning fundamentals

There is a photo from a political campaign where Donald Trump is standing next to a garbage truck. For some reason, this image always comes back to me when I think about the job of a data analyst. Not because of politics, but because the analogy fits surprisingly well.

A data analyst is someone who has access to powerful tools, systems and data that can influence real business decisions. At the same time, most of the work involves digging through messy, inconsistent, incomplete and sometimes downright ugly data. A lot of garbage. And someone has to deal with it.

When people think about data analysis, they usually picture dashboards, charts, insights and clever conclusions. That part exists, of course. But in reality, it is maybe ten percent of the job. The remaining ninety percent is data cleaning.

In this article, I want to explain why data cleaning is so important, why it takes so much time and why data analysis without proper cleaning is often meaningless. I will walk through five core operations that I perform almost every time I work with data, regardless of the tool, industry or company size.

If you want to become a data analyst, or you already are one and keep wondering why things never look as clean as in tutorials, this article is for you.

Why Data Cleaning Takes So Much Time

Real-world data is rarely clean. It comes from production systems, user input, integrations, migrations and business compromises. Each of these is a potential source of errors.

Someone typed something manually. A system was down for a few hours. A field was optional and people ignored it. A format changed over time. A developer assumed something that later turned out to be wrong.

Before you can analyze anything, you first need to understand what you are actually looking at. Data cleaning is not a technical detail you can skip. It is the foundation of the entire analysis. If the foundation is shaky, every dashboard and every insight built on top of it will be shaky too, no matter how good it looks.

The Two Sides of Data Cleaning

Data cleaning has two dimensions.

The first one is technical. This includes working with tables, columns and values using SQL, Python, Excel, Power BI, Tableau or any other tool. This is the part most people talk about.

The second dimension is much less visible but equally important. It is communication. Talking to the people who create the data, use the systems and consume the reports. Very often something that looks like a mistake from an analyst’s perspective is actually correct from a business point of view. Or the other way around.

In this article I focus on the technical side, but it is important to remember that data cleaning without business context can easily lead to wrong conclusions.

Step 1: Identifying and Removing Duplicates

The first thing I usually check when working with a new dataset is duplicates. And I do not mean removing them right away. Identification comes first.

Duplicates are not always a mistake. It depends entirely on how the data is structured. If you have a product table where each product should appear once, duplicates are a problem. But if the same product exists in multiple languages or markets, duplicated identifiers might be expected.

The real issue starts when duplicates are counted without anyone noticing. Sales numbers become inflated. Order counts are off. Inventory levels do not make sense.

The most dangerous situation is when the number of duplicates is small. Large errors are easy to spot. Small ones quietly slip into reports, influence decisions and stay unnoticed for months.

That is why identifying duplicates is a mandatory step. Even if you decide not to remove them, you need to understand why they exist and how they affect your metrics.

Step 2: Handling Missing Values

In a perfect world, every column would be filled. In reality, missing values are everywhere. In SQL, we call them NULLs, and they are responsible for a surprising number of analytical mistakes.

A missing value can mean many different things. Sometimes it is meaningful. For example, in an orders table, a “return ID” might be empty if no return happened. That is perfectly fine.

In other cases, missing data is caused by technical issues. A system was unavailable. A sensor failed. An integration broke. Then the question becomes what to do with those gaps.

Should you remove those records entirely? Should you keep them and report the missing values? Or should you try to fill them?

For numerical time-based data, such as temperature or usage metrics, filling missing values based on neighboring observations can make sense in very specific scenarios. But this should always be a conscious decision, not an automatic one.

Ignoring missing values is almost never a good idea. They can break aggregations, averages and models in subtle ways that are hard to detect later.

Step 3: Normalization and Standardization

Normalization often sounds like database theory, and it is. But it has very practical implications for analysts.

From an analytical perspective, normalization is about avoiding unnecessary duplication of information. If an orders table contains customer ID, name, surname and email, something is probably wrong. Customer details should live in a separate table, and orders should reference them via an identifier.

Standardization is a closely related concept. It is about formats and units. Currencies should not be mixed. Weight should not be sometimes kilograms and sometimes grams. Dates should follow one format, not ten.

These issues are especially common when data comes from multiple sources. Everything might look fine at first glance, but the numbers do not add up. In many cases, inconsistent formats are the reason.

Step 4: Consistency and Sanity Checks

This is where common sense becomes a core analytical skill.

Data can be technically valid and still completely wrong. Free-text fields are a classic example. City names, country names or product categories entered manually will almost always appear in many variations.

Warsaw can be written as “Warsaw”, “warsaw”, “Wawa”, “Capital of Poland” and dozens of other forms. All of them are valid strings. Analytically, they are a nightmare.

The same applies to boolean flags. If a system does not enforce a single format, you will quickly see values like true, false, T, F, 1, 0, yes and no mixed together. Every analysis then requires additional logic just to interpret the data correctly.

Sanity checks also include ranges and logic. Birth dates in the future. Negative sales without returns. Quantities that suddenly jump to unrealistic values. These are all signals that something needs attention.

Step 5: Detecting and Sometimes Removing Anomalies

This is the most dangerous part of data cleaning.

Anomalies, or outliers, can be errors or they can be the most important signals in your data. A sudden drop to zero inventory might indicate a system failure, or it might mean a real stockout. A spike in sales could be a duplication issue, or it could be a successful campaign.

Removing anomalies is tempting because they “mess up” charts and averages. But removing them blindly can erase valuable business information.

Statisticians will tell you not to remove outliers. Business reality is more complicated. Decisions often need to be made quickly, and perfect methodology is not always possible.

The key is awareness. If you remove anomalies, you should know exactly why you are doing it and what the consequences are.

Data Cleaning as Detective Work

Despite its reputation, data cleaning does not have to be boring. For me, it often feels like detective work. Finding patterns, spotting inconsistencies, understanding what went wrong and why.

If you look at a table and immediately feel that something is off, that is a very good sign. You do not need to be perfect. You just need curiosity and a healthy level of distrust toward raw data.

This is also where the difference between tool knowledge and analytical thinking becomes very clear. Anyone can learn syntax. Not everyone learns how to question data.

Why Data Analysis Without Data Cleaning Fails

Every chart, metric and conclusion is built on top of input data. If that input is flawed, everything downstream is flawed as well.

Data cleaning takes time because it has to. It cannot be fully automated. It requires judgment, context and decision-making.

If you are learning data analysis or thinking about switching careers, it is important to understand this reality. The job is not only about pretty dashboards. It is about doing the hard work that makes those dashboards trustworthy.

Conclusion

Data cleaning is the foundation of data analysis. Without it, numbers lie, charts mislead and decisions suffer. Duplicates, missing values, inconsistencies, broken formats and anomalies are not edge cases. They are the norm.

If you learn to appreciate this part of the process, data analysis can be a very satisfying profession. If you ignore it, the problems will catch up with you sooner or later.

If you think this article could help someone better understand what working with data really looks like, share it on social media. That is the easiest way to get this knowledge to people who are just starting their journey in data analytics.

The article was written by Kajo Rudziński – analytical data architect, recognized expert in data analysis, creator of KajoData and polish community for analysts KajoDataSpace.

That’s all on this topic. Analyze in peace!

Did you like this article 🙂?
Share it on Social Media 📱
>>> You can share it on LinkedIn and show that you learn something new every day.
>>> You can throw it on Facebook – and perhaps help a friend of yours who is looking for this.
>>> And remember to bookmark this page, you never know if it won’t come handy in in the future.

You prefer to watch 📺 – no problem
>>> Subscribe and watch my English channel on YouTube.

Prefer to read in Polish? No problem.

Ja Ci ją z przyjemnością wyślę. Za darmo. Bez spamu.

Poradnik Początkującego Analityka

Video - jak szukać pracy w IT

Regularne dawki darmowej wiedzy, bez spamu.