Histograms are one of the most popular tools for analyzing data distribution. While we often associate them with visual charts, their logic and analytical value can also be replicated in SQL. In this article, I’ll show you step by step how to create a simple histogram in SQL using subqueries, Common Table Expressions (CTEs), and CASE WHEN.
1. What is a Histogram and How to Use It in SQL?
A histogram is a way of representing the distribution of data, allowing us to analyze the frequency of specific values in a dataset. In SQL, we can retrieve this information by properly aggregating data. For example, we can understand the structure of customer orders: do most customers place only one order, or do we have many loyal customers with frequent transactions?
2. First Steps – Grouping Data
To start the analysis, we can calculate how many orders each customer has placed. This is done using GROUP BY
and the COUNT
function:
SELECT
customer_id,
COUNT(DISTINCT order_id) AS order_cnt
FROM
orders
GROUP BY
customer_id;
The result is a table showing the number of orders placed by each customer.
3. Subqueries – Counting Customers in Ranges
The next step is to count how many customers placed a specific number of orders. To achieve this, we use a subquery:
SELECT
order_cnt,
COUNT(DISTINCT customer_id) AS customer_cnt
FROM (
SELECT
customer_id,
COUNT(DISTINCT order_id) AS order_cnt
FROM
orders
GROUP BY
customer_id
) s
GROUP BY
order_cnt;
At this stage, we have a table showing how many customers placed each specific number of orders. This is our raw histogram, but it still needs simplification.
4. Creating Buckets with CASE WHEN
For large datasets, the number of unique order counts might be too high, making the results hard to read. To address this, we group customers into ranges (buckets) using the CASE WHEN
construct. This allows us to create ranges like 1 order, 2-5 orders, 6-10 orders, etc.
Here’s the SQL code:
WITH order_counts AS (
SELECT
customer_id,
COUNT(DISTINCT order_id) AS order_cnt
FROM
orders
GROUP BY
customer_id
),
customer_counts AS (
SELECT
order_cnt,
COUNT(DISTINCT customer_id) AS customer_cnt
FROM
order_counts
GROUP BY
order_cnt
)
SELECT
CASE
WHEN order_cnt = 1 THEN '1'
WHEN order_cnt < 6 THEN '2 - 5'
WHEN order_cnt < 11 THEN '6 - 10'
ELSE '>10'
END AS customer_group,
SUM(customer_cnt) AS customer_group_cnt
FROM
customer_counts
GROUP BY
customer_group;
5. Result Interpretation
The output table might look like this:
Order Range | Number of Customers |
---|---|
1 | 12 |
2 – 5 | 317 |
6 – 10 | 415 |
>10 | 49 |
From this table, we can see that most customers place between 2 and 10 orders, while only a small number have more than 10 orders. Such an analysis helps understand the customer structure and can be useful, for example, in loyalty segmentation.
6. Conclusion
This method is an effective way to analyze data distribution in SQL. Using subqueries, CTEs, and CASE WHEN, we can not only efficiently group data but also present results in readable categories. If you want to learn more techniques like this, check out my courses on KajoData!
Prefer to read in Polish? No problem!
Other interesting articles:
- Simple Pivot Table in SQL + Handling NULL Values in CASE WHEN
- Data Cleaning Techniques: A Step-by-Step Guide for Data Analysts
- Common CV Mistakes to Avoid: Essential Tips for Job Seekers
- 9 LinkedIn Mistakes That Are Killing Your Job Search
That’s all on this topic. Analyze in peace!
Did you like this article 🙂?
Share it on Social Media 📱
>>> You can share it on LinkedIn and show that you learn something new every day.
>>> You can throw it on Facebook – and perhaps help a friend of yours who is looking for this.
>>> And remember to bookmark this page, you never know if it won’t come handy in in the future.
You prefer to watch 📺 – no problem
>>> Subscribe and watch my English channel on YouTube.