Histograms in SQL
A histogram is an approximate representation of the distribution of numerical data. In other words, histograms show the number of data points that fall within a specified range of values (typically called “bins” or “buckets”). In this post, I’m exploring options for querying data as histogram by using SQL.
Note that for the sake of simplicity, I’ll be using MySQL as a database, but the queries are quite generic and can be used in other SQL dialects.
Let’s assume we have a table for persisting
users. Our goal is to build a histogram for understanding distribution of orders among many users. The table is kept to minimum for the simplicity.
CREATE TABLE `orders` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `user_id` bigint(20) NOT NULL, PRIMARY KEY (`id`), KEY `IDX_user_auth_token_user_id` (`user_id`) ) ENGINE=InnoDB AUTO_INCREMENT=1; INSERT INTO `orders` (user_id) VALUES (1), (1), (2), (2), (2), (3), (3), (3), (3), (4), (4), (4), (4), (4), (4), (4);
After inserting some orders, we have the following data:
User with ID 1 has 2 orders User with ID 2 has 3 orders User with ID 3 has 4 orders User with ID 4 has 7 orders
Querying the table should produce the following result:
SELECT id, user_id FROM `orders`; Result: | id | user_id | | --- | ------- | | 1 | 1 | | 2 | 1 | | 3 | 2 | | 4 | 2 | | 5 | 2 | | 6 | 3 | | 7 | 3 | | 8 | 3 | | 9 | 3 | | 10 | 4 | | 11 | 4 | | 12 | 4 | | 13 | 4 | | 14 | 4 | | 15 | 4 | | 16 | 4 |
Building a Histogram
Choosing bucket sizes is a science by its own and depends on the objective and data distribution. We can keep things simple here and choose 3 as a bucket size. The buckets would be
6-9 and so on.
Lower and upper bounds
Now that we identified our buckets, the question is if we want to make the bucket bounds inclusive or exclusive. I’d say it again depends on the given task we want to accomplish. Typically, for clarity you would want to make the upper bound exclusive and lower bound inclusive.
First, let’s see how many orders each user has
SELECT COUNT(1) as count, user_id FROM orders GROUP BY user_id; Result: | count | user_id | | ----- | ------- | | 2 | 1 | | 3 | 2 | | 4 | 3 | | 7 | 4 |
This is already good, the only thing we need to do is to distribute the data points to our predefined value ranges (buckets)
SELECT FLOOR(ordercount/3.00)*3 AS bucket, COUNT(*) AS count FROM (SELECT COUNT(1) AS ordercount FROM orders GROUP BY user_id) o GROUP BY bucket; | bucket | count | | ------ | ----- | | 0 | 1 | | 3 | 2 | | 6 | 1 |
Here we go, we have our first version of histogram. The query above floors each data point to the nearest multiple of 3 and then groups by that value. This is the trick to create the buckets we needed. It is worth to mention that if there is no data point for a specific bucket, that row will be skipped from the results.
Inclusive upper bounds
It might be needed to actually have the upper bound inclusive. Rounding up instead of down in the above query will do it! So changing
CEIL is sufficient.
SELECT CEIL(ordercount/3.00)*3 AS bucket, COUNT(*) AS count FROM (SELECT COUNT(1) AS ordercount FROM orders GROUP BY user_id) uat GROUP BY bucket; Result: | bucket | count | | ------ | ----- | | 3 | 2 | | 6 | 1 | | 9 | 1 |
In Cumulative Histogram the cumulative number of observations is counted in all of the buckets up to the specified bucket. Let’s see what would be difference in our example.
| bucket | count | | ------ | ----- | | 3 | 2 | | 6 | 3 | | 9 | 6 |
In this scenario, we can’t use the rounding strategy anymore, however we can scan over the result of grouping and count the cumulative value. In order to do this, we would also need to predefine the buckets in advance in our query.
SELECT COUNT(CASE WHEN ordercount <= 3 THEN 1 END) AS bucket_le_3, COUNT(CASE WHEN ordercount <= 6 THEN 1 END) AS bucket_le_6, COUNT(CASE WHEN ordercount <= 9 THEN 1 END) AS bucket_le_9, COUNT(1) AS bucket_le_inf FROM (SELECT COUNT(1) AS ordercount FROM orders GROUP BY user_id) o; | bucket_le_3 | bucket_le_6 | bucket_le_9 | bucket_le_inf | | ----------- | ----------- | ----------- | ------------- | | 2 | 3 | 4 | 4 |
As you can see we also introduced a new bucket,
bucket_le_inf. This is required unless you know in advance the limit of the buckets you want to plot and it will contain the cumulative result of all previous buckets.
Histograms are not easy to get right and all of the above is only a good enough solution. Feel free to play around with examples and fiddle with it.