Creating a Histogram with SQL
data:image/s3,"s3://crabby-images/9b1a3/9b1a346233c8f6a4c188d38b66b60fb81a7e0a72" alt="Jeremy Greze"
sql
In data analysis, histograms are often used to visualize the distribution of a single variable. In this post, we'll look at an SQL query that creates a histogram, given a number of bins (in this case, 10). This SQL code can be adapted to work with any table and any data. For our example, we'll use a table named "weather", with a column called "temperature".
select
min(data.temperature) as min,
max(data.temperature) as max,
count(data.temperature) as count,
floor(
(
case when data.temperature = value.max then value.max*0.999999999 else data.temperature end
- value.min
) / (
value.max - value.min
) * 10
) +1 as bin_number
from "weather" as data
cross join (select max(temperature) as max, min(temperature) as min from "weather") as value
where data.temperature is not null
group by bin_number
order by bin_number;
min | max | count | bin_number |
---|---|---|---|
12.7 | 14.9 | 81 | 1 |
15 | 17.2 | 320 | 2 |
17.3 | 19.5 | 371 | 3 |
19.6 | 21.8 | 367 | 4 |
21.9 | 24.1 | 225 | 5 |
24.2 | 26.4 | 105 | 6 |
26.5 | 28.7 | 56 | 7 |
28.8 | 31 | 35 | 8 |
31.1 | 33.1 | 10 | 9 |
33.6 | 35.6 | 7 | 10 |
Quick explanation of the code:
- The select statement begins by getting the minimum and maximum temperature values in each bin, and counting the total number of entries in each bin.
- The bin_number is calculated by subtracting the minimum value from the temperature, dividing by the range of the values (max value - min value), and multiplying by the number of bins (10 in this case). The FLOOR function is used to round down to the nearest whole number and get the bin number (a + 1 is added to the end to make sure the bin number starts at 1 instead of 0).
- The case statement is used to make sure the maximum value is included in the last bin. If this wasn't there, the highest value would end up in its own bin, giving us 11 (n+1) bins instead of the specified 10 (n).
- We use a cross join to get the minimum and maximum values from the table.
- The where clause is to exclude any null values from the data.
- Finally, the group by and order by clauses group the data by bin number and order the results by bin number.
You can play around with the code and see the results in this SQLite Playground.