About Data Aggregation
Last updated
Last updated
Chapter 2
By Alistair Croll
Most datasets fall somewhere in the middle of these levels of aggregation. If we know what kind of data we have to begin with, we can greatly simplify the task of correctly visualizing them the first time around.
Let’s look at these types of aggregation one by one, using the example of coffee consumption. Let’s assume a café tracks each cup of coffee sold and records two pieces of information about the sale: the gender of the buyer and the kind of coffee (regular, decaf, or mocha).
The basic table of these data, by year, looks like thisThese are completely made up coffee data, BTW.:
A factoid is a piece of trivia. It is calculated from source data, but chosen to emphasize a particular point.
Example: 36.7% of coffee in 2000 was consumed by women.
This is one type of information (the dependent variable) compared to another (the independent variable). Often, the independent variable is time.
In this example, the total sales of coffee depends on the year. That is, the year is independent (“pick a year, any year”) and the sales is dependent (“based on that year, the consumption is 23,005 cups”).
A series can also be some other set of continuous data, such as temperature. Consider this table that shows how long it takes for an adult to sustain a first-degree burn from hot water. Here, water temperature is the independent variableUS Government Memorandum, Consumer Product Safety Commission, Peter L. Armstrong, Sept. 15, 1978.:
And it can be a series of non-contiguous, but related, information in a category, such as major car brands, types of dog, vegetables, or the mass of planets in the solar systemNational Space Science Data Center, NASA: http://nssdc.gsfc.nasa.gov/planetary/factsheet/planet_table_ratio.html:
In many cases, series data have one and only one dependent variable for each independent variable. In other words, there is only one number for coffee consumption for each year on record. This is usually displayed as a bar, time series, or column graph.
In cases where there are several dependent variables for each independent one, we often show the information as a scatterplot or heat map, or do some kind of processing (such as an average) to simplify what’s shown. We’ll come back to this in the section below, Using visualization to reveal underlying variance.
A multiseries dataset has several pieces of dependent information and one piece of independent information. Here are the data about exposure to hot water from before, with additional dataUS Government Memorandum, Consumer Product Safety Commission, Peter L. Armstrong, Sept. 15, 1978.:
Or, returning to our coffee example, we might have several series:
With this dataset, we know several things about 2001. We know that 16,452 cups were served to men and that 14,021 cups served were regular coffee (with caffeine, cream or milk, and sugar).
We don’t, however, know how to combine these in useful ways: they aren’t related. We can’t tell what percentage of regular coffee was sold to men or how many cups were served to women.
In other words, multiseries data are simply several series on one chart or table. We can show them together, but we can’t meaningfully stack or combine them.
As the name suggests, a summable multiseries is a particular statistic (gender, type of coffee) segmented into subgroups.
Because we know a coffee drinker is either male or female, we can add these together to make broader observations about total consumption. For one thing, we can display percentages.
Additionally, we can stack segments to reveal a whole:
One challenge with summable multiseries data is knowing which series go together. Consider the following:
There is nothing inherent in these data that tells us how we can combine information. It takes human understanding of data categories to know that Male + Female = a complete set and Regular + Decaf + Mocha = a complete set. Without this knowledge, we can’t combine the data, or worse, we might combine it incorrectly.
Even if we know the meaning of these data and realize they are two separate multiseries tables (one on gender and one on coffee type) we can’t explore them deeply. For example, we can’t find out how many women drank regular coffee in 2000.
This is a common (and important) mistake. Many people are tempted to say:
36.7% of cups sold in 2000 were sold to women.
And there were 9,929 cups of regular sold in 2000.
Therefore, 3,642.5 cups of regular were sold to women.
But this is wrong. This type of inference can only be made when you know that one category (coffee type) is evenly distributed across another (gender). The fact that the result isn’t even a whole number reminds us not to do this, as nobody was served a half cup.
The only way to truly explore the data and ask new questions (such as “How many cups of regular were sold to women in 2000?”) is to have the raw data. And then it’s a matter of knowing how to aggregate them appropriately.
The following table of summary records looks like the kind of data a point-of-sale system at a café might generate. It includes a column of categorical information (gender, where there are two possible types) and subtotals for each type of coffee. It also includes the totals by the cup for those types.
This kind of table is familiar to anyone who’s done basic exploration in a tool like Excel. We can do subcalculations:
There are 5 male drinkers and 3 female drinkers
There were 16 regulars, 8 decafs, and 9 mochas
We sold a total of 33 cups
But more importantly, we can combine categories of data to ask more exploratory questions. For example: Do women prefer a certain kind of coffee? This is the kind of thing Excel, well, excels at, and it’s often done using a tool called a Pivot Table.
Here’s a table looking at the average number of regular, decaf, and mocha cups consumed by male and female patrons:
Looking at this table, we can see a pretty clear trend: Women like regular; men seem evenly split across all three types of coffeeThere aren’t enough data to make a statistically reliable statement like this. But this is all made-up data anyway, so stop thinking so much about coffee consumption..
The thing about these data, however, is they have still been aggregated somehow. We summarized the data along several dimensions—gender and coffee type—by aggregating them by the name of the patron. While this isn’t the raw data, it’s close.
One good thing about this summarization is that it keeps the dataset fairly small. It also suggests ways in which the data might be explored. It is pretty common to find survey data that looks like this: for example, a Google Form might output this kind of data from a survey that says:
Producing the following data in the Google spreadsheet:
When you have summary records or raw data, it’s common to aggregate in order to display them easily. By showing the total coffee consumed (summing up the raw information) or the average number of cups per patron (the mean of the raw information) we make the data easier to understand.
Consider the following transactions:
We can show the average of each coffee type consumed by cup as a summary graphic:
But averages hide things. Perhaps some people have a single cup of a particular type, and others have many. There are ways to visualize the spread, or variance, of data that indicate the underlying shape of the information, including heat charts, histograms, and scatterplots. When keeping the underlying data, you can wind up with more than one dependent variable for each independent variable.
A better visualization (such as a histogram, which counts how many people fit into each bucket or range of values that made up an average) might reveal that a few people are drinking a lot of coffee, and a large number of people are drinking a small amount.
Consider this histogram of the number of cups per patron. All we did was tally up how many people had one cup, how many had two, how many had three, and so on. Then we plotted how frequently each number occurred, which is why this is called a frequency histogram.
The average number of cups in this dataset is roughly 3. And the mode, or most common number, is 2 cups. But as the histogram shows, there are three heavy coffee drinkers who’ve each consumed 7 cups, pushing up the average.
In other words, when you have raw data, you can see the exceptions and outliers, and tell a more accurate story.
Even these data, verbose and informative as they are, aren’t the raw information: they’re still aggregated.
Aggregation happens in many ways. For example, a restaurant receipt usually aggregates orders by table. There’s no way to find out what an individual person at the table had for dinner, just the food that was served and what it cost. To get to really specific exploration, however, we need data at the transaction level.
Transactional records capture things about a specific event. There’s no aggregation of the data along any dimension like someone’s name (though their name may be captured). It’s not rolled up over time; it’s instantaneous.
These transactions can be aggregated by any column. They can be cross-referenced by those columns. The timestamps can also be aggregated into buckets (hourly, daily, or annually). Ultimately, the initial dataset we saw of coffee consumption per year results from these raw data, although summarized significantly.
When we roll up data into buckets, or transform it somehow, we take away the raw history. For example, when we turned raw transactions into annual totals:
We anonymized the data by removing the names of patrons when we aggregated it.
We bucketed timestamps, summarizing by year.
Either of these pieces of data could have shown us that someone was a heavy coffee drinker (based on total coffee consumed by one person, or based on the rate of consumption from timestamps). While we might not think about the implications of our data on coffee consumption, what if the data pertained instead to alcohol consumption? Would we have a moral obligation to warn someone if we saw that a particular person habitually drank a lot of alcohol? What if this person killed someone while driving drunk? Are data about alcohol consumption subject to legal discovery in a way that data about coffee consumption needn’t be? Are we allowed to aggregate some kinds of data but not others?
Can we address the inherent biases that result from choosing how we aggregate data before presenting it?
The big data movement is going to address some of this. Once, it was too computationally intensive to store all the raw transactions. We had to decide how to aggregate things at the moment of collection, and throw out the raw information. But advances in storage efficiency, parallel processing, and cloud computing are making on-the-fly aggregation of massive datasets a reality, which should overcome some amount of aggregation bias. </div> </div>
When trying to turn data into information, the data you start with matter a lot. Data can be simple factoids—of which someone else has done all of the analysis—or raw transactions, where the exploration is left entirely to the user.
Level of Aggregation
Number of metrics
Description
Factoid
Maximum context
Single data point; No drill-down
Series
One metric, across an axis
Can compare rate of change
Multiseries
Several metrics, common axis
Can compare rate of change, correlation between metrics
Summable multiseries
Several metrics, common axis
Can compare rate of change, correlation between metrics; Can compare percentages to whole
Summary records
One record for each item in a series; Metrics in other series have been aggregated somehow
Items can be compared
Individual transactions
One record per instance
No aggregation or combination; Maximum drill-down
Year
2000
2001
2002
2003
2004
2005
2006
2007
2008
Total sales
19,795
23,005
31,711
40,728
50,440
60,953
74,143
93,321
120,312
Male
12,534
16,452
19,362
24,726
28,567
31,110
39,001
48,710
61,291
Female
7,261
6,553
12,349
16,002
21,873
29,843
35,142
44,611
59,021
Regular
9,929
14,021
17,364
20,035
27,854
34,201
36,472
52,012
60,362
Decaf
6,744
6,833
10,201
13,462
17,033
19,921
21,094
23,716
38,657
Mocha
3,122
2,151
4,146
7,231
5,553
6,831
16,577
17,593
21,293
Year
2000
2001
2002
2003
Total sales
19,795
23,005
31,711
40,728
Water Temp °C (°F)
Time for 1st Degree Burn
46.7 (116)
35 minutes
50 (122)
1 minute
55 (131)
5 seconds
60 (140)
2 seconds
65 (149)
1 second
67.8 (154)
Instantaneous
Planet
Mass relative to earth
Mercury
0.0553
Venus
0.815
Earth
1
Mars
0.107
Jupiter
317.8
Saturn
95.2
Uranus
14.5
Neptune
17.1
Water Temp °C (°F)
Time for 1st Degree Burn
Time for 2nd & 3rd Degree Burns
46.7 (116)
35 minutes
45 minutes
50 (122)
1 minute
5 minutes
55 (131)
5 seconds
25 seconds
60 (140)
2 seconds
5 seconds
65 (149)
1 second
2 seconds
67.8 (154)
Instantaneous
1 second
Year
2000
2001
2002
2003
2004
2005
Male
12,534
16,452
19,362
24,726
28,567
31,110
Regular
9,929
14,021
17,364
20,035
27,854
34,201
Year
2000
2001
2002
2003
2004
2005
2006
2007
2008
Male
12534
16452
19362
24726
28567
31110
39001
48710
61291
Female
7261
6553
12349
16002
21873
29843
35142
44611
59021
Year
2000
2001
2002
2003
2004
Male
12534
16452
19362
24726
28567
Female
7261
6553
12349
16002
21873
Regular
9929
14021
17364
20035
27854
Decaf
6744
6833
10201
13462
17033
Mocha
3122
2151
4146
7231
5553
Name
Gender
Regular
Decaf
Mocha
Total
Bob Smith
M
2
3
1
6
Jane Doe
F
4
0
0
4
Dale Cooper
M
1
2
4
7
Mary Brewer
F
3
1
0
4
Betty Kona
F
1
0
0
1
John Java
M
2
1
3
6
Bill Bean
M
3
1
0
4
Jake Beatnik
M
0
0
1
1
Totals
5M, 3F
16
8
9
33
Row Labels
Average of Regular
Average of Decaf
Average of Mocha
F
2.67
0.33
0.00
M
2.00
1.75
2.00
Grand Total
2.29
1.14
1.14
Timestamp
What is your name?
Gender?
Regular
Decaf
Mocha
1/17/2014 11:12:47
Bob Smith
Male
4
3
Name
Regular
Decaf
Mocha
Bob Smith
2
3
1
Jane Doe
4
0
0
Dale Cooper
1
2
4
Mary Brewer
3
1
0
Betty Kona
1
0
0
John Java
2
1
3
Bill Bean
3
1
0
Jake Beatnik
0
0
1
Totals
16
8
9
Averages
2
1
1.125
Timestamp
Name
Gender
Coffee
17:00
Bob Smith
M
Regular
17:01
Jane Doe
F
Regular
17:02
Dale Cooper
M
Mocha
17:03
Mary Brewer
F
Decaf
17:04
Betty Kona
F
Regular
17:05
John Java
M
Regular
17:06
Bill Bean
M
Regular
17:07
Jake Beatnik
M
Mocha
17:08
Bob Smith
M
Regular
17:09
Jane Doe
F
Regular
17:10
Dale Cooper
M
Mocha
17:11
Mary Brewer
F
Regular
17:12
John Java
M
Decaf
17:13
Bill Bean
M
Regular