(Original notebook can be found in this gist)
Averages, sums and counts when grouping by
import pandas as pd
import numpy as np
Assume we have a table where each row corresponds to a transaction and indexed by timestamp.
By re-sampling this table by 1H
frequency and counting the number of transaction per hour, we get a new aggregated view when each row corresponds to an hour and the value is the count of transaction happened within that hour.
As an example, we start with the following table. This demonstrates the result of the re-sampling mentioned above.
np.random.seed(42)
idx = pd.date_range('2017-01-01', '2017-01-14', freq='1H')
df = pd.DataFrame(np.random.choice([1,2,3,4,5,6], size=idx.shape[0]), index=idx, columns=['count'])
df.head()
count | |
---|---|
2017-01-01 00:00:00 | 4 |
2017-01-01 01:00:00 | 5 |
2017-01-01 02:00:00 | 3 |
2017-01-01 03:00:00 | 5 |
2017-01-01 04:00:00 | 5 |
The objective is to measure behavior as depending on the day of the week (Mo., Tue. etc.) First, let us count how many events happened during each weekday. This can be achieved in couple of ways:
df.pivot_table('count', index=df.index.dayofweek, aggfunc='sum')
count | |
---|---|
0 | 161 |
1 | 170 |
2 | 164 |
3 | 133 |
4 | 169 |
5 | 98 |
6 | 172 |
or:
df.groupby(df.index.dayofweek).sum()
count | |
---|---|
0 | 161 |
1 | 170 |
2 | 164 |
3 | 133 |
4 | 169 |
5 | 98 |
6 | 172 |
Lastly, being more explicit (and less Pythonic):
pd.DataFrame(
[
df[df.index.dayofweek==i].sum()[0] for i in range(7)
],
columns=['count']
)
count | |
---|---|
0 | 161 |
1 | 170 |
2 | 164 |
3 | 133 |
4 | 169 |
5 | 98 |
6 | 172 |
Next, we want to compute the average number of transaction per weekday. First, let's do it explicitly, for Monday (day 0). The total number of transactions is:
df[df.index.dayofweek == 0].sum()
count 161
dtype: int64
In the dates range used, there are two Mondays, therefore, the average is:
((df.resample('1d').sum()).index.dayofweek == 0).sum()
2
Therefore, the average is:
df[df.index.dayofweek == 0].sum() / ((df.resample('1d').sum()).index.dayofweek == 0).sum()
count 80.5
dtype: float64
One may think the following syntax is a quicker way:
df.pivot_table('count', index=df.index.dayofweek, aggfunc='mean')
count | |
---|---|
0 | 3.354167 |
1 | 3.541667 |
2 | 3.416667 |
3 | 2.770833 |
4 | 3.520833 |
5 | 3.920000 |
6 | 3.583333 |
But this returns the wrong average. To be more precise, it divides the number of transactions per day-of-week by the number of hours that occurred in this day-of-week in the time range, given by:
df.groupby(df.index.dayofweek).count()
count | |
---|---|
0 | 48 |
1 | 48 |
2 | 48 |
3 | 48 |
4 | 48 |
5 | 25 |
6 | 48 |
Or, put together:
df.groupby(df.index.dayofweek).sum() / df.groupby(df.index.dayofweek).count()
count | |
---|---|
0 | 3.354167 |
1 | 3.541667 |
2 | 3.416667 |
3 | 2.770833 |
4 | 3.520833 |
5 | 3.920000 |
6 | 3.583333 |
A way around it, would be:
count_per_day_df = df.resample('1d').sum()
count_per_day_df.groupby(count_per_day_df.index.dayofweek).mean()
count | |
---|---|
0 | 80.5 |
1 | 85.0 |
2 | 82.0 |
3 | 66.5 |
4 | 84.5 |
5 | 49.0 |
6 | 86.0 |
I mentioned this maybe counter intuitive behavior on SO.