Dr. Dror

Foo is not just a "Bar"

Averages, sums and counts when grouping by


(Original notebook can be found in this gist)

Averages, sums and counts when grouping by

import pandas as pd
import numpy as np

Assume we have a table where each row corresponds to a transaction and indexed by timestamp. By re-sampling this table by 1H frequency and counting the number of transaction per hour, we get a new aggregated view when each row corresponds to an hour and the value is the count of transaction happened within that hour.

As an example, we start with the following table. This demonstrates the result of the re-sampling mentioned above.

np.random.seed(42)
idx = pd.date_range('2017-01-01', '2017-01-14', freq='1H')
df = pd.DataFrame(np.random.choice([1,2,3,4,5,6], size=idx.shape[0]), index=idx, columns=['count'])
df.head()
count
2017-01-01 00:00:00 4
2017-01-01 01:00:00 5
2017-01-01 02:00:00 3
2017-01-01 03:00:00 5
2017-01-01 04:00:00 5

The objective is to measure behavior as depending on the day of the week (Mo., Tue. etc.) First, let us count how many events happened during each weekday. This can be achieved in couple of ways:

df.pivot_table('count', index=df.index.dayofweek, aggfunc='sum')
count
0 161
1 170
2 164
3 133
4 169
5 98
6 172

or:

df.groupby(df.index.dayofweek).sum()
count
0 161
1 170
2 164
3 133
4 169
5 98
6 172

Lastly, being more explicit (and less Pythonic):

pd.DataFrame(
    [
        df[df.index.dayofweek==i].sum()[0] for i in range(7)
    ],
    columns=['count']
)
count
0 161
1 170
2 164
3 133
4 169
5 98
6 172

Next, we want to compute the average number of transaction per weekday. First, let's do it explicitly, for Monday (day 0). The total number of transactions is:

df[df.index.dayofweek == 0].sum()
count    161
dtype: int64

In the dates range used, there are two Mondays, therefore, the average is:

((df.resample('1d').sum()).index.dayofweek == 0).sum()
2

Therefore, the average is:

df[df.index.dayofweek == 0].sum() / ((df.resample('1d').sum()).index.dayofweek == 0).sum()
count    80.5
dtype: float64

One may think the following syntax is a quicker way:

df.pivot_table('count', index=df.index.dayofweek, aggfunc='mean')
count
0 3.354167
1 3.541667
2 3.416667
3 2.770833
4 3.520833
5 3.920000
6 3.583333

But this returns the wrong average. To be more precise, it divides the number of transactions per day-of-week by the number of hours that occurred in this day-of-week in the time range, given by:

df.groupby(df.index.dayofweek).count()
count
0 48
1 48
2 48
3 48
4 48
5 25
6 48

Or, put together:

df.groupby(df.index.dayofweek).sum() / df.groupby(df.index.dayofweek).count()
count
0 3.354167
1 3.541667
2 3.416667
3 2.770833
4 3.520833
5 3.920000
6 3.583333

A way around it, would be:

count_per_day_df = df.resample('1d').sum()
count_per_day_df.groupby(count_per_day_df.index.dayofweek).mean()
count
0 80.5
1 85.0
2 82.0
3 66.5
4 84.5
5 49.0
6 86.0

I mentioned this maybe counter intuitive behavior on SO.