Loading [MathJax]/jax/output/HTML-CSS/jax.js

Dr. Dror

Foo is not just a "Bar"

Why do we need to divide by n-1?


(Original post can be found in this gist)

It seems to be a reappearing question: what is the difference between pandas.std and numpy.std?. See for example this SO thread. But what does it mean that Pandas uses the unbiased estimator and Numpy does not? In this post I would like to help you understand what is the difference and when which should be used. I will try to achieve this by following the paper "Why divide by N1?" by J. L. M. Wilkins (link). In particular instead of using archaic software, I will redo the examples using Numpy and Pandas.

Background

What is the average and the standard deviation of height of humans? To answer this question, let us denote by X the whole population of humanity (at the moment); each individual would be denoted by Xi. The average (or mean), denoted by μ can be computed by

μ=Ni=1XiN

where N is the number of individuals. Next, the standard deviation is given by:

σ=Ni=1(Xiμ)2N

Recall that the variance is given and denoted by σ2; we will consider the variance from now on.

The task at hand is, obviously, impossible to accomplish. What you are likely to do is pick a sample Xi(1),Xi(2),,Xi(n), where n<N and i(k) is a permutation. For the sake of simplicity, let us denote them by Xi where i=1,,n. Using this sample you will compute the sample mean, denoted by ˉX and the sample variance.

The first part, that is, the mean of the sample, is easy:

ˉX=nj=1Xjn

It turns out that ˉX is an unbiased estimator of μ. Intuitively saying that it estimates nicely the real mean μ. See for example reference.

Things become trickier when it comes to the sample's variance, denoted by S2. One may try to compute S2 by "generalizing" the formula for σ2.

This would be denoted by S2n:

S2n=ni=1(XiˉX)2n

However, this is biased estimator of σ2. The unbiased estimator is

S2=ni=1(XiˉX)2n1

All this is explained nicely, and in a deeper way, in various places; even Wikipedia is a good start. An intuitive explanation can be found here. In this post, I will follow the examples of Wilkins which gives a better motivation that indeed S2n is biased and S2 is not.

Let the game begin

Before we start, let us stress that by default, pandas.std is equivalent to S2 and numpy.std to S2n.

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import numpy as np
import pandas as pd

#
# In general tqdm is great and I strongly suggest to try it out, if you run this notebook
# locally. However, it doesn't render well in the post generated from this notebook. So I define
# A dummy function which does nothing.
#
# from tqdm import tqdm_notebook as tqdm
def tqdm(X):
    return X

We start with a population a consisting of the numbers 1 to 100 (inclusive).

a = np.arange(1,101)

The mean μ is

np.mean(a)
50.5

Next, we compute σ2; we use the default behavior of np.std.

np.std(a)**2
833.25

We can double check that indeed Numpy delivers:

np.sum(np.power(a - np.mean(a), 2)) / len(a)
833.25

By default, Pandas will fail:

np.power(pd.Series(a).std(), 2)
841.66666666666663

But this can be fixed:

np.power(pd.Series(a).std(ddof=0), 2)
833.25

Great, this all aligns well with the documentation and the definitions. In the sequence of values you have covers the WHOLE of the population in question, you should either use np.std or pd.std(ddof=2).

Sample the whole population

Time to play the game. As we cannot have all values for all the individuals in our population, we have to use a sample. We will now do it for various sample sizes. At each step we will compute S2 and S2n and compare the results. We take sample sizes of n=5,10,25,50 and 100. For each sample size, we compute 15000 times S2 and S2n. Lastly, we average the results for each sample size.

SIZES = [5, 10, 25, 50, 100]
N=15000
Sn_s = []
for SIZE in tqdm(SIZES):
    size_res = []
    for i in tqdm(range(N)):
        size_res.append(np.std(np.random.choice(a, size=SIZE))**2)
    Sn_s.append(np.sum(np.array(size_res)) / N)
Sn_s = pd.Series(Sn_s, index=SIZES)
S_s = []
for SIZE in tqdm(SIZES):
    size_res = []
    for i in tqdm(range(N)):
        size_res.append(np.std(np.random.choice(a, size=SIZE), ddof=1)**2)
    S_s.append(np.sum(np.array(size_res)) / N)

S_s = pd.Series(S_s, index=SIZES)

Let's collect the results into a table.

res_df = pd.concat([Sn_s, S_s], axis=1, keys=['S_n^2', 'S^2'])
res_df['sigma^2'] = pd.Series(np.std(a) ** 2 * np.ones(res_df.shape[0]), index=res_df.index)
res_df
S_n^2 S^2 sigma^2
5 665.030288 837.021747 833.25
10 748.445851 836.054854 833.25
25 800.563841 832.693469 833.25
50 816.226962 830.447212 833.25
100 825.145818 833.501119 833.25
res_df.plot(figsize=(20,5), marker='o')
plt.xticks(res_df.index);

png

Recall that σ2=833.25. It is clear that the n1 normalization yields better approximation of the variance of the population. This approximation is already fairly accurate when using mere 5% of the population as a sample.

Summary

So, if you know that the series of values that you have represents ALL members of your population, you should go ahead and use σ2 (using either np.std or pandas.std(ddof=0)). If however, your series of values is merely a sample of a bigger population, then you should use the unbiased estimator of σ2, denoted by S2 and computed by either np.std(ddof=1) or pandas.std.

An experiment

As a mathematician, I now ask myself how would estimators with different n would behave. Let

D2n=ni=1(XiˉX)2n

In particular, D2n=S2n and D2n1=S2.

res = []
for n in tqdm(np.arange(-4,5)):
    Dn_s = []
    for SIZE in tqdm(SIZES):
        size_res = []
        for i in tqdm(range(N)):
            size_res.append(np.std(np.random.choice(a, size=SIZE), ddof=n)**2)
        Dn_s.append(np.sum(np.array(size_res)) / N)
    Dn_s = pd.Series(Dn_s, index=SIZES, name=str(n))
    res.append(Dn_s)
df_res = pd.DataFrame(res).T
df_res.plot(logy=True, figsize=(20,5))
plt.xticks(df_res.index);

png