Dr. Dror

Foo is not just a "Bar"

Benchmarking Columns Operations


(Original notebook can be found in this gist)

I recently ran the following experiment. The reason was my need to perform operations on two columns. Think for example in terms of features engineering; you want to produce an new feature (i.e. column) which is the ratio of the some other two columns.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(size=(10000,2)), columns=['foo', 'bar'])
df.head()
foo bar
0 0.065648 0.593402
1 0.592051 0.123502
2 0.621030 0.629470
3 0.210630 0.462535
4 0.263708 0.807304

Iterating over the rows

Here we iterate over the rows, and access the indexes of the pd.Series which represents the row.

%%timeit
pd.Series([
    x[1]['foo'] * x[1]['bar'] for x in df.iterrows()
])
518 ms ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Applying a function

Here we use pd.DataFrame.apply and provide the axis. This is actually, the solution I had to use in a case which I now longer remember its details.

%%timeit
df.apply(lambda x: x['foo'] * x['bar'], axis=1)
299 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Columns operations

Best and fastest approach; at least for this simple case.

%%timeit
df.foo.mul(df.bar)
97.2 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Summary

Obviously, for the simple multiplication, the last option is the most pythonic (and the fastest as well). But, still due to an edge case I decided to run this test. Maybe someone would find it helpful.