Benchmarking Columns Operations

Tue 27 June 2017
DS
#python, #pandas

(Original notebook can be found in this gist)

I recently ran the following experiment. The reason was my need to perform operations on two columns. Think for example in terms of features engineering; you want to produce an new feature (i.e. column) which is the ratio of the some other two columns.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random(size=(10000,2)), columns=['foo', 'bar'])

df.head()

	foo	bar
0	0.065648	0.593402
1	0.592051	0.123502
2	0.621030	0.629470
3	0.210630	0.462535
4	0.263708	0.807304

Iterating over the rows

Here we iterate over the rows, and access the indexes of the pd.Series which represents the row.

%%timeit
pd.Series([
    x[1]['foo'] * x[1]['bar'] for x in df.iterrows()
])

518 ms ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Applying a function

Here we use pd.DataFrame.apply and provide the axis. This is actually, the solution I had to use in a case which I now longer remember its details.

%%timeit
df.apply(lambda x: x['foo'] * x['bar'], axis=1)

299 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Columns operations

Best and fastest approach; at least for this simple case.

%%timeit
df.foo.mul(df.bar)

97.2 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Summary

Obviously, for the simple multiplication, the last option is the most pythonic (and the fastest as well). But, still due to an edge case I decided to run this test. Maybe someone would find it helpful.