(Original notebook can be found in this gist)
I recently ran the following experiment. The reason was my need to perform operations on two columns. Think for example in terms of features engineering; you want to produce an new feature (i.e. column) which is the ratio of the some other two columns.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(size=(10000,2)), columns=['foo', 'bar'])
df.head()
foo | bar | |
---|---|---|
0 | 0.065648 | 0.593402 |
1 | 0.592051 | 0.123502 |
2 | 0.621030 | 0.629470 |
3 | 0.210630 | 0.462535 |
4 | 0.263708 | 0.807304 |
Iterating over the rows
Here we iterate over the rows, and access the indexes of the pd.Series
which represents the row.
%%timeit
pd.Series([
x[1]['foo'] * x[1]['bar'] for x in df.iterrows()
])
518 ms ± 60 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Applying a function
Here we use pd.DataFrame.apply
and provide the axis.
This is actually, the solution I had to use in a case which I now longer remember its details.
%%timeit
df.apply(lambda x: x['foo'] * x['bar'], axis=1)
299 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Columns operations
Best and fastest approach; at least for this simple case.
%%timeit
df.foo.mul(df.bar)
97.2 µs ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Summary
Obviously, for the simple multiplication, the last option is the most pythonic (and the fastest as well). But, still due to an edge case I decided to run this test. Maybe someone would find it helpful.