Loving this pattern for keeping Pandas DataFrame transformations organized by using .pipe()
to pass either a DataFrame or Series through a sequence of transformations (example is adapted from the docs):
def subtract_federal_tax(df):
return df * 0.9
def subtract_state_tax(df, rate):
return df * (1 - rate)
def subtract_national_insurance(df, rate, rate_increase):
new_rate = rate + rate_increase
return df * (1 - new_rate)
transformed_df = (
df.copy() # if necessary
.pipe(subtract_federal_tax)
.pipe(subtract_state_tax, rate=0.12)
.pipe(subtract_national_insurance, rate=0.05, rate_increase=0.02)
# etc...
)
Basically…
- The type signature of every transformation function in the chain should be
pd.DataFrame -> pd.DataFrame
orpd.Series -> pd.Series
- The data should be the first arg you pass (if you want to avoid an awkward tuple)
- Each transformation can do whatever you like, as long as it returns the updated
DataFrame
orSeries
the next step is expecting
You’ll need to choose pd.DataFrame -> pd.DataFrame
if you need data from multiple columns to resolve the values in the column you’re transforming. You might also prefering defaulting to it as a global pattern for consistency (even when only referencing one column).
Questions:
- Performance penalty? Are intermediate copies created at each step?