Skip to main content

Declarative dataframe transformations with pipe

Loving this pattern for keeping Pandas DataFrame transformations organized by using .pipe() to pass either a DataFrame or Series through a sequence of transformations (example is adapted from the docs):

def subtract_federal_tax(df):
    return df * 0.9
    
def subtract_state_tax(df, rate):
    return df * (1 - rate)
    
def subtract_national_insurance(df, rate, rate_increase):
    new_rate = rate + rate_increase
    return df * (1 - new_rate)
    
transformed_df = (
    df.copy() # if necessary
    .pipe(subtract_federal_tax)
    .pipe(subtract_state_tax, rate=0.12)
    .pipe(subtract_national_insurance, rate=0.05, rate_increase=0.02)
    # etc...
)
 
 

Basically…

  1. The type signature of every transformation function in the chain should be pd.DataFrame -> pd.DataFrame or pd.Series -> pd.Series
  2. The data should be the first arg you pass (if you want to avoid an awkward tuple)
  3. Each transformation can do whatever you like, as long as it returns the updated DataFrame or Series the next step is expecting

You’ll need to choose pd.DataFrame -> pd.DataFrame if you need data from multiple columns to resolve the values in the column you’re transforming. You might also prefering defaulting to it as a global pattern for consistency (even when only referencing one column).

Questions:

  • Performance penalty? Are intermediate copies created at each step?