I got the following dataframe:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost0 0 1 1 clothes NaN 1.0 6.01 1 1 1 electronics 2.0 2.0 6.02 2 1 15 furniture NaN 15.0 6.03 3 2 15 furniture NaN 15.0 15.04 4 2 15 furniture NaN 15.0 15.0
Per user, combine shipping_cost with default_shipping_cost and calculate the mean of the combined shipping_costs but only if there is at least one shipping_cost given.
Explanation:
- user_1
shipping_cost
is given (at least once) so we can calculate the mean - user_2 there are no
shipping_cost
, so I would like to go with Nan
Code:
import pandas as pdpd.set_option("display.max_columns", None)pd.set_option("display.max_rows", None)pd.set_option('display.width', 1000)df = pd.DataFrame( {'user': [1, 1, 1, 2, 2],'default_shipping_cost': [1, 1, 15, 15, 15],'category': ['clothes', 'electronics', 'furniture', 'furniture', 'furniture'],'shipping_cost': [None, 2, None, None, None] })df.reset_index(inplace=True)df['shipping_coalesce'] = df.shipping_cost.combine_first(df.default_shipping_cost)dfg_user = df.groupby(['user'])df['estimated_shipping_cost'] = dfg_user['shipping_coalesce'].transform("mean")print(df)
Expected output:
index user default_shipping_cost category shipping_cost shipping_coalesce estimated_shipping_cost0 0 1 1 clothes NaN 1.0 6.01 1 1 1 electronics 2.0 2.0 6.02 2 1 15 furniture NaN 15.0 6.03 3 2 15 furniture NaN 15.0 NaN4 4 2 15 furniture NaN 15.0 NaN