I'm doing some work with a fairly large amount of (horse racing!) data for a project, calculating rolling sums of values for various different combinations of data - thus I need to streamline it as much as possible.
Essentially I am:
- calculating the rolling calculation of a points field over time
- calculating this for various grouped combinations of data [in this case the combination of horse and trainer]
- looking at the average of the value by group for the last 180 days of data through time
The rolling window calculation below works fine - but takes 8.2s [this is about 1/8 of the total dataset - hence would take 1m 5s]. I am looking for ideas of how to streamline this calculation as I'm looking to do it for a number of different combinations of data, and thus speed is of the essence. Thanks.
import pandas as pdimport timeurl = 'https://raw.githubusercontent.com/richsdixon/testdata/main/testdata.csv'df = pd.read_csv(url, parse_dates=True)df['RaceDate'] = pd.to_datetime(df['RaceDate'], format='mixed')df.sort_values(by='RaceDate', inplace=True)df['HorseRaceCount90d'] = (df.groupby(['Horse','Trainer'], group_keys=False) .apply(lambda x: x.rolling(window='180D', on='RaceDate', min_periods=1)['Points'].mean()))