I have a large dataset, with on the order of 2^15 entries, and I calculate the confidence interval of the mean of the entries with scipy.stats.bootstrap
. For a dataset this size, this costs about 6 seconds on my laptop. I have a lot of datasets, so I find this takes too long (especially if I just want to do a test run to debug the plotting etc.). By default, SciPy's bootstrapping function resamples the data n_resamples=9999
times. As I understand it, the resampling and computing the average of the resampled data should be the most time-consuming part of the process. However, when I reduce the number of resamples by roughly three orders of magnitude (n_resamples=10
), the computational time of the bootstrapping method does not even half.
I'm using Python 3 and SciPy 1.9.3.
import numpy as npfrom scipy import statsfrom time import perf_counterdata = np.random.rand(2**15)data = np.array([data])start = perf_counter()bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=9999)end = perf_counter()print(end-start)start = perf_counter()bs = stats.bootstrap(data, np.mean, batch=1, n_resamples=10)end = perf_counter()print(end-start)start = perf_counter()bs = stats.bootstrap(data, np.mean, n_resamples=10)end = perf_counter()print(end-start)
gives
6.0210669040679933.998902082443237330.46708607673645
To speed up bootstrapping, I have set batch=1
. As I understand it, this is more memory efficient, and prevents swapping the data. Setting a higher batch number increases the time consumption, as you can see above.
How can I do faster bootstrapping?