I have a dataset called transactions
representing shopping carts that I've gotten into the following format:
member Date V11 1000 15-03-2015 sausage,whole milk,semi-finished bread,yogurt 2 1000 24-06-2014 whole milk,pastry,salty snack 3 1000 24-07-2015 canned beer,misc. beverages 4 1001 25-11-2015 sausage,hygiene articles 5 1001 27-05-2015 soda,pickled vegetables 6 1001 02-05-2015 frankfurter,curd
What I need is something that looks like canonical sparse matrix cart data:
cart sausage whole milk bread yogurt frankfurter #many more cols1 TRUE TRUE TRUE TRUE FALSE
After a few hours of struggling, I'm currently doing this in a very non-R way. My dataframe is called transactions
and has all of my 'shopping events' in the format shown above in the first code block.
ll <- unique(unlist(strsplit(paste0(transactions$V1, collapse=","), ',')))txn_df <- data.frame()txn_df[c(ll, "cart")] <- list(character(0))build_carts <- function(row){ xs <- sapply(strsplit(row$V1, ","), trimws) # first `strsplit` by comma, then trim whitespace tmp <- data.frame(matrix(nrow=1, ncol = length(txn_df))) #new dataframe names(tmp) <- names(txn_df) #copy columns tmp$cart <- paste(row$Date, row$member, sep="_") #make a new cart ID #set present items to TRUE for (i in 1:length(xs)) { tmp[,which(colnames(tmp)==xs[i])] = TRUE } tmp <- replace(tmp, is.na(tmp), FALSE) # all other items false txn_df <<- rbind(txn_df, tmp) #copy to parent DF}res <- by(transactions, seq_len(nrow(transactions)), build_carts)
This works but is, as you'd imagine, very very slow. Is there a way to speed this up without going too deep down into the tidyverse? e.g. if the code could be at least partially legible to a tidyverse noob that would be great (for didactic purposes).