Abstract

In this analysis we are given a log of customer transactions over a 9 month period, and some survey information about the customers. Our goal is to predict the total amount of money each customer will spend in the following 3 month period. Accurately predicting customer attrition can be fairly tricky. I was able to develop a model using random forests that rivaled the performance of the academic-standard approach of building a probability model. By ensembling predictions, I was able to exceed the performance of both models during validation.

Exploratory Data Analysis

Let’s take a peek at the customer info and customer transaction log.

library(tidyverse)
library(lubridate)
library(forcats)

customer_info <- read_csv("customer_info.csv")
customer_preds <- read_csv("customer_preds.csv")
customer_trans_log <- read_csv("customer_trans_log.csv")

(customer_info %>% 
   mutate_if(is.character, factor) %>% 
  mutate(`Loyalty Program` = factor(`Loyalty Program`),
         `Shared Email` = factor(`Shared Email`)) ->customer_info)
customer_trans_log

Data Visualization

customer_info %>% 
  select_if(is.factor) %>% 
  gather() %>% 
  ggplot() +
  geom_bar(aes(x = value, fill = key)) +
  facet_wrap(~ key, scales = "free") +
  ggtitle("Frequency distribution of categorical variables") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(color="grey", face="bold", size=16, hjust = .50, vjust = 3),
        legend.position="none")

By inspecting these variables we can get a preliminary idea about the customer base of this ecommerce retailer: the average customer is approximately a middle-age female who shops organic. The majority signed up for the loyalty program and shared their e-mail upon signing up, indicating that they intended to be repeat customer.

Normally, investigating the relationships between these variables would be a great place to begin analysis. In the case of predicting customer churn however, they are a red herring that will lead to dubious guesswork and speculation. We will not be exploring them further.

Sales Trend

Here we create a column for months so we can visualize how sales for this set of customers changed over time by.

customer_trans_log %>% 
  mutate(month = month(date, label = TRUE) %>% fct_relevel("Sep", "Oct", "Nov", "Dec")) -> customer_trans_log

sales_growth <- customer_trans_log %>% 
  group_by(month) %>% 
  summarize(revenue = sum(sales),
            transactions = n()) %>% 
  gather(... = -month)

ggplot(sales_growth, aes(x = month)) +
  geom_line(aes(group = key, color = key, y = value)) +
  facet_wrap(~ key, scale = "free") +
  theme(legend.position = "none")

Customers began to drop off rather quickly after the initial purchase, but some loyal shoppers remained. Also notice that revenue is simply a scaled value of transaction frequency - a fact which will simplify our predictive modeling.

Segment Customers based off Their RFM characteristcs

A customer base can be better understood in using the following 3 quantities:

  • RECENCY: the time of most recent transaction. We will represent this value a time in seconds since their initial purchase.
  • FREQUENCY: how many transactions the customer made during their lifetime
  • MONETARY VALUE: average transaction value for the customer
start_date <- ymd("2016-09-1")

customer_trans_log %>% 
  mutate(purchase_interval = interval(start_date, date) %>% int_length()) -> customer_trans_log

(customer_trans_log %>% 
  group_by(customer_id) %>% 
  summarize(most_recent = max(purchase_interval),
            transaction_frequency = n(),
            avg_transaction_value = mean(sales)) -> customer_data)

One approach to customer segmentation is to group customers by terciles of each of these quantities. Terciles are calculated using data for repeat customers only, with non-repeat customers belonging to their own group.

The FRM terciles are calculated:

  • Most recent transaction
(customer_data %>% 
  filter(most_recent != 0) %>% 
  summarize(terciles = list(quantile(most_recent, p = c(0, .33, .66, 1)))) %>% 
  unlist() -> trc)
  terciles.0%  terciles.33%  terciles.66% terciles.100% 
        86400       3955392      15724800      23500800 
customer_data %>% 
  mutate(recency = case_when(
    most_recent == 0 ~ 0,
    most_recent %>% between(trc[1], trc[2] - 1) ~ 1,
    most_recent %>% between(trc[2], trc[3] - 1) ~ 2,
    most_recent %>% between(trc[3], trc[4]) ~ 3
  )) -> customer_data
  • Transaction Frequency
(customer_data %>% 
  filter(most_recent != 0) %>% 
  summarize(terciles = list(quantile(transaction_frequency, p = c(0, .33, .66, 1)))) %>% 
  unlist() -> trc)
  terciles.0%  terciles.33%  terciles.66% terciles.100% 
            2             3             8           143 
customer_data %>% 
  mutate(frequency = case_when(
    most_recent == 0 ~ 0,
    transaction_frequency %>% between(trc[1], trc[2] - 1) ~ 1,
    transaction_frequency %>% between(trc[2], trc[3] - 1) ~ 2,
    transaction_frequency %>% between(trc[3], trc[4]) ~ 3
  )) -> customer_data
  • Average transaction value
(customer_data %>% 
  filter(most_recent != 0) %>% 
  summarize(terciles = list(quantile(avg_transaction_value, p = c(0, .33, .66, 1)))) %>% 
  unlist() -> trc)
  terciles.0%  terciles.33%  terciles.66% terciles.100% 
            4            22            35           499 
customer_data %>% 
  mutate(monetary_value = case_when(
    most_recent == 0 ~ 0,
    avg_transaction_value %>% between(trc[1], trc[2] - .0001) ~ 1,
    avg_transaction_value %>% between(trc[2], trc[3] - .0001) ~ 2,
    avg_transaction_value %>% between(trc[3], trc[4]) ~ 3
  )) -> customer_data

Now transform this data into a pivot table with 28 cells.

(customer_data %>% 
  group_by(recency, frequency, monetary_value) %>% 
  summarize(size = n(),
            expected_value = mean(avg_transaction_value)) %>%
   ungroup() %>% 
   mutate_at(c("recency", "frequency", "monetary_value"), factor) %>% 
  arrange(recency, frequency, monetary_value) -> RFM_table)

And render into a visualization:

ggplot(data = RFM_table %>% filter(recency != 0)) +
  geom_text(aes(x = recency, y = frequency, label = round(expected_value, 1), size = size, color = monetary_value)) +
  facet_wrap(~monetary_value) +
  theme(legend.position = "none")

Here is a visual representation of the relative value of of our customer segments. Each customer can be represented as a point in RFM space \(c = (r, f, m) \in \{1, 2, 3\}^3\); one-time shoppers are assigned an RFM value of 0 and not included in the graphic. The grid displays the average transaction value of the customer in that segment and the size of the number reflects the size of the customer segment. This graphic allows us to visually compare the expected contributions of different segments to total revenue. It could be extended into a crude predictive model, but we’ll leave it as a pretty pricture.

Predictive Models

Tri-Periodic Random Forests

Model Description

In a typical machine learning problem, we split the data into training and test sets, fit the model on the training set, and then predict using observations from the test set. The problem we are dealing with here is fundamentally different - doing a traditional train/ test split doens’t make any sense. Instead we divide the customer’s behavior into 3-month periods and treat each period as a new set of observations. Then we only need to find that customer’s frequency and recency for each period. I wil collate all the variables into one data frame, but conceptually we have a relational database with 4 tables, [P1, P2, P3, P4] following the schema: <ID> <FREQUENCY> <RECENCY>.

The learning algorithm will work as follows:

Train on P1, supervised by P2 with the formula F2 ~ F1 + R1 Use the trained model to predict F3 with F2 as the new data. Compare to actual results for F3.

It is “tri-periodic” because the algorithm requires 3 periods to function. Instead of fitting an algorithm to the whole dataset for making predictions, we fit the next tri-period- note that P1 is not used at all. The validity of this set-up may be related to the memorylessness property of the exponential distribution. In the next section we see evidence that the inter-transaction times are exponentially distributed.

Preparing the Data

We essentially have to derive this entire relational database from one table with 2 columns! What follows is a herculean exercise in dplyr.

customer_trans_log %>% 
   select(-sales, -purchase_interval) %>% 
  mutate(period = fct_collapse(month,
                               p1 = c("Sep", "Oct", "Nov"),
                               p2 = c("Dec", "Jan", "Feb"),
                               p3 = c("Mar", "Apr", "May")) %>% 
           droplevels()) -> customer_trans_log
p1_start <- ymd("2016-09-01")
p2_start <- ymd("2016-12-01")
p3_start <- ymd("2017-03-01")

customer_trans_log %>% 
  group_by(period) %>% 
  count(customer_id) %>% 
  spread(key = period, value = n) -> frequency_by_period
customer_trans_log %>% 
  mutate(period_start = case_when(
    period == "p1" ~ p1_start,
    period == "p2" ~ p2_start,
    period == "p3" ~ p3_start
  )) -> customer_trans_log
customer_trans_log %>% 
  mutate(purchase_interval = interval(period_start, date) %>% int_length()) -> customer_trans_log
customer_trans_log %>% 
  group_by(period, customer_id) %>%
  summarize(recent = max(purchase_interval)) %>% 
  spread(key = period, value = recent) %>% 
  rename(R1 = p1,
         R2 = p2,
         R3 = p3)-> recency_by_period
(frequency_by_period %>% 
  full_join(recency_by_period, by = "customer_id") %>% 
  replace(is.na(.), 0) %>% 
  select(customer_id, p1, R1, p2, R2, p3, R3) -> freq_rec_data)

We now have our frequency - recency data table where p counts the number of transactions a customer made during that period, and R signifies the time of their final purchase in that period, in seconds since the period began. If the customer made on purchases that period, R is 0.

Fitting the random forest

Here is an experiment to see whether including “recency” information actually improves prediction of frequency in the next period.

library(randomForest)
randomForest(p2 ~ p1 + R1, freq_rec_data) -> rf1
randomForest(p2 ~ p1, freq_rec_data) -> rf1_minus
freq_rec_data %>% 
  mutate(p2_hat = rf1$predicted,
         p2_hat_minus = rf1_minus$predicted) -> freq_rec_data

freq_rec_data %>% 
  ggplot() +
  geom_density(aes(x = p2_hat - p2), color = "red") +
  geom_density(aes(x = p2_hat_minus - p2), color = "blue") +
  ggtitle("Including recency improves predictions") +
  coord_cartesian(xlim = c(-1.5, 1.5)) +
  xlab("residual")

The residuals for the random forest model that includes both predictors have a much sharper peak at 0, indicating less error. Overall this approach looks promising.

round2 <- tibble(p1 = freq_rec_data$p2,
                 R1 = freq_rec_data$R2)

predict(rf1, newdata = round2) -> p3_hat

freq_rec_data %>% 
  mutate(p3_hat = p3_hat) -> freq_rec_data
freq_rec_data %>% 
  ggplot(aes(x = (p3_hat - p3) / sd(p3_hat - p3))) +
  geom_density() +
  xlab("studentized residuals") +
  coord_cartesian(xlim = (c(-1.5, 1.5)))

We make the assumption that the average transaction value is unique to a customer and independent of all other factors. Therefore, to get the 3-month sales predictions for each customer, I just multiply each p3_hat prediction by that customer’s average transaction value, which was calculated in the previous section. To calculate the residuals I would have had to aggregate sales for period 3. However, we will wait until the next section to really evaluate the performance of this model.

freq_rec_data %>% 
  mutate(avg_transaction_value = customer_data$avg_transaction_value,
         p3_sales_estimate = avg_transaction_value * p3_hat) -> freq_rec_data

Buy ’till You Die

Our next modelling approach will be using state-of-the-art statistical packages, courtesy of Zodiac co-founders Dr. McCarthy and Dr. Fader, authors of the BTYD package, and Dr. Platzer, author of BTYDplus. The BTYD model assumes that customers can make purchases (buy) until they become inactive (die). Specifically, we will use the Pareto/NBD model, in which customers can make purchases at any time. Using four parameters, it describes the rate at which customers make purchases and the rate at which they drop out. We can obtain estimates for these paramters by using only the variables of frequency and recency.

library(BTYDplus)
library(ggpubr)

customer_trans_log <- read_csv("customer_trans_log.csv")

(customer_trans_log %>% 
  rename(cust = customer_id) -> customer_trans_log)

We now divide the data set into calibration (training) period, and holdout. To validate our model, it makes sense to use periods 1 and 2 for calibration, and use period 3 as a hold out set. To prepare data, the package provides a function that converts a customer log to a CBS data frame.

calibration_end <- "2017-03-01"
(elog2cbs(customer_trans_log, T.cal = calibration_end) -> CBS)

Notes about the customer-by-sufficient-statistic (CBS) data-frame:

  • cust Is the unique customer Id
  • x is the number of repeat transactions (frequency)
  • t.x is the time of the last recorded transaction given in weeks elapsed since first (recency)
  • litt is the sum over logarithmic intertransaction times (required for estimating regularity)
  • first is the date of the first transaction,
  • T.cal is the duration between the first transaction and the end of the calibration period in weeks
  • T.star is the length of the holdout period in weeks
  • x.star is the number of repeat transactions during the holdout period

The package comes with functions to determine whether our data is a good candidate for Pareto/NBD modeling.

op <- par(mfrow = c(1, 2))
k.wheat <- estimateRegularity(customer_trans_log, method = "wheat",
plot = TRUE, title = "Wheat & Morrison")

k.mle <- estimateRegularity(customer_trans_log, method = "mle",
plot = TRUE, title = "Maximum Likelihood")

par(op)

The Wheat & Morrison regularity estimate: {r} k.wheat The Maximum Likelihood regularity estimate {r} k.mle

The “regularity” estimates refers the shape parameter of a gamma distribution. The return values are close to 1, supporting the assumption of exponentially distributed intertransaction times. The Wheat & Morrison plots the empirical intertransaction distribution against a theoretical exponential distribution.

Pareto/NBD Model

The Pareto Negative Binomial Distribution model draws inferences about a customer’s state based on their transaction frequency and the observed elapsed time since a customer’s last activity. The model assumes a customer’s lifetime τ to be exponentially distributed with parameter µ, where µ is Gamma(s, β)-distributed across customers. The package provide a function that estimates these paramters from the CBS frame.

params.pnbd <- BTYD::pnbd.EstimateParameters(CBS)
names(params.pnbd) <- c("r", "alpha", "s", "beta")
round(params.pnbd, 3)
    r alpha     s  beta 
 0.58  2.02  0.69  5.67 

r and alpha are unobserved parameters for the NBD transaction process. s and beta are unobserved parameters for the Pareto (exponential gamma) dropout process.

CBS$xstar.pnbd <- BTYD::pnbd.ConditionalExpectedTransactions( 
  params = params.pnbd,
  T.star = CBS$T.star,
  x = CBS$x,
  t.x = CBS$t.x,
  T.cal = CBS$T.cal)


rbind(`Actuals` = c(`Holdout` = sum(CBS$x.star)),
`Pareto/NBD` = c(`Holdout` = round(sum(CBS$xstar.pnbd))))
           Holdout
Actuals      20325
Pareto/NBD   20582

The ConditionalExpectedTransactions function uses the Pareto/NBD model parameters and a customer’s past transaction behavior to estimate the number of transactions they are expected to make in a given time period. Above is the comparison between the actual and estimated total number of customer transactions during the 13-week holdout period. They are remarkably close.

(CBS %>% 
  mutate(avg_sale = if_else(x != 0, sales / x, 0),
         total_expected_sales = xstar.pnbd * avg_sale,
         sales_resid = total_expected_sales - sales.star,
         x.star.resid = xstar.pnbd - x.star) -> CBS) 

The package provides some more plots about the data.

op <- par(mfrow = c(1, 2))
CBS %$% BTYD::pnbd.PAlive(params.pnbd, x, t.x, T.cal) -> PAlive_probs

CBS %>% 
  mutate(PAlive = PAlive_probs) -> CBS

CBS$PAlive %>% hist(main = "Survival Probability")

(BTYD::pnbd.PlotDropoutRateHeterogeneity(params.pnbd)) -> dropout_plot

par(op)

The heterogeneity plot is a distribution of each customer’s exponential parameter that determines their lifetime. The mean dropout rate calculated at 12%. This is approximately the mean of the Survival Probability distribution, which agrees with the Pareto/NBD assumption that a customer’s lifetime can be modeled with an exponential distribution.

The PAlive function uses the Pareto/NBD model parameters and a customer’s past transaction behavior to return the probability that they are still alive at the end of the calibration period. The bimodal distribution is revealing. The Pareto model predicts that most customers have less than a 20% percent chance of being alive during the holdout period, but some shoppers will still be alive with a high degree of certainty - there isn’t musch middle ground.

CBS %>% 
  filter(x > 20) %>% 
  ggplot() +
  geom_point(aes(x, PAlive), alpha = .5) +
  ggtitle("High frquency shoppers are given very high probability of surviving")

op <- par(mfrow = c(1, 2))

transaction_plot <- BTYD::pnbd.PlotFreqVsConditionalExpectedFrequency(params.pnbd, T.star = 13, CBS, CBS$x.star, censor = 50)

frequency_plot <- BTYD::pnbd.PlotFrequencyInCalibration(params.pnbd, CBS, censor = 20)

par(op)

Here we can see how well the Pareto/NBD model predicts the distribution of the transaction frequencies in the holdout period. For low frequency shoppers the estimates are very accurate. But notice the sharp increase in variance when the model tries to predict the amount of transactions for high-frequency shoppers. This could pose a problem, since the most loyal customers are arguably the ones we want the best estimates for, since they spend the most.

ggplot(CBS) +
  geom_histogram(aes(sales_resid), alpha = .7, color = "pink") +
xlim(-.5, 5) -> p1
 
  ggplot(CBS)+ geom_density(aes(x.star.resid), color = "blue") +
    xlim(-.1, .5)  -> p2
  
ggarrange(p1, p2)

Residual plots give some evidence for systematic overestimation. It could be caused by the optimistic estimates of survival of high-frequency shoppers.

Model Comparison

Now we can check the performance of my sequential random forest model against the state-of-the-art probability model.

comp <- CBS %>% 
  select(cust, actual_sales = sales.star, pndb_sales = total_expected_sales) %>% 
  mutate(rf_sales = freq_rec_data$p3_sales_estimate)
comp %>% 
  select(-cust) %>% 
  gather() %>% 
  ggplot(aes(color = key)) +
  geom_density(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  coord_cartesian(xlim = c(0, 150))

The random forest density has a lot more mass near zero. Also notice that the actual sales has a much fatter tail than predicted by either model.

RMSE <- function(response, truth = comp$actual_sales) {
  (response - truth)^2 %>% sum() %>% sqrt() %>% `/`(., nrow(comp))
}

tibble(rf_rmse = RMSE(comp$rf_sales),
       pndb_rmse = RMSE(comp$pndb_sales),
       ensemble_rmse = RMSE(.5 * (comp$rf_sales + comp$pndb_sales)))

Surprisingly, we averaged in a worse predictor and got out a better one!

Ensembling can improve performance when two models generate good predictions but in different ways. Random forests learn a piecewise-linear function on the predictor space whereas the pndb model learns hidden parameters that govern the actual probability distribution of transaction frequency. The two strategies are so different that it’s possible they complement each other’s weaknesses.

Making Predictions

Now we refit the Pareto/NBD model to all 9 months of customer data.

elog2cbs(customer_trans_log) -> CBS

params.pnbd <- BTYD::pnbd.EstimateParameters(CBS)
names(params.pnbd) <- c("r", "alpha", "s", "beta")

CBS$xstar.pnbd <- BTYD::pnbd.ConditionalExpectedTransactions( 
  params = params.pnbd,
  T.star = 13,
  x = CBS$x,
  t.x = CBS$t.x,
  T.cal = CBS$T.cal)

CBS %>% 
  mutate(avg_sale = if_else(x != 0, sales / x, 0),
         total_expected_sales = xstar.pnbd * avg_sale) -> CBS

Train the random forest on period 2 data, then predict on period 3 data to estimate period 4 data.

library(randomForest)
rf2 <- randomForest::randomForest(p3 ~ p2 + R2, data = freq_rec_data)
round3 <- tibble(p2 = freq_rec_data$p3,
                 R2 = freq_rec_data$R3)

p4_estimate <- predict(rf2, newdata = round3)

rf_sales = p4_estimate * customer_data$avg_transaction_value

Ensemble the predictions.

ensemble_sales = .5 * (rf_sales + CBS$total_expected_sales)
customer_preds %>% 
  arrange(customer_id) %>% 
  mutate(three_month_pred = ensemble_sales) ->customer_preds

write_csv(customer_preds, "final_predictions.csv")
customer_preds

References

Paper on RFM segmentation for CLV by Dr. Fader

Customer Base Analysis with BTYDplus by Michael Platzer

Documentation for BTYD package by Dr. McCarthy

BTYD Walkthrough