Intro to Markov Chain Attribution

Ben Denis Shaffer 2021-04-02 1687 min read

What is Attribution Modeling

Attribution modeling is a task that comes up in Digital Marketing. Broadly speaking the objective of Attribution modeling is to improve the assessment of various Advertising Channels in driving Marketing Objectives. Before jumping to the data analysis let’s first set the context for the problem that Markov Chain Attribution modeling addresses.

With this post I am aiming to cover the following:

What is Attribution Modeling?
What is Multi-Touch Attribution?
Intro to Markov Chains
Using R for:

Simulating Consumer Journeys from a Markov Chain
Estimating a Markov Chain from Consumer Journeys
Using Markov Chains for Data-Driven Attribution

Many Advertising Channels

When marketers spend money on advertising they strategically allocate budget towards different campaigns which in turn use a multitude of advertising channels. Some common adverting channels include cable and streaming TV, Programmatic, Social, Paid Search, Print. How these are defined, and which media partners and platforms are used to spend the planned budget varies, but the fact that advertisers communicate through a multitude of channels remains.

Goals and KPI’s

Let’s keep in mind that marketers don’t advertise for the sake of advertising. They have particular goals and objectives around which a marketing communications strategy is built. One of the roles of Analytics is to help build a measurement plan that will define a set of Key Performance Indicators (KPI’s) that will serve as a measure of success for the marketing efforts. Depending on the goals, and technical ability these KPIs are chosen and tools for measuring them are put in place. For example, we may want to:

Generate visits to a website or a webpage
Generate leads through form submissions
Have users perform certain actions on our website
Raise awareness for our brand
or simply Drive sales

Whatever the KPI is, a unit of a completed KPI is referred to as a Conversion.

ROI and Attribution

While we are running a campaign and spending our budget there is a key question that we need to answer:

How do we evaluate the degree of importance of the different channels in driving our KPI?

Obviously, if our goal was to drive sales but all else being equal our sales did not increase, then none of the channels worked well. But if we do succeed in driving incremental sales, which channel is working best? In other words, how do we Attribute the incremental sales to different channels and compute our Return on Investment (ROI, sometimes referred to as return on advertising (ROAS))? Should we tactically alter our plan because one of the channels is not performing as expected? Answering these questions is one-way Analytics ads value to Advertising.

Last-Touch in the Consumer Path

With Digital advertising our ability to measure when and how consumers interact with ads has become easier. This tractability allows us to collect data and analyze parts of the Consumer Path to a Conversion. A consumer journey exists whether or not it’s tracked, and no one can ever really track the whole journey. What we can track are some of the Media Touchpoints. Often the easiest way to Attribute Conversions is to simply look at the Last-Touch in the Consumer Path. That’s because by default this is the only touchpoint we do see. Giving the Last-Touch full credit for driving a Conversion will result in an unfair assessment of our media effectiveness and will not account for the Synergy that a mix of channels aims to create.

Thankfully if an adequate effort to track our media is made we can generate Path Data and hope to attribute in a more sophisticated way, that is using Multi-Touch Attribution (MTA). This allows us to better judge the effectiveness of channels used, and make better investment decisions.

Multi-Touch Attribution

There are two main categories of MTA methods; Rule-Based and Data-Driven. Of course, both of these use data but what distinguishes them is how they assign importance to touch-points along the consumer path. Rule-Based methods heuristically assign weights to touch-points based on their position.

Data-Driven methods take a different approach. Instead of assigning arbitrary rules based on the position they establish probabilistic relationships between touchpoints and their role in driving Conversions.

Markov Chain Attribution

Markov Chain Attribution is one of the more popular data-driven methods, and as the name suggests it takes advantage of Markov Chains. Unless you studied Operations Research or Math at university you might not know what these chains are so let’s do a high-level intro.

Markov Chains

The key concept to remember about Markov Chains is the Transition Matrix. Markov Chains themselves are a conceptual tool for describing a (stochastic) system. All this means is that when there is randomness or a probabilistic component to a process, Markov Chains might be a good tool to first describe, and then analyze this process. The process we are concerned with here is Customer Journeys towards Conversions.

The Fundamentals

We can visualize a Markov Chain as a directed network. Each node is a State that a consumer can be in while following a path. Each arrow is a Transition, and with it, there is an associated probability. So there is a predetermined set of touch-points you may have, but which will happen next is uncertain.

library(visNetwork)
library(tidyverse)
library(gt)
nodes <- 
  data.frame(id = 1:6, 
             label = c("Non-Conversion", "Display", "TV", 
                       "Search", "Conversion","Start"),
             color = c("darkred", "grey", "grey", 
                       "grey", "lightblue", "green"))
edges = expand.grid(from = 1:6, to = 1:6)
edges = edges %>% filter(from != to, to != 5, to != 1, from != 6)

visNetwork(nodes, edges, width = "100%") %>% 
  visEdges(arrows = "from") %>% 
  visIgraphLayout(layout = "layout_with_kk")

The conversion and non-conversion nodes have arrows pointing to them from all channels, but no arrows emanating from them. These are called Absorbing States and for us, they indicate an end of a journey. When we collect all of the probabilities of transitioning from one state to the next, we organize them in a square Transition Matrix.

M = t(matrix(c(1,    0,   0,   0,  0,
               200,   0, 300, 350, 10,
               160, 300,   0, 350,  8,
               150, 200, 200,   0,  20,
               0,    0,   0,   0,  1), nrow = 5))

M = M/rowSums(M)

colnames(M) = c("Non-Conversion", "Display","TV","Search","Conversion")
row.names(M) = c("Non-Conversion", "Display","TV","Search","Conversion")

as.data.frame(round(M,3), row.names = row.names(M)) %>% 
  rownames_to_column() %>% gt::gt() %>%
  gt::tab_header(title = "Theoretical Transition Matrix")  %>%
  tab_style(style = list(cell_fill(color = "#F1CBC3")),
            locations = cells_body(columns = c(2,6), rows = 2:4)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 3, rows = 3:4)) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 4, rows = c(2,4))) %>%
  tab_style(style = list(cell_fill(color = "#F6A392")),
            locations = cells_body(columns = 5, rows = 2:3))

Theoretical Transition Matrix

	Non-Conversion	Display	TV	Search	Conversion
Non-Conversion	1.000	0.000	0.000	0.000	0.000
Display	0.233	0.000	0.349	0.407	0.012
TV	0.196	0.367	0.000	0.428	0.010
Search	0.263	0.351	0.351	0.000	0.035
Conversion	0.000	0.000	0.000	0.000	1.000

Above is an example Transition Matrix which I just put together for illustration. To make it realistic I set the conversions probabilities to relatively low values. Later we will use this matrix for Simulation!

An important property to keep in mind is that rows sum up to one since we are dealing with probabilities. The rows represent the current state in the journey, and columns represent the next state.

rowSums(M)

## Non-Conversion        Display             TV         Search     Conversion 
##              1              1              1              1              1

There is one more important thing to know about Markov Chains, and that is the Markov Property or the memorylessness. This will be TMI for some, but this is really the reason why it is possible to summarize the whole process in one simple Transition Matrix. The property is a simplifying assumption stating that the probability of where you transition to next depends only on where you are now, but not on your whole journey history. To be clear, this doesn’t mean that the journey history doesn’t matter. It absolutely does impact attribution results. What it doesn’t impact is the probability. This makes analysis significantly simpler from a mathematical point of view.

From Chains to Paths (Simulation)

In practice, given a data-set of consumer paths (constructing this data-set is a big task so ‘given’ is a strong word), we may estimate a Transition Matrix. The reverse is also true. Given a Transition Matrix we can simulate a set of consumer paths. In fact, simulation plays an important role in deriving the actual levels of importance for each channel, so it is worth seeing the mechanics of this process.

Below we define a function that will work well with the above sample matrix \(M\). Given this matrix and a number of Markov Chain steps we want to make num_sim we do the following:

Randomly select a starting point by sampling from the set of media touchpoints
1. In practice we would actually estimate the starting point probabilities.
For each step:
1. Pick a row of probabilities from \(M\) corresponding to the current state/touchpoint.
  1. Take one sample from a Multinomial distribution with probabilities set to the row.
2. Set the sample as the new current state and repeat.
If we hit a Conversion or Non-Conversion we end the path.

simulate_path = function(M, num_sim){
  num_sim = num_sim
  path = vector(mode = "integer", length = num_sim)
  path[1] = sample(2:(nrow(M)-1), 1)
  for(i in 1:(num_sim-1)){
    p = M[path[i],]
    sn = which(rmultinom(1, 1, p) == 1)
    path[i+1] = sn
    if(sn == 1 | sn == 5){
      break
    }
  }
  return(path[path > 0])
}

Simulating One Path

Let’s test out this function. The first example is a non-converter.

set.seed(124)
num_sim = 100
path = simulate_path(M, num_sim)
plot(seq_along(path), path , 
     type = "l", axes=FALSE, ylim = c(1,5),
     main = "Non-Converter", 
     ylab = "Channel Touchpoint",
     xlab = "Touchpoint Number")
points(seq_along(path), path)
axis(side = 2, at = 1:5)
axis(side = 1, at = seq_along(path))

We see that we start out with Display, move to Search, then TV, carry on the journey until step 6 where we drop out without converting.

This is another example but this time we have a converter.

set.seed(007)
num_sim = 100
path = simulate_path(M, num_sim)
plot(seq_along(path), path , 
     type = "l", axes=FALSE, ylim = c(1,5),
     main = "Converter", 
     ylab = "Channel Touchpoint",
     xlab = "Touchpoint Number")
points(seq_along(path), path)
axis(side = 2, at = 1:5)
axis(side = 1, at = seq_along(path))

This time we started out with TV and converted on step 8 with Search being our last touchpoint.

Simulating a Full Data-set of Journeys

We can now simulate a whole set of paths simply by repeating the function call simulate_path. With a few more data wrangling operations we have a pathdf. Here I will generate 10,000 paths.

num_sim = 100
num_paths = 10000

paths = purrr::map(1:num_paths, ~simulate_path(M, num_sim))

conversion = 
  purrr::map(paths, ~ data.frame(conversion = 
                                   ifelse(.x[length(.x)] == 5, 
                                          "converter",
                                          "non-converter")
                                 )
             ) %>% bind_rows(.id = "path_num")

pathdf = 
  map(paths, ~data.frame(touchpoint = 1:length(.x), channel = .x)) %>%
  bind_rows(.id = "path_num") %>%
  left_join(conversion) %>%
  left_join(
    data.frame(channel_name = colnames(M), channel = 1:5)
  )

head(pathdf,10) %>% gt::gt() %>%
  gt::tab_header(title = "Simmulated Paths")

Simmulated Paths

path_num	touchpoint	channel	conversion	channel_name
1	1	4	non-converter	Search
1	2	2	non-converter	Display
1	3	1	non-converter	Non-Conversion
2	1	4	non-converter	Search
2	2	2	non-converter	Display
2	3	3	non-converter	TV
2	4	2	non-converter	Display
2	5	4	non-converter	Search
2	6	2	non-converter	Display
2	7	3	non-converter	TV

Let’s now plot all of these paths on top of one another to see how this Markov Chain behaves.

plotly::ggplotly(
pathdf %>%
  ggplot(aes(touchpoint, channel, color = conversion, group = path_num)) + 
  geom_line() +
  labs(x = "Touchpoint Number", y = "Channel Touhcpoint") +
  theme_minimal()
)

From the plot we see that more often lines drop down, which indicates a non-converter. This is as expected. We can see that the simulated conversion rate was just over 8%.

table(conversion$conversion)/nrow(conversion)

## 
##     converter non-converter 
##        0.0808        0.9192

From Paths to Chains

Now that we have simulated a set of consumer paths from our Markov Chain \(M\) we can try to use them to estimate the original \(M\) with \(\hat{M}\) and see how well the estimation method works. Here I will introduce the brilliant ChannelAttribution package authored by Davide Altomare. It is open source and hosted on GitLab here. This package contain a function called transition_matrix which takes in a data-set of paths and estimates the Transition Matrix.

Data Prep

The first order of business is to structure the data in the required format. First we need to parse out the final touchpoint in each path since they represent a conversion/non-conversion event. Then we summarize the data grouping by path and count the number converters and non-converters for each unique path observed.

named_paths = 
  pathdf %>% group_by(path_num) %>%
  group_split() %>%
  map(~pull(.x,channel_name))

path_trim = map(named_paths, ~.x[.x != "Non-Conversion" & .x != "Conversion"])

journeydf = 
  as_tibble(
    cbind(
      as.data.frame(
        do.call(rbind,
                map(path_trim, ~str_c(.x, collapse = " > "))
                )
        ),conversion$conversion)
    )

names(journeydf) = c("path","conversion")

journeydf = 
  journeydf %>% 
  group_by(path) %>%
  summarise(
    converters = sum(if_else(conversion == "converter", 1, 0)),
    non_converters = sum(if_else(conversion == "non-converter", 1, 0))
    ) %>%
  arrange(-converters, -non_converters)

head(journeydf, 15) %>% gt::gt() %>%
  gt::tab_header(title = "Simmulated Journey Data (Top 15 Converters)")

Simmulated Journey Data (Top 15 Converters)

path	converters	non_converters
Search	78	902
Display	60	804
TV	53	588
Search > Display	29	267
TV > Search	27	409
Display > Search	26	329
Search > TV	20	227
Display > TV	20	224
TV > Display	18	269
TV > Search > Display	14	104
Display > TV > Search	13	152
Search > Display > Search	13	127
Display > Search > Display > Search	10	50
TV > Display > Search	8	128
TV > Search > TV	8	95

Here are some quick summary stats about the generated path data.

data.frame(
  metric = c("Converters", 
             "Non Converters", 
             "Number of Paths",
             "Conversion Rate %", 
             "Unique Journeys"),
  value = c(sum(journeydf$converters),
            sum(journeydf$non_converters),
            sum(journeydf$non_converters) + sum(journeydf$converters),
            100*sum(journeydf$converters) / num_paths,
            nrow(journeydf))) %>%
  gt::gt() %>%
  gt::tab_header(title = "Summary")

Summary

metric	value
Converters	808.00
Non Converters	9192.00
Number of Paths	10000.00
Conversion Rate %	8.08
Unique Journeys	1501.00

There are many other ways to explore this kind of data. One quick and intersection one is the distribution of path lengths.

path_lengths = map_int(path_trim, ~length(.x))
summary(path_lengths)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.002   5.000  31.000

hist(path_lengths, main = "Path Length Distribution")

Transition Matrix Estimation

Finally, now that we have our simulated data-set with a known transition matrix \(M\) , we can now see how we can go from data back to a transition matrix, via estimation and compute \(\hat{M}\). Here we apply the ChannelAttribution::transition_matrix() function to out data. The function returns a lookup table for the channel ids and a table to transition probabilities.

library(ChannelAttribution)

tM = transition_matrix(journeydf, 
                       var_path = "path", 
                       var_conv = "converters", 
                       var_null = "non_converters")

tM$transition_matrix %>% gt::gt() %>%
  gt::tab_header(title = "Transition Probabilities")

Transition Probabilities

channel_from	channel_to	transition_probability
(start)	1	0.33580000
(start)	2	0.33780000
(start)	3	0.32640000
1	(conversion)	0.02380785
1	(null)	0.27083187
1	2	0.35113052
1	3	0.35422977
2	(conversion)	0.02017868
2	(null)	0.22704868
2	1	0.40280345
2	3	0.34996919
3	(conversion)	0.01620316
3	(null)	0.18688167
3	1	0.43694010
3	2	0.35997507

We can format and the table and output in a way where it is easy for us to compare the estimated and the true transition matrices

Estimated Transition Matrix

	(null)	Display	TV	Search	(conversion)
Display	0.227	0.000	0.350	0.403	0.020
TV	0.187	0.360	0.000	0.437	0.016
Search	0.271	0.351	0.354	0.000	0.024

Original Transition Matrix

	Non-Conversion	Display	TV	Search	Conversion
Display	0.233	0.000	0.349	0.407	0.012
TV	0.196	0.367	0.000	0.428	0.010
Search	0.263	0.351	0.351	0.000	0.035

We can see that the estimation worked pretty well especially for the larger probabilities. Because conversion event are much rarer there is more variability / less precision for their estimates.

Attribution

So far we introduced a theoretical Markov Chain in the form of a transition matrix \(M\) , used it to simulate a data-set of Consumer Journeys, and returned back to a Transition Matrix by estimating \(\hat{M}\). Now it’s time to convert, and move on to the actually attribution model!

Removal Effects

Markov Chains seem to be a good idea as a model for Consumer Journeys. But how do we use them to actually attribute conversions is not immediately clear.

The main idea here is to compute Removal Effects. Removal Effects measure how many fewer conversions would be observed if a channel was omitted/removed from the paths. If we compute this for each channel in the Markov Chain then the Relative Removal Effects will be the factor by which we scale the total conversion volume.

Running Attribution

As we just learned running a Markov attribution model involves estimating a Transition Matrix and then computing the Removal Effects. To do this we simply use the markov_model function, which takes the data in the prepared format. By specifying out_more = TRUE in addition to the attribution results we also get back the transition matrix and removal effects.

mm_res = 
  markov_model(journeydf, 
             var_path = "path", 
             var_conv = "converters", 
             var_null = "non_converters", 
             out_more = TRUE)

## 
## Number of simulations: 100000 - Convergence reached: 2.32% < 5.00%
## 
## Percentage of simulated paths that successfully end before maximum number of steps (32) is reached: 99.98%

Attribution Results

First, let’s take a look at the results. The total number of conversions was 808. The below table shows how these were attributed by our Data-Driven model.

mm_res$result %>% gt::gt() %>%
  gt::tab_header(title = "Markov Chain Attribution Results")

Markov Chain Attribution Results

channel_name	total_conversions
Search	284.9682
Display	263.2859
TV	259.7459

We see that the number of conversions attributed to each channel is similar however Search was attributed the most conversions. This is not surprising since transition probabilities were steered towards Search but only slightly. In addition, each channel was equally likely to be picked as a first touchpoint in our simulation which doesn’t happen in reality.

To show how these results are calculated from the removal effects here is the explicit code:

total_coversions = sum(journeydf$converters)
removal_effects =  mm_res$removal_effects$removal_effects
relative_removal_effects = removal_effects/sum(removal_effects)
attributed_conversions = total_coversions * relative_removal_effects

attributed_conversions

## [1] 284.9682 263.2859 259.7459

The removal effects themselves can be easily accessed and interpreted

removal_effects

## [1] 0.7806061 0.7212121 0.7115152

From these numbers, we see that about 78% fewer conversions would be observed if Search was removed. This is quite dramatic but not surprising for such a small and interconnected Markov Chain.

Quick Comparison to Rule-Based Models

With the ChannelAttribution package it is easy to also get results of some Rule-Based models and compare them to the Data-Driven results.

mta_res = 
  heuristic_models(journeydf, 
                 var_path = "path", 
                 var_conv = "converters") %>% 
  left_join(mm_res$result)

## Joining, by = "channel_name"

mta_res %>% gt::gt() %>%
  gt::tab_header(title = "MTA Model Result Comparison")

MTA Model Result Comparison

channel_name	first_touch	last_touch	linear_touch	total_conversions
Search	282	338	299.0591	284.9682
Display	269	262	262.9943	263.2859
TV	257	208	245.9466	259.7459

We can clearly see the difference between Last-Touch and the Markov Chain model results as the importance of Display and especially TV are well underestimated. Last-Touch attributed 50 fewer conversions to TV than the Markov Chain model. This example makes it clear how Markov Chain attribution can correct for the inherent bias of the Last-Touch attribution. Linear Touch results are similar here, again because the transition probabilities are quite between the channels are not too far apart.

mta_res %>% 
  pivot_longer(-channel_name, names_to = "MTA_Model", values_to = "Conversions") %>% 
  ggplot(aes(MTA_Model, Conversions, fill = channel_name)) + 
  geom_col(position = "dodge") +
  labs(title = "MTA Model Result Comparison") +
  theme_minimal()

Key Considerations

Hopefully, you find the topic of MTA and Markov Chain Attribution as interesting as I do. Though this introduction is quite lengthy there are still many aspects of this type of analysis to consider. Here are a few crucial things that need to be considered next:

Data is the key ingredient of any data analysis so the quality of analysis hinges on the quality of data.
Quality of data depends on a few factors:

Measurability — how well can a Channel be measured? Digital ads are easily measured while OOH (street banners) are hard to track.
Measurement — just because something is easy to measure doesn’t mean it is well measured. If proper tools are not implemented, maintained, and managed data will not be properly collected.
Data Management — even if data is properly collected it still needs to be managed which means naming conventions and data classification has to be in place.
Cross-Device / Cross-Platform (Walled Gardens) — in order to stitch together the Customer Journey data we need to be able to identify users across different platforms and devices through which they are reached. This is perhaps the biggest limitation of MTA today.

3. The Data Engineering aspect is important if MTA results need to be delivered frequently and efficiently. The more flexibility the more complexity there will be.

4. We might also want to consider if there are better ways to estimate the transitions matrix for example with Bayesian methods.

Ben's Data Science