Wednesday, 4 July 2012

Introduction to Predictive Scoring Models - Part 2

In part 1 of my introduction to predictive scoring models I wrote about how some data variables have particular ‘predictive power’. This is based on the correlation they have with the thing we want to predict, the dependent variable.

In my example I attempted to find variables that would predict likelihood of making a gift so I compiled a list of those data variables that appeared to have the strongest correlation with being an existing donor. The next stage of the process is to tie this together. 

Dealing with the output has 4 key steps:

Step 1 – Produce the output file

You need to produce a report from your database for each constituent in your sample with columns for each of your data variables.

Ask questions of your data that require answers of YES or NO. Code the results as 1 for YES and 0 for NO.

It will look something like this:

The first binary column in the table represents the dependent variable. Has this constituent done the thing you are trying to predict? If you have the skills and the tools to produce this yourself, great, if not, you need to cultivate your relationship with your database team.

Step 2 – Calculate the score

The next thing to do is add up the 1s and 0s for all the independent variables you have included in your file to produce a score for each row/person.  It doesn’t matter how many variables you have included in your model. Don’t include the dependent variable in your score calculation.

It will look something like this:

Step 3 – Analyse the results

The thing we want to determine is if the score has a relationship with the number of donors found. Saving the output columns in numbers allows you to multiply and group the results by score easily.

Here is the result of my 10,000 records grouped by score:

It is not easy to see from this whether or not the model has produced anything useful. What I need to do is show what percentage of each score is a donor as there are vastly different numbers of constituents at each score.

I also find it best to show the effectiveness of a score by plotting it into a graph.

This shows quite clearly that the higher the score the higher the percentage of donors found. 

The problem here is that there are very few constituents at the higher levels. Only one constituent scored 8 points and they happened to be a donor. Not much good for segmentation.

Step 4 – Split into percentiles

It is important to look at results of your model based on percentiles of your sample. What does the top 25% of constituents look like when compared to the bottom 25%?

I broke my sample into four quartiles of roughly equal size:

So looking at the fourth quartile, 23.3% are donors or 551 donors from a possible 2366. 

What we have done is identify a large group of people that are not donors but look like our donors. We also have identified a large enough group to start segmenting  our data for calling or mailing.

Until next time....


1 comment:

  1. Good post. Could have been more helpful with some more examples. But great post nevertheless