Wednesday, 6 June 2012

Introduction to Predictive Scoring Models - Part 1

I recently attended a meeting of prospect researchers from various organisations from across Scotland and at that meeting two things became apparent. Firstly, I was the only data analyst in the room (well really I was the only one that thought of themselves that way) and secondly there appeared to be great interest in what data analysis could bring to the traditional role of the prospect researcher. In particular there was a real enthusiasm to learn more about predictive modelling.

With this in mind I thought it would be useful to put together something about getting started with using data to build a predictive scoring model. Even a simple model can produce fantastic results when applied to appeal mailings or telephone campaigns.

What is a scoring model?

The idea is that if you find those that look like your donors, you will have a better chance of producing more donors. To do this you gather together different data variables about your constituents. When you find a variable that has predictive power you allocate it a point. When you combine multiple variables together and the points are added, you produce a score, the higher the score the more like a donor the constituent is.

What gives a variable predictive power?

I have been asked many times what the average donor looks like. Who gives to this fund or that? Just because the characteristics we study relate to donors does not make those characteristics predictive. To build a scoring model we need to uncover distinguishing characteristics not common characteristics. How populations differ from target groups to random groups is far more interesting statistically.

The crucial distinction is that we are not looking for similarities between our donors. We are looking for distinguishing qualities between our donors and the rest of our constituents. The average donor may look very similar to the average non-donor.

Consider this example.

From a sample drawn from my general population I looked at two variables, whether or not a legacy has been pledged and whether or not an email address exists. Firstly I looked at the 8% of my sample that are donors.

Looking at donors in isolation

I found that very few donors have pledged to leave a legacy while a majority of donors have an email address recorded. In fact a little over 1 in every 2 donors has an email address recorded.

Does this give email address more predictive power? Looks like 1 in 2 chance of finding a donor with email. Right?

The simple answer is well, we just don’t know. There still isn’t enough information for that. Remember we need to look for what distinguishes a donor from the rest of our constituents.

Things change when we include all constituents in the calculation

When we look at our sample from the perspective of the data variable and look at donors with non-donors things change. Email is no longer the significant variable. The vast majority of those with email are non-donors whereas the opposite is true of those that have pledged a legacy. You have a two in three chance of picking donor if you focus on those with a legacy pledge.

When we looked at donors alone, it seemed like we had a 1 in 2 chance of finding a donor if we looked at the email variable. When we look at the bigger picture we see that it is more like 1 in 10.

Don’t look at donors in isolation. Compare data for donors with data for everyone.

You need to repeat this test for other variables to find out those with the strongest predictive power.  You could try starting with event attendance, employment details, questionnaire responses, email address, legacy pledge status as well as age, relationships and location.

Once you have collected your variables you need to determine how many to use and which are significantly higher than the baseline (the percentage of donors in your sample as a whole).
Here are a few examples:

For my sample of 10,000 records I found a baseline of 8% donors. Focusing on legacy pledgers as a distinct cohort, the percentage of donors is nearly 8 times higher.

Aim to keep the number of variables low. Try finding up to 10 variables that show higher rates of giving than your sample baseline. Any final model can still be effective with as little as 6 variables combined.

That’s enough for now. Part 2 will deal with tying this together and reporting the model output.


No comments:

Post a Comment