Jessica asks, “As a Data Scientist for marketing, how do you decide which variables are important?”
Generally speaking, feature selection or variable/predictor importance is the technique you’d use to make that determination – with the understanding that what you’ll likely get is correlative in nature. You then have to use the scientific method to prove causation.
And that’s if you find a relationship that isn’t spurious. Sometimes, you’ll get spurious correlations – correlations that make no sense at all, which is why you must know your data set well as a subject matter expert. And the worst case scenario is when you get no relationships at all. That means you have to augment or engineer variables.
Can’t see anything? Watch it on YouTube here.
Listen to the audio here:
- Got a question for You Ask, I’ll Answer? Submit it here!
- Subscribe to my weekly newsletter for more useful marketing tips.
- Find older episodes of You Ask, I Answer on my YouTube channel.
- Need help with your company’s data and analytics? Let me know!
- Join my free Slack group for marketers interested in analytics!
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.
Today’s episode Jessica asks, as a data scientist for marketing, how do you decide which variables are important? So variable importance, also known as feature selection, predictor importance is a set of techniques and algorithms that you use to essentially try to figure out which of the variables that you have in a data set have a relationship with the outcome that you care about.
So this is typically regression analysis, although it can there can be variants for classification, but fundamentally, it’s a regression analysis to figure out is there a mathematical relationship between an outcome and all the data that you have with it? And this is something that we’ve been doing for a very, very long time, right? If you’ve ever run a basic correlation and an Excel spreadsheet, you’re technically doing a type of Variable importance or variable selection.
What’s different today from doing it an Excel, for example, is that you can use machine technology to look at every possible combination of variables, which you would call multiple regression or multiple regression subset analysis, and have machines try and pick the algorithm that would be best suited for that data set, because there are some algorithms that are better suited for looking at categorical or non number data.
There’s some algorithms that are good at number of data.
There’s some algorithms that are good at both, but not as good as either one.
And so using machine learning technology allows us to identify those relationships in a much more robust way.
And quite frankly, just a faster way than trying to do it by hand.
Now, what you get with a lot of feature selection techniques is is a correlation regression analysis leads to a correlation.
And that’s important to know because when you have a correlation or an association, you have not proved causation.
Stats one on one correlation is not causation.
So you would use machine learning technology to first do a first pass at what are the features that we think are important and then, ideally use the scientific method to prove that this has a relationship with the outcome.
Now, that’s if you find that the relationship isn’t spurious.
Sometimes you will get what’s called spurious correlations, correlations that make no sense at all.
They’re they’re variables that have no relationship.
But the machine sees a pattern, even though it’s not valid.
There’s actually a great blog by Tyler vegan called spurious correlations, go Google spurious correlations is hilarious.
It’s all these things that have strong correlations.
But clearly no relationship to each other, like the number of people who died from drowning and the number of movies, Nicolas Cage has been right have no relationship to each other.
But there’s a mathematical relationship.
And that’s why you need the scientific method to be able to prove that what A causes B.
This is also why you have to know your data set really well as a subject matter expert, part of data science is having that subject matter expertise so that you can look at the variables that a machine would say, these correlate, and go.
Now they don’t really correlate they, I mean, they, they mathematically do have a relationship, but it’s not a valid relationship.
And the worst case scenario with a lot of these tools is that you get a whole bunch of nothing, you get a whole bunch of inconclusive answers that then tell you, you don’t have enough data, or there’s data missing or their relationships missing any data that you then have to go and either augment by bringing in more data or engineer by creating new data from the Data you already have.
So let’s look at an example of this.
I’ll bring this up here.
This is IBM Watson Studio.
And what I’ve done is I’ve taken my lead scoring data from my marketing automation system.
I fed it in here, I said, Tell me feature importance wise, all the data that I’m collecting in my marketing automation system.
What has the highest math mathematical relationship to the outcome I care about, which in this case is the points if you’ve ever worked with a marketing automation or CRM system, lead score or points is one of the indicators that says hey, this is a high quality lead or this is a low quality lead.
In this case, we see a very strong relationship between when a contact was last active and their lead score.
This makes total sense, the more active you are and the more frequent you actually actually you are, of course, the higher the points are probably going to have right somebody who’s was active once four years ago.
And not a very good lead.
The second relationship which is much, much weaker, I would actually say it’s there’s not a relationship here is activity on Twitter.
And so this is an example of you had a very good indicator, which is activity and then you got some indicators that not so good.
And then you go into the suburbs here, there’s a whole bunch of data that has no relationship whatsoever.
So now we have a relationship.
The question is, could we prove that this relationship leads to a higher lead score? Well, we know intuitively that that probably is the case.
But we want to scientifically prove that to do that, we could do things like send more emails or run retargeting and remarketing ads to see if we can get people to be active, who are not active.
So I would take my data set.
Take everybody who’s been active in the last 30 days, put them out of the data set.
Take action, don’t put them in a control Group, take everybody who’s older than 30 days, put them in the experiment group.
Maybe randomize mix and match, like 20%.
I mean, and then run the same ads to both saying, hey, come read today’s email, right? And what we’d want to see is, who do we see the points increase on the experiment group? substantially.
To prove that activity date last active actually does increase lead scores.
This is a very simple, straightforward way to prove this.
Here’s the catch.
This is where subject matter expertise comes in.
If my lead scoring algorithm, the way that I’ve chosen to assign points in my marketing automation system is flawed or makes no sense then I could be testing and proving something that doesn’t matter.
We would want to, for example, analyze taking a step back, does lead score have a relationship with people who actually bought something If it doesn’t, then the lead score itself is broken.
And then this analysis doesn’t matter.
So you get a sense when it comes to how to decide what variables are important, there is a lot of technology, but there’s also a lot of business sense.
There’s also a lot of common sense.
Is there a relationship here? Does that relationship matter? So these are the questions that you would need to take as you do this kind of analysis.
Really good question, challenging question because again, there are so many layers to the onion that you’re going to end up peeling back, but you’ll realize at some point, things may be more broken than you think.
That’s always a challenging place to be in.
If you have follow up questions, leaving the comments box below, subscribe to the YouTube channel on the newsletter, I’ll talk to you soon.
want help solving your company’s data analytics and digital marketing problems? Visit Trust insights.ai today and let us know how we can help you
You might also enjoy:
- You Ask, I Answer: Best Language for Marketing Data Science, R or Python?
- Google Analytics: When Are New Vs. Returning Visitor Ratios Useful?
- The Evolution of the Data-Driven Company
- IBM THINK 2020 Digital Experience: Day 2 Review
- How to Set Your Public Speaking Fee
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers