Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Model and Algorithm Selection?

Katherine asks, “How do you know which is the right algorithm or model to choose for any given data set?”

You Ask, I Answer: Model and Algorithm Selection?

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher Penn 0:13

In today’s episode, Catherine asks, How do you know which is the right algorithm or model to choose for any given data set? That’s a tough question.

That’s a tough question, because there’s so many things that can go into those choices.

The first, obviously, is the data itself, what kind of data you you’ve got, right? If it’s a mix of categorical and continuous numbers and not numbers, that can shed some light as to as to what algorithms are just off the table or not.

The big thing, though, is, what is the intended outcome, right? Because there’s two fundamental tasks in data science and machine learning.

There’s regression and classification.

Classification is hey, we got a bunch of data, we don’t know how to organize it, let’s classify it so that it’s easier to understand the clumps of of data, maybe there’s a way to describe what those clumps are.

Regression is given a known outcome, what things most closely represent that outcome are most likely that outcome.

And within each of those two families, you then have a whole series of techniques, like, you know, nearest neighbors, for example, or SVM, for classification, or gradient boosting or lasso and ridge regression for regression analysis.

The question always is what are the what are the measures of performance that you’re trying to use? So, in classification, the most common metric is called the area under the receiver operating characteristics or AUROC.

And essentially, it’s a measurement to say how good a dataset classification algorithm or model is, how well it performs, right? Whether it what percentage of true positives versus false positives, it gives off.

It’d be like, you know, you get a bunch of fruit, and you classify, these are apples, these are pears, these are grapes, etc.

And your measure of success is how many things wrong you get right? Like, maybe get some really, really, really large grapes, and you misclassify a bunch of them as plums.

That would be you know, that would have a lower AUROC score than if you were correctly set up.

These are their large grapes, but they’re still grapes.

That would get you a higher AUROC score.

In regression, the most common measures there are the root mean squared error and the R squared number, which are descriptors of how closely a result fits a line, right? So if you have this, this line, or this curve of the regression, how closely does it fit against the existing data? Knowing that lets you know how accurate your analysis was.

Now, you have a bunch of different tools out there right now, that can sort of test to see how different algorithms perform on data.

One of the ones I use a lot is called IBM Watson Studio auto AI, you give it a dataset, you give it the outcome you’re after, and it tests all the different algorithms and models and says, Hey, here’s the ones that have the best performance based on the characteristics you’ve specified, like the highest R squared number, the lowest root mean squared error.

Those tools are huge, huge time savers, because otherwise, you have to test everything by hand, which I’ve done, it’s not fun.

There’s more and more automated machine learning that does that sort of thing where you give it the outcome and the data, and it will just test out a bunch of things and then let you know, hey, here’s what I found.

And then it’s up to you, as the data scientist to say, Okay, I think this one is the best blend of performance and accuracy, or this is the best blend of accuracy and fits the kind of outputs we need.

For example, there are some regression algorithms that cannot output, what’s called variable importance of all the variables that went into the regression, which ones are the most important, which ones have the highest relationship to the outcome we care about in marketing? That kind of algorithm would tell us what channels are working, right.

So if we’re talking about marketing channels, that type of analysis would be we want to know the and if there’s an algorithm that doesn’t provide variable importance, its usefulness to us is going to be pretty low.

Right? If that’s if that’s a key requirement.

So I guess the long answer to the question is, knowing what your requirements are or knowing what your model KPIs are.

And then using the automation software of your choice to test a bunch of things or do it manually.

I mean, that’s, that’s always a viable option

Christopher Penn 5:15

to see which algorithm or model performs best given the data set.

And given the requirements that you need to fit it to.

It’s not easy, right? It’s not fast.

It’s there’s no easy button.

Because even if your software chooses an algorithm that fits well, if anything changes in that data set, you’ve got to rerun the process all over again, possibly multiple times.

So it’s, it’s not a one and done.

It’s a living, breathing thing.

But good question.

It’s an interesting question, and in a very challenging one.

It’s one of the areas where automated machine learning really can offer substantial measurable benefits to folks who are engaging in machine learning practices.

So thanks for asking.

If you’d like this video, go ahead and hit that subscribe button.


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Get your copy of AI For Marketers

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!