You Ask, I Answer: Time Intensive Data Science Tasks?

Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Time Intensive Data Science Tasks?

Katherine asks, “What’s the most time intensive part of data science?”

You Ask, I Answer: Time Intensive Data Science Tasks?

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher Penn 0:13

In today’s episode, Catherine asks, What’s the most time intensive part of data science? Well, the most time intensive part by far is a Data Prep and feature engineering.

These are the stages where you are taking data, cleaning it while examining it, cleaning it, preparing it for analysis, preparing it for modeling, doing feature engineering and adding on new features and things.

That’s probably 80% of data science, for real.

The actual in depth analysis, the actual machine learning if you’re going to be doing it for machine learning, that’s like 10 to 15% of your time.

The reason why Data Prep is so time intensive is that despite the raft of companies and software and tools that they claim, they can automate it all away.

You can’t, you can’t automate it all the way.

Because in so many cases, what you’re dealing with is stuff that is different every time.

Right? When you’re looking at a data set of nutrition data, it’s got certain characteristics, when you’re looking at motor data, when you’re looking at environmental data, when you’re looking at email marketing, statistics, all of these things are datasets that are unique and different.

And though there are common processes and techniques for doing data prep and feature engineering, there is no one size fits all.

And there’s certainly no way to easily today easily just handy to set to a computer and say, Hey, do all my cleanup and prep and feature engineering for me.

Because you don’t these machines don’t necessarily know what’s needed.

They don’t necessarily know what procedures would make sense to do and what procedures, there isn’t a lot of sense in them.

For example, suppose you have a date field in a, an email marketing dataset, the software would know to to make sure that it’s a date field, and that’s formatted correctly and things like that.

But it wouldn’t necessarily know that you might want to extract out day of week or hour of day, it also wouldn’t know you don’t typically want you know, day of month or day of year, those are not necessarily going to be things that from an email marketing perspective are going to lend a whole lot of insight, maybe they will, maybe they won’t, depending on your email marketing strategy.

But we as the data scientists would know based on our subject matter expertise, based on our skills, and then based on our domain knowledge of the email marketing that sometimes those extra added engineered features are good idea, and sometimes they don’t add any extra value.

That’s one of the reasons why data science is so complicated.

And why it is so hard to find talent in the data science field because you need somebody who is both a subject matter expert in data science, but also a subject matter expert in whatever it is that the you know, the datasets that you’re studying for what that industry is someone looking at, for example, COVID data is going to have a very different understanding of what features are important, you know, based on virology and immunology, than somebody who’s doing data analysis on car engines, right? They’ll have similar techniques, but they’re gonna deploy them in very, very different ways.

Someone who’s an expert in engines is going to be looking at factors like mean time between failure, whereas somebody looking at COVID data is probably going to be looking at things like genetic drift, like antigenic drift and phylogenetic maps.

Those are very different tasks.

And you need to have the subject matter expertise in that domain, to be able to know what features to include to know what features are missing, especially.

And then whether or not you can engineer the data set to repair some of the missing data.

I don’t foresee a day when you can just simply hand a dataset over to a machine and have it do all that cleaning and prep and augmentation and extraction and make it all work seamlessly because it’s different every time.

It’s like.

It’s like being a chef, right? Being a chef in a maybe in a food court.

And there’s just one big restaurant that serves every cuisine.

You don’t know what the next person is going to ask.

Maybe they want chicken chow mein maybe they want Pasta carbonara.

Maybe they want to pretzel.

Right? There’s no way to tell.

And so you’ll have slots of skills and common techniques, but at the same time, every order is going to be different.

So Oh,

Christopher Penn 5:01

that’s the most intensive part of data scientists science, the time intensive part.

It is prepping engineering.

And that’s not going to get better anytime soon.

The machines can help.

But even then they still need guidance to pull it off.

So, I would expect if you are in the field of data science or you are working towards becoming one, that’s where you’re going to spend a lot of your time and frankly, that’s where things go the most wrong because if you don’t have the right data for any models or insights, it’s like not having the right ingredients to cook with.

Right if you’re trying to bake bread and you got a bag of sand.

Doesn’t matter how good a cook you are.

You’re not making an animal over bread.

Anyway, really good question.

Thanks for asking.

If you’d like this video, go ahead and hit that subscribe button.

Machine-Generated Transcript

Christopher Penn 0:13

Christopher Penn 5:01

Comments

Leave a Reply Cancel reply

Pin It on Pinterest