Monina asks, “What steps make up a data science lifecycle? Where do you begin?”
The short answer to this question is to define the problem and hypothesis, prepare your data, explore your data, test your hypothesis, build a model, validate the model, and then deploy and observe. Each stage is composed of multiple sub-steps. Watch the video for a full explanation.
Can’t see anything? Watch it on YouTube here.
Listen to the audio here:
- Got a question for You Ask, I’ll Answer? Submit it here!
- Subscribe to my weekly newsletter for more useful marketing tips.
- Find older episodes of You Ask, I Answer on my YouTube channel.
- Need help with your company’s data and analytics? Let me know!
- Join my free Slack group for marketers interested in analytics!
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.
In today’s episode menina asks, What steps make up a data science lifecycle? Where do you begin? So the short answer to this question is that the data science lifecycle is essentially a series of processes that we use to make data science work.
It begins with defining the problem, the question the hypothesis, the early steps of the scientific method.
And then we move into things like exploring your data, testing your data, building a model, validating it, deploying and observing it.
And this overlaps pretty well with the scientific method as it should, and data science.
But each of these stages is composed of multiple sub steps.
There’s a lot more to unpack in each of these.
So let’s actually bring this up here.
So what you see here is the data science lifecycle the red part defining the problem and your hypothesis is probably We the most important part of this entire thing.
Because without great problem definition and a provably true or false statement for the hypothesis, the rest of this stuff doesn’t matter.
This part also in the problem definition, takes time to figure out what data you’ll be needing in order to do the rest of the process.
So the red part there most important, then you get to five steps in preparation of data.
So ingesting the data, getting it from all the different systems, it’s in analyzing it just to not for what the data says, but just making sure that the data is in good working condition.
How much is missing? How much is how many anomalies are there is there a possibility of bias is there corruption in the data, all those things go into the data analyze stage.
After that, if you have to repairing any of the data, the things that are broken cleaning it up, normalizing it if you need to put Get into the proper data structures.
And then after the cleaning is preparing the data, so rating it reading it for analysis.
This can be things like encoding, declaring variables, categorical or continuous.
All this stuff is probably the most laborious stage of data science.
But it’s also one of the most important besides problem definition.
Because, again, we’re doing data science means we’re doing science with data.
And if our data is corrupted, that we can’t do good science.
After that, you get to the yellow stages.
This is where we start doing what’s called exploratory data analysis.
And that is a whole cycle in and of itself.
But fundamentally, we’re looking to do we need to augment our data with new external data.
We do full exploration.
And we do comparison looking inside of our data to see what potential answers it has.
We have not actually test our hypothesis yet.
We’re just still in the the data verification stage to make sure that our data is going to do what we want it to do.
That’s when we get to the green stage the hypothesis assessing where we make that prediction.
Is our hypothesis True or false? What should we do about it? And then we we build a model a theory.
It’s not fully a theory until it’s proven, but a model of our hypothesis with our data, then get to the blue part hypothesis testing, validating that model, does our data and our hypothesis work together to answer that provably true or false statement? For example? You could say in Google Analytics, our hypothesis is that website traffic will always be lower on the weekends.
That is it provably true or false statement.
It’s a singular condition.
And we would bring in our data, analyze it, repair clean prepare, you know, Google Analytics wasn’t working for one or two days.
augment, explore, compare our prediction is that this is a true statement.
And if it is true, we might want to think about what to do about it, we build that model of very simple, you know, when the augmenting stage, we might have augmented days of week in the data set, right, because Google doesn’t give you that out of its out of Google Analytics.
It’ll give you the numerical date, but it won’t give you the day of week.
And then you validate when you do an average of all the Saturdays, not average all the Sundays, and then you have an average of all the weekdays is your hypothesis is true or false, you validate it, and if it’s false, you have to refine it, or start over or throw it away.
And if it’s true, you might want to restart an augment.
Now hop back to augment and get more data, maybe you looked at a year, maybe you should go two years, three years, five years, maybe look at any other sites you have legitimate access to whatever the case may be.
you refine that hypothesis.
And then once you’ve got a working model, that you’ve essentially proven you deploy it Now in the case of the insight that your website traffic’s lower on the weekends, that deployment would simply be telling your marketing team, hey, we want more traffic on the weekends, we got to run some ads, or we don’t care about weekend traffic, because no one in the office is around to answer sales questions.
cut our ad spend on the weekends.
So that deploy stage is really about taking our prescribed in the green section and rolling it out.
Once we’ve proven that our hypothesis true or false, and then we observe it, make sure that Yep, our our model is working as intended.
And we have proven true for ourselves.
Whatever our hypothesis was, that’s the data science lifecycle as a whole.
And again, there’s things to unpack in each of these stages.
Every even he This more detailed model.
Just taking something like repairing your data can be a whole series of you know, 10 1520 steps doing things like, you know, missing value imputation determining if you’re, if you have missing data is missing at random, is it not missing at random, there’s all sorts of things that you can do it each of these stages.
And that’s one of the reasons why data science is so complex.
Because each of these stages, there are mathematical principles at work.
There are technical principles at work.
There are business principles at work, there’s domain expertise at work.
So there’s all of these things that you have to unpack and be able to do in a data set in order to be able to execute the scientific method and develop that working model that is reliable, that is repeatable.
And that is defendable.
You know, you went to something like peer review or the very least colleague of you to make sure that your model is in fact, valid.
So as you start your journey, one of the things I would recommend you do is take this model and then start with very simple data sets.
Again, the Google Analytics answer example is a is a good one because it is compact.
It is mostly clean most of the time, more or less, and allows you to test your knowledge of each of these steps without having massive, massive mathematical and technical hurdles.
At each stage.
You start with super simple, and then as you get comfortable running through this life cycle, you can then work with more and more complex data, build harder to test hypotheses and ultimately be able to use this on a regular basis.
But really good question.
If you have follow up questions, leave them in the comments box below.
Subscribe to the YouTube channel on the newsletter, we’ll talk to you soon.
Take care what helps solving your company’s data analytics and digital marketing problems.
This is Trust insights.ai today and listen to How we can help you
You might also enjoy:
- Understand the Meaning of Metrics
- It's Okay to Not Be Okay Right Now
- Transformer les personnes, les processus et la technologie - Christopher S. Penn - Conférencier principal sur la science des données marketing
- Almost Timely News, 17 October 2021: Content Creation Hacks, Vanity Metrics, NFTs
- You Ask, I Answer: Google Tag Manager and Google Analytics Integration?
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers