Michael asks, “Have you heard of augmented analytics (defined by Gartner)? It seems to me it means your job will get easier in the short run and you’ll be out of business in the long run – if you believe it. I’d be interested in your comments on it.”
Augmented analytics is what the rest of the world calls automated data science. It holds a lot of promise, but there are a few problems with it right now. There are four aspects to the feature engineering part of data science. Some can be automated easily; others will require significantly more research before fully automated solutions are viable. Watch the video for full details.
Subsets of feature engineering:
- Feature extraction – machines can easily do the one-hot encoding, but things like labeling are tricky (limited label data and active learning are helping)
- Feature estimation and selection – machines very easily do variable/predictor importance
- Feature creation – a subset of feature engineering – is still largely a creative task
- Feature imputation – also a subset of feature engineering – is knowing what’s missing from a dataset (MOC)
These are difficult to automate tasks. Will they ever be? Probably. But not for a while, especially the latter parts which require significant domain expertise. For the most valuable models, these will become automated, but there are tons of models for which it will take a while, if ever, for them to be made.
Can’t see anything? Watch it on YouTube here.
Listen to the audio here:
- Got a question for You Ask, I’ll Answer? Submit it here!
- Subscribe to my weekly newsletter for more useful marketing tips.
- Find older episodes of You Ask, I Answer on my YouTube channel.
- Need help with your company’s data and analytics? Let me know!
- Join my free Slack group for marketers interested in analytics!
What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.
In today’s episode, Michael asks, have you heard of augmented analytics as defined by Gartner, it seems to me it means your job will get easier in the short run, and you’ll be out of business in the long run. If you believe it, I’d be interested in your comments on it. So I took a look at the article that Michael had shared about augment analytics. And fundamentally, after you read through it is it is, as consulting firms are often doing is they’re they’re branded spin their branded name on something very common. augmented analytics is what the rest of the world calls automated data science, the ability to use machine learning and AI technologies to take a data set and transform it and do a lot of the analysis and insights generation from that data set. automated data science is it holds a lot of promise. But the challenge is in when you look at the data science lifecycle, there is a stage which they say in the article, your data preparation is 80% of the data scientists work. And it’s his mundane work, which isn’t really true.
That’s something that said often by people who are not data scientists,
feature engineering as a subset of that is probably the most important part. So there’s really, we think about there’s there’s sort of three parts to this section of data science there is getting the data, there’s cleaning the data, and then there’s preparing the data for usage, getting the data, yes, something that is automated, should be automated. Because pulling data out of API’s and things is a very, very programmatic process. And it should be cleaning the data. Again, something that can be automated to some degree. There are a number of good machine learning tool libraries that can help you clean your data. The hard part is the preparation of the data. And this is done it processes called feature engineering. And feature engineering simply means finding ways to make the data set more valuable and more useful for machine learning modeling. And there’s four parts to it that are important.
There is feature extraction, which is when you are creating features, or you’re doing processing on features, I should clarify a feature is nothing more than a dimension. If you think about in Google Analytics, for example, there are dimensions and metrics, metrics, so the numbers dimensions that they aspects. So metrics are how many visitors? Did you get your way? Your website? dimensions are which website? Which sources did they come from, like Facebook, or email, and so on, so forth. dimensions are not numbers, metrics are numbers. So when we’re talking about feature engineering, we’re talking about engineering, additional dimensions and metrics from the dimensions and metrics you already have. So for example, in a tweet, a dimension would be the date, right, and you could engineer additional things from that date, such as the year, the month, the day, the day of the year, the day of the month, the day at the quarter, and so on and so forth. Simple feature extraction like that, or what’s called one hot encoding, which is an aspect of turning words into numbers. So if you had a database of days of the week, Sunday would become one and Monday would become a two and so on so forth. That stuff, yes, machines can easily automate it. And it’s something that machines absolutely should do. When it comes to feature extraction, those things like labeling get very tricky. Again, marketers see this a lot and things like sentiment when you try to assess is a tweet positive, neutral and negative? Well, there’s a lot of judgment that goes into that kind of labeling and machines are getting better at it, but still not great at it. And when you have limited label data, especially for more complex data sets, yes, again, our machine learning algorithms like active learning that are starting to help, but they are still very, very limited in what they can do. For example, labeling your data, is it customer service, sweet, this is a sales tweet, is this an advertising related tweet, who should this tweet go to using Twitter stuff as an example, because it’s very easy to, to see the applications, those labels are not something that a machine comes out of the box and knowing how to do and you have to provide that labeling. The second aspect of feature engineering is called estimation and selection. what features are relevant to the modeling you’re trying to do if you’re building a machine learning model, and you just throw all the data at it, you’re going to have exponential amounts of compute time required in order to be able to understand, like, have the model run correctly. So that’s something again, machine can very easily do that kind of estimation and selection. And that is something that you absolutely should not attempt to do. And
the third and fourth aspects of the ones where augmented analytics, as Gartner calls it, or automated data science, really start to run into trouble. feature creation, which is a subset really, of extraction, in many ways, is largely a creative task. What features should we create just because you can create day or week or month, should you? Right? If estimation, selection is about winnowing down the features to the ones that are useful for a model, creation is adding new ones and knowing which ones to add and which ones not to add what’s relevant, what’s not relevant. So So very, again, creative tasks, that machines will be able to, at some point, do a sort of a general best practices version, but will be difficult for them to come up with all the possible combinations, at least until has permissions have much larger data sets to work with. And we build those active learning algorithms. The fourth one is one where I think machines have a significant amount of trouble and will for a long time, and that is feature amputation. This is when you look at a data set, knowing what’s missing from it. So recently, I was looking at marketing over coffees, podcast data, and I want to run some machine learning models to figure out what drives things like downloads or episode popularity. And I had Google Analytics data and I had our podcast, download data. And I had search data and I had social media sharing data. And I forgot one, I forgot to get the subscriber data from feed burner,
which is a pretty big mission clearly was not the was not having enough coffee that day.
I had to know from my domain experience, so that data set was missing.
That’s something that machines are will have a very difficult time doing. And yes, for the most valuable, most important models, it is likely that machines will be able to baselines, you know what general best practices, hey, these features should be in a data set like this. But that’s a long way off. And that’s only going to be for the most valuable data sets, if you’re trying to build a a podcast importance machine learning model. That’s not super valuable right now. And so there is no out of the box template that a machine could automatically pick up and run with. So that domain expertise, that knowledge, that experience is very difficult to automate, very costly to automate. And the ROI may not be there. And you would be better off having a data scientist with some generalized broad experiences of what goes into different types of models. Being able to provide that feature invitation, so is augmented analytics, or automated data science gonna put us all out of business now, not for not for a while. And by a while I’m talking, you know, five or 10 years, at a minimum.
machine learning models and AI models will keep getting better, and they will keep making a lives easier. But there’s still a long way to go. Even with some of the most powerful new tools in the marketplace, like auto AI from IBM, and auto ml from h2o, there’s still a substantial amount of feature engineering that needs to happen up front. And it is as much an art as it is a science, which is frustrating for people like me who like to have processes that you just this is the best practice, just do it. No, the best practice gets you the minimum level of competence for any given task, and then you have to add value on top of it. The good news is, for all of us who are domain experts in our various fields, and occupations are our experience and our perspective. And our ability to think creatively. Still matters and will still matter for quite some time to come. So great question, Michael, very, very detailed question. Important. important to understand these distinctions to why automated data science will not just be a magic push of a button. And I could go on for hours about all the different examples where this fall is down. So but that is the short answer. As always, leave your comments in the comments below please and please subscribe to the YouTube channel and the newsletter i’ll talk to you soon. want help solving your company’s data analytics and digital marketing problems.
This is trust insights.ai today and let us know how we can help you
You might also enjoy:
- How to Set Your Public Speaking Fee
- The Evolution of the Data-Driven Company
- Best Practices for Public Speaking Pages
- How to Set Your Consultant Billing Rates
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers (2019 Edition)