#WinWithAI: How Data Preparation Must Change for AI

Warning: this content is older than 365 days. It may be out of date and no longer relevant.

#WinWithAI: How Data Preparation Must Change for AI

As part of my work with IBM in the Win With AI Summit, one topic I’ve been asked to address is what technologies will impact AI strategies and rollout.

Register for the IBM Win With AI Summit in NYC here.

When we look at the data science lifecycle, we see that a healthy portion of the lifecycle is spent on data preparation. Many of these tasks:

  • Refactoring & reformatting data
  • One-hot encoding
  • Normalization/denormalization
  • Scaling/centering
  • Decomposition
  • Dimension reduction/PCA
  • Feature engineering

All these tasks are like tools in a toolbox or utensils in a drawer. Right now it takes a skilled, experienced data scientist to understand what to use. As deep learning improves and becomes more accessible through technologies like Watson Studio, we should see a reduction in the manual labor of data preparation for AI. That in turn will mean faster, better results.

#WinWithAI: How Data Preparation Must Change for AI

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode as part of my work with IBM for the win with AI summit full disclosure, I’m compensated to participate in the event.

one topic of an astral dress is what technologies will impact AI strategies and rollout. When you look at the data science lifecycle, we see that a healthy portion of today’s data science, which is

a mandatory part of preparing data for the use of by machine learning and artificial intelligence technologies, a huge part is spent on data preparation. We spend 5060, 7080, 90%

of our time on Data Prep and what are we doing well, we’re doing things like filling in missing values are impeding messaging values are dealing with them. We are dealing with all sorts of crazy data formats that make no sense we are dealing with

anomaly detection removal where it’s appropriate to do so we are tasked with making data relevant to each other. So this is a process called normal scaling and centering where we need to make the data fit in similar scales. And there’s a whole list of tasks, refactoring and reformatting one hot and coding where we re encode certain variables with numbers instead of text normalization or D normalization of tables, if, depending on on how we want to do our analysis decomposition where we take data and break it apart into component pieces, which is the opposite of of the normalization and in some ways dimensionality reduction principal component analysis where we’re trying to reduce the number of columns, so it’s funny decomposition adds new comms dimension reduction reduces comms,

identification of key variables, what are the variables that are most impacted?

Full to a data set. And all this really falls under a bucket called feature engineering. And this is this is a huge chunk of time spent by data scientists and AI engineers to make AI and machine learning work properly. It is also one of the biggest obstacles to companies rolling out artificial intelligence initiatives within the company. Because

in a lot of cases, companies lack good governance. They lack great data or high quality data they’ve got they’ve got the data, they just don’t have it in a in a format that’s accessible and usable for machine learning. So feature engineering, data cleansing, cleansing, data preparation, all this is stuff that

we spend a tremendous amount of time and very, very expensive time on right now. Now these tasks are all tools in the toolbox.

Or utensils in a drawer, like a tool like a utensil right now you need a skilled experienced data scientist, someone who’s got the ability to work with the data to to correctly use and choose the tools. So not every dataset needs for example one hot and coding. Not every dataset needs principal component analysis

right now we need that human to apply that judgment and then go go do the thing. Go go execute on the activity. Again, with data scientists costing anywhere from three to five to 700,000 a year. That gets super expensive, right? That’s a data scientist who you’re paying 300,700,000

a year to that’s their you know, their their hourly bill rate effectively is 350 an hour had350 an hour to have someone sort of copying and pasting and tuning stuff up is a waste of money.

So when you look at the benefits of AI of artificial intelligence, acceleration, accuracy and automation, all three of these things are things that can be at should be and are being applied to data preparation. So through deep learning technologies, we have seen the last couple of years a tremendous effort towards automated feature engineering where with with

strong deep learning technologies, machines can pre engineered the data set and then hand it off to a human for final inspection and sampling

that is still

in many ways not accessible to the business user. And it is even not accessible to

the average data scientist who is not working specifically with machine learning technologies that’s changing and where we will see new technologies impacting artificial intelligence in the coming

Here is with these features becoming much more available and much more accessible to Don hardcore machine learning specialists. So, a really good example of this, of course, is IBM Watson studio where

even if you’re using Charisse and TensorFlow and you’re, you’re trying out auto Charisse and things like that you’re still slinging code, one of the benefits of a service like Watson studio is it, it takes the same system and puts it into a drag and drop interface. So now, instead of needing to, to write the code to do to set up the, the deep learning framework, you know, drag and drop the pieces together. So, as long as you understand the architecture and you understand the outcome of what you want, it’s a lot faster to get up and running. Things like that will improve will continue to improve. It will continue to be enhanced with technologies like auto Charisse,

so that

our preparation

process and our preparation time will diminish. So we get to our answers faster, we will get better answers. Because obviously, if you’re if you’re relying on a human to mix and match the tools, there’s no guarantee that, you know, the human will have a bad day. This morning, it took me five minutes to remember the term feature engineering. I kept getting stuck with that with with factoring.

And so removing the humans from those processes will make the processes faster and more reliable and will free up those humans to do things just like you know, make extra large cups of coffee as they watch the machines work.


in terms of what we should be looking for in the next year within AI technology, specifically around data. We want to keep our eyes very carefully on automated feature engineering automated data preparation

because that’s where that’s where the biggest bang for the buck is. Reduce the time to start modeling reduce the time to start creating.

outcomes now puts

while still making sure that we have interpret ability of our data and interpret ability of our models. And and again services like Watson studio will help enormously with that new technologies like AutoCAD will help enormously with that. And that will eventually let these tools be available to people like you and me, where we are not necessarily PhDs. We are not necessarily multiple PhD holders where folks trying to get something done so it there is the technology is moving really, really fast right now.

Every day there are new innovations every day there are new improvements and every so often there are really big breakthroughs that that just turn up the dial on on how fast we can get access to these technologies. So there’s a lot to look forward to in the next year. And it would not surprise me if within a couple of years there are

business user friendly drag and drop interfaces for data preparation where you don’t even need a data science degree or certification, you’re just your average middle manager, you drag and drop a few things. And then out the other end spits a data set ready for modeling. And you hand that off to your your data team to to make stuff work, but it contains the data that you want as a business user. So I hope to see you at the win with AI summit in New York City and September 13, and if you’re going to be there, you can tune in online as well. But there’s a link in the notes to register and I will talk to you soon. Please subscribe to the YouTube channel newsletter. Talk to you soon. Take care

if you want help with your company’s data and analytics visit Trust Insights calm today and let us know how we can help you

FTC Disclosure: I am an IBM Champion and am compensated by IBM to support and promote IBM events such as the Win With AI Summit.

You might also enjoy:

Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here

AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!


Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This