You Ask, I Answer: Best Tools for Cleaning Data?

Jessica asks, “What are the best tools for cleaning data?”

That’s a fairly broad question. It’s heavily dependent on what the data is, but I can tell you one tool that will always be key to data cleansing no matter what data set. It’s the neural network between your ears.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, Jessica asked what are the best tools for cleaning data? So fairly broad question.

It’s really heavily dependent on what the data is because every data set every data type is different.

And our definition of cleaning data also is going to be very different based on what it is we’re trying to do.

There’s a bunch of different types of cleansing you’d want to do.

Do you want to identify anomalies and you want to get rid of them? Do you want to fix missing data are what kinds of things are you looking for? Are you trying to detect corrupted data? All of these different situations require different types of tools for identifying a nominal That one’s pretty straightforward, you can do that, you know, even in an Excel depending on the size of your data you may not want to but for smaller datasets for sure, the spreadsheet will will do fine for at least just identifying anomalies doing basic exploratory data analysis and summarizing your tables.

So things like means and mediums, Interquartile ranges, all these are good for understanding sort of the shape of the data set, and what it does.

For identify corrupted data, that’s a lot harder.

That requires sampling and inspection.

So real simple example if you were to go through your email list.

What are the different ways that you could identify bad emails right? There are going to be some that are obvious like someone who types in gmail.com, but forgets letter I in there in gmail.com That’s something that you can programmatically try to address common misspellings among the most well known domains would be an obvious thing to do.

Other things again, using email as example, you may need specialized tools.

There’s a tool that we use for you upload your email list and it checks them for validity and spits back Hey, here’s a list of the addresses that have gone bad.

You will definitely need something like that for that specific use case.

And that’s again a very specialized tool for missing data.

Depending on the type of data it is, if it’s if it’s categorical or continuous categorical means non numeric, continuous as numeric data for numeric data, you can do things like predictive mean matching, for example to try to infer or impute the data missing.

There’s actually a whole bunch of tools that are really good at this.

I use a bunch of our there’s a bunch in Python as well, that can do everything up to really sophisticated neural networks to essentially guess, at what likely values the data would be.

These have flaws.

Particularly they have flaws on cumulative datasets.

So if you’re doing a running total, and you’ve got a day or two of missing data, they don’t do well with that.

I’m not sure why.

If you have categorical data, there are tools like random forests that can again do that imputation kind of guess what the missing label is, with a caveat that the more data that’s missing, the harder it is for these tools to get it right if you got 1000 lines in a spreadsheet and got six rows that are missing an attribute.

These tools are going to probably Do a pretty decent job of filling in those blanks.

If you got 1000 lines and 500 are missing, you’re going to get salad back a tossed salad, it’s not going to be any use because so much of it’s going to be wrong.

The general rule of thumb with a lot of data sets is if you’re between anywhere between 25 and 40% of the data is missing, you’re not going to be able to do imputation well, and again, to the point of detecting bad inputs, it’s gonna be really hard.

Really, really right there’s some stuff that’s gonna be easy, right? You know, somebody types in test at test COMM And you’re in your marketing automation system, you can filter those out pretty easily, but non obviously fake addresses very difficult and clean those out.

There’s going to be a lot of work, especially if they’re valid but incorrect.

So this is something called spiking.

You can have somebody spike a data set, there was a A political rally and not too long ago where a bunch of Kpop folks and tick talkers reserved a bunch of tickets and flooded the system with bad data.

The challenge is, and this is this should strike fear into the heart of every marketer.

If you collect spurious data, and it is in violation of a law and you use that data, you are liable.

Right, so, let’s say that let’s say that my company is based in California, right? It’s not and you put in my my work email into a system like that, but it was harvested or it was faked.

And you the marketer send me email at assuming that I signed up for this thing.

And I say I did not sign up for this and you don’t adhere to you know, basic best practices for unsubscribes and stuff which a lot of political campaigns don’t.

You can be sued, you can be sued for under the California consumer Privacy Act.

So identifying bad data is very important, but also very, very difficult.

That said, the most powerful, the fastest, but the most important tool for cleaning data is a neural network.

This one right here, right? The tool between your ears is essential for every single one of these scenarios, because you need to bring domain expertise to the data set to understand what needs to be cleaned and what does not.

You need to bring data science experience to the data set to understand what’s possible to clean and what the limitations are.

And you need to bring good old fashioned common sense and the willingness to say, you know what, this isn’t gonna go well.

This is gonna go really badly.

Let’s not do this.

find some other way to get the status if you’re allowed to do so.

That’s the hardest part of gleaning do by far, tools are less important than process.

And that in turn is less important than the people who are doing the work.

Because everything that can go along with data, at some point will, and you’re going to need assistance getting that fixed up.

So, lots.

Lots of challenges in cleaning data.

And cleaning data is one of the things that marketing has traditionally not been really good at.

Hopefully, as more people embrace marketing data science, as more people do work in the field, we will elevate our overall proficiency at cleaning data, and making sure that it is useful and reliable.

The best place to start for learning how to do this honestly, is with something like a spreadsheet and a small data set and you going in and learning All the ways data can go wrong in a data set, you know very well.

So I would start there to teach yourself how to do these things.

And then, as you get into more sophisticated stuff like imputation of missing values, that’s when you’re going to need to bring in extra tools or or different tools.

Chances are, you’ll get to a point where you will need custom tools that you build yourself in order to clean the most complex challenges, so expect to do that at some point.

If you have follow up questions, leave them in the comments box below.

Subscribe to the YouTube channel on the newsletter, I’ll talk to you soon take care.

One helps solving your company’s data analytics and digital marketing problems.

Visit Trust insights.ai today and let us know how we can help you


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Get your copy of AI For Marketers

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!


Pin It on Pinterest

Shares
Share This