How much data do you need to effectively do data science and machine learning?
The answer to this question depends on what it is you’re trying to do. Are you doing a simple analysis, some exploration to see what you might learn? Are you trying to build a model – a piece of software written by machines – to put into production? The answer depends entirely on the outcome you’re after.
Here’s an analogy. Suppose you’re going to bake cake. What quantities of ingredients do you need?
Well, how many cakes are you going to bake, and how large are they? There is a minimum limit to quantities just for the basic chemistry of baking a cake to happen at all, but there are cakes you can make that are disappointingly small yet are still cakes.
Are you baking a round cake? A sheet cake? Ten sheet cakes? How quickly do you need them?
You start to get the idea, right? If you need to bake 100 cakes in 24 hours, you need a much bigger oven, probably a much bigger mixer, perhaps an extra staff member, and a whole lot of flour, sugar, milk, eggs, and baking powder than if you’re baking a single cake.
The same is true of data science and AI. To do a simple exploratory analysis on a few Tiktok videos requires relatively little data. To build a model for the purposes of analyzing and reverse-engineering Tiktok’s algorithm requires tens of thousands of videos’ data, possibly more.
Some techniques, for example, can use as few as a handful of records. You can do linear regression technically with only three records, that’s the bare minimum amount you need for a simple linear regression to function. Other techniques like neural networks can require tens of thousands of records just to put together a functional model. That’s why it takes some experience in data science and machine learning to know what techniques, what recipes fit not only the outcome you have in mind, but also what ingredients and tools you have on hand.
There’s no firm benchmark about how much data you need, just as there’s no firm benchmark about how much flour you need for a cake. What is necessary is understanding the outputs you’re trying to create and then determining if you have the necessary ingredients for that output.
You might also enjoy:
- What Content Marketing Analytics Really Measures
- Best Practices for Public Speaking Pages
- How To Start Your Public Speaking Career
- Retiring Old Email Marketing Strategies
- The Biggest Mistake in Marketing Data
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers