Warning: this content is older than 365 days. It may be out of date and no longer relevant.

How Much Data Do You Need For Data Science and AI?

How much data do you need to effectively do data science and machine learning?

The answer to this question depends on what it is you’re trying to do. Are you doing a simple analysis, some exploration to see what you might learn? Are you trying to build a model – a piece of software written by machines – to put into production? The answer depends entirely on the outcome you’re after.

Here’s an analogy. Suppose you’re going to bake cake. What quantities of ingredients do you need?

Well, how many cakes are you going to bake, and how large are they? There is a minimum limit to quantities just for the basic chemistry of baking a cake to happen at all, but there are cakes you can make that are disappointingly small yet are still cakes.

Are you baking a round cake? A sheet cake? Ten sheet cakes? How quickly do you need them?

You start to get the idea, right? If you need to bake 100 cakes in 24 hours, you need a much bigger oven, probably a much bigger mixer, perhaps an extra staff member, and a whole lot of flour, sugar, milk, eggs, and baking powder than if you’re baking a single cake.

The same is true of data science and AI. To do a simple exploratory analysis on a few Tiktok videos requires relatively little data. To build a model for the purposes of analyzing and reverse-engineering Tiktok’s algorithm requires tens of thousands of videos’ data, possibly more.

Some techniques, for example, can use as few as a handful of records. You can do linear regression technically with only three records, that’s the bare minimum amount you need for a simple linear regression to function. Other techniques like neural networks can require tens of thousands of records just to put together a functional model. That’s why it takes some experience in data science and machine learning to know what techniques, what recipes fit not only the outcome you have in mind, but also what ingredients and tools you have on hand.

There’s no firm benchmark about how much data you need, just as there’s no firm benchmark about how much flour you need for a cake. What is necessary is understanding the outputs you’re trying to create and then determining if you have the necessary ingredients for that output.

Happy baking!

You might also enjoy:

Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here

AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!


Leave a Reply

Your email address will not be published. Required fields are marked *

Pin It on Pinterest

Share This