Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Company-Level Amazon Ecommerce Datasets?

Steve asks, “I’m looking for a dataset of companies that are actively selling on Amazon. How would you as a marketing data scientist approach this problem?”

That’s an interesting question. To my knowledge, there aren’t publicly available, free datasets of this sort (though please leave a link in the comments if you know one), so you’ll have to do a bit of leg work to create your own. Tools like BuiltWith and Hubspot can be a big help here.

You Ask, I Answer: Company-Level Amazon Ecommerce Datasets?

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, Steve asks, I’m looking for a data set of companies that are actively selling on Amazon.

How would you as a data scientist approach this problem? Hmm? Well, that’s an interesting question.

To my knowledge, I don’t know that there are any publicly available free data sets of this source that would do this thing, you probably end up building your own.

If, by the way, if anyone knows of, if you know of a data set that is publicly available and free, or even if it’s not, I mean, it’s available and it just cost money.

leave a link in the comments below if you would.

For something like this, you’re gonna have to do a bit of legwork.

You got to create your own and what you’ll have to do is first look at If you know if you have a known subset of companies that you know for sure are selling on Amazon, then go to their websites and look for indicators that would help you classify those companies as Amazon sellers and then build a second data set of companies you know, are not not selling on Amazon.

And what you’re going to do is you’re looking for specific characteristics to try and identify something that in an automated fashion that indicates that yes, this company is an Amazon seller.

There are really good tools built with is one HubSpot actually hub spots free CRM is another that can analyze the most common technologies being used by a company’s website and provide that information to you.

In fact, let’s let’s bring this up here.

So this is what you see.

This is inside of HubSpot.

This is a company it’s based in Los Angeles.

You can see it has the timezone there and then it has a box Start at the bottom called web technologies.

And you can see for this particular company on their website, they’ve got Microsoft Exchange for the email, YouTube, Google Tag Manager, Facebook advertiser, pixel, office 365, Adobe analytics, Adobe DTM recapture Google Analytics, ad roll and outlook.

So this list of technologies are for this particular company.

Now, this is not an Amazon reseller.

This is just some company picked out of the pile randomly.

This company has this set of particular technologies and these are good indicators of what their Mar tech stack looks like.

So from a an analysis perspective, you’re going to want to create a data set, you know, 50 or 100, known Amazon sellers, and 50 or 100, known non Amazon sellers.

And you’re going to want to extract this data from Hubspot or from built with either either companies data is fine and put it together and Some sort of spreadsheet.

Or if you want to get more sophisticated and use some of the more fancy tools like Python or R, you could certainly do that.

But ultimately, what you want to do is you want to build a profile of what are the common technologies in use by an Amazon seller? What are the common technologies that are in use by non Amazon sellers? And what’s the difference? Is there a particular technology that predicts pretty well, or a combination of technologies that predict pretty well, that a company is an Amazon seller, there’s certain things that are just dead giveaways.

Like, that’s what this this company does, or this this company has.

For example, Amazon has tracking tags, right? There’s tons of tracking tags that they offer for affiliates.

Are those are those the ones is that is that a good indicator? Or are those tags so prevalent that it’s a it’s a misleading signal? You won’t know until you do the data analysis, but once you have that, then you’ll have a The the key essentially to being able to identify a list of companies then from there, you load those companies into, you know, built with or Hubspot or whatever, just willy nilly.

And as you can see, one of the things that these tools will also do is give you a general sense mostly for publicly traded companies of what their annual revenue is, how many employees they have, etc.

And that will really help identify and separate out these different types of companies.

It is going to be a lot of work.

It is a lot, a lot of work.

And it’s very manual work, because you have to hunt down those companies on Amazon, and then equally, pull together a list of others of other ecommerce companies that are not on Amazon.

But that training dataset, you’re gonna want a good sample, you’re gonna want to 50 or 100 companies in either category that will give you a robust enough data set.

To see the patterns in it to see there’s a certain you know certain things that almost everybody Amazon always uses on their websites.

There may not be a pattern that is a risk with a project like this, there may not be a pattern but then you know that you know that that is no longer something you can rely on.

And you’ll have to source the data some other way.

That knowledge alone has value.

That knowledge alone, even if there’s not a there there, that knowledge alone will tell you.

Okay.

We know that these web technologies or company size or number of employees, or year they were founded or publicly traded or not, are good or bad indicators of whether a company sells on Amazon or not as an e commerce company.

Pull the data together.

Your best bet is going to be to store it in a spreadsheet initially And ideally, what what comes out of Hubspot is like I know, at least for the Hubspot API is all the technologies come out in one big text string, and one of the things you have to do is you have to separate that out into different columns, which is not a lot of fun, but it is doable.

And then what I would suggest doing is turning each of those into flags.

So for example, Google Analytics is a one for Yeah, zero for No.

And then you have essentially a spreadsheet with 50 or 100 columns on it.

And then for each company, you would have a field indicate like Amazon seller, yes, no, or one zero.

And then you know, Google Analytics, one, zero, Microsoft Exchange, one, zero, YouTube, one, zero, that data format, will let you do the analysis very quickly.

Because you can start to add up, count the numbers of you know, ones and zeros need to the columns.

And that will give you a much better more robust analysis.

As I said, it’s going to take some time.

But if you approach it with this methodology about the 50 to 100, things you have in common and the 50 to 100 that are not in your target audience and the things they have in common, and looking for the intersections between the two, you will get an answer of some kind.

If you don’t get that answer, then you also know that there’s a pretty good chance anyone selling a list? You would have to at least grill them really well.

Okay, how did you get this information? What criteria? How did you scrape the information? And if if they they say, Well, you know, we looked at like their web technologies and you’ve already done your own analysis, you can say, Hmm, I did that too.

I didn’t find anything was statistically relevant.

And if they give you an answer, like well, it’s a proprietary blend of our own technologies and stuff like that.

No.

But in talking to the people who are providing these lists as vendors, doing your own work first gives you much more depth to the questions you can ask them to qualify them as a vendor to say, Yes, that sounds like something I hadn’t tried.

You might be onto something or, you know, I did that I didn’t see what you’re seeing.

So I’m not sure how reliable your data is.

That way you can avoid spending a whole lot of money before without having any results to show for it.

If you have follow up questions about this leave in the comments box below.

This is a a challenging data science question.

That’s not really a data science question.

So data analysis question, although having the control and having the experiment group does start to lean it towards a scientific question.

It’s an exploratory data analysis problem first Is there even though they’re there before you can form a hypothesis That’s what this information would would help you start to lean towards terms of the data that you would need and things like that.

Again, questions leave in the comments box below.

Subscribe to the YouTube channel in the newsletter, I’ll talk to you soon.

Take care.

want help solving your company’s data analytics and digital marketing problems? Visit Trust insights.ai today and let us know how we can help you


You might also enjoy:


Want to read more like this from Christopher Penn? Get updates here:

subscribe to my newsletter here


AI for Marketers Book
Take my Generative AI for Marketers course!

Analytics for Marketers Discussion Group
Join my Analytics for Marketers Slack Group!