There’s been quite a bit of confusion about how Google works when it comes to the process of indexing and ranking our websites, so I thought I’d take a few minutes to lay out the process as best as we know it. Much of this information comes directly from Google’s technical teams – developer interviews, podcasts, and research publications.
Broadly, Google has six main stages of processing when it comes to understanding our websites and what’s on them – and what to show users in search results. Let’s dig through each of these.
Stage 1: Crawl
Google first needs to get information from our websites to process. Their software, GoogleBot, does exactly this. It crawls our site, page by page, and vacuums up the data our site provides into a serialized protocol buffer – essentially taking all the data and converting it into machine-readable formats.
What we see is not what GoogleBot sees; GoogleBot reinterprets our pages and reorders stuff based on its own needs. You can see what GoogleBot sees right from within Google Search Console if you’re curious:
Note the major differences in the code. GoogleBot has taken the source code for my site, slimmed it down, and rewritten it to make it easier for other Google systems to process.
Key action to take: make sure your site is accessible to GoogleBot! Be sure that pages you want to be found are set up to be found – and vice versa, pages you don’t want crawled, use the appropriate tools like robots.txt to avoid being found.
Stage 2: Render
Once GoogleBot has gone through our site and extracted all the information, that specialized version of our site is handed off to a system Google has named Caffeine. Caffeine uses a version of Chrome – like the web browser – to render, or view each page. Some important things that happen in this phase of the process include:
- Converting binary documents like PDFs, spreadsheets, etc. to HTML where applicable
- Normalizing HTML
- Understanding the overall document structure, page headings, syntax, etc.
In interviews with the developer team, they express a ton of frustration about how so many sites are badly coded and fail to conform to even basic good HTML, making the job of the Chrome server farms much harder. Pages and sites that render faster, easeier, and more cleanly will do better in Google’s rendering farms.
Check your site in Chrome’s Developer Tools – it’s a powerful set of tools and critically, the same tools and code Google uses in its render farms to understand our pages. What you see in Chrome Developer Tools is what Google sees when it tries to render your page – and things like Core Web Vitals are checked here, which will become ranking signals in 2021.
One critical thing to note is that if a page fails to render properly, Google will make its best effort to try fixing it internally – and that may remove some content that could be used for ranking later.
Key action to take: validate your HTML with a good validator like the W3C validator and fix critical errors. Make your site as fast and as clean as possible.
Stage 3: Collapse
The third part of Google’s order of operations is collapse, where they take the rendered data from their massive Chrome server farms and start throwing things out. What gets thrown out? Error pages. Bad redirects. Pointless redirects.
Using some of the training data from raters in the Google Search Quality Rating Guidelines, pages that have no value and would just take up space in Google’s servers get discarded at this point. They expressly don’t index error pages, and they do attempt to discern even soft error pages.
For example, if your site has a missing page and instead of throwing a 404 error, it redirects people to the homepage (a common trick used by some SEO folks to avoid having 404 errors, but a bad practice), Google will simply discard the original error page entirely.
Key action to take: Instead of tricks to deal with error pages, actually fix broken pages on your site so that they work correctly.
Stage 4: Extract
The fourth stage in Google’s order of operations is extraction. At this point, they’re looking to pull out all structured data on a site to understand what each page is about, what the contents are, and how they relate to each other.
Google’s servers do entity extraction, likely using both custom code and the machine learning model BERT, to identify entities on a page. Entities include things like people’s names, place names, proper nouns, etc. – parts of speech that give context to a page. They also do more formulaic extraction of things like phone numbers.
Developers emphasize that they look for explicitly declared structured data first as a way to conserve resources, so sites using schema markup, JSON-LD, and other structured data languages will receive preference and cleaner extraction of what the page is about based on that. For example, if you have a page with multiple phone numbers on it but you’ve declared in your structured data that one of those phone numbers is your primary phone number – the one you want customers to call – Google likely will ingest that declared number as the preferred one and show it in things like the OneBox in search.
Key action to take: Use structured data! Your site should absolutely be using JSON-LD or schema markup to tell Google exactly what a page is about. For common page types like articles, recipes, lyrics, etc. the more you tell Google, the better it will extract information from your page.
Once you’ve implemented structured data, use the Rich Results test tool to validate that it’s working:
Stage 5: Index
Up until this point, everything that’s been happening has been part of the crawling process, the part where Google takes in the data and makes use of it. Crawling is the first of the three big operations. Indexing is part two, in which Google takes all its processed data and does something with it.
In indexing, Google adds your site’s page data to its search index. This means that a page is eligible to show up in search results. Your site has to have been crawlable by GoogleBot, able to be rendered, still had valid results after collapse, and had usable information extracted.
What happens in indexing? According to interviews with Google technical folks, in addition to going into the search database, a ton of feature engineering happens at this point with our data. What sort?
- Google SafeSearch attempts to discern if our content is pornographic, and flags it as such.
- Google SafeBrowsing uses data from the render phase to flag a site as containing malware or other security risks.
- Google establishes ranking signals for localization, such as the page’s language and its geographic location so that results that are locally relevant are given some preference in applicable queries (like “coffee shop near me”).
- Other unspecified ranking signals are developed at this point and passed to the ranking engines, which are different than indexing.
What’s critical to understand is that indexing and ranking are different.
Ranking is what order pages show up in a Google search result.
Indexing is whether a page will show up at all.
Site owners should check out their indexing status in Google Search Console to understand what pages are available in search and what aren’t, based on how Google has indexed them:
How long does it take for indexing to occur? According to Google’s technical folks, it can take up to a month for a page to appear in the index and show up in Google Search Console. Remember – crawling and indexing are not the same thing! Crawling can happen in minutes. Indexing – because of all the technical stages before indexing – can take much longer.
Key action to take: Check your index coverage, and fix anything that’s within your control to fix!
Stage 6: Rank
Now we get to the part everyone in marketing is concerned about: how Google ranks pages to show up in search results. When we look for advice about this, we often find lots of contradictory information outside of Google. So the question is, what does Google have to say about it?
Lots of information outside of Google about search ranking isn’t factually correct. For example, the numeric PageRank score that was publicized heavily about 10 years ago (and is still used in crappy spam SEO messages even to this day) was that PageRank was a score between 1 and 10, where pages with a 10 score showed up best. According to Google’s Gary Illyes, PageRank was never a 1-10 score, but an integer with a maximum 16-bit value of 65,536.
So, what makes a page rank? Well, here’s the funny thing: no one, including Google, knows exactly what makes a page rank because there are hundreds, possibly thousands of data points that go into its neural networks to decide page ranking – and not all of those features are explicitly declared. Some of the technical aspects we do know:
- High quality incoming links to pages (the original PageRank)
- Relevance to the query
- Mobile usability/page speed
What else could be in the box? This is the challenge of deep learning neural network models: we don’t actually know. What we do know is that Google has thousands of signals to choose from, but a human being isn’t making those choices. Some of the possible signals include:
- Document-level relevance: with the deployment of BERT and learning-to-rank capabilities, document relevance may be a ranking signal; a page that is high quality and contains relevant information may rank higher even if it doesn’t have many inbound links
- Text and language features: again, with BERT, the ability to identify tons of different entities and text structures could lend hundreds or even thousands of signals to Google’s neural networks
- User behaviors: with Google Analytics data from millions of websites, Google has its choice of data for user experiences – not just in search itself, but also what happens on different candidate websites. This, plus user interactions on Google.com itself provide tons of user satisfaction signals.
- Human ratings: this is where ranking gets really murky. Google has human beings individually rating a small sample of websites based on their search quality rating guidelines for what makes a highly effective search experience. This 175-page guide is intended as a manual for the humans to help them rate websites and help Google build a training library for its algorithms.
Why are human ratings so murky? Because of the way AI works. Here’s an example of how raters are asked to guide and rate pages:
You can see that it’s a simple sliding scale, which is used as input for machine learning. These ratings provide a neural network with outcomes to look for in what’s probably a semi-supervised learning environment – lots of high-quality data inputs combined with these known outcomes. What happens behind the scenes is that the neural network attempts to build a model out of the complex interactions and then sees which of the many different techniques it uses gets closest to the outcomes provided. That means the hundreds or even thousands of data points generated from the different processes along the way in the crawling and indexing stages.
Here’s why this is murky: the nature of neural networks means we – and Google – don’t necessarily know which variables, alone or in combination, raw or feature-engineered, are used to make up that model of a high quality search result. It’d be like trying to deconstruct a meal that’s already been cooked. You can sort of tell some of the ingredients, but plenty of the process – how fast it was cooked, at what temperature, in what kind of pan, on what kind of stove – is all opaque to the person eating the meal.
Once ranking has been computed, that information is then distributed in what’s possibly a gigantic graph network for users to consume. You type in a Google search query, and you get the related results that provide the best experience and relevance to what you asked Google about.
Key actions to take: What does this mean for us? We can only act on the information we know:
- We know PageRank, which is based on things like inbound links, is still relevant. Thus we should keep building relevant, high-quality links.
- We know BERT looks at the contextual relevance of our content and combined with TF-Ranking, so our content should be rich and topically relevant at the sentence, paragraph, and document levels.
- We know that technical aspects like page load, mobile friendliness, and other web vitals are or will be ranking signals, so our sites should function technically well.
- Finally, we know that the human ratings guidelines are the training data for the neural network models, which means that ideally, we should help our sites meet all of the highest quality rating guidelines to conform to what the machines have been trained to think of as the best content to show to users.
So, to recap: the process of crawling, indexing, and ranking content is composed of multiple steps and there are things marketers can and should be doing to improve their friendliness with Google’s machinery at each of the steps. While following every step won’t guarantee success, not following the steps for basic technical and content SEO will almost certainly harm you.
Appendix and Sources
Sources used in this post:
- Rendering and signal collection
- The Caffeine indexing system
- Learning to Rank with BERT
- Google Search Quality Rating Guidelines
- Scaling Graph Neural Networks with Approximate PageRank
You might also enjoy:
- Marketing Data Science: Introduction to Data Blending
- You Ask, I Answer: Google Tag Manager and Google Analytics Integration?
- Retiring Old Email Marketing Strategies
- How to Set Your Public Speaking Fee
- It's Okay to Not Be Okay Right Now
Want to read more like this from Christopher Penn? Get updates here:
Get your copy of AI For Marketers