A special lesson dedicated to one of the themes that often afflicts SEO and those who work on sites: the crawl budget and, more generally, the analysis of Google scans. Back with an extra appointment is the Google Search Console Training series, with an episode focused on the theme of crawling and the new report Statistics scan in Google Search Console, which first allows you to verify the ability of Googlebot to scan a particular site.

Google’s video on crawling and crawl status report

The lesson is entrusted, as on previous occasions, to the Search Advocate Daniel Waisberg, who first provides a brief introduction to the way Google scans pages, defines some relative terms such as crawl rate, crawl demand and crawl budget, and then he goes on to describe the Scan Statistics report, which provides data on crawl requests, average response time, and more.

As a disclaimer, the Googler explains that such topics are more relevant to those who work on a large website, while those who have a project with a few thousand pages do not have to worry too much (although, he says, “never bad to learn something new, and who knows that your site can not become the next big thing”).

Crawling: what it is and how it works for Google

The scanning process starts with a list of Urls of previous scans and sitemaps provided by site owners: Google uses web crawlers to visit these addresses, read the information they contain, and follow links on those pages.

Crawlers will review the pages already in the list to see if they have been modified and will also scan the newly detected pages. During this process, crawlers have to make important decisions, such as prioritizing when and what to scan, making sure that the website can handle server requests made by Google.

The pages successfully scanned are processed and transmitted to Google indexing to prepare the content for publication in search results. Google is very careful not to overload servers, so the frequency of scans depends on three factors:

  • Crawl rate or scan rate: maximum number of simultaneous connections a crawler can use to scan a site.
  • Crawl demand: how much content is desired by Google.
  • Crawl budget: number of URLs that Google can and wants to scan.

The importance of crawling

In further details, Waisberg explains that the crawl demand depends on “how much content is desired by Google” and is “influenced by Urls that have not been scanned by Google before, and by Google’s estimate on how often content changes on non-URLs”.

Google calculates the crawl rate of a site periodically, based on the responsiveness of the site itself or, in other words, the share of crawling traffic it can actually handle: if the site is fast and consistent in responding to crawlers, the rate goes up if there is an indexing request; if instead the site slows down or responds with server errors, the rate goes down and Google scans less.

In the rare cases where Google crawlers overload servers, you can set a limit to the scan speed using the settings in the Search Console.

Taking together the scanning speed and question you can “define the budget crawl as the number of Urls that Google can and wants to scan”, as we said talking about what budget crawl means to Google.

When Googlebot is able to scan a site efficiently, it allows a site to quickly get new content indexed in search results and helps Google find out about changes made to existing content.

How it is and how Google Crawl Stats report can be used

To find out how often Google scans the site and what the answers were, you can use the Crawl Stats report in the Google Search Console, which provides statistics on Google’s crawling behavior and help understand and optimize the scan.

The new version of this tool, released late last year as also announced in Google Search News back in November 2020, allows you to have data that answer questions such as:

  • What is the general availability of the site?
  • What is the average page response for a crawl request?
  • How many requests have been made by Google to the site in the last 90 days?

The Scan Stats report adds to the old webmaster tools and is only available for properties at the root directory level: site owners can find it by accessing the Search Console and going to the “Settings” page.

When the report is opened, a summary page appears, which includes a graph of the scanning trends, details of the host’s status, and a detailed analysis of the scan request.

The graph on scan trends

In particular, the graph of scanning trends shows information on three metrics:

  • Total scan requests for site URLs (successful or not). Requests for resources hosted outside the site are not counted, so if images are served on another domain (such as a CDN network) they will not appear here.
  • Total download size from the site while scanning. Page resources used by multiple pages that Google has cached are required only the first time (at the time of storage).
  • Average page response time for a search request for indexing to recover page content. This metric does not include page resource recovery such as scripts, images, and other linked or embedded content, and does not take into account page render time.

When analyzing these data, Waisberg recommends looking for “higher peaks, drops and trends over time”: for example, if you notice a significant drop in total scan requests, it is good to make sure that no one has added a new robots.txt file to the site; if the site responds slowly to Googlebot it could be a sign that the server fails to handle all the requests, as well as a constant increase in the average response time is another “indicator that the servers might not handle all the load”although it may not immediately affect the scanning speed but rather the user experience.

Host status analysis

Host status data allows you to check the general availability of a site in the last 90 days. The errors in this section indicate that Google cannot scan the site for technical reasons.

Again there are 3 categories that provide details of host status:

  • Robots.txt fetch: the percentage of errors while scanning the robots.txt file. It is not mandatory to have a robots.txt file, but must return the answer 200 or 404 (valid, compiled or empty file, or non-existent file); if Googlebot has a connection problem, such as a 503, it will stop scanning the site.
  • DNS Resolution: indicates when the DNS server has not recognized the host name or has not responded during the scan. In case of errors, it is suggested to contact the registrar to verify that the site is properly configured and that the server is connected to the Internet.
  • Server connectivity: shows when the server is not responding or has not provided the full answer for the URL during a scan. If you notice significant peaks or connectivity problems, it is suggested to talk to the provider to increase capacity or resolve availability problems.

A substantial error in any of the categories may result in a reduction in availability. There are three values of the host state that appear in the report: if Google has found at least one of these errors on the site in the last week, a red icon-shaped alert appears with exclamation mark; if the error is older than a week and goes back to the last 90 days, a white icon appears with a green check that indicates that there have been problems in the past (temporary or resolved in the meantime), which can occur through server logs or with a developer; finally, if there have been no substantial problems of availability in the last 90 days everything is in place and a green icon with a white check appears.

Googlebot’s scan requests

The scan request cards show several broken down data that help figure out what Google crawlers found on the site. In this case, there are four breakdowns:

  • Scan response: The responses received by Google while scanning the site, grouped by type, as a percentage of all the responses to the scans. Common response types are 200, 301, 404 or server errors.
  • Scanned file types: shows the file types returned by the request (whose percentage value refers to the responses received for that type and not to the recovered bytes); the most common are HTML, images, video or Javascript.
  • Purpose of the scan: shows the reason for scanning the site, such as the discovery of a new URL for Google or refresh for a re-crawl of a note page
  • Type of Googlebot: indicates the type of user agent used to make the scan request, such as smartphone, desktop, image and others.

Conclusions and take aways on crawling

Before concluding, Waisberg summarizes the main information provided in the video.

The crawl budget is the number of URLs that Google can and wants to scan on websites every day, and it’s a parameter “relevant for large websites, because Google needs to prioritize what to scan first, how much to scan and how frequently to repeat the scan”.

To understand and optimize Google scan you can use the Google Search Console Scan Statistics report, starting with the page summary graph to analyze the scan volume and trends, continuing with the host status details to check the general availability of the site and, finally, verifying the breakdown of crawl requests to understand what Googlebot finds when it scans the site.

These are the basics of using the crawl stats report to ensure that Googlebot can scan the site efficiently for Search, to follow the necessary operations of budget crawl optimization and general interventions to bring out the site on Search.

Call to action