Crawl Speed and Frequency

Learning Objective

After going through this module, you should understand what crawl budget is, how it’s allocated to a website and how to optimize for it.

Video Duration

23:27

Answer Quiz

Take current module quiz

Materials

Ready-to-use templates

Resources

Reports and Resources

2.6.1 What Is a Crawl Budget?

Crawl budget is the number of pages on your website a web crawler will crawl within a given timeframe.

Each time you hit the publish button, Google needs to crawl and index the content for it to start appearing in search results. Given the scale and volume of content on the internet, crawling becomes a valuable resource that needs to be budgeted and rationed for the most efficient use.

To put it simply, it’s hard for Google to crawl and index every single page on the internet everyday. So, Google crawls each website according to its assigned budget.

2.6.2 How Is Crawl Budget Assigned to Websites?

Crawl budget is assigned to websites based on two factors — crawl limit and crawl demand.

Crawl Limit

This is a website’s capacity and/or willingness to be crawled.

Not every website is built to be crawled everyday. Crawling involves Googlebot sending requests to your website’s server that, if done too frequently, may stress the server’s capacity.

Also, not every publisher wants their site crawled continuously.

Crawl Demand

Crawl demand is a measure of how often a particular page wants to be (re)crawled. Popular pages or pages that are updated frequently need to be crawled and recrawled more frequently.

2.6.3 Why Should Publishers Care About Crawl Budget?

If Google can’t crawl and index your content, that content will simply not show up on search results.

That being said, crawl budgets are generally only a concern for medium to large publishers who have more than 10,000 pages on their website. Smaller publishers shouldn’t need to worry overly about crawl budgets.

Publishers with 10,000 or more pages on their website, however, will want to avoid Googlebot crawling pages that were not important. Exhausting your crawl budget on irrelevant or less important content means higher value pages may not be crawled.

Moreover, news publishers will want to be careful about wasted crawl budgets given that crawling is one of the three ways that Google News discovers fresh content in a timely manner. The other two are by using sitemaps and Google Publisher Center, which we’ve explored further in our Google News Sitemap and Google Publisher Center modules

2.6.4 Optimizing for Crawlers

Optimizing the frequency and speed with which Googlebot crawls your website involves monitoring a range of variables. We begin by listing the most important factors involved in optimizing crawl budget and frequency.

Monitor Content Crawl

The two most useful tactics for monitoring how your content is being crawled are analyzing log files and Google Search Console’s (GSC) crawl stats report.

1. Log File Analysis

A log file is a text document that records every activity on your website’s server. This includes all data about crawls requests, page requests, image requests, requests for javascript files and any other resource needed to run your website.

For the purposes of technical SEO, log file analysis helps determine a lot of useful information about URL crawling, including but not limited to:

Which URLs have been crawled.

Which URLs are being crawled most frequently.

Identifying if any low-value or non-essential URLs are being crawled, thereby wasting crawl budget.

How To Do This

Log file analysis is a task that requires some degree of technical familiarity with a website’s backend. For this reason, we recommend using log file analyzer software. There are several free and paid log analysis tools available such as Graylog, Loggly, Elastic Stack, Screaming Frog Log Analyzer and Nagios to name a few.

If you are an experienced developer or system administrator, you can manually perform a log file analysis as well.

To do this, follow these steps:

Log into the cPanel of your hosting provider

Navigate to Merics and then Raw Access

Download the log file

Once you have downloaded the log file, you can change the extension to .csv and open it using Microsoft Excel or Google Sheets. As we said, however, this approach requires a certain level of expertise to make sense of the log file.

You can also access the log file using an FTP client by entering the path of the log file. A typical log file path looks something like this:

Server name(for instance, Apache)/var/log/access.log

However, it is much more convenient to use a log analysis tool instead. Once you’ve uploaded the log file into the tool, you can sort the data using several filters. For example, you’ll be able to see which URLs have been accessed most frequently by Googlebot.

You’ll also be able to see if Googlebot has been accessing non-essential or low-value URLs such as faceted navigation URLs, duplicate URLs, etc. Identifying these is important as they are wasting your crawl budget.

Look at the below screenshot, taken from Screaming Frog’s SEO Log File Analyser, to see what we mean.

2. GSC Crawl Stats Report

GSC provides website owners with comprehensive data and insights about how Google crawls their content. This includes detailed reports on:

Which files/pages were crawled and for what purpose.

Which type of Googlebot (news, videos, mobile, etc) undertook the crawling.

Total number of crawl requests, etc.

GSC also makes available easy to understand graphs and charts to provide website owners with even more information. The screenshot below is what a typical crawl stats report on GSC looks like.

Source

The GSC also lets you know if there are any issues with crawling. It checks for several errors and assigns each a code. The most common errors that GSC checks for include:

Server error [5xx]

Redirect error

Submitted URL blocked by robots.txt

Submitted URL marked ‘noindex’

Submitted URL seems to be a soft 404

Submitted URL returned 401

Submitted URL not found (404)

Submitted URL returned 403

Submitted URL blocked due to other 4xx issue

The GSC report also shows how many pages have been affected by each error alongside the validation status.

How To Do This

Here’s how you can access the GSC crawl stats report for your website or web page:

To use GSC, you need to sign up.

Next, you’ll need to prove you own the website you wish to add to GSC. This can be done by adding what Google calls a “property”. A property is anything that needs to be examined in GSC. It could be a single page or an entire website.

To verify site ownership, select open GSC, navigate to Property Selector and click “add new property” from the property selector.

You will then see a list of verification methods that will allow you to verify your ownership of the site/page.

These include:

HTML File Upload
HTML Tag
Google Analytics Tracking Code
Google Tag Manager
Domain Name Provider
Google Sites, Blogger or Domain Account.

Select your verification method. Once your property has been verified, it will appear on GSC and you can view detailed analytics for its crawl budget and activity.

Optimize Crawl Budget

We now know that crawl budget is a valuable resource whose use must be optimized for best results. Here are a few techniques to do this:

1. Fix Duplicate Content

Duplicate content may end up being crawled separately, leading to a wastage of crawl budget. To avoid this from happening, either consolidate duplicate pages on your website into one, or delete duplicate pages.

2. Use Robots.txt

Robots.txt is a file that serves a number of purposes, one of which is to tell Googlebot not to crawl certain pages or sections of pages. This is an important strategy that can be used to prevent Googlebot from crawling low-value content or content that doesn’t need crawling.

Here are a few best practices when using robots.txt to optimize crawl budget:

Assign priority to parameters that need to be blocked from crawling in your order of preference.

Specify robots, directives and parameters that are causing additional crawl using log files.

Block common paths that CMS’ typically have such as 404, admin, login pages, etc.

How To Do This

Creating and executing a robots.txt file to restrict Googlebot access requires some coding knowledge. Here are the steps involved:

A robots.txt file is created using a text editor such as a notepad. The file needs to be named robots.txt for Googlebot to recognize it.

Once a file has been created, we add rules to it, which is to say that we specify a set of instructions telling the file to block access to certain agents under specific conditions.

A typical robots.txt file will have the following elements:

A user-agent, such as Googlebot, to which access is to be denied
A disallow or allow entry, specifying a file or directory to which access is to be restricted for the user-agent
A sitemap to tell Google which content it should crawl.

Below is what a simple robots.txt file looks like.

Source

This code means that a user agent — Googlebot in this instance — is not allowed to crawl any URL that begins with “http://www.example.com/nogooglebot/”.

Once your file has been created and saved on your local machine, it needs to be uploaded to your website. Doing this depends on your server and your hosting provider. You will need to contact your provider to ascertain the exact steps for this.

We suggest seeking expert help if you don’t feel comfortable creating and uploading robots.txt files yourself.

3. Segment XML Sitemaps

A crawl bot arrives at a site with a general allocation of how many pages it will crawl. The XML sitemap effectively directs the bot to read selected URLs, ensuring the effective use of that budget.

Note that a page’s ranking performance depends on several factors including content quality and internal/external links. Consider including only top-tier pages in the map. Images can be allocated their own XML sitemap.

Follow these recommendations to ensure optimal XML sitemap implementation:

Reference the XML sitemap from the robots.txt file.

Create multiple sitemaps for a very large site. Don’t add more than 50,000 URLs to a single XML sitemap.

Keep it clean and only include indexable pages.

Keep the XML sitemap up-to-date.

Keep the file size to less than 50MB.

For a more detailed look at sitemaps, refer to our dedicated module on this topic.

4. Implement an Efficient Internal Linking Strategy

Internal links perform three important functions:

Organizing content around themes, which helps build topical authority.

Spreading page equity between high-value pages to other pages.

Helping users and web crawlers more easily navigate the site.

Thus for efficient crawling, it is important to implement an efficient internal linking strategy. For more on internal linking, refer to our detailed course module here.

5. Upgrade Hosting

If a website runs on a shared hosting platform, crawl budget will be shared with other websites running on said platform. A large publisher may find independent hosting to be a valuable alternative.

Prior to upgrading your hosting to resolve bot traffic overload, there are some factors worth considering that might impact server loads otherwise.

Process images using a separate content distribution network (CDN) that is also optimized to host next gen image formats such as webp.

Consider hosting CPU, disk space based on your website function and requirements.

Monitor activity using solutions like New Relic to monitor excess usage of plugins and bots.

For more on the advantages of CDNs, check out our page experience module.

6. Use Javascript Judiciously

When Googlebot lands on a web page it renders all the assets on said page, including Javascript. While crawling HTML is rather straightforward, Googlebot must process Javascript a number of times in order to be able to render it and understand its content.

This can quickly drain Google’s crawl budget for a website. The solution is to implement Javascript rendering on the server side.

How To Do This

Addressing Javascript in your website’s source code requires coding expertise and we recommend consulting a web developer if you plan to make any such changes. That said, here are a few guidelines on what to look for when trying to optimize the use of Javascript.

Avoid sending Javascript assets to the client for rendering so that crawl bots don’t expend their resources and can work more efficiently

Use browser-level lazy loading instead of it being Javascript based.

Use server side tagging for analytics and third-party tagging, either self-hosted or using solutions such as stape.io.

7. Monitor CWVs

CWVs are a measure of page performance that directly affects how your page performs in search rankings.

The GSC’s CWV report groups URL performance under three categories:

Metric type ( LCP, FID and CLS)

Status

URL groups

CWVs can also impact your crawl budget. For example, slow loading pages can eat up into your crawl budget as Google has a limited amount of time for crawling tasks. If your pages load fast, Google can crawl more of them within the limited time it has. Similarly, too many error status reports can slow crawling down and waste your crawl budget.

For a more thorough examination of CWVs, see our module on page experience.

8. Use a Third-Party Crawler

A third-party crawler such as Semrush, Sitechecker.pro or Screaming Frog allows web developers to audit all of a site’s URLs and identify potential issues.

Third-party crawlers can be used to identify:

Broken links

Duplicate content

Missing page titles

These programs offer a crawl stats report to help highlight problems that Google’s own tools may not.

Improving structured data and cutting down on hygiene issues will streamline Googlebot’s job of crawling and indexing a site.

We recommend the following best practices when using third-party crawlers:

Emulate Googlebot, via search crawl settings, to prevent being blocked from hosting providers and to properly identify and fix all technical issues.

Debug missing pages from a crawl using this great guide from Screaming Frog.

9. Monitor URL Parameters

URL parameters — the section of the web address that follows the “?” — are used on a page for a variety of reasons, including filtering, pagination and searching.

While this can boost the user experience, it can also cause crawling issues when both the base URL and one with parameters return the same content. An example of this would be “http://mysite.com” and “http://mysite.com?id=3” returning the exact same page.

Parameters allow a site to have a near unlimited number of links — such as when a user can select days, months and years on a calendar. If the bot is allowed to crawl these pages, the crawl budget will be used up needlessly.

This can especially be an issue for concern if your website uses faceted navigation or session identifiers that can spawn multiple duplicate pages which, if crawled, could lead to a wastage of crawl budget.

Duplicate URLs can also result if you have localized versions of your webpage in different languages, and the content on these pages has not been translated.

We recommend the following to address this:

Use robots.txt to disallow Googlebot from crawling duplicate pages.

Use the <hreflang> tag to specify the content’s language variations. The hreflang tab tells the crawler that the page is a regional variation of the main content, thus preventing the crawler from registering it as a duplicate in case it hasn’t been translated yet.

Here’s how a simple <hreflang> looks like in your source code:

<link rel=”alternate” hreflang=”es-mx” href=”https://examplesite.com/news/hreflang-tags”/>

This tells the crawler that the specified URL is a Spanish (Mexican) variant of the main URL, and it should not be treated as a duplicate.

2.6.5 Nice to Have

We’ve discussed the essentials of crawl budget management. The pointers listed in this section, though not critical to healthy crawl budget management, go a long way toward supplementing the techniques discussed previously.

Handling Crawling Emergencies

A crawling emergency occurs when Googlebot overwhelms your website with more crawl requests than it can handle. It’s important to identify the issue as quickly as possible, which can be done by closely monitoring server logs and crawl stats in Google Search Console.

If a sudden surge in crawling is not managed in time, it could cause the server to slow down. Server slowdown would increase the average response time for crawlers and, as a result of this high response time, search engines will automatically reduce their crawl rate. This is problematic because reduced crawl rates will lead to a loss in visibility, with new articles not being crawled immediately.

If you notice over crawling is taxing your servers, here a few things you can do:

1. Limit the Crawl Rate

Google has sophisticated algorithms that control the crawl rate. So ideally, one should not tamper with the crawl rate. However, in an emergency situation, you can log into your GSC account and navigate to Crawl Rate Settings for your property.

If you see the crawl rate there as Calculated As Optimal, you won’t be able to change it manually. A special request needs to be filed with Google to change the crawl rate.

If this isn’t the case, you can simply change the crawl rate yourself to your desired value. This value will stay valid for 90 days.

If you do not wish to tamper with crawl rates in the GSC, you can also block access to the page by Googlebot using robots.txt. The procedure to do this has been explained previously.

2. Check the Site’s Crawl Rate

It can take Google up to three days to crawl most sites. The only exceptions are news sites or other sites that publish time-sensitive content which may be crawled daily.

To check how frequently your pages are being crawled, monitor your site log. If you still feel your content is not being crawled as frequently as it should be, follow these steps:

Submit your updated news sitemap to Google. A news sitemap is a sitemap created specifically from Google news.

If you’ve made any changes to your news sitemap, use the ping tool to inform Google of the same, This can be done by sending a GET request from either your command line or your browser to the following address:

https://www.google.com/ping?sitemap=FULL_URL_OF_SITEMAP

Use the <lastmod> tag in sitemaps to show when an indexed URL was last updated or modified.

3. Return a 503 or 429 Error Code

Please note: This step should be viewed as the last action anyone should take as it carries a certain degree of risk. If Googlebot sees 503 and 429 errors then it will start to crawl slower and may stop the crawling, leading to a temporary drop in the number of indexed pages.

A 503 error code means that the server is temporarily down, while 429 means that a user has sent too many requests in a specific amount of time. These codes let Googlebot know that the problem is temporary, and it should return to crawl the page at a later time.

Although a seemingly minor step, this is important because if Googlebot doesn’t know the nature of the problem a web page is experiencing, it assumes the problem is of a long-term nature and may mark the page down as being unresponsive, which can affect SEO.

Creating 503 error codes is done through a PHP file, which is inserted within the existing HTML source code of your page along with an error message. You’ll also need to write a few additional lines of HTML code mentioning when the site is expected to return.

This is what the code for a 503 redirect looks like:

Source

Doing 503 or 429 redirects requires advanced HTML coding skills and we suggest consulting with your web developer before attempting this.

2.6.6 Avoid These Common Pitfalls

We now have a good understanding of what a crawl budget is and how to optimize it. However, knowing what not to do when it comes to crawl budgets is equally important.

Here are some common pitfalls to avoid to ensure you get the most out of your website’s crawl budget:

Trying to Increase Crawl Rate Without a Good Reason

The frequency with which Google crawls your website is determined by its algorithms, which take into account several signals to arrive at an optimal crawl frequency.

Increasing the crawl rate does not necessarily lead to better positions in search results. Crawling frequency or even crawling itself is not a ranking factor in and of itself.

Repeatedly Submitting Old Content for Crawling

Google does not necessarily prefer fresher content over older content. Google ranks pages based on the relevance and quality of the content irrespective of whether it is old or new. So, it is not necessary to keep having them crawled.

Using Crawl-Delay Directive to Control Googlebot

The crawl-delay directive does not help to control Googlebot. If you wish to slow down crawling frequency in response to excessive crawling that is overwhelming your website, refer to the instructions provided in the section above.

Having a Slow Loading Website

Your website’s loading speed can affect your crawl budget. A fast loading page means Google can access more information over the same number of connections.

For tips on loading speed optimization, check out our module on page experience.

Using Only Nofollow Links to Block Crawlers

Nofollow links may still end up affecting your crawl budget as these may still end up being crawled. On the other hand, links that robots.txt have disallowed have no effect on crawl budget.

Also, alternate URLs and Javascript content may end up being crawled, consuming your crawl budget, so it’s important to restrict access to them by either removing them or by using robots.txt.

2.6.7 Actions and Takeaways

Crawl budget is a valuable resource and it’s critical that you optimize for it. Crawling and indexing issues can affect the performance of your content, especially if your website has a large number of pages.

The two most fundamental operations involved in optimizing crawl budget are keeping your sitemap updated and regularly monitoring indexing issues from the GSC crawl stats report and log files.

It’s important to learn how to apply crawl management best practices both during the rollout of new website features and also when one-off errors happen.

Previous Module

Back to Chapter

Next Module

Active now

6

Crawl Speed and Frequency

1

Design and Layout