Save Crawl Budget with Separate Hostnames, Says Google

Google has provided a recommendation for website owners aiming to optimise their site’s performance in search results: host resources, such as images, JavaScript, and CSS files, on content delivery networks (CDNs) or subdomains. This strategic move can help conserve your main website’s crawl budget, allowing Googlebot to focus on crawling and indexing essential pages more efficiently. By doing so, website owners can improve their site’s visibility and ensure critical content is prioritised during crawling.

A noteworthy point shared by Google is the way Googlebot handles caching. Resources fetched by Googlebot are cached for up to 30 days, regardless of the HTTP cache settings configured on your server. This means hosting resources on CDNs or subdomains does not compromise their accessibility but instead enhances the website’s overall performance. It also reduces the likelihood of Googlebot repeatedly crawling unchanged resources, thus saving the crawl budget for the main site’s primary content.

Google also highlighted the risks of blocking essential resources in the robots.txt file. Many website owners attempt to optimise their site by preventing Googlebot from crawling certain files, such as CSS or JavaScript. However, Google has warned that this approach can backfire, as these resources are often critical for rendering pages correctly. Without access to them, Googlebot may struggle to understand a site’s content, layout, and user experience, potentially harming the site’s rankings in search results.

For websites with extensive content or those frequently hitting crawl budget limits, offloading resources to CDNs can be particularly advantageous. CDNs not only reduce the load on the main website but also ensure faster delivery of resources to users and search engines. Similarly, hosting resources on a subdomain keeps them accessible while separating their crawl demand from the primary site.

This practice aligns with SEO best practices by ensuring that Googlebot focuses on the pages that matter most. It also supports a more efficient crawling process, which is increasingly important for large or complex websites. Furthermore, the use of CDNs and subdomains can improve site speed and user experience, which are vital factors for both search engine rankings and user satisfaction.

In conclusion, Google’s advice underscores the importance of managing your crawl budget wisely. By hosting non-essential resources on CDNs or subdomains, avoiding over-restriction through robots.txt, and leveraging Googlebot’s caching behaviour, you can enhance your site’s SEO performance while maintaining a seamless user experience. Adopting these strategies can help ensure your website remains competitive and visible in the ever-evolving digital landscape.

Google Search Central has launched an exciting new series titled “Crawling December,” which is set to offer in-depth insights into the way Googlebot crawls and indexes webpages. This series is designed to provide webmasters, SEO professionals, and website owners with valuable information that can help optimise their sites for better indexing and ranking in Google’s search results. The series will explore aspects of the crawling process that are often overlooked but can significantly impact how effectively a website is crawled.

Throughout the month of December, Google will publish a fresh article each week, with each post focusing on different, less-discussed elements of the crawling process. While crawling might seem straightforward, there are many behind-the-scenes technicalities that can affect how Googlebot interacts with your website. By shedding light on these areas, Google hopes to provide users with the knowledge they need to fine-tune their websites and improve their visibility on the search engine.

The first post in the series sets the stage by covering the basics of crawling. It explains how Googlebot goes about visiting and indexing webpages, which is essential for SEO success. One of the main topics covered is how Googlebot handles page resources, such as scripts, images, and stylesheets, which are crucial for rendering and understanding a webpage. Understanding how these resources are handled is key to ensuring that Googlebot can effectively crawl and index a site’s content.

Additionally, the article explores the concept of crawl budgets, an important yet often misunderstood topic. Crawl budgets refer to the number of pages a search engine can crawl within a given time frame. Managing your crawl budget efficiently ensures that Googlebot focuses on your most important pages and avoids wasting time on less relevant or duplicate content. This is particularly important for larger websites with many pages, where optimising crawl budgets can make a significant difference in how quickly and efficiently pages are indexed.

With this series, Google is aiming to empower website owners to make more informed decisions about their site’s structure and crawling strategy, ultimately helping them improve their site’s indexing and visibility on Google Search.

Crawling Basics

Modern websites are increasingly complex, primarily due to the use of advanced JavaScript and CSS, which make them harder to crawl compared to the simpler, HTML-only pages of the past. This complexity can present challenges for search engines, but Googlebot, which functions similarly to a web browser, is equipped to handle it—albeit on a different schedule than regular browsing.

When Googlebot visits a webpage, the first thing it does is download the HTML from the main URL. This HTML typically includes links to other essential resources such as JavaScript, CSS, images, and videos that are integral to displaying the full content of the page. At this point, Googlebot has only accessed the core structure of the page, but not the final version seen by users.

Following the initial download, the next step involves Google’s Web Rendering Service (WRS), which works with Googlebot to fetch these linked resources. The WRS enables Googlebot to process these additional components, much like a web browser would, to render the page fully. It’s this rendering process that allows Googlebot to see the webpage as users would, considering elements like styling and interactivity provided by JavaScript.

Once all the resources have been fetched and processed, the final page construction takes place. This results in Googlebot having a fully rendered view of the page, which is then indexed for search purposes. Understanding this multi-step process is crucial for optimising websites, especially when ensuring that all elements of a site are properly crawled and indexed by search engines.

Crawl Budget Management

Crawling additional resources beyond the main content of a website can lead to a reduction in the website’s crawl budget. This is an important factor for website owners to consider when managing how search engines like Google crawl their sites.

To help mitigate this issue, Google’s Web Rendering Service (WRS) attempts to cache every resource, including JavaScript and CSS, that is used in the pages it renders. By doing this, the system ensures that the same resources don’t need to be fetched again during subsequent crawls, thereby conserving valuable crawl budget.

It is also essential to note that the WRS cache remains in place for up to 30 days. This caching period is fixed and is not influenced by any HTTP caching rules set by the developers of the site. As a result, the WRS can continue to serve cached resources without needing to re-fetch them, which ultimately helps save a site’s crawl budget for other important resources.

By effectively managing this caching strategy, websites can ensure more efficient crawling, enabling Googlebot to focus its resources on indexing the most important parts of a website. This can significantly improve a site’s overall SEO performance.

Recommendations

This post provides valuable tips for site owners looking to optimise their crawl budget and ensure their websites are efficiently crawled by Googlebot.

Reduce Resource Use: To save crawl budget, it’s important to minimise the number of resources required to create a good user experience. By streamlining the resources needed to render a page, Googlebot can focus its efforts on more important aspects of your site.

Host Resources Separately: Hosting resources on a different hostname, such as a CDN or subdomain, can help to shift the crawl budget burden away from the main website. By doing so, the main site’s crawl budget is preserved for indexing key content, while the separate resources can be crawled independently.

Use Cache-Busting Parameters Wisely: Be cautious when using cache-busting parameters, as changing resource URLs unnecessarily can cause Googlebot to recheck them, even if the content remains the same. This results in wasted crawl budget, which could be better used elsewhere on the site.

Additionally, Google warns against blocking resource crawling through robots.txt. While it might seem like a good idea to prevent crawling of specific resources, it can be risky. If Googlebot is unable to access crucial resources needed for rendering a page, it may have difficulty understanding the page’s content, which can negatively affect how it is indexed and ranked.

By following these best practices, site owners can help ensure that Googlebot crawls and indexes their site more efficiently, ultimately improving SEO performance.

Monitoring Tools

The Search Central team recommends that the best way to monitor what resources Googlebot is crawling is by reviewing your site’s raw access logs. These logs provide detailed information on the resources that Googlebot is attempting to access while crawling your site.

To identify Googlebot specifically, you can use its IP address. Google provides a list of the IP address ranges used by Googlebot in its developer documentation. By cross-referencing these ranges with your access logs, you can pinpoint exactly which resources Googlebot is crawling, helping you to better understand how Googlebot interacts with your site.

Why This Matters

This post clarifies three key points that directly impact how Google discovers and processes your site’s content.

Firstly, the management of resources is crucial in preserving your crawl budget. Hosting scripts and styles on CDNs can help save this valuable resource, ensuring that Googlebot spends less time on unnecessary tasks.

Secondly, Google caches resources for 30 days, irrespective of your HTTP cache settings. This caching mechanism helps to conserve crawl budget by reducing the need for repeated downloads of the same resources.

Lastly, blocking critical resources via robots.txt can backfire. If Google is unable to access essential resources, it may struggle to render your pages properly, which can negatively affect your site’s performance in search results.

Understanding these mechanics is essential for SEOs and developers alike. By making informed decisions about resource hosting and accessibility, they can optimise how well Googlebot crawls and indexes their websites.