Crawl Budget: What It Is, Why It Matters, and How to Stop Wasting It

by Francis Rozange | Mar 27, 2026 | SEO

Category: SEO | Reading time: 18 minutes | Last updated: March 2026

Crawl budget is one of those SEO concepts that generates an enormous amount of discussion, most of it unnecessary for most websites. John Mueller from Google has been direct about this: “IMO crawl budget is overrated. Most sites never need to worry about this.” Gary Illyes, who wrote Google’s original blog post on crawl budget back in 2017, confirmed in 2025 that the one-million-page threshold where crawl budget starts to matter has not changed. If your site has fewer than 100,000 pages, Mueller has stated that this “is usually not enough to affect crawl budget.” But for large e-commerce stores, sites with faceted navigation, portals with parametric URLs, or sites with a history of migrations, crawl budget can be the invisible bottleneck that explains why your pages are not appearing in Google despite having good content. This article explains what crawl budget actually is according to Google’s own team, when it genuinely matters, and how to stop wasting it.

What Crawl Budget Actually Is (According to Google)

The concept of crawl budget was not created inside Google. It was coined by the SEO community, and Google eventually adopted a definition to match the external conversation. Gary Illyes explained on the Search Off the Record podcast: “For the longest time we were saying that we don’t have the concept of crawl budget. And it was true. We didn’t have something that could mean crawl budget on its own. And then, because people were talking about it, we tried to come up with something.” The definition Google settled on is “the number of URLs Googlebot can and is willing to crawl for a given site.” This depends on two factors working together: crawl rate limit and crawl demand. The crawl rate limit is the maximum number of simultaneous connections Googlebot will make to your server without causing problems. If your server responds quickly, Googlebot can crawl more pages in the same time window. If your server is slow or returns errors, Googlebot reduces its crawl rate to avoid overloading the server. The crawl demand is how much Google wants to crawl your site based on factors like the popularity and freshness of your content, how often pages change, and the perceived importance of your URLs. A news site that publishes dozens of articles daily will have much higher crawl demand than a corporate brochure site that changes once a quarter.

When Crawl Budget Actually Matters

The Thresholds Google Has Confirmed

According to Google’s official documentation, crawl budget is primarily relevant in two scenarios: sites with more than one million unique pages updated weekly or more frequently, and sites with more than 10,000 pages that change daily. Gary Illyes confirmed in 2025 that the one-million-page threshold has not changed since 2020. Mueller separately confirmed that 100,000 URLs is “usually not enough to affect crawl budget.” For 99 percent of business websites, problems that look like crawl budget issues are actually content quality problems, internal linking problems, or server speed problems. If your site has 5,000 pages and some content is not being indexed, do not look at crawl budget first. Check whether that content has genuine user value, whether it is properly linked from authoritative pages, and whether it has technical issues like noindex tags, canonical tags pointing elsewhere, or slow server responses.

Signs You Actually Have a Crawl Budget Problem

The most reliable indicator of a genuine crawl budget problem is a growing number of pages in Google Search Console‘s Coverage report under “Discovered, currently not indexed.” This status means Google has found the URL (through your sitemap or internal links) but has decided not to crawl it yet. If that number is large and growing, you may have a crawling problem worth investigating. Other signs include new pages taking weeks or months to appear in Google’s index despite being in your sitemap and properly linked, important pages showing stale cached versions in Google’s cache (indicating infrequent recrawling), and server logs showing Googlebot spending most of its time on low-value URLs rather than your important content. Check your crawl stats in Google Search Console under Settings, then Crawl stats. This shows you how many pages Google is crawling per day, your average response time, and the distribution of crawl activity across your site. If your average server response time is above 500ms, fixing server speed should be your first priority because it directly increases the number of pages Googlebot can crawl within its time budget.

The Top Four Crawl Budget Killers (According to Google)

In the Search Off the Record podcast’s “2025 Wrapped” episode, Gary Illyes and Martin Splitt reviewed Google’s internal data on the most common crawl issues. Nearly 85 percent of major crawl issues stem from structural traps that waste Googlebot’s resources on useless URLs.

Faceted Navigation (50% of All Crawl Issues)

Faceted navigation, the filtering and sorting options on e-commerce category pages, is responsible for half of all crawl budget waste reported to Google’s team. When a category page lets users filter by color, size, price, brand, material, and sort by popularity, price, or rating, each combination creates a unique URL. A single category with 10 colors, 8 sizes, 5 brands, and 3 sort options can generate thousands of URL variations, each showing essentially the same products in a slightly different arrangement. Googlebot attempts to crawl each of these URLs, burning through crawl budget on near-duplicate pages that add no unique value. The fix is to block faceted URLs from crawling using robots.txt disallow rules for the parameter patterns (like Disallow: /*?color= or Disallow: /*?sort=), and to use canonical tags pointing all faceted variations to the clean category URL. For WooCommerce sites, this requires careful configuration because WooCommerce generates filterable URLs by default without proper crawl controls.

Action Parameters (25% of All Crawl Issues)

Action parameters are URL parameters generated by user actions that Googlebot should never crawl: add-to-cart URLs, wishlist URLs, comparison URLs, and other transactional parameters. Illyes joked that “the things that Googlebot tends not to do is shop around on the internet. It will not buy your weirdo hoodie.” Yet every millisecond Googlebot spends crawling an add-to-cart URL is budget wasted that could have been spent indexing a product page or blog post. Block these URLs in robots.txt with rules like Disallow: /*?add-to-cart= and Disallow: /*?wishlist=. Most e-commerce platforms generate these parameters by default, and most sites never think to block them because they do not appear in the site’s visible navigation.

Session IDs (10% of All Crawl Issues)

Despite being an outdated practice, session IDs appended to URLs still account for 10 percent of crawl issues. When your site appends a unique session identifier to every URL (like ?sid=12345), Googlebot treats every session as a unique page. This creates a massive amount of near-duplicate content that dilutes the value of the main page and wastes crawl budget on temporary, useless URLs. Modern session management should use cookies, not URL parameters. If your site still uses URL-based session IDs, this is a technical debt that needs fixing regardless of crawl budget because it also creates duplicate content problems that affect your rankings directly.

Infinite Spaces (Calendar Widgets, Event Plugins)

Infinite spaces are URLs generated by calendar widgets, date pickers, or pagination that allow Googlebot to click “next” endlessly. If a calendar widget generates a valid URL for every month up to the year 3000, Googlebot may try to crawl each one. Illyes noted instances where plugins generated these infinite traps on every single path of a website, trapping the crawler in a loop of empty content that exhausts the crawl budget before it reaches valuable pages. Audit your site for any feature that generates an infinite series of URLs and block those patterns in robots.txt.

Gary Illyes’s 2025 Revelation: Speed Matters More Than Size

Perhaps the most important insight from Illyes’s 2025 podcast appearance is that server speed matters more than page count for crawl budget. “If you are making expensive database calls, that’s going to cost the server a lot,” Illyes noted. A website with a few hundred thousand pages but plagued with slow database queries, dynamic rendering issues, or poor server configurations can suffer more in crawlability than a static site with over a million pages. Improving server response time can multiply your daily crawl rate by up to four times because Googlebot can request more pages per minute when each response comes back faster. This aligns with the priority hierarchy we discussed in our site speed optimization article: fix TTFB first, because fast server responses benefit both your users and Googlebot. Illyes also clarified that crawling is not the main bottleneck for Google’s resources. “It’s not crawling that is eating up the resources. It’s indexing and potentially serving, or what you are doing with the data when you are processing that data.” This means that even if Googlebot crawls your page, if the content is low quality, slow to render, or duplicative, Google may choose not to index it, making the crawl a waste from both sides.

How to Optimize Your Crawl Budget

Fix Server Speed First

Based on Illyes’s 2025 insight, the correct priority for crawl budget optimization is server speed first, content quality second, and URL volume third. Reduce your server response time (TTFB) to under 200ms for cached pages and under 600ms for dynamic pages. This alone can dramatically increase the number of pages Googlebot crawls per day. Use LiteSpeed or Nginx instead of Apache, enable server-side caching, optimize database queries, and deploy a CDN. Our site speed optimization article covers the technical details of each of these improvements.

Clean Up Your robots.txt

Your robots.txt file is your primary tool for telling Googlebot what not to crawl. Use disallow rules to block URL patterns that waste crawl budget: faceted navigation parameters, action parameters (add-to-cart, wishlist, comparison), internal search results, admin and staging directories, calendar and date-based infinite spaces, and print-friendly or PDF versions of pages. Be precise with your disallow rules. A rule like Disallow: /*?* blocks all parameterized URLs, which may include legitimate pages. Instead, target specific parameters: Disallow: /*?color=, Disallow: /*?sort=, Disallow: /*?sid=. Remember that robots.txt blocks crawling but not indexing. If a blocked URL has external links pointing to it, Google may still index the URL (showing it in results without a description) even though it cannot crawl the content. For pages that should not appear in search at all, use a meta noindex tag instead of or in addition to robots.txt.

Optimize Your XML Sitemap

Your XML sitemap should be a curated list of your important pages, not an exhaustive list of every URL on your site. Include only pages that you want Google to index: your main content pages, product pages, category pages, and blog posts. Exclude pages that should not appear in search results: admin pages, thin archive pages, paginated results, filtered views, and any page with a noindex tag. Keep your sitemap fresh. If you add or update content, your sitemap should reflect those changes. Most WordPress SEO plugins (Yoast, Rank Math, AIOSEO) generate sitemaps automatically and update them when content changes. Submit your sitemap to Google Search Console and check the sitemap report regularly to verify that Google can access it and that the number of submitted URLs roughly matches the number of indexed URLs.

Fix Redirect Chains

A redirect chain occurs when URL A redirects to URL B, which redirects to URL C, which finally reaches the destination page. Each hop in the chain consumes a crawl request without delivering any indexable content. While Google has stated that it follows up to 10 redirects in a chain, each one wastes crawl budget and adds latency. Crawl your site with Screaming Frog and look for redirect chains longer than one hop. Fix them by updating the original redirect to point directly to the final destination. Also identify and fix any internal links that point to redirected URLs rather than to the final destination directly.

Manage Duplicate Content

Duplicate content wastes crawl budget because Googlebot crawls multiple URLs that all serve the same content. Common sources of duplication include HTTP and HTTPS versions of the same page, www and non-www versions, trailing slash and non-trailing-slash versions, URL parameter variations, and paginated archive pages. Use 301 redirects to resolve protocol and www variations (every site should resolve to one canonical version). Use canonical tags for content that legitimately exists at multiple URLs (like products in multiple categories). Use meta noindex for content that should exist for users but not for search engines (like tag archives or date archives in WordPress).

Handle AI Crawlers

A growing concern for crawl budget in 2025 and 2026 is AI crawlers. Bots like GPTBot (OpenAI), ClaudeBot (Anthropic), and various other AI training and retrieval crawlers are now consuming significant server resources. Some reports indicate that AI crawlers can consume up to 40 percent of a site’s bandwidth, reducing the resources available for Googlebot. If your server logs show heavy AI crawler traffic, consider blocking training crawlers (like GPTBot) while allowing retrieval crawlers that might cite your content in AI search results. The distinction matters: blocking all AI crawlers protects your bandwidth but may reduce your visibility in AI-powered search. A selective approach blocks training bots while allowing the retrieval bots that drive traffic.

How to Monitor Crawl Budget

Google Search Console Crawl Stats

Google Search Console provides crawl statistics under Settings, then Crawl stats. This report shows the total number of crawl requests per day, the average response time, and the percentage of crawl requests that returned specific HTTP status codes. A healthy crawl stats report shows consistent daily crawl activity (not wild fluctuations), average response times under 500ms, a high percentage of 200 (OK) responses, and minimal 404, 500, or redirect responses. If your crawl rate is declining over time without you making changes, it may indicate that Google is reducing its crawl demand for your site because of quality signals or server performance issues.

Server Log Analysis

For the most detailed view of how Googlebot interacts with your site, analyze your server access logs. Server logs show every request Googlebot makes, including the exact URLs crawled, the response codes returned, the time of each request, and the frequency of visits to specific sections of your site. Tools like Screaming Frog Log Analyzer, Oncrawl, or even custom scripts can parse your logs and reveal whether Googlebot is spending its time on your important content or getting trapped in low-value URL patterns. If your logs show Googlebot crawling thousands of faceted navigation URLs while barely touching your new blog content, you have a clear crawl budget distribution problem that your robots.txt needs to address.

Monitoring Crawl Budget: Tools and Techniques

Google Search Console Crawl Stats

Google Search Console provides crawl statistics under Settings, then Crawl stats. This report shows the total number of crawl requests per day, the average response time, and the percentage of crawl requests that returned specific HTTP status codes. A healthy crawl stats report shows consistent daily crawl activity without wild fluctuations, average response times under 500ms, a high percentage of 200 (OK) responses, and minimal 404, 500, or redirect responses. If your crawl rate is declining over time without you making changes, it may indicate that Google is reducing its crawl demand because of quality signals or server performance issues. Also check the Coverage report regularly, particularly the “Discovered, currently not indexed” and “Crawled, currently not indexed” categories. The former indicates pages Google found but chose not to crawl (a potential crawl budget issue), while the latter indicates pages Google crawled but chose not to index (a content quality issue, not a crawl budget issue).

Server Log Analysis

For the most detailed view of how Googlebot interacts with your site, analyze your server access logs. Server logs show every request Googlebot makes, including the exact URLs crawled, the response codes returned, the time of each request, and the frequency of visits to specific sections of your site. Tools like Screaming Frog Log Analyzer, Oncrawl, or even custom scripts can parse your logs and reveal whether Googlebot is spending its time on your important content or getting trapped in low-value URL patterns. If your logs show Googlebot crawling thousands of faceted navigation URLs while barely touching your new blog content, you have a clear crawl budget distribution problem that needs addressing through robots.txt rules and internal linking improvements. Log analysis also reveals AI crawler activity: look for user agents like GPTBot, ClaudeBot, CCBot, and other AI crawlers to understand how much of your server resources they are consuming and whether blocking some of them would free up capacity for Googlebot.

When NOT to Worry About Crawl Budget

If your site has fewer than 10,000 pages, crawl budget is almost certainly not your problem. Even up to 100,000 pages, Mueller has confirmed that Google handles this volume without crawl budget constraints. If your new pages are being indexed within a day or two of publication, your crawl budget is fine. If you are not seeing “Discovered, currently not indexed” growing in Search Console, your crawl budget is fine. Focus your energy on content quality, site speed, internal linking, and the other fundamentals that have a much larger impact on whether your pages rank well. The most common mistake in SEO is attributing indexing problems to crawl budget when the real issue is that the content is thin, duplicative, or not linked from anywhere important on the site. Google does not refuse to crawl pages because of budget limits on small sites. It refuses to index pages because they do not meet the quality threshold for inclusion in its index.

Conclusion

Crawl budget is a real technical concept that matters for large, complex websites. But for the vast majority of sites on the web, it is, as Mueller says, overrated. The correct priority order for crawl budget optimization is: fix server speed first (because fast responses let Googlebot crawl more in less time), clean up structural crawl traps second (faceted navigation, action parameters, session IDs, infinite spaces), and manage URL volume third (through robots.txt, sitemaps, and canonical tags). If your site has fewer than 100,000 pages and your server responds in under 500ms, spend your time on content and link building instead of crawl budget. If you run a large e-commerce store or content portal and you see “Discovered, currently not indexed” growing in Search Console, then crawl budget optimization is worth your attention, and the fixes in this article will help you reclaim the crawl resources that are currently being wasted on URLs that add no value.


LaFactory has been building and optimizing website architectures since 1996. Our technical SEO audits include server log analysis that shows exactly where Googlebot spends its time on your site and how to redirect that attention to your most valuable content. Contact us for an audit that identifies the real cause of your indexing problems.

Sources