Robots.txt Tester
What it does
The Robots.txt Tester fetches the robots.txt for a given domain, parses every rule group by user-agent, and lets you test arbitrary paths against the rules to see whether they would be allowed or blocked for a given crawler. The output explains which rule matched and why — useful when robots logic is non-obvious because of overlapping Allow and Disallow patterns.
Common situations
You have configured robots.txt to block specific paths but pages are still being crawled and indexed. Test the paths against the rules — often the rule pattern doesn’t match what you thought it did. Wildcards and path matching are the source of most robots.txt confusion.
A migration has just gone live and you want to verify the new robots.txt is correct before search engines start re-crawling. Test the canonical URL set against the rules; verify expected pages are allowed and expected blocks are blocked.
A bot is hitting paths you thought were blocked. Test the bot’s user-agent string against the rules — sometimes a more specific user-agent group earlier in the file applies, with rules different from the catch-all * group.
You are auditing whether a competitor’s site is exposed to AI training crawlers. Fetch their robots.txt, test against GPTBot, ClaudeBot, Google-Extended, etc. The result tells you whether they have opted out of AI training.
A Search Console error reports that a page is “Blocked by robots.txt” but the page should be crawlable. Test the path; the rule that’s blocking it surfaces immediately, even if it’s a wildcard pattern that wasn’t obvious from skimming the file.
What you need to know
Robots.txt is the voluntary protocol that tells crawlers which paths they can and cannot access. Reputable bots honour it; abusive bots ignore it. The matching logic is precise but unforgiving — small pattern mistakes create big differences in what’s actually blocked.
The matching rules:
Longest match wins. When multiple rules match a URL, the rule with the longest path pattern is applied. So Disallow: /private/ blocks /private/page, but Allow: /private/public.html overrides it because /private/public.html is a longer match.
On equal-length matches, Allow wins. When the path patterns are the same length, Allow takes precedence over Disallow. So Disallow: /api/ and Allow: /api/ together → Allow wins.
Wildcards. * matches any sequence of characters; $ anchors to end-of-URL. So:
Disallow: /*?filter=blocks any URL containing?filter=Disallow: /*.pdf$blocks any URL ending in.pdfDisallow: /tmp/*/old/blocks/tmp/anything/old/
User-agent grouping is sequential. The first User-agent line starts a group; subsequent directives apply to that group until the next User-agent line. The * group applies to any bot not named in a more specific group earlier.
The most common mistake: assuming the * group applies to Googlebot. It doesn’t, if Googlebot has its own group earlier in the file. Bots match the most specific group only — they don’t combine rules from multiple groups.
The second most common mistake: pattern matching as substring. Disallow: /admin blocks /admin, /admin/, /admin/login, AND /administrative/. The rule is a prefix match, not an exact match. To match /admin exactly, use Disallow: /admin$.
The tester implements the same matching logic as Google’s crawler. Paste a path, pick a user-agent, and the result tells you whether the path is allowed, blocked, and which rule matched.
Frequently asked questions
Why is my page blocked even though there’s an Allow rule?
The Allow rule’s path pattern is shorter or equal in length to the Disallow that’s blocking the path. Allow only wins when its pattern is longer than the Disallow, or when they’re equal length. Make the Allow pattern more specific.
What does User-agent: * cover?
Every bot that doesn’t match a more specific User-agent group earlier in the file. If Googlebot has its own group, the * group does not apply to Googlebot — Googlebot’s rules come from its own group only.
Can I block a specific file extension globally?
Yes — Disallow: /*.pdf$ blocks all URLs ending in .pdf. The * is the wildcard for any path, $ anchors to the URL end. Common pattern for blocking auto-generated downloadable files.
How do I block AI crawlers?
Add user-agent groups for the well-known AI crawlers: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google AI), CCBot (Common Crawl), PerplexityBot (Perplexity). For each: User-agent: GPTBot n Disallow: /. The major AI companies have publicly committed to honouring these rules.
Is path matching case-sensitive?
Yes for paths (URL paths are case-sensitive on most servers), no for user-agents (UA matching is case-insensitive in robots.txt). So Disallow: /Admin and Disallow: /admin are different; User-agent: GoogleBot and User-agent: googlebot are the same.
What’s a “wild path”?
A pattern using wildcards. /* matches any path; /*?param= matches any path with that query parameter. Wildcards are an extension to the original protocol but supported by all major bots.
Does robots.txt prevent indexing?
No — only crawling. Pages blocked in robots.txt can still be indexed if other pages link to them, appearing in search as URL-only entries with no description. Use noindex for index control, robots.txt for crawl control.
Can robots.txt have comments?
Yes — lines starting with # are comments and ignored. Useful for documenting why specific rules exist.
Common problems
Problem: Page is blocked in robots.txt but is still indexed in Google.
Robots.txt blocks crawling; it doesn’t remove already-indexed pages or prevent indexing of URLs found via other links. To remove a page from the index, allow it temporarily, add noindex meta tag, let Google recrawl and process the noindex, then re-disallow.
Problem: Allow rule overrides Disallow on the live site but tester says blocked.
Check that the Allow path is at least as long as the Disallow path. Disallow: /admin/ and Allow: /admin/ are equal-length; Allow wins. Disallow: /admin/secret/ and Allow: /admin/ — Disallow wins because its pattern is longer.
Problem: Different bots are getting different access despite identical rules.
Check whether the file has user-agent-specific groups earlier. The most common cause is User-agent: Googlebot near the top with restrictive rules, then User-agent: * later with permissive rules — Googlebot uses its own group, ignoring the * rules entirely.
Problem: Rules look correct in the file but bots are accessing blocked paths.
Some bots (especially less-reputable ones) ignore robots.txt. Reputable bots (Google, Bing, ChatGPT, Claude, etc.) honour it; everything else is at the bot’s discretion. For genuinely sensitive content, use authentication, not robots.txt.
Problem: Pattern matches more than expected because of substring behaviour.
Disallow: /api matches /api, /api/, AND /apiclient/. To match exactly /api and /api/... but not /apiclient, use Disallow: /api/ (with trailing slash) or anchor with $.
Tips
- Always test paths against the rules before relying on robots.txt to block them. Wildcards and prefix matching create non-obvious results.
- The longer the match, the more it overrides shorter rules. To carve out exceptions to broad rules, the exception’s path must be more specific.
- Block AI training crawlers if you don’t want your content used for model training. The major ones (GPTBot, ClaudeBot, Google-Extended, CCBot) honour robots.txt.
- Don’t list “secret” URLs in robots.txt — the file is public. Listing them advertises their existence.
- After every robots.txt change, test the most-trafficked URL paths against the new rules. Verify before deploying.
Related tools in this suite
The Robots.txt Generator is the source-side companion — when the tester reveals an issue, the generator builds the corrected file. The Sitemap Inspector is useful for verifying the sitemap declared in robots.txt actually exists and is parseable.
What this looks like at scale
For a single site, hand-edited robots.txt is fine. For organisations with multiple environments (production, staging, preview) or many subdomains, robots.txt should be generated dynamically per environment to ensure staging and preview don’t get crawled. The WordPress development service covers this kind of multi-environment configuration during deploy pipeline work.
Take it further
If your robots.txt has grown complex enough that errors are frequent, the underlying issue is usually that crawl-budget management is being done in robots.txt instead of via canonical URLs or noindex meta tags. Talk through the constraints and there’s often a cleaner path.