The robots.txt File Every Large Site Gets Wrong

Most robots.txt problems aren't caused by bad intentions — they're caused by good intentions applied at the wrong layer. As sites scale, configuration drift turns a tidy file into a minefield. Here's what breaks and why it matters more than most teams realise.

I've audited hundreds of sites over the years, and robots.txt is consistently one of the most underestimated files in the technical SEO stack. It's short, plain text, and sits in one place. What could go wrong?

Quite a lot, as it turns out.

Mistake 1: Treating robots.txt as an access control mechanism

The robots.txt spec is advisory. That's not a caveat buried in the documentation — it's the central point of the entire protocol. When you add Disallow: /admin/, you are politely asking crawlers not to go there. You are not blocking them.

The practical consequences:

  • Malicious bots and scrapers ignore it entirely
  • Any URL in a disallowed path can still be indexed if it receives inbound links
  • Google may show a disallowed URL in search results with no snippet — visible, but content-free

For anything that genuinely needs protecting — staging environments, admin panels, internal tools — robots.txt is a useful additional signal, but never a substitute for authentication. Treat them as complementary, not equivalent.

The rule: use robots.txt to manage crawl budget and signal intent. Use authentication to enforce access. Never confuse the two.

Mistake 2: The accidental full block

This one causes real damage and happens more often than it should. A developer deploys a staging config to production. A migration leaves the wrong robots.txt in the root directory. Someone adds Disallow: / to test something and forgets to remove it.

User-agent: *
Disallow: /

Two lines. Every search engine stops crawling your entire site. The damage compounds slowly — Google won't immediately de-index everything, but fresh content stops being crawled, rankings begin to erode, and by the time the problem surfaces in your visibility data it's weeks old.

After any deployment that touches the root directory: verify that https://yourdomain.com/robots.txt returns what you expect. Make it part of your post-deploy checklist without exception.

Mistake 3: Blocking CSS and JavaScript

This was common advice in the early 2010s. Block /wp-content/, block /assets/, reduce crawl overhead. The logic made a kind of sense when Googlebot was text-only.

Google has rendered JavaScript since 2015. When you block the CSS and JS that build your page, Google sees something closer to raw HTML stripped of structure — not the same experience a user gets. Googlebot's rendering evaluation directly informs how it understands and ranks content. Blocking render assets hurts that process.

If you inherited a site with these rules and the rankings look fine, that may mean the blocks aren't actually having a measurable effect. Or it may mean you're leaving performance on the table. Either way, the rules have no legitimate modern justification — remove them.

Mistake 4: Using it to hide pages from search results

This is the most persistent misconception in technical SEO. Disallowing a URL stops Googlebot from crawling it. It does not stop Google from indexing it.

If a disallowed page receives a single inbound link from anywhere on the web, Google can discover the URL and index it — without ever visiting the page. The result is an indexed URL with no snippet, no title, and no content context. Not hidden. Just empty.

To prevent a page from appearing in search results, you need a noindex meta tag in the HTML head — and you need Google to be able to crawl the page in order to read it. Disallow and noindex are not interchangeable; they work at different stages of the pipeline.

Mistake 5: Not including the sitemap

The Sitemap: directive in robots.txt is one of the most reliable ways to make sure all major crawlers know where your content is. It's not only Googlebot — Bing, Yandex, and others all read robots.txt and follow the sitemap reference.

A robots.txt without a sitemap URL is a missed opportunity, particularly for large sites where discovery is a real crawl budget consideration.

Sitemap: https://example.com/sitemap.xml

One line. No excuse not to have it.

The common thread

Every mistake on this list shares the same root cause: misunderstanding what robots.txt actually does versus what it appears to do. It's a lightweight convention that most legitimate crawlers follow most of the time. Nothing more, and nothing less.

Used correctly — to manage crawl budget, direct resources away from low-value paths, and signal intent to search engines — it's a clean and effective tool. Used as a security mechanism or an indexing blocker, it fails, quietly and at scale.

← Back to Notes
Mags Sikora
Freelance SEO Consultant, SEO Director

Senior SEO Strategist with 18+ years leading search programmes for enterprise and global digital businesses. Director of SEO at Intrepid Digital.