Image: D J Shin (CC BY-SA 3.0) from Wikimedia Commons
So you’ve mastered the basics of the robots.txt file, but you’ve also realized that there must be more to this handy tool.
Well, it’s time to dig a little deeper!
As you probably know, robots.txt files are primarily used to guide search engine bots by using rules to block them, or allow them access to, certain parts of your website.
While the simplest way to use the robots.txt file is to block bots from whole directories, there are several advanced functions that give more granular control over how your site is crawled.
Here are five pro tips for those that want to get a little more advanced in their bot wrangling…
1. Crawl Delay
Say that you run a large website with a high frequency of updates. For argument’s sake, let’s say it is a news website.
Every day, you post dozens of new articles to your home page. Because of the large number of updates, search engine bots are constantly crawling your site, placing a heavy load on your servers.
The robots.txt file gives you a simple way to remedy this: the crawl delay directive.
The crawl delay directive instructs bots to wait a certain number of seconds between requests. For example:
User-agent: Bingbot
Crawl delay: 10
One benefit of this directive is that it allows you to limit the number of URLs crawled per day on larger sites.
If you set your crawl delay to 10 seconds, as in the above example, then it means that Bingbot would crawl a maximum of 8640 pages per day (60 seconds x 60 minutes x 24 hours / 10 second delay = 8640).
Unfortunately, not all search engines (or bots in general for that matter) recognize this directive, with the most notable to ignore it being Google.
2. Pattern Matching (AKA Wildcards)
Pattern matching or wildcards allow you to check for strings of characters or patterns within blocks of raw data.
This can be incredibly useful, especially when you need to stop bots from crawling specific types of files or string expressions. It allows much more control than using a broad-brush approach to block whole directories, and saves you from having to list every URL that you want to block individually.
The most basic form of pattern matching would be using the wildcard character (*). For example, the following directive would block all subdirectories that begin with “private”:
User-agent: Googlebot
Disallow: /private*/
You can match the end of a string using the dollar sign character. The following, for example, would block all URLs ending in “.asp”:
User-agent: Googlebot
Disallow: /*.asp$
Another example: to block all URLs that include the question mark character (?) you would use the following command:
User-agent: *
Disallow: /*?*
You can also use pattern matching to block bots from crawling specific filetypes, in this example .gif files:
User-agent: *
Disallow: /*.gif$
3. Allow Directive
If you’ve read this far, you’re probably familiar with the disallow directive. A lesser-known feature is the “allow” directive.
As you might guess from the name, the allow directive works in the opposite way to the disallow directive. Instead of blocking bots, it specifies paths that may be accessed by designated bots.
This can be useful in a number of instances. For example, say you have disallowed a whole section of your site, but still want bots to crawl a specific page within that section.
In the following example, the Googlebot is allowed access to the “google” directory of the website:
User-agent: Googlebot
Disallow: /
Allow: /google/
4. Noindex Directive
Unlike the disallow directive, the noindex directive will not stop your site from being crawled by search engine bots. However, it will stop search engines from indexing your pages.
Another handy tip: it will also remove pages from the index. This has obvious benefits, for example if you need a page with sensitive information removed from search engine results pages.
Note that noindexing is unofficially supported by Google and not supported at all by Bing.
You can combine both disallow and noindex directives to stop pages being both crawled and indexed by bots:
User-agent: *
Noindex: /private/
User-agent: *
Disallow: /private/
5. Sitemap
XML sitemaps are another essential tool for optimizing your site, especially if you want search engine bots to actually find and index your pages!
Before a bot finds your page, it first needs to find your XML sitemap.
To make absolutely certain that search engine bots find your XML sitemap, you can add its location to your robots.txt file:
Sitemap: https://www.example.com/sitemap.xml
While all major search engines recognize this protocol, you should still submit your sitemap to each search engine via the relevant webmaster console (Google Search Console, Bing Webmaster Tools, etc.).
For more great SEO tips, check out my interview with Dawn Anderson on the Marketing Speak podcast.
Hi Stephan,
Thanks for this post. Didn’t know about noindexing in robots.txt. That’s kinda neat!
Now if only the search engines would recognise it 🙂
/Lars
Nice article Stephan! Hope you are good. You hit all the advanced stuff in this Article. Actually, I try not to use robots.txt though. I would rather use a no index no follow, a real canonical, 301 redirect, etc. But this is a good post.
I definitely agree with you… you did a really good job at getting this across. Thank yall for this.
I liked how you clearly and briefly present the material, so thanx!