How To Exclude Query String Parameters from Search Engine crawling using robots.txt
Last year we wrote about the problem of excessive crawling from search engine spiders. Search engines, like Google and Bing, aim to index as much content as possible. For ecommerce sites, this often means indexing pages with various query string parameters used for sorting, filtering, or pagination. While these parameters help users navigate your site, they can create several issues:
- Over-Crawling: Search engines may spend excessive time crawling similar pages with different parameters, wasting your crawl budget.
- Duplicate Content: Pages with different parameters can be seen as duplicate content, diluting your SEO efforts.
- Server Load: Excessive crawling can increase server load, slowing down your site and affecting user experience. Search engines typically account for 30-50% of page requests to an ecommerce store. Managing their crawling effectively can have a massive impact on site speed and server spend.
Another common cause of over crawling is internal searches being indexed.
In our previous article we mentioned using the webmaster tools provided by Google and Microsoft to manage crawler behaviour by adding ignored parameters. Since publishing that article, both tools have been updated, removing the ability to add parameters to be ignored during a crawl.
Differences in Crawling and Indexing
Search engines maintain an 'index' of web pages, it is pages in this index which appear in search results. To maintain this index the search engine will crawl a website to 'discover' new content, and to also keep its index up to date. Webmasters can control what gets indexed by using various tags or headers in their web pages. These include:
- Canonical Tags can be used to indicate the preferred version of a page. This helps to consolidate link 'juice' and to tell the search engine which URL to index.
- Noindex tags can be used to prevent specific pages from being indexed. This is useful for thank you pages, admin pages or any content you don't want to appear in search results.
- Nofollow links can be used to indicate to a search engine not to pass on SEO value to the linked page.
However, controlling what does or does not get indexed does not prevent content from being crawled. The only way to do that is via the robots.txt file. You may be familiar with the Disallow directive in the robots.txt file, but you may not be aware that you can use wildcards to prevent crawling of url parameters.
An example...
Consider an ecommerce store that has a category page which can then be customised with the following parameters:
orderBy
colors
brands
page
results
These may appear in any order and the combinations of parameters might result in 100s or even 1000s of variations of essentially the same page. Google is fairly smart when presented with this scenario, but Bing.... Bing can crawl very aggressively and it likes to try everything. In our example above we may want to stop crawling everything except the page number, in which case an effective way to control crawler behaviour would be:
User-agent: *
Disallow: /*?*orderBy=*
Disallow: /*?*colors=*
Disallow: /*?*brands=*
Disallow: /*?*results=*
We can't really do it in a single Disallow as the parameters might be in any order. By including the ? in the url we're ensuring that the parameter names are only in the query string, and not in the main url path. By doing this we can prevent the crawlers from wasting your crawl budget and smashing your server resources.
Conclusion
Search engines can often make up 30-50% of the overall page requests to a website. It's crucial to manage their behaviour to maximise their effectiveness and minimise your server utilisation. Keep an eye on your access logs to watch out for unwanted behaviour and use robots.txt to keep them in line!