Industry News, Technology
Google and Bing: What Do They Know? Do They Know Things? Let's Find Out.
Here at Propellernet, the technical SEO team are constantly running tests and researching how search engines deal with different setups. This can end up being incredibly useful – sometimes testing unusual things can help reveal how search engines like Google and Bing deal with crawling weird edge cases.
One thing that we see from time to time is that sites that have the same content appear when a trailing slash is added to a URL, without a redirect being in place, so:
would render the same content. We also often see sites that render when some characters are in uppercase, so:
might render the same content (without one URL redirecting to the other).
In most cases, search engines detect that it’s duplicate content and only end up serving one of those URLs – but we wanted to see what would happen if that content wasn’t duplicate. What would happen if example.com/page and example.com/page/ or example.com/Page rendered totally different content?
To find out, we set up pages on a test site that did exactly that. There were four test URLs in total, two of which tested for how search engines deal with unique content when the URL is capitalised:
… and two of which tested for how search engines deal with unique content with a trailing slash:
In both of these scenarios we followed the same pattern:
It’s also worth mentioning that the XML sitemap is registered with both of those tools, as well as being referenced in the robots.txt file.
We then analysed the results by exploring the log files to see which pages are crawled (and by which search engines), as well as checking to see which pages end up indexed and which pages get returned for specific queries.
How do search engines deal with capital letters in the URL?
For our case sensitive test, we found that Google handles it fairly well – it crawled both /casesensitive and /CASESENSITIVE and indexed both. For a search on a snippet of unique text (in quotes), it returned the correct page. All of that suggests that Google treats URLs like /casesensitive and /CASESENSITIVE as completely separate resources.
There is one unusual thing, however – a site: search brings up the lowercase URL, but the uppercase URL is filtered out for being too similar to the other displayed URLs and isn’t shown unless the ‘repeat the search with the omitted results included’ link is clicked.
With Bing, however, it was a different story. Monitoring the server logs showed that Bingbot only crawled the lowercase version of the URL. As a result, only the lowercase version is indexed and returned in search results – searches for snippets of text that only appear on the uppercase version are not returned.
How do search engines deal with trailing slashes?
Interestingly enough, we found the same results with trailing slashes as we did with the capitalised URLs. Google crawled both URLs – with the slash and without – and indexed both. It also returns both when unique snippets of text are searched but has the same issue where one version is filtered out unless ‘repeat the search with omitted results included’ is clicked, which suggests that it probably expects the page to be duplicate (even though it isn’t).
And again, Bing dealt with the trailing slash and non-trailing slash URLs in the same way as it did with the capitalised URLs test. We found that it only crawled the non-trailing slash URL – the log files show that it never at any point crawled /slash/. As a result, it has only indexed and only returns the non-trailing slash URL.
So what have we learnt?
It looks like Google – for the most part – does treat URLs like /page and /page/ and /page and /Page as if they were unique, at least as far as crawling them goes.
Bing, however, when presented with those options, will only crawl one. Our theory is that Bing expects those URLs to be duplicate and so dedupes them before it crawls – therefore it never discovers that they aren’t, in fact, duplicate.
The main lesson here is that you should always consider URLs like /page, /page/ and /Page as duplicate content, and there should be a canonical version that the others 301 redirect to. Google sees those URLs as completely unique resources, and so your best bet is to view them in the same way.