Many ecommerce sites have session IDs or user IDs in the URL of their pages. This tends to cause either the pages to not get indexed by search engines like Google, or to cause the pages to get included many times over and over, clogging up the index with duplicates (this phenonemon is called a “spider trap”). Furthermore, having all these duplicates in the index causes the site’s importance score, known as PageRank, to be spread out across all these duplicates (this phenonemon is called “PageRank dilution”).
Ironically, Googlebot regularly gets caught in a spider trap while spidering one of its own sites – the Google Store (where they sell branded caps, shirts, umbrellas, etc.). The URLs of the store are not very search engine friendly: they and are overly complex, and include session IDs. This has resulted in 3,440 duplicate copies of the Accessories page and 3,420 copies of the Office page, for example.
If you have a dynamic, database-driven website and you want to avoid your own site becoming a spider trap, you’ll need to keep your URLs simple. Try to avoid having any ?, &, or = characters in the URLs. And try to keep the number of “parameters” to a minimum. With URLs and search engine friendliness, less is more.
There is a rule: never give a spider, including Google, an URL with the session id included in it. I am carefully following this rule.
But to my surprise I discover from time to time in my logs that Google manages under some circumstances to grab some URLs with session ids still contained in them. Probably I was visited by some new or unknown spider. After that Google spidered the web contents of the unknown-to-me spider and grabbed from there some of my URLs with session id included. At least this is the only reason I can figure out so far.
As a conclusion try as much as you can if you don’t implement the force-cookies-use policy (and I don’t) you will be occasionally confronted with Google grabbing URLs containing the undesired session ids. So my question is what is best practice if this is to happen:
Give the spider a 404 Page not found
or
get rid of the session id and redirect the spider with a 302 Redirect to an URL that dose not contain the session id?
Or may be there is an even better way?!