Avoiding session IDs when spidering

Posted on April 20th, 2006 in GSA, Google Mini, Spidering by Paul

Many web sites use sessions to keep track of visitors as they browse around. If you do not have cookies turned on in your browser, the cookie may be sent through the URL so the site can still track you. This is very useful if it’s storing your shopping basket information, but it can have drawbacks.

Unfortunately sessions in the URL can upset spidering - a Google Search Appliance or Mini will generally up several ‘connections’ to a web site when it is spidering, this is like having several independent people browsing the site at the same time. Each of these connections receives a different session ID, which makes the URLs look different to the spiders. This in turn means each connection may spider the same pages that have all ready been covered. Also, if the session times out it may be replaced by a new session when the next page is spidered, which means that again the spider will re-read pages it has all ready found. This is because this:

http://www.gsadeveloper.com/cars.php?phpsessid=6541231687865446

And this:

http://www.gsadeveloper.com/cars.php?phpsessid=6541231AQ7J865KLP

Look like different pages, even though they may turn out to have the same content. To avoid this happening, you can stop the spider reading pages which have session IDs in the URL. You can avoid the most common session IDs by adding these lines to the ‘Do Not Crawl URLs with the Following Patterns:’ section of ‘URLs to Crawl’:

contains:PHPSESSID
contains:ASPSESSIONID
contains:CFSESSION
contains:session-id

The web sites you are spidering may still contain session IDs, it is worth checking with the site owner if this is going to be a problem, and keep an eye on the ‘Inflight URLs’ shown in ‘System Status’ - ‘Crawl Status’ when spidering a site for the first time. If the same URLs are turning up a lot, you may have a session problem. You’ll need to stop spidering the site and work out which bit of the URL you need to ignore, then you can add it to do not crawl list like the examples above.

2 Responses to 'Avoiding session IDs when spidering'

Subscribe to comments with RSS or TrackBack to 'Avoiding session IDs when spidering'.

  1. Ty C. said,

    on February 14th, 2007 at 11:37 pm

    What if the website requires the session ID? Is there a way to tell GSA to ignore a specific querystring parameter but still index the page? Otherwise the entire site will be ignored, won’t it?

  2. Paul said,

    on February 15th, 2007 at 11:07 am

    If you don’t exclude the session ID, it will spider the pages. I suggest you set the host load to be 1, which might help it keep to the same session ID as it effectively only sends one connection to spider the site rather than 4 (the default host load.)

    You can’t get it to ignore part of the URL, it only understands inclusion and exclusion based on parameters.

    If you have to have a session ID, then you may need to look at changing the CMS or whatever runs the site so it will always feed the same session ID to the Mini when it is spidering.

    Basically, spiders hate session IDs, so if you have to have one, you’re always going to be in a bit of trouble. Hmm… you could set up a page of links to every page on your site, all with a session ID appended to them, then set that page as the place the Mini starts spidering. That might then allow it access to the whole site without getting too confused. Unfortunately this is just guesswork from me, usually I deal with people who can get rid of the sessions, or places where we exclude them entirely.

Post a comment