Avoiding session IDs when spidering

April 20, 2006 in Google Mini,GSA,Spidering | Comments (4)

Many web sites use sessions to keep track of visitors as they browse around. If you do not have cookies turned on in your browser, the cookie may be sent through the URL so the site can still track you. This is very useful if it’s storing your shopping basket information, but it can have drawbacks.

Unfortunately sessions in the URL can upset spidering – a Google Search Appliance or Mini will generally up several ‘connections’ to a web site when it is spidering, this is like having several independent people browsing the site at the same time. Each of these connections receives a different session ID, which makes the URLs look different to the spiders. This in turn means each connection may spider the same pages that have all ready been covered. Also, if the session times out it may be replaced by a new session when the next page is spidered, which means that again the spider will re-read pages it has all ready found. This is because this:

http://www.gsadeveloper.com/cars.php?phpsessid=6541231687865446

And this:

http://www.gsadeveloper.com/cars.php?phpsessid=6541231AQ7J865KLP

Look like different pages, even though they may turn out to have the same content. To avoid this happening, you can stop the spider reading pages which have session IDs in the URL. You can avoid the most common session IDs by adding these lines to the ‘Do Not Crawl URLs with the Following Patterns:’ section of ‘URLs to Crawl':

contains:PHPSESSID
contains:ASPSESSIONID
contains:CFSESSION
contains:session-id

The web sites you are spidering may still contain session IDs, it is worth checking with the site owner if this is going to be a problem, and keep an eye on the ‘Inflight URLs’ shown in ‘System Status’ – ‘Crawl Status’ when spidering a site for the first time. If the same URLs are turning up a lot, you may have a session problem. You’ll need to stop spidering the site and work out which bit of the URL you need to ignore, then you can add it to do not crawl list like the examples above.

Comments (4)

RSS feed for comments on this post. TrackBack URL

  1. Comment by Ty C. — February 14, 2007 @ 11:37 pm

    What if the website requires the session ID? Is there a way to tell GSA to ignore a specific querystring parameter but still index the page? Otherwise the entire site will be ignored, won’t it?

  2. Comment by Paul — February 15, 2007 @ 11:07 am

    If you don’t exclude the session ID, it will spider the pages. I suggest you set the host load to be 1, which might help it keep to the same session ID as it effectively only sends one connection to spider the site rather than 4 (the default host load.)

    You can’t get it to ignore part of the URL, it only understands inclusion and exclusion based on parameters.

    If you have to have a session ID, then you may need to look at changing the CMS or whatever runs the site so it will always feed the same session ID to the Mini when it is spidering.

    Basically, spiders hate session IDs, so if you have to have one, you’re always going to be in a bit of trouble. Hmm… you could set up a page of links to every page on your site, all with a session ID appended to them, then set that page as the place the Mini starts spidering. That might then allow it access to the whole site without getting too confused. Unfortunately this is just guesswork from me, usually I deal with people who can get rid of the sessions, or places where we exclude them entirely.

  3. Comment by Manu Garg — March 14, 2014 @ 4:23 am

    Thanks for the info, this was very helpfull. But my question how does GSA creates and manages session when it does the crawling of url’s. In other words the stuff explained above deals with the sessions in the destination url’s. I’ve seen that GSA also creates new sessionId for each url , even though all the urls falls under a specific set of pattern. Isn’t that a overhead on the spider as well. Won’t it be convenient for the search engine if it crawls all the url with same sessionid (dotcomsid) for a single request, irrespective of number of url’s to be crawled.

  4. Comment by Paul — March 14, 2014 @ 3:00 pm

    Hi Manu, I haven’t used the latest version of the GSA software, but in the versions I have used, the GSA never sets the sessionId itself, it has always been created by the server it is spidering.

    If the sessionId that has been assigned to the spider is not automatically put in to the links on the page it is crawling, when the GSA spiders each of those links it could be that your server is giving it a new sessionId every time, as it would not have the sessionId that was assigned to the GSA spider copied in to the URL.

    So what I’m saying is – the way your pages are coded is most likely the reason you’re seeing lots of different sessionIds, it’s not the GSA creating new sessionIds deliberately.

    It would be a lot easier if spiders would keep the same sessionId across a site, but then it’s up to us as developers to make our sites work that way, or just not set sessionIds where they’re not necessary. As spiders can’t tell what is changed on the site in reaction to their session, they are always going to prefer to spider without having a session set, as that’s the most likely state for a searcher to turn up at the site in, whether they use a GSA or other search engine to find the page.

    All that said, it’s still very annoying to have to code around!

Leave a comment

Current ye@r *