Crawling status in your Google Mini

May 4, 2006 in Google Mini,Spidering | Comments (0)

When you’re spidering (or ‘crawling’) using your Google Mini, you get a little box telling you it’s status – or at least you do on the old/v1 Mini, I haven’t tried a new Mini yet.

Within the box you get a list of:
Total Inflight URLs – these are links it has found, but not read the pages of yet.

Total Crawled URLs – pages that have been read

Locally Crawled URLs – this may come up when you’re re-spidering. It lists pages which have not changed since the last spidering, so it just reads it’s internal copy which saves reading the live pages again, saving time and bandwidth.

Excluded URLs – these are pages that have been excluded from being read for some reason, either via a robots.txt exclusion, or because of one of the rules you have set in the ‘Configure Crawl’ section.

All of these labels are links, if you click on them you get a list of the various documents in each section. Clicking on ‘Total Crawled URLs’ gives you a list of pages normally labelled ‘New Document’, within ‘Locally Crawled URLs’ they will be labelled ‘Cached Version.’

What makes a ‘New Document’?

By default, the Google Mini will look at the ‘Last-Modified’ header of the page. If it is more recent than the last spidered version of the page that it has stored, it will read and index the live page.

If your pages are straight HTML, then Last-Modified will be when they were created on the server – so when they were uploaded by FTP in most cases. However, if your pages are dynamic, for instance PHP or ASP pages, and use information from a database or even include files, then their Last-Modified date will usually be the moment they were served up to the Mini, or anyone visiting your site. This means the Mini will read all of the pages, even if they haven’t changed since it last spidered them, as the web server is telling them it’s a recently changed page because of the date. This does not hurt your site in any way, but it means you will use up more time and bandwidth in your spidering as it could be reading pages which haven’t changed since they were last spidered.

Finding your pages ‘Last Modified’ date

If you use Firefox, you can install the Web Developer Toolbar for it. When you are looking at a page, click on ‘Information’ then ‘View Response Headers’ and you’ll see the ‘Last-Modified’ date being reported to your browser and any passing web spiders.

Comments (0)

RSS feed for comments on this post.

Leave a comment

Sorry, the comment form is closed at this time.