Archive for May, 2006

Custom meta tags in search results and full stops

May 4, 2006 in Google Mini,GSA,XML API | Comments (4)

When you’re using custom meta tags on your pages so you can serve up or search very specific information in your Google Mini or Search Appliance, it’s important to chose a meta name that will not conflict with other meta tags that might exist on your site or the public sites you are spidering (as pointed out to me yesterday by Nathan, the host of two Mini’s I’m working on currently.)

You can put a little code of your own before or after your meta tag’s name to make it unique for your project. This is like ‘namespaces’ in programming – where you try to keep your variables separate from anything that might conflict with them and over-write them with different data. For instance the Dublin Core project puts ‘DC.’ in front of their names, so you know what standard it’s related to. So instead of…

<meta name=”Publisher” content=”Web Positioning Centre” />

you have:

<meta name=”DC.Publisher” content=”Web Positioning Centre” />

Letting you know they are working within Dublin Core standards, and it’s unlikely any page is all ready using a tag called ‘DC.Publisher’, whereas it could be using ‘Publisher’ on it’s own.

If you’re setting up your own meta tags for use with a Mini or GSA, do not use full stops, ‘.’, to separate your code from the general name. When you pull back the results, they use full stops to separate different tags that you want to bring back using the ‘getfields’ flag.

So if you wanted to bring back the information in ‘DC.Publisher’ with the rest of the search results data, it will actually try to bring back information from the meta tag named ‘DC’ and another tag named ‘Publisher’

To avoid this happening, use something else to separate your namespace code (your ‘DC’) from the rest of the name. It would be a good idea not to use anything that needs to be ‘URL escaped‘ which pretty much limits you down to the following: $-_+!*’() – personally I tend to use a hyphen, ‘-’, as it’s quite readable and is unlikely to cause problems in programming, unlike 4 or ()

Your meta information with your namespace code could look something like this:

<meta name=”gsad-site” content=”Spidertest” />
<meta name=”gsad-author” content=”Web Positioning Centre” />
<meta name=”gsad-image” content=”http://www.spidertest.com/images/wpc-logo.gif” />

Then you can get back these fields in your search results by using:

&getfields=gsad-site.gsad-author.gsad.image

In the XML of your results, you will get these additional fields:

<MT N=”gsad-site” V=”Spidertest”/>
<MT N=”gsad-author” V=”Web Positioning Centre”/>
<MT N=”gsad-image” V=”http://www.spidertest.com/images/wpc-logo.gif”/>

Crawling status in your Google Mini

in Google Mini,Spidering | Comments (0)

When you’re spidering (or ‘crawling’) using your Google Mini, you get a little box telling you it’s status – or at least you do on the old/v1 Mini, I haven’t tried a new Mini yet.

Within the box you get a list of:
Total Inflight URLs – these are links it has found, but not read the pages of yet.

Total Crawled URLs – pages that have been read

Locally Crawled URLs – this may come up when you’re re-spidering. It lists pages which have not changed since the last spidering, so it just reads it’s internal copy which saves reading the live pages again, saving time and bandwidth.

Excluded URLs – these are pages that have been excluded from being read for some reason, either via a robots.txt exclusion, or because of one of the rules you have set in the ‘Configure Crawl’ section.

All of these labels are links, if you click on them you get a list of the various documents in each section. Clicking on ‘Total Crawled URLs’ gives you a list of pages normally labelled ‘New Document’, within ‘Locally Crawled URLs’ they will be labelled ‘Cached Version.’

What makes a ‘New Document’?

By default, the Google Mini will look at the ‘Last-Modified’ header of the page. If it is more recent than the last spidered version of the page that it has stored, it will read and index the live page.

If your pages are straight HTML, then Last-Modified will be when they were created on the server – so when they were uploaded by FTP in most cases. However, if your pages are dynamic, for instance PHP or ASP pages, and use information from a database or even include files, then their Last-Modified date will usually be the moment they were served up to the Mini, or anyone visiting your site. This means the Mini will read all of the pages, even if they haven’t changed since it last spidered them, as the web server is telling them it’s a recently changed page because of the date. This does not hurt your site in any way, but it means you will use up more time and bandwidth in your spidering as it could be reading pages which haven’t changed since they were last spidered.

Finding your pages ‘Last Modified’ date

If you use Firefox, you can install the Web Developer Toolbar for it. When you are looking at a page, click on ‘Information’ then ‘View Response Headers’ and you’ll see the ‘Last-Modified’ date being reported to your browser and any passing web spiders.