Archive for the ‘GSA’ Category

When your GSA license runs out

March 27, 2008 in GSA | Comments (3)

In case you’ve wondered, when your Google Search Appliance license runs out, you’ll receive this message in the admin area:

The license has expired. You are in the grace period. The software will stop crawling, indexing, and serving in 14 days. Please contact Google to extend your license.

The GSA will keep working normally for two weeks, giving you a chance to get a new license from Google Enterprise to continue it’s service, or finalise plans for moving away from it if you’re not going to get a new license.

Please note: the Google Mini does not have a license that runs out. Once the warrenty period is over you it’ll keep running as long as the hardware holds up.

You can’t spider XML with a Google Mini (so far)

January 16, 2007 in Google Mini,GSA | Comments (1)

A question I’ve seen come up a lot which isn’t answered directly by my earlier post is whether the Google Mini or Search Appliance can spider raw XML. Unfortunately, no, it cannot.

The Mini / Search Appliance can read the XML, but it takes it in as straight text, so any searching you do will look at node names, attributes and content, rather than just content.

The best I can suggest is you have some scripting to run an XSL transform on your XML to turn it in to a small (or indeed large) site of web pages, then spider those with the appliance.

Pages without titles get blank <T> nodes in XML

January 8, 2007 in Google Mini,GSA,XML API | Comments (1)

When using the XML API to your Google Mini or Search Appliance, if a page in the search results does not have a <title> in it’s HTML, then it does not have a ‘T’ node (in GSP/RES/R/T) in the XML returned for the search.

The XSLT controlling the look of the web frontend on the box automatically replaces the title with the URL of the page instead (with the http:// taken off the start.) With the XML API you can decide to replace it with anything you like, but this behaviour is certainly preferred by the clients I’ve had to set it up for. Best of all would be for all pages to have a title, but there could well be some that slip through testing (when there is testing) so it’s best to be prepared for it.

Fixing an error like /export/hda3 … (No such file or directory)

September 29, 2006 in Google Mini,GSA,XML API | Comments (1)

While setting up a Google Mini search for a site recently, I kept getting an error message like this:

/export/hda3/4.3.105.M.6/local/conf/frontends/default_Frontend/domain_filter (No such file or directory)

When trying to use the XML API to get the results back. I tried various things and eventually contacted the host to make sure their setup wasn’t blocking anything. Eventually I realised the Collection and Front End settings are case sensitive and I had mine in all-lowercase as I’d been told, whereas the Mini was actually set up with capitalised first letter. Once I’d matched what they were in the Mini, the error stopped. You can also get the same error with the Google Search Appliance.

Simple, but so easy to get caught out by it!

Ignoring specific content on a page

July 28, 2006 in Google Mini,GSA,Spidering | Comments (2)

If you want your Google Mini or Search Appliance to ignore part of your page, you can use some special tags to stop the content being indexed (and therefore brought back in the search results.)

Surround the content you want ignored with the following tags:

<!-- googleoff: index --> <!-- googleon: index -->

So if you have

<!-- googleoff: index --> I like bees <!-- googleon: index -->

On your page and you search for ‘bees’, it won’t come up, even if the page has been spidered. The only people who will find out about your love of buzzing insects will be those who have found the page through other means.

This can be useful for excluding parts of your page that the appliance might find confusing, for instance ‘H’ wants to exclude his breadcrumb trail.

Custom meta tags in search results and full stops

May 4, 2006 in Google Mini,GSA,XML API | Comments (4)

When you’re using custom meta tags on your pages so you can serve up or search very specific information in your Google Mini or Search Appliance, it’s important to chose a meta name that will not conflict with other meta tags that might exist on your site or the public sites you are spidering (as pointed out to me yesterday by Nathan, the host of two Mini’s I’m working on currently.)

You can put a little code of your own before or after your meta tag’s name to make it unique for your project. This is like ‘namespaces’ in programming – where you try to keep your variables separate from anything that might conflict with them and over-write them with different data. For instance the Dublin Core project puts ‘DC.’ in front of their names, so you know what standard it’s related to. So instead of…

<meta name=”Publisher” content=”Web Positioning Centre” />

you have:

<meta name=”DC.Publisher” content=”Web Positioning Centre” />

Letting you know they are working within Dublin Core standards, and it’s unlikely any page is all ready using a tag called ‘DC.Publisher’, whereas it could be using ‘Publisher’ on it’s own.

If you’re setting up your own meta tags for use with a Mini or GSA, do not use full stops, ‘.’, to separate your code from the general name. When you pull back the results, they use full stops to separate different tags that you want to bring back using the ‘getfields’ flag.

So if you wanted to bring back the information in ‘DC.Publisher’ with the rest of the search results data, it will actually try to bring back information from the meta tag named ‘DC’ and another tag named ‘Publisher’

To avoid this happening, use something else to separate your namespace code (your ‘DC’) from the rest of the name. It would be a good idea not to use anything that needs to be ‘URL escaped‘ which pretty much limits you down to the following: $-_+!*'() – personally I tend to use a hyphen, ‘-‘, as it’s quite readable and is unlikely to cause problems in programming, unlike 4 or ()

Your meta information with your namespace code could look something like this:

<meta name=”gsad-site” content=”Spidertest” />
<meta name=”gsad-author” content=”Web Positioning Centre” />
<meta name=”gsad-image” content=”http://www.spidertest.com/images/wpc-logo.gif” />

Then you can get back these fields in your search results by using:

&getfields=gsad-site.gsad-author.gsad.image

In the XML of your results, you will get these additional fields:

<MT N=”gsad-site” V=”Spidertest”/>
<MT N=”gsad-author” V=”Web Positioning Centre”/>
<MT N=”gsad-image” V=”http://www.spidertest.com/images/wpc-logo.gif”/>

Avoiding session IDs when spidering

April 20, 2006 in Google Mini,GSA,Spidering | Comments (4)

Many web sites use sessions to keep track of visitors as they browse around. If you do not have cookies turned on in your browser, the cookie may be sent through the URL so the site can still track you. This is very useful if it’s storing your shopping basket information, but it can have drawbacks.

Unfortunately sessions in the URL can upset spidering – a Google Search Appliance or Mini will generally up several ‘connections’ to a web site when it is spidering, this is like having several independent people browsing the site at the same time. Each of these connections receives a different session ID, which makes the URLs look different to the spiders. This in turn means each connection may spider the same pages that have all ready been covered. Also, if the session times out it may be replaced by a new session when the next page is spidered, which means that again the spider will re-read pages it has all ready found. This is because this:

/cars.php?phpsessid=6541231687865446

And this:

/cars.php?phpsessid=6541231AQ7J865KLP

Look like different pages, even though they may turn out to have the same content. To avoid this happening, you can stop the spider reading pages which have session IDs in the URL. You can avoid the most common session IDs by adding these lines to the ‘Do Not Crawl URLs with the Following Patterns:’ section of ‘URLs to Crawl':

contains:PHPSESSID
contains:ASPSESSIONID
contains:CFSESSION
contains:session-id

The web sites you are spidering may still contain session IDs, it is worth checking with the site owner if this is going to be a problem, and keep an eye on the ‘Inflight URLs’ shown in ‘System Status’ – ‘Crawl Status’ when spidering a site for the first time. If the same URLs are turning up a lot, you may have a session problem. You’ll need to stop spidering the site and work out which bit of the URL you need to ignore, then you can add it to do not crawl list like the examples above.

Relevance in Mini and GSA searches

March 7, 2006 in Google Mini,GSA | Comments (8)

A question from Jim Westergren caused by looking at an oddity in Google PR reporting prompted me to look at the relevance rating in Google Mini search results.

For each search you do on the Mini/GSA, you get back a variety of information for each page. Not all of this is immediately obvious – there’s a last modified date which is often pretty useless, and also there’s a relevancy rating for each page. If you’re using the XML API, you’ll find the relevancy within <RK>.

Jim asked if the RK rating was always the same, or different for each page. I’ve done some checking, and it is different in the Mini, and the rating for each page depends on what you have searched for. For instance, on one search, a page I was following was rated as ‘5’, in another, it was ‘0’.

This means the RK value could be useful in a set of results, although as by default the results are ranked by relevancy so what the box thinks is the most relevant is at the top of the results, it’s probably not greatly useful. When sorting by date it becomes more useful, as you can try to spot the most relevant page in whatever results you happen to be looking at.

It should not be seen as a version of PageRank (‘PR’) within the search appliances, because the value is not fixed across all searches.

How do I access the XML from the Google Mini / GSA?

January 26, 2006 in Google Mini,GSA,Q&A,XML API | Comments (8)

As well as the standard web interface, the Google Mini and Google Search Appliance have an XML interface which gives you a results set back in XML.

To access the XML, you use a scripting language to use HTTP GET with a particular URL:

For XML without a DTD:
http://www.miniaddress.com/search?q=searchphrase&output=xml_no_dtd &client=collectionname&site=collectionname

Where ‘www.miniaddress.com’ is the address of your search appliance (this can also be an IP address,) ‘collectionname’ is the name of your collection, and ‘searchphrase’ is what you are searching for.

If you want the DTD, change output=xml_no_dtd to output=xml

You can set lots of flags in the URL to do things like change the start number of the results set, or change the encoding of the results coming back to UTF-8 or latin. You can look up the various flags in the GSA XML reference.

How to spider hidden content

January 25, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (2)

This answers a question I’ve been asked in a couple of different ways…

Q: I have a page that uses many javascript popup windows for news articles, how can I index them?
Q: On a webserver I have a directory full of html pages. These pages are NOT listed anywhere. Can the google mini return search results where these html files will also be included?

If you have some pages that are usually off-limits to spiders, you can make sure your Search Appliance or Mini spiders them in a couple of ways:

1. Put the exact URL of each page to be spidered in the list of places to be spidered in the crawling admin – if you have many pages, this will become a maintenance problem.

2. Use a sitemap page which is not indexed by the appliance – the easy way to do it.

To do 2, make a list of all the pages you want spidered that are being missed out because there is not a direct route to them for the spider – i.e. Javascript is getting in the way, or the only route to them is blocked via robots.txt or something similar. This does not need to be a fancy page, it’s just a list for the spider to see and no people will ever need to see it.

In the HTML of this page, between the tags, put the following line:

<meta name=”robots” content=”noindex, follow” />

This means the spider will read the page and follow all links on it, but the page itself will not be indexed. If it isn’t indexed, it can’t be shown in the search results, so no-one can find it.

Now give the GSA or Mini the address of this sitemap page to crawl. Any time you add new pages to your site that are not getting spidered, you can add their address to the sitemap page.

NB: Google and the other big search engines also follow this ‘robots’ meta tag command, however you will need a link to the sitemap page from another part of your site for them to spider the pages and find them interesting enough to keep in their index, so if you use it to expose pages to the public search engines you will need your sitemap to look prettier as people might click through a link on your website to the page.

Setting a unique user agent to help control spidering

January 23, 2006 in Google Mini,GSA,Spidering | Comments (1)

When your Google Mini or Google Search Appliance spiders websites, it has a ‘User Agent.’ Practically everything that reads documents on the web has a User Agent, for instance when I use Internet Explorer to view a web page, it send this:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

And Firefox sends this:

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

When you first get a Mini or Search Appliance, it sends the following by default:

gsa-crawler (Enterprise; [code]; gsa-admin@test.com)

The [code] changes depending on your appliance – the Mini and GSA I use have different codes. The e-mail address is set to the admin person so you can be contacted if the spidering is going too fast. The bit we’re interested in here is the ‘gsa-crawler’.

You can reset this to something more relevant to your project:

In the Mini:
View/Edit the collection, go to Crawler Parameters, type in the new ‘User Agent Name’

In the Google Search Appliance:
Crawl and Index, then HTTP Headers, type in new name in ‘Enter User Agent Name’

When you’ve set it to something unique, you can use the robots.txt file of your website to control where it spiders.

So, say you have set your User Agent to ‘gsadeveloper-spider’, it will now be sending through:

gsadeveloper-spider (Enterprise; [code]; gsa-admin@test.com)

And in your robots.txt file you can exclude areas depending on this name. So on my main work site, Web Positioning Centre I could set the robots.txt file to exclude the Labs area by setting the following in the file:

User-agent: gsadeveloper-spider
Disallow: /labs/

This will only block documents in the labs directory, and it will only effect my GSA spidering as ‘gsadeveloper-spider’

This can be useful if you want to block areas of sites off for projects, for instance if you’re working on a large site and you have the lowest level Google Mini, you may want to block irrelevant pages from being spidered and using up some of the 100,000 document limit. But it only blocks it off for you, not any other GSA’s spidering the site, and not the big search engines like Google, Yahoo and MSN Search.

When blocking an area, you can make sure they were blocked either by searching for words in that document or checking whether it was excluded in the crawling report. In the Mini you get to this through:

View/Edit the collection, go to System Status, then Serving Status, Browse the hierarchy of URLs seen during crawl. This will show you a list of the websites crawled. Click the number under ‘Excluded URLs’ for the site where you are blocking content. This will give you a list of URLs which were blocked, pages blocked using robots.txt will have “Crawled with empty body: Disallowed by robots.” next to them – this only happens for pages which have direct links to them from a crawled part of the site, it will not show you other pages in the same section which can only be got to through the blocked area, as it won’t have gone in to that section to find the links to them.

Using ‘disallow’ in robots.txt stops anything in that directory being read, and can be used to disallow single documents as well, you can find out more on the robotstxt site, which has an explanation of the standard and example code.

How does the GSA / Google Mini count documents?

January 8, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (0)

The Google Mini has a document limit of 100,000 documents, the Google Search Appliance between 500,000 and 2 million, although you can put several GSA’s together so you can raise that up higher.

Each ‘document’ is a page or file which has a unique address, so

d:\Project Weevils\documents\Squared burrows.doc
http://intranet/
http://webpositioningcentre.co.uk/index.php?article=5435345
and http://webpositioningcentre.co.uk/index.php?article=2947

are all documents. To a developer, the last two URLs may look like the same page, showing different content, to the GSA, that doesn’t matter – they are unique URLs, so they are different documents and will be counted separately.

So, with the Mini you get to index 100,000 individual addresses. If you have a dynamic site which shows all it’s content via variables on one address, then each page served dynamically on a different variable is counted as a separate document.

A problem with this is that if you are using multiple variables in your URLs and your page / CMS can show the same content under different URLs, you can have the same content spidered several times under different addresses – this is one variety of ‘spider trap.’ If this might happen to your site, you should look at trying to exclude it happening by giving a spiderable path to one set of the content and ignoring others using regular expressions. I’ll try to write more on that soon.

What ports does a Google Mini have?

in Google Mini,GSA,Q&A | Comments (0)

Although the Google Mini is a server, and is usually used through it’s web based interface, it has the follow ports available if you want to plug in to it directly:

  • Power (standard kettle lead)
  • PS/2 ports for keyboard and mouse
  • 2 x USB ports
  • Printer
  • VGA
  • 9 pin serial
  • 2 x ethernet ports

Google Mini back

Click on the picture to see a larger version