Archive for the ‘XML API’ Category

Pages without titles get blank <T> nodes in XML

January 8, 2007 in Google Mini,GSA,XML API | Comments (1)

When using the XML API to your Google Mini or Search Appliance, if a page in the search results does not have a <title> in it’s HTML, then it does not have a ‘T’ node (in GSP/RES/R/T) in the XML returned for the search.

The XSLT controlling the look of the web frontend on the box automatically replaces the title with the URL of the page instead (with the http:// taken off the start.) With the XML API you can decide to replace it with anything you like, but this behaviour is certainly preferred by the clients I’ve had to set it up for. Best of all would be for all pages to have a title, but there could well be some that slip through testing (when there is testing) so it’s best to be prepared for it.

Fixing an error like /export/hda3 … (No such file or directory)

September 29, 2006 in Google Mini,GSA,XML API | Comments (1)

While setting up a Google Mini search for a site recently, I kept getting an error message like this:

/export/hda3/4.3.105.M.6/local/conf/frontends/default_Frontend/domain_filter (No such file or directory)

When trying to use the XML API to get the results back. I tried various things and eventually contacted the host to make sure their setup wasn’t blocking anything. Eventually I realised the Collection and Front End settings are case sensitive and I had mine in all-lowercase as I’d been told, whereas the Mini was actually set up with capitalised first letter. Once I’d matched what they were in the Mini, the error stopped. You can also get the same error with the Google Search Appliance.

Simple, but so easy to get caught out by it!

Custom meta tags in search results and full stops

May 4, 2006 in Google Mini,GSA,XML API | Comments (4)

When you’re using custom meta tags on your pages so you can serve up or search very specific information in your Google Mini or Search Appliance, it’s important to chose a meta name that will not conflict with other meta tags that might exist on your site or the public sites you are spidering (as pointed out to me yesterday by Nathan, the host of two Mini’s I’m working on currently.)

You can put a little code of your own before or after your meta tag’s name to make it unique for your project. This is like ‘namespaces’ in programming – where you try to keep your variables separate from anything that might conflict with them and over-write them with different data. For instance the Dublin Core project puts ‘DC.’ in front of their names, so you know what standard it’s related to. So instead of…

<meta name=”Publisher” content=”Web Positioning Centre” />

you have:

<meta name=”DC.Publisher” content=”Web Positioning Centre” />

Letting you know they are working within Dublin Core standards, and it’s unlikely any page is all ready using a tag called ‘DC.Publisher’, whereas it could be using ‘Publisher’ on it’s own.

If you’re setting up your own meta tags for use with a Mini or GSA, do not use full stops, ‘.’, to separate your code from the general name. When you pull back the results, they use full stops to separate different tags that you want to bring back using the ‘getfields’ flag.

So if you wanted to bring back the information in ‘DC.Publisher’ with the rest of the search results data, it will actually try to bring back information from the meta tag named ‘DC’ and another tag named ‘Publisher’

To avoid this happening, use something else to separate your namespace code (your ‘DC’) from the rest of the name. It would be a good idea not to use anything that needs to be ‘URL escaped‘ which pretty much limits you down to the following: $-_+!*’() – personally I tend to use a hyphen, ‘-’, as it’s quite readable and is unlikely to cause problems in programming, unlike 4 or ()

Your meta information with your namespace code could look something like this:

<meta name=”gsad-site” content=”Spidertest” />
<meta name=”gsad-author” content=”Web Positioning Centre” />
<meta name=”gsad-image” content=”http://www.spidertest.com/images/wpc-logo.gif” />

Then you can get back these fields in your search results by using:

&getfields=gsad-site.gsad-author.gsad.image

In the XML of your results, you will get these additional fields:

<MT N=”gsad-site” V=”Spidertest”/>
<MT N=”gsad-author” V=”Web Positioning Centre”/>
<MT N=”gsad-image” V=”http://www.spidertest.com/images/wpc-logo.gif”/>

How to show different looking results based on subcollection

March 9, 2006 in Google Mini,XML API,XSL | Comments (3)

A common question is “How can I make the Google Mini show a different results page depending on the subcollection being used?” You might want to do this because you have a search covering an intranet and public area, or just two very different sites.

The Mini can only show one design of results on it’s own, i.e. through the normal results page it shows, based on the XSL you can set up in the ‘Configure Serving -> Output Format’ section of the admin area. However, as long as you have a scripting language on your web server (e.g. PHP, ASP, ColdFusion, Perl) you can use the XML interface to get the results back, then change the way they look in one of two ways:

  1. Trigger some XSL with the scripting language, making it look the way you want – you can use the admin area of the Mini to design the XSL if it helps
  2. Alternatively, use the scripting language to parse the XML, and display it how you want. This is probably more convenient if you do not know XSL and do not have time to learn it, but do know some programming. Doing this in ColdFusion (MX or 7), ASP (v3 or above) or PHP 5 is relatively straightforward if you all ready know some XML, it’s slightly harder in PHP 4, but still very possible with a little more effort

These give you the flexibility of being able to choose a method that best suits you or your developers, and gives you a lot of control over the look and feel of search results.

How do I access the XML from the Google Mini / GSA?

January 26, 2006 in Google Mini,GSA,Q&A,XML API | Comments (8)

As well as the standard web interface, the Google Mini and Google Search Appliance have an XML interface which gives you a results set back in XML.

To access the XML, you use a scripting language to use HTTP GET with a particular URL:

For XML without a DTD:
http://www.miniaddress.com/search?q=searchphrase&output=xml_no_dtd &client=collectionname&site=collectionname

Where ‘www.miniaddress.com’ is the address of your search appliance (this can also be an IP address,) ‘collectionname’ is the name of your collection, and ‘searchphrase’ is what you are searching for.

If you want the DTD, change output=xml_no_dtd to output=xml

You can set lots of flags in the URL to do things like change the start number of the results set, or change the encoding of the results coming back to UTF-8 or latin. You can look up the various flags in the GSA XML reference.