Archive for the ‘Google Mini’ Category

Google Mini XSS vulnerability

September 27, 2007 in Google Mini | Comments (0)

The MID series Google Minis have a cross-site scripting vulnerability, Google Enterprise has just released a patch for it in the support area (you’ll need your support login and password to get to it.)

If you’re not sure what age Mini you have, there’s a test code on that page you can use to check your Mini. The MID series have ‘MID’ in their user agent when spidering, which might also help you check.

The M2 series, which has been on sale since last summer, and the Google Search Appliance, are not vulnerable to the problem.

If your Mini is public facing, you should patch it straight away. If you only use the XML feed and show results through other code, it’s up to you whether you patch it or not, you’re less at risk of someone using it for nefarious means.

You can’t spider XML with a Google Mini (so far)

January 16, 2007 in Google Mini,GSA | Comments (1)

A question I’ve seen come up a lot which isn’t answered directly by my earlier post is whether the Google Mini or Search Appliance can spider raw XML. Unfortunately, no, it cannot.

The Mini / Search Appliance can read the XML, but it takes it in as straight text, so any searching you do will look at node names, attributes and content, rather than just content.

The best I can suggest is you have some scripting to run an XSL transform on your XML to turn it in to a small (or indeed large) site of web pages, then spider those with the appliance.

Pages without titles get blank <T> nodes in XML

January 8, 2007 in Google Mini,GSA,XML API | Comments (1)

When using the XML API to your Google Mini or Search Appliance, if a page in the search results does not have a <title> in it’s HTML, then it does not have a ‘T’ node (in GSP/RES/R/T) in the XML returned for the search.

The XSLT controlling the look of the web frontend on the box automatically replaces the title with the URL of the page instead (with the http:// taken off the start.) With the XML API you can decide to replace it with anything you like, but this behaviour is certainly preferred by the clients I’ve had to set it up for. Best of all would be for all pages to have a title, but there could well be some that slip through testing (when there is testing) so it’s best to be prepared for it.

Google Mini: Searching Subcollections from the frontend

in Google Mini | Comments (3)

If you are creating new subcollections on your Google Mini (v1), and you want to search them from a drop down menu on the frontend, you will need to refresh the frontend to do it. Sound confusing, just do these steps (the same goes for turning on the menu for searching subcollections in the first place):

  1. Login to the admin area and click on the collection name, then ‘Edit’
  2. Click ‘Configure Serving’ then ‘Output Format’
  3. Click the little arrow next to ‘Search Box’
  4. Click the tickbox next to ‘include a menu to search by subcollection’
  5. Click ‘Save Page Layout Code’

Note: This can take a few minutes to have an effect on the frontend, so if you don’t see a change to your search page immediately, just hang on and try again in a few minutes.

If you were just trying to turn on the menu of subcollections, that’s all you need to do. However, if you’re trying to get your new subcollection to show, you need to do these two steps again:

  1. Click the tickbox next to ‘include a menu to search by subcollection’
  2. Click ‘Save Page Layout Code’

That will cause the menu to update. You don’t need to wait for the frontend to refresh when doing this, just go through the first steps to switch the menu off, save it, then re-tick the box and save again.

Fixing an error like /export/hda3 … (No such file or directory)

September 29, 2006 in Google Mini,GSA,XML API | Comments (1)

While setting up a Google Mini search for a site recently, I kept getting an error message like this:

/export/hda3/4.3.105.M.6/local/conf/frontends/default_Frontend/domain_filter (No such file or directory)

When trying to use the XML API to get the results back. I tried various things and eventually contacted the host to make sure their setup wasn’t blocking anything. Eventually I realised the Collection and Front End settings are case sensitive and I had mine in all-lowercase as I’d been told, whereas the Mini was actually set up with capitalised first letter. Once I’d matched what they were in the Mini, the error stopped. You can also get the same error with the Google Search Appliance.

Simple, but so easy to get caught out by it!

Latest project: Course Searcher

September 15, 2006 in client work,Google Mini | Comments (0)

I’m happy to announce my latest Google Mini project: Course Searcher

It’s a search engine that covers educational courses for 14-19 year olds in Brighton and Hove (the area of the UK I live in.) I’ve written up what I did for the project in on my freelancing site.

We’ve used custom meta tags on the school’s websites to allow an advanced search, and also to feed information in to a database powering the ‘Pathways’ system, which leads to customised searching on the schools courses.

I’d like to thank the team I worked with on the site, who made it an enjoyable project. Here’s hoping a lot of students find it useful for finding the courses they should be doing next to help their future career.

Ignoring specific content on a page

July 28, 2006 in Google Mini,GSA,Spidering | Comments (2)

If you want your Google Mini or Search Appliance to ignore part of your page, you can use some special tags to stop the content being indexed (and therefore brought back in the search results.)

Surround the content you want ignored with the following tags:

<!-- googleoff: index --> <!-- googleon: index -->

So if you have

<!-- googleoff: index --> I like bees <!-- googleon: index -->

On your page and you search for ‘bees’, it won’t come up, even if the page has been spidered. The only people who will find out about your love of buzzing insects will be those who have found the page through other means.

This can be useful for excluding parts of your page that the appliance might find confusing, for instance ‘H’ wants to exclude his breadcrumb trail.

Custom meta tags in search results and full stops

May 4, 2006 in Google Mini,GSA,XML API | Comments (4)

When you’re using custom meta tags on your pages so you can serve up or search very specific information in your Google Mini or Search Appliance, it’s important to chose a meta name that will not conflict with other meta tags that might exist on your site or the public sites you are spidering (as pointed out to me yesterday by Nathan, the host of two Mini’s I’m working on currently.)

You can put a little code of your own before or after your meta tag’s name to make it unique for your project. This is like ‘namespaces’ in programming – where you try to keep your variables separate from anything that might conflict with them and over-write them with different data. For instance the Dublin Core project puts ‘DC.’ in front of their names, so you know what standard it’s related to. So instead of…

<meta name=”Publisher” content=”Web Positioning Centre” />

you have:

<meta name=”DC.Publisher” content=”Web Positioning Centre” />

Letting you know they are working within Dublin Core standards, and it’s unlikely any page is all ready using a tag called ‘DC.Publisher’, whereas it could be using ‘Publisher’ on it’s own.

If you’re setting up your own meta tags for use with a Mini or GSA, do not use full stops, ‘.’, to separate your code from the general name. When you pull back the results, they use full stops to separate different tags that you want to bring back using the ‘getfields’ flag.

So if you wanted to bring back the information in ‘DC.Publisher’ with the rest of the search results data, it will actually try to bring back information from the meta tag named ‘DC’ and another tag named ‘Publisher’

To avoid this happening, use something else to separate your namespace code (your ‘DC’) from the rest of the name. It would be a good idea not to use anything that needs to be ‘URL escaped‘ which pretty much limits you down to the following: $-_+!*'() – personally I tend to use a hyphen, ‘-‘, as it’s quite readable and is unlikely to cause problems in programming, unlike 4 or ()

Your meta information with your namespace code could look something like this:

<meta name=”gsad-site” content=”Spidertest” />
<meta name=”gsad-author” content=”Web Positioning Centre” />
<meta name=”gsad-image” content=”http://www.spidertest.com/images/wpc-logo.gif” />

Then you can get back these fields in your search results by using:

&getfields=gsad-site.gsad-author.gsad.image

In the XML of your results, you will get these additional fields:

<MT N=”gsad-site” V=”Spidertest”/>
<MT N=”gsad-author” V=”Web Positioning Centre”/>
<MT N=”gsad-image” V=”http://www.spidertest.com/images/wpc-logo.gif”/>

Crawling status in your Google Mini

in Google Mini,Spidering | Comments (0)

When you’re spidering (or ‘crawling’) using your Google Mini, you get a little box telling you it’s status – or at least you do on the old/v1 Mini, I haven’t tried a new Mini yet.

Within the box you get a list of:
Total Inflight URLs – these are links it has found, but not read the pages of yet.

Total Crawled URLs – pages that have been read

Locally Crawled URLs – this may come up when you’re re-spidering. It lists pages which have not changed since the last spidering, so it just reads it’s internal copy which saves reading the live pages again, saving time and bandwidth.

Excluded URLs – these are pages that have been excluded from being read for some reason, either via a robots.txt exclusion, or because of one of the rules you have set in the ‘Configure Crawl’ section.

All of these labels are links, if you click on them you get a list of the various documents in each section. Clicking on ‘Total Crawled URLs’ gives you a list of pages normally labelled ‘New Document’, within ‘Locally Crawled URLs’ they will be labelled ‘Cached Version.’

What makes a ‘New Document’?

By default, the Google Mini will look at the ‘Last-Modified’ header of the page. If it is more recent than the last spidered version of the page that it has stored, it will read and index the live page.

If your pages are straight HTML, then Last-Modified will be when they were created on the server – so when they were uploaded by FTP in most cases. However, if your pages are dynamic, for instance PHP or ASP pages, and use information from a database or even include files, then their Last-Modified date will usually be the moment they were served up to the Mini, or anyone visiting your site. This means the Mini will read all of the pages, even if they haven’t changed since it last spidered them, as the web server is telling them it’s a recently changed page because of the date. This does not hurt your site in any way, but it means you will use up more time and bandwidth in your spidering as it could be reading pages which haven’t changed since they were last spidered.

Finding your pages ‘Last Modified’ date

If you use Firefox, you can install the Web Developer Toolbar for it. When you are looking at a page, click on ‘Information’ then ‘View Response Headers’ and you’ll see the ‘Last-Modified’ date being reported to your browser and any passing web spiders.

Avoiding session IDs when spidering

April 20, 2006 in Google Mini,GSA,Spidering | Comments (4)

Many web sites use sessions to keep track of visitors as they browse around. If you do not have cookies turned on in your browser, the cookie may be sent through the URL so the site can still track you. This is very useful if it’s storing your shopping basket information, but it can have drawbacks.

Unfortunately sessions in the URL can upset spidering – a Google Search Appliance or Mini will generally up several ‘connections’ to a web site when it is spidering, this is like having several independent people browsing the site at the same time. Each of these connections receives a different session ID, which makes the URLs look different to the spiders. This in turn means each connection may spider the same pages that have all ready been covered. Also, if the session times out it may be replaced by a new session when the next page is spidered, which means that again the spider will re-read pages it has all ready found. This is because this:

/cars.php?phpsessid=6541231687865446

And this:

/cars.php?phpsessid=6541231AQ7J865KLP

Look like different pages, even though they may turn out to have the same content. To avoid this happening, you can stop the spider reading pages which have session IDs in the URL. You can avoid the most common session IDs by adding these lines to the ‘Do Not Crawl URLs with the Following Patterns:’ section of ‘URLs to Crawl':

contains:PHPSESSID
contains:ASPSESSIONID
contains:CFSESSION
contains:session-id

The web sites you are spidering may still contain session IDs, it is worth checking with the site owner if this is going to be a problem, and keep an eye on the ‘Inflight URLs’ shown in ‘System Status’ – ‘Crawl Status’ when spidering a site for the first time. If the same URLs are turning up a lot, you may have a session problem. You’ll need to stop spidering the site and work out which bit of the URL you need to ignore, then you can add it to do not crawl list like the examples above.

How to show different looking results based on subcollection

March 9, 2006 in Google Mini,XML API,XSL | Comments (3)

A common question is “How can I make the Google Mini show a different results page depending on the subcollection being used?” You might want to do this because you have a search covering an intranet and public area, or just two very different sites.

The Mini can only show one design of results on it’s own, i.e. through the normal results page it shows, based on the XSL you can set up in the ‘Configure Serving -> Output Format’ section of the admin area. However, as long as you have a scripting language on your web server (e.g. PHP, ASP, ColdFusion, Perl) you can use the XML interface to get the results back, then change the way they look in one of two ways:

  1. Trigger some XSL with the scripting language, making it look the way you want – you can use the admin area of the Mini to design the XSL if it helps
  2. Alternatively, use the scripting language to parse the XML, and display it how you want. This is probably more convenient if you do not know XSL and do not have time to learn it, but do know some programming. Doing this in ColdFusion (MX or 7), ASP (v3 or above) or PHP 5 is relatively straightforward if you all ready know some XML, it’s slightly harder in PHP 4, but still very possible with a little more effort

These give you the flexibility of being able to choose a method that best suits you or your developers, and gives you a lot of control over the look and feel of search results.

Can you put new pages or applications on a Google Mini?

in Google Mini,Q&A | Comments (1)

I’ve had these questions myself, and been asked it by a few people: “How do I put my website pages on a Google Mini?” and “Can I install my own application on a Mini, e.g. my web app or webstats?”

The answer to both questions is: no, you cannot put your own web pages on the Mini (nor on the Google Search Appliance) and you can’t install your own applications either. They have a web based interface where you can change the look of their own pages somewhat, but beyond that you can’t do anything else. There’s no way of uploading your own pages on to the box, and there’s no way of installing your own applications.

Basically, these things are a plug-and-play box, they don’t like being fiddled with, and although that’s probably rather different from what most webmasters are used to, it does allow Google to be confident about how the box will work, and to avoid support costs covering people who install leaky web apps on to the server and then complain when it eventually crashes.

What the Google Mini is good at, and what it is not good at

March 7, 2006 in Google Mini,Q&A | Comments (4)

I’ve been talking to various people recently, both by e-mail and at the Mini Google Group, who want to use a Mini for things it really isn’t suited for, so hopefully this will be helpful:

What the Google Mini is good at

It is very good at searching through lots of unstructured information, and finding the document (page) that best matches what you are searching for.

So, it is good for searching all your old sales documents on your intranet, to find the reference you made to ‘Singing badgers’

Or it’s good at searching all the articles on your website, for all the ‘orange-spotted tapirs’ that you have written medical items on.

What the Google Mini is not good at

It is not good at comparing pages for specific pieces of information, or for sorting by anything except relevancy or date the page was made.

For instance, if you have a large database with hundreds of products in, and you want to compare the prices of three or four of them, the Mini (and GSA) cannot do this. You need to change your shop so you can run comparisons using your standard database information.

Although you can put price or other information in to the Mini search results by having special meta tags on your product pages, it cannot sort the search results by anything except the standard relevancy, or by the date the page was last changed and spidered. You cannot force it in to sorting by price, or any other detail. If you want this, you need to update the search on your current database so it will sort by the price field in the SQL query.

If your shop search is working very slowly, try looking at:

  • Setting up indexes in the database to cover the most searched on fields
  • Upgrading your database (i.e. if you’re using Access, look at moving to MySQL or Microsoft SQL Server.)
  • Talk to your host about either moving to a higher grade of server, so the database runs more quickly, or upgrading to have a separate database server which is good enough to handle the load your website is putting on it.

The Mini is a good product, but it’s made for a particular set of circumstances, it would be a waste of money to buy it to do something it’s really not built to do, when you could use the same money elsewhere to get a proper solution.

Relevance in Mini and GSA searches

in Google Mini,GSA | Comments (8)

A question from Jim Westergren caused by looking at an oddity in Google PR reporting prompted me to look at the relevance rating in Google Mini search results.

For each search you do on the Mini/GSA, you get back a variety of information for each page. Not all of this is immediately obvious – there’s a last modified date which is often pretty useless, and also there’s a relevancy rating for each page. If you’re using the XML API, you’ll find the relevancy within <RK>.

Jim asked if the RK rating was always the same, or different for each page. I’ve done some checking, and it is different in the Mini, and the rating for each page depends on what you have searched for. For instance, on one search, a page I was following was rated as ‘5’, in another, it was ‘0’.

This means the RK value could be useful in a set of results, although as by default the results are ranked by relevancy so what the box thinks is the most relevant is at the top of the results, it’s probably not greatly useful. When sorting by date it becomes more useful, as you can try to spot the most relevant page in whatever results you happen to be looking at.

It should not be seen as a version of PageRank (‘PR’) within the search appliances, because the value is not fixed across all searches.

New Google Minis available in the UK

February 1, 2006 in Google Mini | Comments (0)

As posted about earlier, there are new Google Minis that can handle 200,000 and 300,000 documents. I was happy to find out today that they can now be bought in the UK for just under £4,000 and £6,000 respectively.