How to spider hidden content

January 25, 2006 in Google Mini,GSA,Q&A,Spidering | Comments (2)

This answers a question I’ve been asked in a couple of different ways…

Q: I have a page that uses many javascript popup windows for news articles, how can I index them?
Q: On a webserver I have a directory full of html pages. These pages are NOT listed anywhere. Can the google mini return search results where these html files will also be included?

If you have some pages that are usually off-limits to spiders, you can make sure your Search Appliance or Mini spiders them in a couple of ways:

1. Put the exact URL of each page to be spidered in the list of places to be spidered in the crawling admin – if you have many pages, this will become a maintenance problem.

2. Use a sitemap page which is not indexed by the appliance – the easy way to do it.

To do 2, make a list of all the pages you want spidered that are being missed out because there is not a direct route to them for the spider – i.e. Javascript is getting in the way, or the only route to them is blocked via robots.txt or something similar. This does not need to be a fancy page, it’s just a list for the spider to see and no people will ever need to see it.

In the HTML of this page, between the tags, put the following line:

<meta name=”robots” content=”noindex, follow” />

This means the spider will read the page and follow all links on it, but the page itself will not be indexed. If it isn’t indexed, it can’t be shown in the search results, so no-one can find it.

Now give the GSA or Mini the address of this sitemap page to crawl. Any time you add new pages to your site that are not getting spidered, you can add their address to the sitemap page.

NB: Google and the other big search engines also follow this ‘robots’ meta tag command, however you will need a link to the sitemap page from another part of your site for them to spider the pages and find them interesting enough to keep in their index, so if you use it to expose pages to the public search engines you will need your sitemap to look prettier as people might click through a link on your website to the page.

Comments (2)

RSS feed for comments on this post.

  1. Comment by elvum — December 13, 2006 @ 2:44 pm

    “In the HTML of this page, between the tags, put the following line…”

    What line?

  2. Comment by Paul — December 13, 2006 @ 3:14 pm

    Apologies, WordPress had removed the line because I hadn’t used the right codes for the brackets at each end. I’ve edited the post and put them back in now, the line is

    meta name=”robots” content=”noindex, follow” /

    With chevron brackets at each end.

    Thanks for pointing it out.

Leave a comment

Sorry, the comment form is closed at this time.