Showing posts with label tag cloud. Show all posts
Showing posts with label tag cloud. Show all posts

Thursday, July 19, 2007

Lucene Tag Cloud Generator

I wrote a tag cloud generator for Lucene, examples of it include
The generator is written as Nutch Plugin for no good reason ;)

I build the cloud from reading the lucene index and pruning it down. It is pruned down by a junk words file which can be used to control how it gets pruned down.
Once I build the list I run a javascriipt file passing in the results, and then the javascript outputs the cloud.
There are a few files to all of this....

JavaSourceCode
The source code requires lucene. Though I wrote it as a Nutch plugin, it does not depend on Nutch.

JunkWordsFile
The junk word file contains terms, and some options.
The options are baked into the code.

The words do not support regex, they are just matched.
Options inlucde

-numbers - ignore numbers
-smallwords - skips words with three or less chars
-dashes - ignore terms with dashes
-# : comments are also supported with #

JavaScriptFile
This file converts into HTML and uses CSS to dress the cloud.

CloudCssFile
CSS file is modified off an example i found in a php tag cloud project.

Tuesday, June 19, 2007

Creating a Tag Cloud

First I was very frustrated, I was attempting to figure out how to generate a 'field' cloud from a lucene index. I was googling '"tag cloud" generator lucene' and I just could not find my way. My view is google at times is losing the simplicity of it's core strength. I wasted some time, but eventually attempted to search in google's blog search and finally found a collection of "tag cloud" generators. And finally found a java cloud generator as a starting point. And that while interesting as it is fully integrated into hibernate is used to generate a cloud from the index created behind the data within your application.

So... I gave up and built it myself, but got a head start from two projects
a. Luke, Lucene Index Toolbox. Great tool for working Lucene. On the first page there is a list of terms, so I started by reviewing that code base.
b. A good PHP example with some basic CSS and leveraging of SPAN tags, also reviewed the PHP code to see how they randomize and divine strength of item within cloud.

I chose to crawl from my blog and indexing a few times -topN 2000, so not a very large crawl, but enough to generate the dataset. And from that data the cloud on top which is based on the 'content' field, and if you scroll all the way to the bottom there is a cloud based on the site field.

I will post the code once I get it cleaned up. If you need it sooner just let me know.
Here is the example cloud I am able to generate.


 products    software    real    main    blog    tools    links    source    business    news    services    search    policy    privacy    documentation    community    contact    service    view