I wrote a tag cloud generator for Lucene, examples of it include
I build the cloud from reading the lucene index and pruning it down. It is pruned down by a junk words file which can be used to control how it gets pruned down.
There are a few files to all of this....
The source code requires lucene. Though I wrote it as a Nutch plugin, it does not depend on Nutch.
The junk word file contains terms, and some options.
The options are baked into the code.
The words do not support regex, they are just matched.
-numbers - ignore numbers
-smallwords - skips words with three or less chars
-dashes - ignore terms with dashes
-# : comments are also supported with #
This file converts into HTML and uses CSS to dress the cloud.
CSS file is modified off an example i found in a php tag cloud project.