More on content intelligence: Stemming

This content intelligence thing seems to be turning into a bit of a mini series.

The last addition to this example for the foreseeable future is Stemming - the reduction of a word to its root form.

In case you are thinking WTF, think about the relationship between training and train, discussion and discuss and so on. If you want to read more about how machines can perform stemming see http://www.comp.lancs.ac.uk/computing/research/stemming/general/ but personally, I am happy to be aware of stemming algorithms and the fact that they can be useful when classifying documents.

If you click through to the detail view of any of my blog posts you'll see that some words have been stemmed in the classification output - the root form appears in brackets after the word.

After playing with NClassifiers PorterStemmer, I found it was often too harsh, reducing words so they made no sense.

As a solution I applied both the PorterStemmer and KStemmer which is available from the lucene.net site. I select the shortest correctly spelled word as the root word. The results are pretty satisfactory, but there will be mistakes - use stemmed to us for example.

So why is this useful? Imagine someone visits your site and searches for the word 'classification'. Using stemming you can reduce classification to classify, determine that 'classifier' has the same root word and return documents matching classify, classification and classifiers.

Read it a few times, it does make sense. Honest!

So now what? Well, I shelf the content intelligence thing for the moment. As with most of my prototyping, I get to the stage where I could sell it as a product if I find a client to fund and then park it.

If I have some free time I may well try and move all of this prototype functionality into an ActionHandler and then add some personalisation for members who are logged on to my site.

Leave a comment