More on NClassifier and content intelligence

Note: When reading this post, click through to the detail view so that you can see the output that is being discussed.

Ever seen pages like your flickr tag list that use the pretty cool effect of displaying the tag in a size that relates to its frequency?

As well as classifying your posts NClassifier makes it easy to get information on how often words occur in a piece of text. If you click through to any of the detail page of any blog post you'll see that yesterdays classification now lists information about how often words occur in the post. It would be a relatively easy job to randomize the order in which the words appear and apply a font size based on how many times they occur in a document or even across an entire website.

We have to remember that the classifier has no knowledge of the meaning of words, so to improve its output we employ few common search engine methods.

Stop Words: these are a list of words that should not be considered as they are not relevant to the classification of the document e.g. (a, the, if, you, he, she). Stop words are somtimes called noise words.

Custom words: Non dictionary words that you want to be included in classification e.g. (Umbraco, c#, Darren).

Significant words: Words that you want to add weight to, in my case Umbraco, Pho and words that relate to these topics.

You can see all of these in the updated config file. Experience has taught me to also exclude single letter words and numeric strings.

The NClassifier word list is then passed through NetSpell and sorted by number of occurrences to give the output that you see.

Finally, although useful, these classifiers need a lot of training to be useful. As you can probably see, you will end up adding more and more words to your list of stop words and gradually the summarization becomes more concise. To work around the need to train your classifiers you can only consider significant words in the output.

This example is very simple, significant words are not weighted, so the only factor that determines the importance of a word in a document is the number of times that it occurs. Weighting can be very complex, so I'll leave it for now.

As with my first NClassifier example, I think it would be best to employ this as an ActionHandler if being used within Umbraco. To employ site wide word counts you would need a data store with the columns documentId, wordId, occurenceCount in order that you could easily update your information upon re-publish.

Comments

Leave a comment