Automatic linking in blog posts: more complexity

Following on from my previous post I've begun to realise that automatically linking key phrases in blog content to URLs is a little more complex than originally imagined.

Take my blog as an example. My posts often contain source code and other technical information. This source code could contain my key phrases, for example Umbraco and Perl appear frequently.

How do I get my auto linker to be smart enough to understand the context of the content it is manipulating and ignore key phrases in code?

Solution #1 would be to only parse the content of specific tags or more likely exclude pre tags from parsing as the tag implies that the content is pre-formatted and not to be modified. The problem here is that instead of using regular expressions to manipulate the content we would need to do something more complex, perhaps parsing the post content as XML so we have a notion of node structure.

Solution #2 is to put parsing back into the hands of the user, only parsing blocks of text that they specifically select in TinyMCE.

I haven't made a call on this yet. I'm more likely to go with option 2. Any thoughts appreciated as usual.


Leave a comment