Bayes on RSS feeds - Unsuitable?
Krishnan Nair Srijith,
Srijith.net,
Aug 20, 2003
Seb tossed this link to me and I feel like I ought to respond. It begins with the tantalizing idea of using Bayes Theorem using some Perl modules to autocategorize blog content. Nifty idea. Could it work? Well, not according to the critics. It does not take into account the origin of the feed, it does not take into account the placement of the word, and it does not take into account the relative importance of the word (such as placement in a title). One critic writes, "If the author of the feed has already denoted the news item was 'technology', it would be wise to give this match a probability of 1 for the category 'Technology'." Well, hardly. To assume that people will categorize entities correctly is the height of wishful thinking, in my opinion. To make the Baysean approach work, what designers should do is evaluate not mere strings, but couples. I would express it like this: title~RSS (which means, roughly, title contains the string 'RSS'). If these are the elements used in the Bayesean calculations then the objections vanish. Mind you, I have just quintupled the number of elements to be considered, so there are other issues to contend with. But all of that said, I'm not ready to go Baysean just yet. My preference is a type of pattern-detection using Perl regular expressions. CRLF
Today: 5 Total: 88 [Share]
] [