2005-08-03

synchcola: (Default)
2005-08-03 10:48 pm

Summer project

In an entry a while back, leonardr mentioned a fast text classification technique and linked to a description of it. I thought it sounded pretty neat, and just recently I went back and read about it. So my new project is implementing SIMPL in Python (as an extension to Bayes Motel). I'm going to talk about that some.

It turns out that implementing SIMPL requires me to write two pieces of code: the "hill-climbing" routine to find good axes of projection, and Ross Quinlan's C4.5 algorithm for generating decision trees. I've finished transliterating a very limited version of C4.5 into Python, and that's here. The status of this is dubious because the code I started from isn't GPL or public-domain. Mu~

The next thing that I need to do should be easier; I just need to write some vector stuff. Also I can't find the Bayes Motel code, gar. And then SIMPL will be loose upon the world!

SIMPL is a lot slower than Bayes, unfortunately, and it takes Θ(documents already in corpus + n) time to add n documents and update the tree and axes. That's a major disadvantage. ^_^; On the other hand, that won't be a limitation for problems of this size.