Aug. 3rd, 2005

synchcola: (Default)
In an entry a while back, leonardr mentioned a fast text classification technique and linked to a description of it. I thought it sounded pretty neat, and just recently I went back and read about it. So my new project is implementing SIMPL in Python (as an extension to Bayes Motel). I'm going to talk about that some.

It turns out that implementing SIMPL requires me to write two pieces of code: the "hill-climbing" routine to find good axes of projection, and Ross Quinlan's C4.5 algorithm for generating decision trees. I've finished transliterating a very limited version of C4.5 into Python, and that's here. The status of this is dubious because the code I started from isn't GPL or public-domain. Mu~

The next thing that I need to do should be easier; I just need to write some vector stuff. Also I can't find the Bayes Motel code, gar. And then SIMPL will be loose upon the world!

SIMPL is a lot slower than Bayes, unfortunately, and it takes Θ(documents already in corpus + n) time to add n documents and update the tree and axes. That's a major disadvantage. ^_^; On the other hand, that won't be a limitation for problems of this size.

Profile

synchcola: (Default)
synchcola

October 2024

S M T W T F S
  12345
6789101112
13141516171819
202122 23242526
2728293031  

Page Summary

Style Credit

Expand Cut Tags

No cut tags
Page generated Aug. 14th, 2025 06:25 pm
Powered by Dreamwidth Studios