2005-08-04

synchcola: (Default)
2005-08-04 09:45 pm

SIMPL journal

Okay, now I need to debug it and find less inefficient ways of doing things. ^_^

I'd like to store documents as numpy arrays, with doc[j, 0] the word and doc[j, 1] the frequency, instead of as dictionaries.

I want some kind of speedup when adding only a few documents. One thing I can do is store the initial Fisher's discriminant and begin the new hill-climbing phase with that. Or I can leave it alone! For large enough documents, I could cache the first two discriminants.

I want to add C4.5's capability for determining decision trees from a random subset. (With those two improvements I could really speed up the algorithm when I only add a couple things, but in that case there wouldn't be much difference from ignoring the new stuff completely! And I would like to behave nicely if a mailing list suddenly appears -nyo. :P)