SIMPL journal
Aug. 4th, 2005 09:45 pm
I'd like to store documents as numpy arrays, with doc[j, 0]
the word and doc[j, 1]
the frequency, instead of as dictionaries.
I want some kind of speedup when adding only a few documents. One thing I can do is store the initial Fisher's discriminant and begin the new hill-climbing phase with that. Or I can leave it alone! For large enough documents, I could cache the first two discriminants.
I want to add C4.5's capability for determining decision trees from a random subset. (With those two improvements I could really speed up the algorithm when I only add a couple things, but in that case there wouldn't be much difference from ignoring the new stuff completely! And I would like to behave nicely if a mailing list suddenly appears -nyo. :P)