I’m running some tests on sklearn decision trees, and the lessons learned so far may be interesting.
I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives.
- When running predictions, if you have a defect where you include the ‘answer’ column in the test columns, the above code gives you a division by zero, which is a good check.
- For my data, when I run with criterion=’entropy’ I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite
- criterion=’entropy’ is noticeably slower than the default (‘gini’)
- The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)
- For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.
- In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)
- I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don’t think this is useful for what I want to do.
- You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)
- SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like “status > 1.5″
- The tree could conceivably generate equality conditions within the structure, although it’d be hard to tell (e.g. “status > 1.5″, “status < 2.5" would be equivalent to "status = 2" if status is an integer)
- I’m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON
- Within the JSON, the “entropy” and “impurity” field shows you how clean the rule is (0 = good). The “value” field shows how many items fit the rule (small numbers are probably not useful, at least for me)
testsRun = 0 testsPassed = 0 testsFalseNegative = 0 testsFalsePositive = 0 testsPositive = 0 testsNegative = 0 for t in test: prediction = clf.predict(t)[0] if prediction == 0: testsNegative = testsNegative + 1 else: testsPositive = testsPositive + 1 if prediction == test_v[testsRun]: testsPassed = testsPassed + 1 else: if prediction == 0: testsFalseNegative = testsFalseNegative + 1 else: testsFalsePositive = testsFalsePositive + 1 testsRun = testsRun + 1 print "Percent Pass: {0}".format(100 * testsPassed / testsRun) print "Percent Positive: {0}".format(100 * testsPositive / testsRun) print "Percent Negative: {0}".format(100 * testsNegative / testsRun) print "Percent False positive: {0}".format(100 * testsFalseNegative / (testsFalsePositive + testsFalseNegative)) print "Percent False negative: {0}".format(100 * testsFalsePositive / (testsFalsePositive + testsFalseNegative)) |
The post Decision Tree Testing Lessons appeared first on Gary Sieling.