Decision Tree Testing Lessons

I’m running some tests on sklearn decision trees, and the lessons learned so far may be interesting.

I’ve put my measurement code at the end – I’m tracking % correct, number of tests that are positive, negative, and false positives and negatives.

When running predictions, if you have a defect where you include the ‘answer’ column in the test columns, the above code gives you a division by zero, which is a good check.
For my data, when I run with criterion=’entropy’ I get 2% increase in accuracy, but other people I talked to on twitter have had the opposite
criterion=’entropy’ is noticeably slower than the default (‘gini’)
The default decision tree settings create trees that are very deep (~20k nodes for ~100k data points)
For my use case, I found that limiting the depth of trees and forcing each node to have a large number of samples (50-500) made much simpler trees with only a small decrease in accuracy.
In forcing nodes to have more samples, the accuracy decreased ~0-5%, roughly along the range of how many samples were included at each node (50-500)
I found that I needed to remove a lot of my database columns to get a meaningful result. For instance originally I had ID columns, which lets sklearn pick up data created in a certain time window (since the IDs are sequential) but I don’t think this is useful for what I want to do.
You have to turn class based attribute values into integers (it appears to be using a numpy float class internally for performance reasons)
SKLearn appears to only use range based rules. Combine this with the above and you get a lot of rules like “status > 1.5″
The tree could conceivably generate equality conditions within the structure, although it’d be hard to tell (e.g. “status > 1.5″, “status < 2.5" would be equivalent to "status = 2" if status is an integer)
I’m more interesting in discovering useful rules than in future predictions; it helps a lot to generate JSON
Within the JSON, the “entropy” and “impurity” field shows you how clean the rule is (0 = good). The “value” field shows how many items fit the rule (small numbers are probably not useful, at least for me)

testsRun = 0
testsPassed = 0
testsFalseNegative = 0
testsFalsePositive = 0
testsPositive = 0
testsNegative = 0
for t in test:
  prediction = clf.predict(t)[0]
  if prediction == 0:
    testsNegative = testsNegative + 1
  else:
    testsPositive = testsPositive + 1
 
  if prediction == test_v[testsRun]:
    testsPassed = testsPassed + 1
  else: 
    if prediction == 0:
      testsFalseNegative = testsFalseNegative + 1
    else:
      testsFalsePositive = testsFalsePositive + 1
 
  testsRun = testsRun + 1
 
print "Percent Pass: {0}".format(100 * testsPassed / testsRun)
print "Percent Positive: {0}".format(100 * testsPositive / testsRun)
print "Percent Negative: {0}".format(100 * testsNegative / testsRun)
print "Percent False positive: {0}".format(100 * testsFalseNegative / (testsFalsePositive + testsFalseNegative))
print "Percent False negative: {0}".format(100 * testsFalsePositive / (testsFalsePositive + testsFalseNegative))

The post Decision Tree Testing Lessons appeared first on Gary Sieling.

Decision Tree Testing Lessons

Trending Articles

Snes4Sym emulator for nokia s60v3

Black Angus Grilled Artichokes

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Calaveras conflict results in shooting, 4 arrests

Kurabuitaki na Sota Koya

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Adilabad District Police Office Mobile Numbers List in Telangana State

Download: Rich Bizzy -Panono Ukwenda (Cover)

Re: Error UA_400_EB000U0410

How to repair Samsung LCD TV panel screen - Part-1 of 5

Practice Sheet of Right form of verbs for HSC Students

Moondru Mudichu 20-07-2016 – Polimer tv Serial

School playground abuse and assault convictions against solicitor...

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Rajasthan Board 10th Result 2017 RBSE 10th Class Result 2017 Name Wise...

23-11-2015 – Priyamana Thozhi

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

The Last Ship – 2ª Temporada Dublado e Legendado – MEGA

Top 10 best green tea brands in Nigeria you should try