{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook explains the answer to problem 6 on the Spring 2016 471 final exam, Construct a decision tree to predict y given the inputs from this data using the ID3 algorithm that selects the variable at each level that maximizes the information gained. The training data is:\n", "\n", " A1 A2 A3 y\n", " \n", " 1 0 0 0\n", " 1 0 1 0\n", " 0 1 0 0\n", " 1 1 1 1\n", " 1 1 0 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from math import log\n", "from __future__ import division # so 1/2 produces a real, rather than truncating to 0\n", "def log2(n):\n", " # a bit of a hack for n==0 to avoid errors\n", " return 0 if n == 0 else log(n,2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A distribution is a list of real numbers that sum to 1.0, e.g., [.125, .125, .25, .5].\n", "The information (or entropy) in a distibution is defined as follows." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def info(distribution):\n", " assert 1.0 == sum(distribution)\n", " return sum ([x * -log2(x) for x in distribution]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compute the amount of information in our training data by noting that we have three 0 values and two 1 values for y, the classification variable" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "info([1.0]) ==> 0.0\n", "info([0.5, 0.5]) ==> 1.0\n", "info([0.0, 1.0]) ==> 0.0\n", "info([0.25, 0.25, 0.25, 0.25]) ==> 2.0\n", "info([0.125, 0.125, 0.25, 0.5]) ==> 1.75\n" ] } ], "source": [ "# here are some examples of information in some distributions\n", "for d in [[1.0], [0.5,0.5], [0.0, 1.0], [0.25,0.25,0.25,0.25], [.125, .125, .25, .5]]:\n", " print \"info({}) ==> {}\".format(d, info(d))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The informaion in our training data is just the distribution of of the values for our classification variable y, which has three 0 values and two 1 values" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.970950594455\n" ] } ], "source": [ "i = info([3/5, 2/5])\n", "print i" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Asking about A1, splits the dataset into two, with A1=0 having one value {0} and A1=1 having four {0,0,1,1}\n", "We can compute the entropy as the weighted sum of the entropy of each subtree. The weight for a subtree is the probabiliaty that we will end up in it.\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ask A1: resulting entropy is 0.8, information gain is 0.170950594455\n" ] } ], "source": [ "# Information after asking about A1\n", "i_a1 = (1/5)*info([1/1, 0/1]) + (4/5)*info([2/4, 2/4])\n", "print \"Ask A1: resulting entropy is {}, information gain is {}\".format(i_a1, i - i_a1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Asking about A2, splits the dataset into two, with A2=0 having two value {0,0} and A2=1 having four {0,1,1}" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ask A2: resulting entropy is 0.550977500433, information gain is 0.419973094022\n" ] } ], "source": [ "# Information after asking about A2\n", "i_a2 = (2/5)*info([2/2, 0/2]) + (3/5)*info([1/3, 2/3])\n", "print \"Ask A2: resulting entropy is {}, information gain is {}\".format(i_a2, i - i_a2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Asking about A3, splits the dataset into two, with A3=0 having three value {0,0,1} and A3=1 having four {0,1}" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ask A3: resulting entropy is 0.950977500433, information gain is 0.019973094022\n" ] } ], "source": [ "# Information after asking about A3\n", "i_a3 = (3/5)*info([2/3, 1/3]) + (2/5)*info([1/2, 1/2])\n", "print \"Ask A3: resulting entropy is {}, information gain is {}\".format(i_a3, i - i_a3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So asking about A2 gains us the most amount of information" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 2 }