{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook explains the answer to problem 6 on the Spring 2016 471 final exam, Construct a decision tree to predict y given the inputs from this data using the ID3 algorithm that selects the variable at each level that maximizes the information gained. The training data is:\n",
    "\n",
    "  A1 A2 A3  y\n",
    " \n",
    "  1  0  0  0\n",
    "  1  0  1  0\n",
    "  0  1  0  0\n",
    "  1  1  1  1\n",
    "  1  1  0  1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from math import log\n",
    "from __future__ import division   # so 1/2 produces a real, rather than truncating to 0\n",
    "def log2(n):\n",
    "    # a bit of a hack for n==0 to avoid errors\n",
    "    return 0 if n == 0 else log(n,2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A distribution is a list of real numbers that sum to 1.0, e.g., [.125, .125, .25, .5].\n",
    "The information (or entropy) in a distibution is defined as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def info(distribution):\n",
    "    assert 1.0 == sum(distribution)\n",
    "    return sum ([x * -log2(x) for x in distribution])                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can compute the amount of information in our training data by noting that we have three 0 values and two 1 values for y, the classification variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "info([1.0]) ==> 0.0\n",
      "info([0.5, 0.5]) ==> 1.0\n",
      "info([0.0, 1.0]) ==> 0.0\n",
      "info([0.25, 0.25, 0.25, 0.25]) ==> 2.0\n",
      "info([0.125, 0.125, 0.25, 0.5]) ==> 1.75\n"
     ]
    }
   ],
   "source": [
    "# here are some examples of information in some distributions\n",
    "for d in [[1.0], [0.5,0.5], [0.0, 1.0], [0.25,0.25,0.25,0.25], [.125, .125, .25, .5]]:\n",
    "    print \"info({}) ==> {}\".format(d, info(d))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The informaion in our training data is just the distribution of of the values for our classification variable y, which has three 0 values and two 1 values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.970950594455\n"
     ]
    }
   ],
   "source": [
    "i = info([3/5, 2/5])\n",
    "print i"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Asking about A1, splits the dataset into two, with A1=0 having one value {0} and A1=1 having four {0,0,1,1}\n",
    "We can compute the entropy as the weighted sum of the entropy of each subtree.  The weight for a subtree is the probabiliaty that we will end up in it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ask A1: resulting entropy is 0.8, information gain is 0.170950594455\n"
     ]
    }
   ],
   "source": [
    "# Information after asking about A1\n",
    "i_a1 = (1/5)*info([1/1, 0/1]) + (4/5)*info([2/4, 2/4])\n",
    "print \"Ask A1: resulting entropy is {}, information gain is {}\".format(i_a1, i - i_a1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Asking about A2, splits the dataset into two, with A2=0 having two value {0,0} and A2=1 having four {0,1,1}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ask A2: resulting entropy is 0.550977500433, information gain is 0.419973094022\n"
     ]
    }
   ],
   "source": [
    "# Information after asking about A2\n",
    "i_a2 = (2/5)*info([2/2, 0/2]) + (3/5)*info([1/3, 2/3])\n",
    "print \"Ask A2: resulting entropy is {}, information gain is {}\".format(i_a2, i - i_a2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Asking about A3, splits the dataset into two, with A3=0 having three value {0,0,1} and A3=1 having four {0,1}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ask A3: resulting entropy is 0.950977500433, information gain is 0.019973094022\n"
     ]
    }
   ],
   "source": [
    "# Information after asking about A3\n",
    "i_a3 = (3/5)*info([2/3, 1/3]) + (2/5)*info([1/2, 1/2])\n",
    "print \"Ask A3: resulting entropy is {}, information gain is {}\".format(i_a3, i - i_a3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So asking about A2 gains us the most amount of information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}