Sale!

ML6375 ID3 project 1 solution

$30.00 $25.50

Category:

Description

5/5 - (6 votes)

The goal of this project is to implement a program for inferring imperfect decision trees.
Theory
We discussed in class the classic ID-3 approach for inferring decision trees. The algorithm keeps
splitting nodes as long as the nodes have nonzero entropy and features are available. This project’s
goal is to infer imperfect decision trees with a small number of nodes and with small entropy.
Consider a tree that partitions the instances into k leaves: S = {s1, s2, . . . , sk}. The target
attribute is T, and the purpose of the tree is to enable us to infer the target attribute. As in the
case of ID-3 our goal is to minimize:
Entropy(T|S) = X
k
j=1
Prob(sj )Entropy(T|sj )
A greedy approach for inferring such a tree is to iteratively determine what node should be partitioned further. The contribution of a single leaf sj to the total uncertainty is:
Ej = Prob(sj )Entropy(T|sj ) = |sj |
n
Entropy(T|sj ) (1)
where |sj | is the number of instances in sj and n is the total number of instances. Therefore, it
makes sense to choose sj with a maximal Ej . A better criterion is obtained using lookahead. Let
A1, . . . Am be the attributes. Let Tj be the target attribute restricted to the subset sj . If we choose
sj , it is to be partitioned using the attribute Ai that minimizes Entropy(Tj |Ai) computed as:
Entropy(Tj |Ai) = X
l
|al
|
|sj |
Entropy(Tj |al)
The total change in the entropy of the tree is:
Eji = Prob(sj )(Entropy(Tj |Ai) − Entropy(T|sj )) = −Prob(sj )Gain(sj , Ai)
where: Gain(sj , Ai) = Entropy(sj ) − Entropy(sj |Ai)
Therefore, we should choose sj that minimizes Eji, or, equivalently, the one that maximizes
F = max
i,j
|sj |
n
Gain(sj , Ai) (2)
1 ML6375 ID3 project 1
Example
A1 A2 A3 Target attribute
1 R Y 0 a
2 R X 0 a
3 R X 0 b
4 R X 1 a
5 G X 1 b
6 G X 1 a
7 G X 1 a
8 G X 1 b
9 B Y 1 a
10 B Y 0 b
Starting with the single leaf {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} it must be chosen as the first leaf to split.
Entropy(T) = 0.970951
Entropy(T|A1) = 0.924511 Gain(T, A1) = 0.0464393
Entropy(T|A2) = 0.965148 Gain(T, A2) = 0.00580215
Entropy(T|A3) = 0.95098 Gain(T, A3) = 0.0199731
Therefore, the best choice is A1. This produces the following partition into three leaves:
s1 = {1, 2, 3, 4}, s2 = {5, 6, 7, 8} s3 = {9, 10}
we have:
Entropy(s1) = 0.811278
Entropy(s1|A1) = 0.811278 Gain(s1, A1) = 0
Entropy(s1|A2) = 0.688722 Gain(s1, A2) = 0.122556
Entropy(s1|A3) = 0.688722 Gain(s1, A3) = 0.122556
F1 = 0.4 × 0.122556 = 0.0490224
Entropy(s2) = 1
Entropy(s2|A1) = 1 Gain(s2, A1) = 0
Entropy(s2|A2) = 1 Gain(s2, A2) = 0
Entropy(s2|A3) = 1 Gain(s2, A3) = 0
F2 = 0.4 × 0 = 0
Entropy(s3) = 1
Entropy(s3|A1) = 1 Gain(s3, A1) = 0
Entropy(s3|A2) = 1 Gain(s3, A2) = 0
Entropy(s3|A3) = 0 Gain(s3, A3) = 1
F3 = 0.2 × 1 = 0.2
Therefore, the next leaf to split is s3, and it is splitted according to A3. This produces the following
partition into 4 leaves:
s1 = {1, 2, 3, 4}, s2 = {5, 6, 7, 8}, s4 = {9}, s5 = {10}
2 ML6375 ID3 project 1
Input/Output
The dataset is described in a file containing only integer numbers. The first number in the file is m,
the number of instances; the second is n, the number of features. These two numbers are followed
by mn feature values. The last attribute is taken as the target attribute. Example:
————————–
10 4
0 0 0 0
0 0 1 0
1 0 2 0
2 2 0 1
0 0 1 1
0 1 2 1
1 1 1 1
1 2 2 0
0 0 1 0
2 0 1 0
————————-
You may assume that a feature has no more than 3 distinct values so that the numbers in each
column can be eiter 0, 1 or 2. The target attribute (the last column) is binary, so that its values
are 0 or 1.
A partition is described in a text file. For a situation where there are m instances, a partition
is described by a sequence of numbers in the range 1 . . . m.
Each partition is specified in a separate line. The first string in a row describing a partition is the
partition id.
With 10 instances, here is the file representing the partition {1,10}, {2,3,4,5}, {6,7,8,9}.
The first partition has id= X, the second id=Y 2, and the id of the third is Z.
————————–
X 1 10
Y2 2 3 4 5
Z 6 7 8 9
————————-
Your program should read as input one dataset file and one partition file. It should determine
which partition to split, and according to what attribute. This information should be printed as a
message. The program should then produce as output the new partition file. Example:

MyProgram
Enter names of the files dataset input-partition output-partition

dataset-1.txt partition-2.txt partition-3.txt
Partition Z was replaced with partitions Z1,Z2 using Feature 3
If the file partition-2.txt is as specified above, the file partition-3.txt might be:
————————–
X 1 10
3 ML6375 ID3 project 1
Y2 2 3 4 5
Z1 7
Z2 6 8 9
————————-
Requirements:
Submission must be on eLearning.
4 ML6375 ID3 project 1