Description
Assignment 1 COMP 250 with Solution
Market Place
For this assignment you will write several classes to simulate an online Market place. Make sure to
follow the instructions below very closely. Note that in addition to the required methods, you are free
to add as many other private methods as you want (no other additional method is allowed).
[10 points] Write an abstract class MarketProduct which has the following private field:
• A String name
The class must also have the following public methods:
• A constructor that takes a String as input indicating the name of product and uses it
to initialize the corresponding attribute.
• A final getName() method to retrieve the name of this MarketProduct.
• An abstract method getCost() which takes no input and returns an int. This method
should be abstract (thus, not implemented) because how to determine the cost depends
on the type of product.
• An abstract method equals() which takes an Object as an input and returns a
boolean. This method should be abstract as well, since depending on the type of product
different conditions should be met in order for two products to be considered equal.
[25 points] All of the followings must be subclasses of the MarketProduct:
• Write a class Egg derived from the MarketProduct class. The Egg class has the following
private fields:
– An int indicating the number of eggs.
– An int indicating the price per dozen of these eggs. Note that the all the prices
(throughout the assignment) are indicated in cents. For instance, 450 represents the
amount $4.50.
The Egg class has also the following public methods:
– A constructor that takes as input a String with the name of the product, an int
indicating the number required, and an int indicating the price of such product by
the dozen. The constructor uses the inputs to create a MarketProduct and initialize
the corresponding fields.
– A getCost() method that takes no input and returns the cost of the product in
cents. The cost is computed base on the number required and the cost per dozen.
For instance, 4 large brown eggs at 380 cents/dozen cost 126 cents (the cost should be
rounded down to the nearest cent). You may assume that cost of all MarketProducts
fits within an int and therefore doesn’t cause overflow.
– An equals() method which takes as input an Object and return true if input
matches this in type, name, cost and number of eggs. Otherwise the method returns
false.
3
• Write a class Fruit derived from the MarketProduct class. The Fruit class has the
following private fields:
– A double indicating the weight in kg.
– An int indicating the price per kg in cents.
The Fruit class has also the following public methods:
– A constructor that takes as input a String with the name of the product, a double
indicating the weight in kg, and an int indicating the price per kg of such product. The constructor uses the inputs to create a MarketProduct and initialize the
corresponding fields.
– A getCost() method that takes no input and returns the cost of the product in
cents. The cost is computed based on the weight and the price per kilogram. For
instance, 1.25 kgs of asian pears at 530 cents per kg cost 662 cents.
– An equals() method just like the Egg class, which matches type, name, weight and
cost.
• Write a class Jam derived from the MarketProduct class. The Jam class has the following
private fields:
– An int indicating the number of jars.
– An int indicating the price per jar in cents.
The Jam class has also the following public methods:
– A constructor that takes as input a String with the name of the product, an int
indicating the number of jars, and an int indicating the price per jar of such product. The constructor uses the inputs to create a MarketProduct and initialize the
corresponding fields.
– A getCost() method that takes no input and returns the cost of the product in
cents. The cost is computed based on the number of jars and their price. For
instance, 2 jars of Strawberry jam at 475 cents per jar cost 950 cents.
– An equals() method like in the previous classes.
• Write a class SeasonalFruit derived from the Fruit class. The SeasonalFruit class
has no fields, but it has the following public methods:
– A constructor that takes as input a String with the name of the product, a double
indicating the weight in kg, and an int indicating the price per kg of such product.
The constructor uses the inputs to create a Fruit.
– A getCost() method that takes no input and returns the cost of the product in
cents. Since this type of Fruit is in season, its original cost should receive a 15%
discount. For instance, 0.5 kgs of McIntosh apples at 480 cents per kg cost 204 cents.
4
[40 points] Write a class Basket representing a list of market products. Note that the instructions
on how to implement this class are not always very specific. This is intentional, since your
assignment will not be tested on the missing details of the implementation. Note though, that
your choices will make a difference in terms of how efficient your code will be. We will not
be deducting point for inefficient code in Assignment 1. Note once again that, you are NOT
allowed to import any other class (including ArrayList or LinkedList). The class has (at
least) the following private field:
• An array of MarketProducts.
The class must also have the following public methods:
• A constructor that takes no inputs and initialize the field with an empty array.
• A getProducts() method which takes no inputs and returns a shallow copy of the array
(NOT a copy of the reference!) of MarketProducts of this basket (with the elements in
the same order).
• An add() method which takes as input a MarketProduct and does not return any value.
The method adds the MarketProduct at the end of the list of products of this basket.
• A remove() method which takes as input a MarketProduct and returns a boolean. The
method removes the first occurrence of the specified element from the array of products
of this basket. If no such product exists, then the method returns false, otherwise,
after removing it, the method returns true. Note that this method removes a product
from the list if and only if such product is equal to the input received. For example,
it is not possible to remove 0.25 Kg of McIntosh apples for a 0.5 Kg McIntosh apples
MarketProduct. After the product has been remove from the array, the subsequent
elements should be shifted down by one position, leaving no unutilized slots in the
middle array.
• A clear() method which takes no inputs, returns no values, and empties the array of
products of this basket.
• A getNumOfProducts() method that takes no inputs and returns the number of products
present in this basket.
• A getSubTotal() method that takes no inputs and returns the cost (in cents) of all the
products in this basket.
• A getTotalTax() method that takes no inputs and returns the tax amount (in cents) to
be paid based on the product in this basket. Since we are in Quebec, you can use a tax
rate of 15%. The tax amount should be rounded down to the nearest cent. Note that
Egg and Fruit are tax free, so taxes should be paid only for Jam.
• A getTotalCost() method that takes no inputs and returns the total cost (in cents and
after tax) of the products in this basket.
• A toString() method that returns a String representing a receipt for this basket. The
receipt should contain a product per line. On each line the name of the product should
5
appear, followed by its cost separated by a tab character. After all the products have
been listed, the following information should appear on each line:
– An empty line
– The subtotal cost
– The total tax
– An empty line
– The total cost
Note that all the integer number of cents should be transformed into a String formatted
in dollars and cents (you can write a helper method to do so if you’d like). If the number
of cents is represented by an int that is less than or equal to 0, then it should be
transformed into a String containing only the hyphen character (“-“). An example of
a receipt is as follows:
Quail eggs 4.00
McIntosh apples 6.12
Asian Pears 4.24
Blueberry Jam 4.75
Blueberry Jam 4.75
Subtotal 23.86
Total Tax 1.42
Total Cost 25.28
[25 points] Write a class Customer which has the following private fields:
• A String name
• An int representing the balance (in cents) of the customer
• A Basket containing the products the customer would like to buy.
The class must also have the following public methods:
• A constructor that takes as input a String indicating the name of the customer, and
an int representing their initial balance. The constructor uses its inputs and creates an
empty Basket to initialize the corresponding fields.
• A getName() and a getBalance() method which return the name and balance (in cents)
of the customer respectively.
• A getBasket() method which returns the reference to the Basket of the customer (no
copy of the Basket is needed).
• An addFunds() method which takes an int as input representing the amount of cents
to be added to the balance of the customer. If the input received is negative, the method
6
should throw an IllegalArgumentException with an appropriate message. Otherwise,
the method will simply update the balance and return the new balance in cents.
• An addToBasket() method which takes a MarketProduct as input and adds it to the
basket of the customer. This method should not return anything.
• A removeFromBasket() method which takes a MarketProduct as input and removes it
from the basket of the customer. The method returns a boolean indicating whether of
not the operation was successful.
• A checkOut() method which takes no input and returns the receipt for the customer
as a String. If the customer’s balance is not enough to cover the total cost of their
basket, then the method throws an IllegalStateException. Otherwise, the customer
is charged the total cost of the basket, the basket is cleared, and a receipt is returned.
7
Assignment 2 COMP 250 with Solution
Learning Objectives
There are several learning goals for this assignment.
First, you will get some exposure to some simple cryptography. We’ll introduce the idea behind
one-time pad and you will implement an example of a stream cipher.
Second, in this assignment you will also get some experience working with linked lists. You will
implement a data structure to represent a deck of cards. This data structure is implemented as a
circular doubly linked list.
Third, in this assignment we will start to focus also on the efficiency of your algorithms. You will
learn to look at code with a more critical eye, without only focusing on the correctness of your
methods.
Lastly, this assignment will also give you more practice programming in Java! Although COMP
250 is not a course about how to program, programming is a core part of computer science and the
more practice you get, the better you will become at it.
Introduction
In 1917, Vernam patented a cipher now called one-time pad encryption scheme. The point of an
encryption scheme is to transform a message so that only those authorized will be able to read it.
One-time pad was later (in 1949) proved to be perfectly secret. The idea behind one-time pad is
that given a plaintext message of length n, a uniformly random stream of digits of length n (which
is the key) is generated and then used to encode the message. The message is concealed by replacing
each character in the plaintext, with a character obtained combining the original one with one of the
digits in the given key. Of course the message can be retrieved by performing the inverse operation
2
on the characters of the encoded message (the ciphertext). Only those with access to the key can
encode and decode a message. One-time pad is perfectly secret, but it has a number of drawbacks:
for it to be secure, the key is required to be as long as the message, and it can only be used once!
This clearly makes the cipher not a convenient one to use. Unfortunately, it was also proven that
the limitations of one-time-pad are inherent to the definition of perfect secrecy. This means that to
overcome those limitations the security requirements have to be relaxed.
Stream ciphers use the same idea of one-time pad encryption scheme except that a pseudorandom sequence of digits is used as the pad instead of a random one. The idea is to use what are
called ‘pseudorandom generators’ which given a smaller key can generate streams of pseudorandom
digits.
In Neal Stephenson’s novel Cryptonomicon, two of the main characters are able to covertly communicate with one another with a deck of playing cards and knowledge of the Solitaire encryption
algorithm, which was created (in real life) by Bruce Schneier. The novel includes the description of
the algorithm, but you can also find a revised version on the web1
.
The Solitaire encryption algorithm is an example of a stream cipher. The key in this case is the deck
of cards in its initial configuration. If two parties, Alice and Bob, share the same deck, following
the Solitaire encryption algorithm they will be able to communicate by encoding and decoding
messages. Of course, the deck and its configuration (i.e. the key) has to be kept secret to achieve
secrecy. To encode and decode messages, Alice and Bob use the deck to generate a pseudorandom
keystream which is then used as the “pad”.
Encode/Decode with Solitaire
Given a message to encode, we need to first remove all non–letters and convert any lower–case letters
to upper–case. We then use the keystream of values and convert each letter to the letter obtained
by shifting the original one a certain number of positions to the right on the alphabet. This number
is the one found in the keystream in the same position as the character we are encoding.
Decryption is just the reverse of encryption. Using the same keystream that was used to generate
the ciphertext, convert each letter to the letter obtained by shifting the original one the given
number of positions to the left on the alphabet.
For example, let’s say that Alice wants to send the following message:
Is that you, Bob?
Then she will first remove all the non-letters and capitalize all the remaining ones obtaining the
following:
ISTHATYOUBOB
She will then generate a keystream of 12 values. We’ll talk about the keystream generation in the
next section, so let’s assume that the keystream is the following:
11 9 23 7 10 25 11 11 7 8 9 3
1See https://en.wikipedia.org/wiki/Solitaire (cipher), or https://www.schneier.com/academic/solitaire/
3
Finally, she can generate the ciphertext by shifting each letter the appropriate number of positions
to the right in the alphabet. For example, the ‘I’ shifted 11 positions to the right, becomes a ‘T’.
The ‘S’ shifted 9 positions to the right becomes a ‘B’. And so on! The final ciphertext will be:
TBQOKSJZBJXE
Bob, upon receiving the message, will need to generate the keystream. If Alice and Bob shared the
same key and used it to generate the same number of pseudorandom values, then the keystream
generated in this moment by Bob will be equal to that used by Alice to encrypt the message. All
there’s left for Bob to do is convert all the letters by shifting them the appropriate number of
position to the left.
Generating a Keystream Using a Deck of Cards
The harder part of the Solitaire encryption algorithm is generating the keystream. The idea is to
use a deck of playing cards plus two jokers (a red one and a black one). Each card is associated with
a value which depends on its rank and its suit. Cards in order from Ace to King have value 1 to 13
respectively. This value can increase by a multiple of 13 depending on the suit of the card. For this
section let’s assume we’ll use the Bridge ranking for suits: clubs (lowest), followed by diamonds,
hearts, and spades (highest). So, for instance, the Ace of clubs has value 1, while the 5 of diamonds
has value 18, and the Queen of spades has value 51. The jokers have a value that depends on the
number of cards in the deck. If the deck has a total of 54 cards (the 52 playing cards plus the two
jokers), then the jokers have value 53. If the deck has total of 28 cards, then the jokers have value
27. That is, the jokers have both the same value and this value is equal to the total number of
cards in the deck minus one.
The keystream values depend solely on the deck’s initial configuration. We will implement the deck
as a circular doubly linked list with the cards as nodes. This means that the first card (the one on
the top of the deck) is linked to the last card (the one at the bottom of the deck) and the last card
is linked to the first one. As an example, let’s consider a deck with 28 cards: the 13 of both clubs
and diamonds, plus the two jokers. Let’s also consider the following initial configuration2
:
AC 4C 7C 10C KC 3D 6D 9D QD BJ 3C 6C 9C QC 2D 5D 8D JD RJ 2C 5C 8C JC AD 4D 7D 10D KD
The cards are represented with their rank, followed by their suit. For example, 6C denotes the 6 of
clubs, JD the Jack of diamonds, and RJ the red joker.
Here are the steps to take to generate one value of the keystream:
1. Locate the red joker and move it one card down. (That is, swap it with the card beneath it.)
If the joker is the bottom card of the deck, move it just below the top card. There is no way
for it to become the first card. After this step, the deck above will look as follows:
AC 4C 7C 10C KC 3D 6D 9D QD BJ 3C 6C 9C QC 2D 5D 8D JD 2C RJ 5C 8C JC AD 4D 7D 10D KD
2Note that this is the same example you find on the wikipedia page
https://en.wikipedia.org/wiki/Solitaire (cipher) where instead of the cards, they list the values.
4
2. Locate the black joker and move it two cards down. If the joker is the bottom card of the
deck, move it just below the second card. If the joker is one up from the bottom card, move
it just below the top card. There is no way for it to become the first card. After this step,
the deck above will look as follows:
AC 4C 7C 10C KC 3D 6D 9D QD 3C 6C BJ 9C QC 2D 5D 8D JD 2C RJ 5C 8C JC AD 4D 7D 10D KD
3. Perform a “triple cut”: that is, swap the cards above the first joker with the cards below the
second joker. Note that here we use “first” and “second” joker to refer to whatever joker is
nearest to, and furthest from, the top of the deck. Their colors do not matter. Note that the
jokers and the cards between them do not move! If there are no cards in one of the three
sections (either the jokers are adjacent, or one is on top or the bottom), just treat that section
as empty and move it anyway. The deck will now look as follows:
5C 8C JC AD 4D 7D 10D KD BJ 9C QC 2D 5D 8D JD 2C RJ AC 4C 7C 10C KC 3D 6D 9D QD 3C 6C
4. Perform a “count cut”: look at the value of the bottom card. Remove that number of cards
from the top of the deck and insert them just above the last card in the deck. The deck will
now look as follows:
10D KD BJ 9C QC 2D 5D 8D JD 2C RJ AC 4C 7C 10C KC 3D 6D 9D QD 3C 5C 8C JC AD 4D 7D 6C
5. Finally, look at the value of the card on the top of the deck. Count down that many cards.
(Count the top card as number one.) If you hit a joker, ignore it and repeat the keystream
algorithm. Otherwise, use the value of the card you counted to as the next keystream value.
Note that this step does not modify the state of the deck. In our example, the top card is
a 10 of diamonds which has value 23. By counting down to the 24th card we find the Jack
of clubs which has value 11. Hence, 11 would be the first keystream value generated by our
deck.
Instructions and Starter Code
As mentioned in the section before we will use a circular doubly linked list to represent a deck of
cards. The starter code contains two files with five classes which are as follows:
• Deck – This class defines a deck of cards. Most of your work goes into this file. This class
contains three nested classes: Card, PlayingCard, and Joker.
• SolitaireCipher – This class represents a stream cipher that uses the Solitaire algorithm to
generate the keystream and then encode/decode messages.
Please note that we defined all the members of the classes public for testing purposes. In reality, for
better coding style, most of those methods and all of the fields should have been kept private.
Methods you need to implement
For this assignment you need to implement all of the methods listed below. See the starter code
for the full method signatures. Your implementations must be efficient. For each method below,
we indicate the worst case run time using O() notation.
5
• Deck.Deck(int numOfCardsPerSuit, int numOfSuits) : creates a deck with cards from
Ace to numOfCardsPerSuit for the first numOfSuits in the class field suitsInOrder. The
cards should be ordered first by suit, and then by rank. In addition to these cards, a red joker
and a black joker are added to the bottom of the deck in this order. For example, with input
4 and 3, and suitsInOrder as specified in the file, the deck contains the following cards in
this specific order:
AC 2C 3C 4C AD 2D 3D 4D AH 2H 3H 4H RJ BJ
The constructor should raise an IllegalArgumentException if the first input is not a number
between 1 and 13 (both included) or the second input is not a number between 1 and the size
of the class field suitsInOrder. Remember that a deck is a circular doubly linked list so
make sure to set up all the pointers correctly, as well as the instance fields.
• Deck.Deck(Deck d) : creates a deck by making a deep copy of the input deck. Hint: use
the method getCopy from the class Card. Disclaimer: this is not the correct way of making
a deep copy of objects that contain circular references, but it is a simple one and good enough
for our purposes.
• Deck.addCard(Card c) : adds the input card to the bottom of the deck. This method runs
in O(1).
• Deck.shuffle() : shuffles the deck. There are different ways of doing this, but for this assignment you will need to implement an algorithm that uses the Fisher–Yates shuffle algorithm.
The algorithm runs in O(n) using O(n) space, where n is the number of cards in the deck.
To perform a shuffle of the deck follow the steps:
– Copy all the cards inside an array
– Shuffle the array using the following algorithm:
for i from n-1 to 1 do
j <– random integer such that 0 <= j <= i
swap a[j] and a[i]
To generate a random integer use the Random object stored in the class field called gen.
– Use the array to rebuild the shuffled deck.
• Deck.locateJoker(String color) : returns a reference to the joker in the deck with the
specified color. This method runs in O(n).
• Deck.moveCard(Card c, int p) : moves the card c by p positions down the deck. You can
assume that the input card belongs to the deck (which implies that the deck is not empty).
This method runs in O(p).
• Deck.tripleCut(Card firstCard, Card secondCard) : performs a triple cut on the deck
using the two input cards. You can assume that the input cards belong to the deck and the
first one is nearest to the top of the deck. This method runs in O(1).
6
• Deck.countCut() : performs a count cut on the deck. The number used for the cut is the
value of the bottom card modulo the total number of cards in the deck. Note that this means
that if the value of the bottom card is equal to a multiple of the number of cards in the deck,
then the method should not do anything. This method runs in O(n).
• Deck.lookUpCard() : returns a reference to the card that can be found by looking at the
value of the card on the top of the deck, and counting down that many cards. If the card
found is a Joker, then the method returns null, otherwise it returns the card found. This
method runs in O(n).
• Deck.generateNextKeystreamValue() : uses the Solitaire algorithm to generate one value
for the keystream using this deck. This method runs in O(n).
• SolitaireCipher.getKeystream(int size) : generates a keystream of the given size.
• SolitaireCipher.encode(String msg) : encodes the input message by generating a keystream
of the correct size and using it to encode the message as described earlier in the pdf.
• SolitaireCipher.decode(String msg) : decodes the input message by generating a keystream
of the correct size and using it to decode the message as described earlier in the pdf.
Small example
Generate a deck of 12 cards as follows:
AC 2C 3C 4C 5C AD 2D 3D 4D 5D RJ BJ
If you seed the random generator using 10 as the seed, then after shuffling the deck once you will
get the following configuration:
3C 3D AD 5C BJ 2C 2D 4D AC RJ 4C 5D
If you were to use this deck to create a Solitaire cipher and you would try to encode the message
“Is that you, Bob?” you would get the ciphertext MWIKDVZCKSFP obtained using the following
keystream:
4 4 15 3 3 2 1 14 16 17 17 14
Finally, note that the keystream used in the section describing how to encode/decode is the
keystream you would obtain using the deck from the example used on how to generate keystream
values.
7
Assignment 3 COMP 250 with Solution
1 INTRODUCTION
Learning Objectives
In this assignment you will be learning about decision trees and how to use them to solve classification
problems. Working on this problem will allow you to better understand how to manipulate trees and how
to use recursion to exploit their recursive structure.
1 Introduction
Congratulations! You have just landed an internship at a startup software company. This company is
trying to use AI techniques – in particular decision trees – to analyze spatial data. Your first task in this
internship is to write a basic decision tree class in Java. This will demonstrate to your new employer that
you understand what decision trees are and how they work.
From a quick web search, you learn that decision trees are a classical AI technique for classifying objects
by their properties. One typically refers to object attributes rather than object properties, and one typically
refers object labels to say how an object is classified.
As a concrete example, consider a computer vision system that analyzes surveillance video in a large store
and it classifies people seen in the video as being either employees or customers. An example attribute
could be the location of the person in the store. Employees tend to spend their time in different places than
customers. For example, only employees are supposed to be behind the cash register.
For classification problems in general, one defines object attributes with x variables and one defines the
object label as a y variable. In the example that you will work with in this assignment, the attributes will
be the spatial position (x1, x2), and the label y will be a color (red or green).
Let’s get back to decision trees themselves. Decision trees are rooted trees. So they have internal nodes
and external nodes (leaves). To classify a data item (datum) using a decision tree, one starts at the root and
follows a path to a leaf. Each internal node contains an attribute test. This test amounts to a question about
the value of the attribute – for example, the location of a customer in a store. Each child of an internal
node in a decision tree represents an outcome of the attribute test. For simplicity, you will only have to
deal with binary decision trees, so the answers to attribute test questions will be either true or false. A test
might be x1 < 5. The answer determines which child node to follow on the path to a leaf.
The labelling of the object occurs when the path reaches a leaf node. Each leaf node contains a label that
is assigned to any test data object that arrives at that leaf node after traversing the tree from the root. The
label might be red or green, which could be coded using an enum type, or simply 0 or 1. Note that, for any
test data object, the label given is the label of the leaf node reached by that object, which depends on the
outcomes of the attribute tests at the internal nodes.
The reason that this document is larger than usual is that decision trees were not covered the pre-recorded
lectures. This document should give you enough information about decision trees for you to do the assignment. If you wish to learn more about decision treees, then there are ample resources available on the
web. Steer towards resources that are about decision trees in computer science, in particular, in machine
learning or data mining. For example:
https://en.wikipedia.org/wiki/Decision_tree_learning
3
1.1 Creating decision trees 1 INTRODUCTION
Be aware that these resources will contain more information than you need to do this assignment, and so
you would need to sift through it and figure out what is important and what can be ignored. Feel free to use
Piazza to share links to good resources and to resolve questions you might have. The task of understanding
what decision trees are is part of the assignment. The amount of coding you need to do is relatively small,
once you figure out what needs to be done.
1.1 Creating decision trees
To classify objects using a decision tree, we first need to have a decision tree! Where do decision trees
come from? In machine learning, one creates decision trees from a labelled data set. Each data item
(datum) in the given labelled data set has well defined attributes x and label y. We refer to the data set
that is used to create a decision tree as the training set. The basic algorithm for creating a decision tree
using a training set is as follows. This is the algorithm that you will need to implement for fillDTNode()
later.
Data: data set (training)
Result: the root node of a decision tree
MAKE DECISION TREE NODE(data)
if the labelled data set has at least k data items (see below) then
if all the data items have the same label then
create a leaf node with that class label and return it;
else
create a “best” attribute test question; (see details later)
create a new node and store the attribute test in that node, namely attribute and threshold;
split the set of data items into two subsets, data1 and data2, according to the answers to the test
question;
newNode.child1 = MAKE DECISION TREE NODE(data1)
newNode.child2 = MAKE DECISION TREE NODE(data2)
return newNode
end
else
create a leaf node with label equal to the majority of labels and return it;
end
In the program, k is an argument of the decision tree construction minSizeDatalist.
1.2 Classification using decision trees
Once you have a decision tree, you can use it to classify new objects. This is called the testing phase.
For the testing phase, one can use data items from the original data used for training (above) or one can
use new data. Typically when a decision tree is used in practice, the test objects are unlabelled. In the
surveillance example earlier, the system would test a new video and try to classify people as employees
versus customers. Here the idea is that one does not know the correct class for each person. Let’s consider
this general scenario now, that we are given a decision tree and the attributes of some new unlabelled
test object. We will use the decision tree to choose a label for the object. This is done by traversing the
decision tree from the root to a leaf, as follows:
4
2 INSTANTIATING THE DECISION TREE PROBLEM
Data: A decision tree, and an unlabelled data item (datum) to be classified
Result: (Predicted) classification label
CLASSIFY(node, datum) {
if node is a leaf then
return the label of that node i.e. classify;
else
test the data item using question stored at that (internal) node, and determine which child node to go
to, based on the answer ;
return CLASSIFY(child, datum);
end
}
2 Instantiating the decision tree problem
For this assignment, the problem is to classify points based on their position. Each datapoint has an array
of attributes x, and a binary label y (0 or 1). For this section, we will focus on datapoints with only two
attributes.
A graphical representation of example of a data set looks like this. (For the graphs, the attribute value x[0],
is represented as x1 and x[1] as x2.) For those who print out the document in color, the red symbols can
be label 0 and the green symbols can be label 1. For those printing in black and white, the (red) disks are
label 0 and the (green) ×’s are label 1.
Figure 1: x1 is horizontal and x2 is vertical coordinate. Note that the points are intermixed. There is no
way to draw a horizontal or vertical line or any curve for that matter that could split the data.
5
2.1 Finding a good split 2 INSTANTIATING THE DECISION TREE PROBLEM
2.1 Finding a good split
Now that we have an idea of what the data are, let us return to the question of how to split the data into two
sets when creating a node in a decision tree. What makes a ‘good’ split? Intuitively, a split is good when
the labels in each set are as ‘pure’ as possible, that is, each subset is dominated as much as possible by a
single label (and the dominant label differs between subsets). For example, suppose this is our data:
Figure 2: What would be a good split of this data?
Two of many possible splits we could make are shown in Fig. 3. Fig 3-a splits the data into two sets
based on the test condition (x1 < 4), i.e. true or false. (By definition, the green symbol that falls on this
line is considered to be in the right half since the inequality is strict.) This is a good split in that all data
points for which the test condition is false have the same label (green) and all data points for which the
test condition is true have the same label (red), and the labels differ in the two subsets.
The split condition (x1 < 6) in Fig. 3-b is not as good, since the subset for which the condition is true
contains datapoints of both labels.
(a) (b)
Figure 3: Example of different splits on the x1 attribute.
6
2.2 Entropy 2 INSTANTIATING THE DECISION TREE PROBLEM
Splits can be done on either of the attributes. For the example in Fig. 4-a, a good split would be defined
by the test condition (x2 < 4).
The situation is more complicated when the data points cannot be separated by a threshold on x1 or x2, as
in Fig 4-b, however. It is unclear how to decide which of the three splits is best. We need a quantitative
way of deciding how to do this.
(a) (b)
Figure 4: (a) Example of a data set and a split using the x2 attribute, where the two subsets have distinct
labels. (b) An example of a data set in which there is no way to split using either an x1 or x2 value, such
that the two subsets have distinct labels.
2.2 Entropy
To handle more complicated situation, ones needs a quantitative measure of the ‘goodness’ of a split, in
terms of the impurity of the labels in a set of data points. There are many ways to do so. One of the most
common is called entropy.
1 The standard definition2 of entropy is:
H = −
X
i
pi
log2
(pi) (1)
where p(i) is a function such that 0 ≤ pi ≤ 1 and P
i
pi = 1. Note that the minus sign is needed to make
H positive, since log2 pi < 0 when 0 < pi < 1. Also, note that if pi = 0 then pi
log2
(pi) = 0 since that is
the limit of this expression as pi → 0. (Recall l’Hopital’s Rule in Calculus 1.)
For the special case that there are two values only, namely p1 and p2 = 1 − p1, entropy H is between 0 and
1, and we can write H as a function of the value p = p1. For a plot of this function H(p), see:
https://en.wikipedia.org/wiki/Binary_entropy_function
1Entropy is an extremely important concept in science. It has its roots in thermodynamics in the 19th century. In the
20th century, “information entropy” was one of the basic for techniques in electronic communication (telephones, cell phones,
internet, etc). In computer science, information entropy is heavily used in data compression, cryptography, and AI.
2Such functions pi are often used to model probabilities, as you will learn if you take MATH 323 or MATH 203 for example.
7
2.3 Using entropy to define a good split 2 INSTANTIATING THE DECISION TREE PROBLEM
In this assignment, we use entropy to choose the best split for a data set, based on its labels. We borrow
the formula for entropy and apply it to our problem as follows:
H(D) = −
X
y∈L
p(y) log2 p(y) (2)
where
• L is the set of labels, and y is a particular label
• D is a data set; each data point has two attributes and a label i.e. (x1, x2, y)
• H(D) is the entropy of the dataset D
• p(y) is the fraction of data points with label y from dataset D.
Since L consists of only two labels, entropy is between 0 and 1. Entropy is 0 if p(y) takes values 0 and 1
for the two labels. Entropy is 1 if p(y) = 0.5 for both labels. Otherwise it has a value strictly between 0
and 1. See plot in the link above.
2.3 Using entropy to define a good split
During the training phase, when one constructs the decision tree, a node is given a data set D as input.
If D has entropy greater than 0, then we would like to split the data set into two subsets D1 and D2. We
would like the entropy of the subsets to be lower than the entropy of D. The subsets may have different
entropy, however, so we consider the average entropy of the subsets. Moreover, because one subset might
be larger than the other, we would like to give more weight to the larger subset. So we define the average
entropy like this:
H(D1, D2) ≡ w1 × H(D1) + w2 × H(D2)
where wi
is the fraction of the points in subset i,
wi =
number of datapoints in Di
number datapoints in D, namely D1 + D2
and i is either 1 or 2. Note that w1 + w2 = 1.
NOTE: DO NOT use the formula : w2 = 1 – w1, although correct, this leads to a numerical approximation
error. For each of the weights (w1, w2) use the formula mentioned above separately.
ASIDE: (We mention the following because you will likely encounter it in your reading.) When building
a decision tree, one often considers the difference H(D) − H(D1, D2), which is called the information
gain. For example, one may decide whether or not to split a node based on whether the information gain
is sufficiently large. In this assignment, you will instead use a different criterion to decide whether to split
a node when building the decision tree Your criterion will be based on the number of data items in D, as
will be discussed later.
8
2.4 Example 2 INSTANTIATING THE DECISION TREE PROBLEM
2.4 Example
Let us now return to the examples we saw earlier and use entropy to discuss which split is best. Recall the
example of Fig. 4-b. To calculate the entropy of the dataset D before split, note there are two classes with
6 points in each of the classes. So the fraction of points in each class is 0.5 and the entropy is:
H(D) = −
1
2
log2
1
2
−
1
2
log2
1
2
= 1 (3)
• Split 1 breaks the dataset into two sub datasets, where the subdataset on top contains 8 points (5
red dots , 3 green crosses) and the one below has 4 points (1 red dot, 3 green crosses). Calculating
average entropy H(D1, D2) after the split yields:
4
12
−
1
4
log2
1
4
+
3
4
log2
3
4
+
8
12
−
5
8
log2
5
8
+
3
8
log2
3
8
= 0.906 (4)
• Split 2 breaks the dataset into two sub datasets, where the sub dataset on the left has 5 datapoints
(4 red dots, 1 green cross) and the other one has 7 datapoints ( 2 red dots , 5 green crosses). The
average entropy H(D1, D2) after the split is:
5
12
−
1
5
log2
1
5
+
4
5
log2
4
5
+
7
12
−
2
7
log2
2
7
+
5
7
log2
5
7
= 0.803 (5)
• Split 3 breaks the dataset into 2 sub datasets, where the sub dataset on the left has 7 datapoints (5 red
dots, 2 green crosses) and the other one has 5 datapoints ( 1 red dot , 4 green crosses). The average
entropy H(D1, D2) after the split is:
7
12
−
2
7
log2
2
7
+
5
7
log2
5
7
+
5
12
−
1
5
log2
1
5
+
4
5
log2
4
5
= 0.803 (6)
So, split 2 and 3 have lower average entropy than split 1. It is no problem that split 2 and 3 have the
same average entropy. Either one can be chosen as the ‘best split’. For the assignment, select the first
encountered best split when the check for the best split starts from the first attribute (x[0]) and proceeds
from there and for a given attribute the check starts from the first datapoint and proceeds thereafter.
2.5 Finding the best split
We can use entropy to compare the ‘goodness’ of different splits. The simplest way to define the best
split is just to consider all possible splits and choose the one with the lowest average entropy. In this
assignment, you will take this brute force approach. You will consider each attribute x1 and x2 and each
of the values of that attribute for points in the data set. You will compute the average entropy for splitting
on that attribute value, and you will choose the split that gives the minumum.
9
2.5 Finding the best split 2 INSTANTIATING THE DECISION TREE PROBLEM
Data: A Dataset
Result: An attribute and a threshold
FIND BEST SPLIT(data) {
best avg entropy := inf;
best attr := -1;
best threshold := -1;
for each attribute in x do
for each data point in list do
compute split and current avg entropy based on that split;
if best avg entropy > current avg entropy then
best avg entropy := current avg entropy;
best attr := attribute;
best threshold := value;
end
end
end
return (best attr, best threshold)
}
Note the order of the two for loops! If you use the opposite order, you might get a different tree (wrong
answer).Note also that if the minim average entropy you find is equal to the entropy of the input data set,
then there is no point in performing a split. In this case, the node should simply be made a leaf with label
equal to the majority of the labels in the data set.
So, far we have seen how to create the decision tree and what are the intuitions behind the math needed to
get the split to build the tree. Let us consider another example. Consider the data shown below.
A decision tree for this dataset should have splits as shown below.
10
2.6 Preventing Overfitting 3 INSTRUCTIONS AND STARTER CODE
Figure 5: Example of an overfit decision tree on the outlier dataset
But does this seem right? Given that all the points around it are of class 1 and the other points of the same
class are far to the left, it might be possible that the datapoint labeled class 2 at the right side of the graph
(6,6) is an outlier or an anomalous reading in the data. Points like these are most of the time, if not always
present in a dataset and care should be taken so that our decision tree does not try too hard to reduce the
impurity of the sub datasets and in the process actively keep splitting the data so as to try to classify the
outliers into purer groups.
This phenomenon described above is called overfitting. The algorithm – in this case the construction of a
decision tree – tries to find a “model” that accounts for all of the data, even the data which might be garbage
for some reason (noise, or due to some error in the program or device that produced the data).
2.6 Preventing Overfitting
There are different ways to prevent overfitting. The method that you will use is called early stopping,
namely stop splitting nodes in the decision tree when the number of data points in a subset is smaller than
some predetermined number. The issue of overfitting is very important in decision trees and in machine
learning in general, but the details are beyond the scope of this assignment.
3 Instructions and starter code
The starter code contains three classes:
Datum.java
This class holds the information of a single datapoint. It has two variables, x and y. x is an array containing
the attributes and y contains the label.
The class also comes with a method toString() which returns a string representation of the attributes
and label of a single datapoint.
11
3 INSTRUCTIONS AND STARTER CODE
DataReader.java
This class deals with three things. The method read data() reads a dataset from a csv3
, and splits the
read dataset into the training and test set, using splitTrainTestData() file. It also has methods that
deal with reading and writing of “serialized” decision tree objects.
DecisionTree.java
This is the main class which deals with the creation of a decision tree and classification of datapoints using
the created decision tree. You will be implementing some of the methods in this class.
Let us go through the different members and attributes of the class:
• The constructor builds a decision tree by calling the fillDTNode() method on a dataset. It is
given a list of data points and a parameter that specifies the minimum number of datapoints that has
to be present in a given dataset to qualify for a split. This minimum number is used to reduced the
chances of overfitting, as discussed above.
• There is a root node field, called rootDTNode, by which other nodes can be accessed.
• There is a field called minSizeDatalist used to store the minimum number of datapoints that
should be present in the dataset so as to initiate a split.
• There is a subclass class DTNode. This class is used to represent a single node of a decision tree.
There are two types of nodes: the internal nodes which define an attribute and a threshold which
help in classification, and leaf nodes which determine the labels of those data points whose attributes
obey the threshold conditions of the ancestor internal nodes leading up to the leaf node.
The DTNode contains the following members:
– leaf: a boolean variable that indicates whether this node object is a leaf or not.
– label: an integer variable that indicates the label of the node. The label of the node indicates
the class of a datapoint that reaches that particular leaf node after traversing the tree. This is
valid only if the node is a leaf node. The classes for this assignment are simply 0 or 1.
– attribute: The index of the attribute, (i.e. 1, . . . , n) on which the dataset is split in that particular node. The value of the attribute is one of {1, . . . , n}. The attribute value is meaningful
only for an internal node. The value stored in this field is meaningful only for an internal node.
In the code, the attributes are x[0], x[1], .., x[n-1], hence their indices values are
0, 1, . . . , n − 1, respectively.
– threshold: This holds the value of the attribute at which the split is done. This is also only
meaningful for an internal node.
– left, right These two variables are of type DTNode, and they represent the two children
of an internal node. At the classify stage, the left child leads to a decision tree node that handles
3According to https://en.wikipedia.org/wiki/Comma-separated_values, comma-separated values
(CSV) file is a delimited text file that uses a comma to separate values. A CSV file stores tabular data (numbers and text)
in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.
12
3 INSTRUCTIONS AND STARTER CODE
the case that the value of the attribute is less than the threshold, and right handles the case that
the value is greater than or equal to the threshold. For a leaf node, they are both null.
The class DTNode also contains a few member methods that helps in building the tree.
– fillDTNode(): This is the method that does all the heavy lifting in the entire assignment.
Given a list of data points and a minimum size for splitting (see earlier), this recursive function
creates the entire decision tree. A detailed description is present in the comments of the code.
– findMajority(): Given a list of datapoints D, this method returns a label for that set.
This method is called during training (construction of the decision tree). It can happen that
the size of a dataset falls below the minimum size mentioned in minSizeDatalist, but
still contains datapoints from more than one class. In such a case, a leaf node is created. To
determine the label for this leaf node, the method goes through all the points and finds the
majority label, i.e. the most common label for those points. This label will be used later at the
classification phase, when a new data point reaches that leaf node. When choosing the label
for a leaf node, if there is no majority (a tie), then the label with the smallest value is returned.
– classifyAtNode(): At the testing phase, given a datapoint with only its attributes (no
label), this method uses the label specified by the decision tree leaf node (as was determined at
training time by the findMajority() method).
– equals() : Given another node, this method checks if the tree rooted at the given node is
equal to the tree rooted at the calling node. The definition of ‘equality’ is elaborated in the next
section.
• calcEntropy(): Given a dataset, this function calculates the entropy of the dataset.
• classify(): Given a datapoint (without the label), predicts the label of the datapoint. The only
difference between this method and classifyAtNode() is that classifyAtNode() does
the classification on its member DTNode, whereas for classify() the DTNode is the root of the
created decision tree.
• checkPerformance(): Given a dataset where the datapoints have both attributes and labels,
this method runs the classify() function on all of the datapoints (using only the attributes) and
compares the label that is given by the decision tree with the “ground truth” label for that data point.
The method returns the fraction of datapoints that were predicted wrong, in the form of a string.
• equals(): Given two decision trees, this method checks if the two trees are equal or not. It returns
a boolean value.
Your task
You need to implement three methods from the DTNode class. None of the methods depend on each other.
We suggest that you implement equals() first.
1. DTNode.equals() (30 points)
This method compares two DTNodes. Given another DTNode object, it checks if the tree that is
rooted at the calling DTNode is equal to the tree rooted at DTNode object that is passed as the
13
4 DATA
parameter.
Two DTNodes are considered equal if:
(a) a traversal (e.g. preorder) of each of the two trees encounters nodes that are equal;
(b) internal node : the thresholds and attributes should be same;
(c) leaf node : the labels should be same.
Note that the tester you are given uses this equals method to check if the tree you implement in the
second part matches with the actual solution.
2. DTNode.fillDTNode() (50 points)
This method takes in a datalist (i.e. an arraylist of objects of type Datum) and returns the calling
DTNode object as the root of a decision tree trained using the datapoints present in the input datalist.
3. DTNode.classifyAtNode() (20 points)
This method takes in a datapoint (excluding the label) in the form of an array of type double (like
Datum.x) and should return its corresponding label (int).
4 Data
The different datasets used in the assignment are shown below. If you look at the DataReader() class,
you will note that only half of the data in each plot is used in the training. The other half is used to test the
performance of the decision tree. This testing phase is not part of the assignment, but we give you some
performance data so that you can appreciate the differences in the data sets.
For each of the three data sets, we list the performance of the decision tree in classifying the other half
(test set) of the data. We show both the fraction of training points that are misclassified (error) and
the fraction of the test points that are misclassified, and we do so for different values of the variable
minSizeDatalist.
4.1 Highly overlapping data :
Notice that the training error is 0 when the minimum size is 1 and rises as minSizeDatalist increases.
However, the test error is near .5 (50 percent) for all values of minSizeDatalist.
minSizeDatalist : 1 Training error : 0.000 Test error : 0.495
minSizeDatalist : 2 Training error : 0.000 Test error : 0.495
minSizeDatalist : 4 Training error : 0.030 Test error : 0.495
minSizeDatalist : 8 Training error : 0.105 Test error : 0.520
minSizeDatalist : 16 Training error : 0.200 Test error : 0.515
minSizeDatalist : 32 Training error : 0.235 Test error : 0.515
minSizeDatalist : 64 Training error : 0.310 Test error : 0.525
minSizeDatalist : 128 Training error : 0.390 Test error : 0.490
14
4.2 Partially overlapping data : 4 DATA
Figure 6: Plot of data present in data high overlap.csv
4.2 Partially overlapping data :
Figure 7: Plot of data present in data partial overlap.csv
In this case, 10 or 12 of the 200 test data points are misclassified for most values of minSizeDatalist.
The error rises considerably when minSizeDatalist is 128.
minSizeDatalist : 1 Training error : 0.000 Test error : 0.050
minSizeDatalist : 2 Training error : 0.000 Test error : 0.050
minSizeDatalist : 4 Training error : 0.015 Test error : 0.050
minSizeDatalist : 8 Training error : 0.035 Test error : 0.060
minSizeDatalist : 16 Training error : 0.045 Test error : 0.050
15
4.3 Minimal overlapping data : 4 DATA
minSizeDatalist : 32 Training error : 0.045 Test error : 0.050
minSizeDatalist : 64 Training error : 0.075 Test error : 0.050
minSizeDatalist : 128 Training error : 0.255 Test error : 0.245
4.3 Minimal overlapping data :
Figure 8: The data points here have very little overlap. We are loosely calling the overlap “minimal”,
although truly minimal would mean zero overlap. We leave some overlap so that overfitting can potentially
occur.
Note exactly one (of 200) training data points is misclassfied when minSizeDatalist is 4 or more.
The test error is roughly constant. It is slightly larger for small values of minSizeDatalist.
minSizeDatalist : 1 Training error : 0.000 Test error : 0.040
minSizeDatalist : 2 Training error : 0.000 Test error : 0.040
minSizeDatalist : 4 Training error : 0.005 Test error : 0.040
minSizeDatalist : 8 Training error : 0.005 Test error : 0.035
minSizeDatalist : 16 Training error : 0.005 Test error : 0.035
minSizeDatalist : 32 Training error : 0.005 Test error : 0.035
minSizeDatalist : 64 Training error : 0.005 Test error : 0.035
minSizeDatalist : 128 Training error : 0.005 Test error : 0.035
16



