Sale!

GZ720J: Homework 2 Bag-of-Words for Scene Classification solved

Original price was: $35.00.Current price is: $30.00. $25.50

Category:

Description

5/5 - (6 votes)

Overview
The bag-of-words (BoW) approach, which you learned about in class, has been applied to many
recognition problems in computer vision. For example, object recognition [5, 7] and scene
classification [6, 8]1
. Video Google http://www.robots.ox.ac.uk/~vgg/research/vgoogle/
is a good demonstration of this.
This technique draws connection from Information Retreival, and has been studied extensively. Thus, there are many variants and extensions for the traditional approach explained in
class. For example, two important extensions are pyramid matching [2, 4] and feature encoding
[1].
What you will be doing:
1. You will take responses of a filter bank on images and build a dictionary of visual words
(Section 2).
2. You will learn a model for the visual world based on a bag of visual words, and use
nearest-neighbor to predict scene classes in a test set (Section 3).
3. Finally, we ask you to dig deep into visual words by trying to visualize them (Section 4).
If the assignment looks long or difficult to you, DONT panic! We provide you with stepby-step instructions to implement a working scene recognition framework. There are not many
lines of code to write. However, it may take a few hours for the base system to run, so make sure
to try each component on a subset of the data set first before putting everything together.
We have provided you with small subsets of the data for such testing (see Section 6).
Image Data You will be working with a subset of the SUN database22
. The data set
contains 1600 images from 9 scene categories like “kitchen”, “sky” and “desert”.
A complete submission consists of a zip file with the following folders (please keep each
system in a separate folder):
1. baseline/: the baseline spatial pyramid classification system that you implement through
this homework;
1This homework aims at being largely self-contained; however, reading the listed papers (even without trying
to truly understand them) is likely to be helpful.
2http://groups.csail.mit.edu/vision/SUN/
2
2. improved/: (if you do the extra credit): the custom system that you design to boost the
performance of your system;
3. a write-up (.pdf format).
We provide you with a number of functions and scripts in the hopes of alleviating some
tedious or error-prone sections of the implementation.
You can find a list of files provided in Section 6. Please read these descriptions.
Figure 3: The provided multi-scale filter bank
1 Warming up with some theory (10pts)
In this section, you will answer some questions about the most basic Bag-of-Words approach.
We provide a suggested number of lines for each answer are mentioned.
Question 1.1 (3pts, 2-3 lines)
Given an N × N image, and an h × h Gaussian filter, how would you convolve the image and
the filter in O(h) instead of O(h
2
)?
Question 1.2 (2pts, 2-3 lines)
Does the bag-of-words approach respect spatial information? Why?
Question 1.3 (5 pts, 2-3 lines)
For images, why is it a better idea to use filter responses as features rather than raw pixel
values?
2 Representing the World with Visual Words (40pts)
We have provided you with a multi-scale filter bank that you will use to understand the visual
world. You can get it with the following provided function:
3
Figure 4: An input image and filter responses for each of the 3 channels
(a) Color Input (b) Output Map of Visual Words
Figure 5: A sample output, rendered with imagesc.
[filterBank] = createFilterBank()
filterBank is a cell array3 Open the file createFilterBank.m to see how the filters were
created. We have also provided you with a function to extract filter responses that takes a
3-channel RGB image and a filter bank and returns the responses of the filters on the image.
[filterResponses] = extractFilterResponses(I, filterBank)
filterResponses is an N ×M matrix, where N is the number of pixels in the input image,
and M is the number of filter responses (three times the size of the filter bank, since you are
applying it to a 3-channel image).
Question 2.1 (5 pts, 3-4 lines) Theory:
What properties do each of the filter functions pick up? You should group the filters into broad
categories (i.e., all the Gaussians). Answer in your write-up.
Creating Visual Words
You will now create a dictionary of visual words from the filter responses using k-means.
After applying k-means, similar filter responses will be represented by the same visual word.
You will use a dictionary with fixed-size. Instead of using all of the filter responses (that can
exceed the memory capacity of your computer), you will use responses at α random
pixels4
. If there are T training images, then you should collect a matrix filterResponses over
all the images that is αT × N, where N is the number of filter responses. Then, to generate a
3Look at MATLABs documentation for more details, but filterBank{i} is a 2D matrix, and filterBank{i}
and filterBank{j} are not necessarily the same size.
4Try using randperm.
4
visual words dictionary with K words, you will cluster the responses with k-means using the
built-in MATLAB function kmeans as follows:
[unused, dictionary] = kmeans(filterResponses, K, ’EmptyAction’,’drop’);
Question 2.2 (15 pts)
You should write the following function to generate a dictionary given a list of images.
[filterBank, dictionary] = getFilterBankAndDictionary(imPaths)
As an input, getFilterBankAndDictionary takes a cell array of strings containing the full
path to all images. You can load each file by iterating from 1:length(imPaths), and doing
imread(imPaths{i}). Generate the αT filter responses over the training files and call k-means.
A sensible initial value to try for K is between 100 and 300, and for α is between 50 and 150,
but they depend on your system configuration and you might want to play with these values.
Question 2.3 (5 pts, 3-4 lines) Theory
How does the dictionary size affect the representation power of the bag-of-words pipeline, e.g.,
if I set k = 10 or k = 10, 000? What is the problem of a dictionary that is too small or too
large?
Once you are done with getFilterBankAndDictionary, call the provided script
computeDictionary, which will pass in the file names, and go get a coffee. If all goes well, you
will have a .mat file named dictionary.mat that contains the filter bank as well as the dictionary of visual words. Dont worry about “did-not-converge” errors. If it takes more than 10
minutes, reduce the number of clusters and samples. If you have debugging issues, try passing
in a small number of training files manually.
Computing Visual Words
Question 2.4 (15 pts)
We want to map each pixel in the image to its closest word in the dictionary. Create the
following function to do this:
[wordMap] = getVisualWords(I, filterBank, dictionary)
wordMap is a matrix with the same width and height as I, where each pixel in wordMap is
assigned the index of the closest visual word of the filter response at that pixel in I. We will
use the standard Euclidean distance to do this; to do this efficiently, use the MATLAB function
pdist2. A result is shown in Fig. 5.
Since this can be slow, we have provided a function
batchToVisualWords(numberOfCores) that will apply your implementation of the function
getVisualWords to every image in the training and testing set. This function will automatically5 use as many cores as you tell it to use. For every image “X.jpg” in images/, there will
be a corresponding file named “X.mat” in wordmaps/ containing the variable wordMap. You
are highly encouraged to visualize a few of the resulting word maps; they should look similar
to the ones in Figs. 2,5.
5
Interested parties should investigate batchToVisualWords.m and the MATLAB commands matlabpool
and parfor.
5
3 Building a Recognition System (95pts)
We have formed a convenient representation for recognition. We will now produce a basic
recognition system with spatial pyramid matching. The goal of the system is presented in Fig.
1: given an image, classify (“name”) the scene where the image was taken.
Traditional classification problems follow two phases: training and testing. During training
time, the computer is given a pile of formatted data (i.e., a collection of feature vectors) with
corresponding labels (e.g., “grass”, “sky”) and then builds a model of how the data relates to
the labels: “if green, then grass”. At test time, the computer takes features and uses these
rules to infer the label: e.g., “this is green, therefore it is grass”.
In this assignment, we will use the simplest classification model: k-nearest neighbor. At
test time, we will simply look at the querys top k nearest neighbors in the training set and
transfer the majority label6
. In this example, you will be looking at the query image and
looking up its k nearest neighbors in a collection of training images whose labels are already
known. This approach works surprisingly well given a huge amount of data, e.g., a very cool
graphics applications from [3].
The components of any nearest-neighbor system are: features (how do you represent your
instances?) and similarity (how do you compare instances in the feature space?). You will
implement both.
Extracting Features
We will first represent an image with a bag-of-words approach. In each image, we simply
look at how often each word appears.
Question 3.1 (10 pts)
Create a function getImageFeatures that extracts the histogram7 of visual words within the
given image (i.e., the bag of visual words).
[h] = getImageFeatures(wordMap, dictionarySize)
As inputs, the function will take:
• wordMap is an H × W image containing the IDs of the visual words
• dictionarySize is the number of visual words in the dictionary.
As output, the function will return h, a dictionarySize × 1 histogram that is L1 normalized, (i.e ., Phi = 1). You may wish to load a single visual word map, visualize it, and verify
that your function is working correctly before proceeding.
Question 3.2 Extra credit (5 pts 2-3 lines) Theory:
Why is it a good idea to normalize the histogram of visual words? (Hint: See Section 3.2 of
[4])
6
in case of a tie, use odd k
0
.
7Look at hist in MATLAB
6
Multi-resolution: Spatial Pyramid Matching
Bag of words ignores the entire spatial structure of the image. In many cases, this may
not be desirable. One way to alleviate this issue (and thus can often have a better representation) is to use spatial pyramid matching. The general idea is to divide the image into a small
number of cells, and concatenate the histogram of these cells to the histogram of the original
image, with a proper weight per cell. We will be using the work of [4] as reference, and will
point you to specific sections as needed.
Here we will implement a popular scheme that cuts the image into 2l × 2
l
cells where l is
the layer number. We treat each cell as a small image and count how often each visual word
appears. This results in a histogram for every single cell in every layer. Finally to represent
the entire image, we concatenate all the histograms together after normalization by the total
number of features in the image. If there are L layers and K visual words, the resulting vector
has dimensionality K
PL
l=0 4
l = K

4
(L+1) − 1

/3.
Now comes the weighting scheme. Note that when concatenating all the histograms, histograms from different levels are assigned different weights. Typically (in Section 3 of [4]), a
histogram from layer l gets half the weight of a histogram from layer l + 1, with the exception
of layer 0, which is assigned a weight equal to layer 1. A popular choice is to set the weight
for layer 0 and layer 1 to 2−L
, and for the rest to 2l−L−1
(e.g., in a three layer spatial pyramid,
L = 2 and weights are set to 1/4, 1/4 and 1/2 for layer 0, 1 and 2 respectively, see Fig. 6).
Note that the L1 norm (absolute values of all dimensions summed up together) for the final
vector is 1.
Question 3.3 Extra credit (5 pts 2-3 lines) Theory
Why do we weight different levels of the histogram differently? (Hint: See Section 3.1 of [4])
Figure 6: Spatial Pyramid Matching: From [4]. Toy example of a pyramid for L = 2. The
image has three visual words, indicated by circles, diamonds, and crosses. We subdivide the
image at three different levels of resolution. For each level of resolution and each channel, we
count the features that fall in each spatial bin. Finally, weight each spatial histogram.
Question 3.4 (20 pts)
Create a function getImageFeaturesSPM that form a multi-resolution representation of the
given image.
[h] = getImageFeaturesSPM(layerNum, wordMap, dictionarySize)