Description
16-720B Homework 1 Spatial Pyramid Matching for Scene Classification Solved
1 Overview The bag-of-words (BoW) approach, which you learned about in class, has been applied to a myriad of recognition problems in computer vision. For example, two classic ones are object recognition [5, 7] and scene classification [6, 8]1 . Beyond that, the BoW representation has also been the subject of a great deal of study aimed at improving it, and you will see a large number of approaches that remain in the spirit of bag-of-words but improve upon the traditional approach which you will implement here. For example, two important extensions are pyramid matching [2, 4] and feature encoding [1]. An illustrative overview of the homework is shown in Figure. 2. In Section. 1, we will build the visual words from the training set images. With the visual words, i.e. the dictionary, in Section. 2 we will represent an image as a visual-word vector. Then the comparison between images is realized in the visual-word vector space. Finally, we will build a scene recognition system based on the visual bag-of-words approach to classify a given image into 8 types of scenes. ….. …… …… Feature extraction Cluster Visual word vocabulary …… …… Visual word vocabulary Feature extraction Feature extraction Image represented as histogram of visual-words Building the dictionary Represent images as histograms of visual words and compare images Compare Figure 2: An overview of the bags-of-words approach to be implemented in the homework. Given the training set of images, the visual features of the images are extracted. In our case, we will use the filter responses of the pre-defined filter bank as the visual features. The visual words, i.e. dictionary, are built as the centers of clusterings of the visual features. During recognition, the image is first represented as a vector of visual words. Then the comparison between images is realized in the visual-word vector space. Finally, we will build a scene recognition system that classifies the given image into 8 types of scenes What you will be doing: You will implement a scene classification system that uses the bag-of-words approach with its spatial pyramid extension. The paper that introduced the pyramid matching kernel [2] is: K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. ICCV 2005. http://www.cs.utexas.edu/ ~grauman/papers/grauman_darrell_iccv2005.pdf Spatial pyramid matching [4] is presented at: 1This homework aims at being largely self-contained; however, reading the listed papers (even without trying to truly understand them) is likely to be helpful. 2 Figure 3: The provided multi-scale filter bank S. Lazebnik, C. Schmid, and J. Ponce, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006. http://www.di. ens.fr/willow/pdfs/cvpr06b.pdf You will be working with a subset of the SUN database2 . The data set contains 1600 images from various scene categories like “auditorium”, “desert” and “kitchen”. And to build a recognition system, you will: • first, take responses of a filter bank on images and build a dictionary of visual words; • then, learn a model for the visual world based on the bag of visual words (with spatial pyramid matching [4]), and use nearest-neighbor to predict scene classes in a test set. In terms of number of lines of code, this assignment is fairly small. However, it may take a few hours to finish running the baseline system, so make sure you start early so that you have time to debug things. Also, try each component on a subset of the data set first before putting everything together. We provide you with a number of functions and scripts in the hopes of alleviating some tedious or error-prone sections of the implementation. You can find a list of files provided in Section 4. Notice that, we include num workers as input for some functions you need to implement. Those are not necessary, but can be used with multi-threading python libraries to significantly speed up your code. This homework was tested using python3.5 and pytorch 0.4.1. 1 Representing the World with Visual Words 1.1 Extracting Filter Responses We want to run a filter bank on an image by convolving each filter in the bank with the image and concatenating all the responses into a vector for each pixel. In our case, we will be using 20 filters consisting of 4 types of filters in 5 scales. The filters are: (1) Gaussian, (2) Laplacian of Gaussian, (3) derivative of Gaussian in the x direction, and (4) derivative of Gaussian in the y direction. The convolution function scipy.ndimage.convolve() can be used with user-defined filters, but the functions scipy.ndimage.gaussian filter() and scipy.ndimage.gaussian laplace() may be useful here for improved efficiency. The 5 scales we will be using are 1, 2, 4, 8, and 8√ 2, in pixel units. Q1.1.1 (5 points): What properties do each of the filter functions (See Figure 3) pick up? You should group the filters into broad categories (e.g. all the Gaussians). Also, why do we need multiple scales of filter responses? Answer in your write-up. 2http://groups.csail.mit.edu/vision/SUN/ 3 Figure 4: An input image and filter responses for all of the filters in the filter bank. (a) The input image (b) The filter responses of Lab image corresponding to the filters in Figure. 3 Q1.1.2 (10 points): For the code, loop through the filters and the scales to extract responses. Since color images have 3 channels, you are going to have a total of 3F filter responses per pixel if the filter bank is of size F. Note that in the given dataset, there are some gray-scale images. For those gray-scale images, you can simply duplicated them into three channels using the command repmat. Then output the result as a 3F channel image. Complete the function visual words.extract filter responses(image) and return the responses as filter responses. We have provided you with a template code with detailed instructions in it. You would be required to input a 3-channel RGB or gray-scale image and filter bank to get the responses of the filters on the image. Remember to check the input argument image to make sure it is a floating point type with range 0 1, and convert it if necessary. Be sure to check the number of input image channels and convert it to 3-channel if it is not. Before applying the filters, use the function skimage.color.rgb2lab() to convert your image into the Lab color space, which was designed to more effectively quantify color differences with respect to human perception. (See here for more information.) If image is an M × N × 3 matrix, then filter responses should be a matrix of size M × N × 3F. Make sure your convolution function call handles image padding along the edges sensibly. Apply all 20 filters on a sample image, and visualize as a image collage (as shown in Figure 4). You can use the included helper function util.display filter responses() (which expects a list of filter responses with those of the Lab channels grouped together with shape M × N × 3) to create the collage. Submit the 4 collage of 20 images in the write-up. 1.2 Creating Visual Words You will now create a dictionary of visual words from the filter responses using k-means. After applying k-means, similar filter responses will be represented by the same visual word. You will use a dictionary with fixed-size. Instead of using all of the filter responses (that can exceed the memory capacity of your computer), you will use responses at α random pixels3 . If there are T training images, then you should collect a matrix filter responses over all the images that is αT × 3F, where F is the filter bank size. Then, to generate a visual words dictionary with K words, you will cluster the responses with k-means using the function sklearn.cluster.KMeans as follows: kmeans = sklearn.cluster.KMeans(n clusters=K).fit(filter responses) dictionary = kmeans.cluster centers You can alternatively pass the n jobs argument into the KMeans() object to utilize parallel computation. Q1.2 (10 points): You should write the functions visual words.compute dictionary one image(args) visual words.compute dictionary() to generate a dictionary given a list of images. The overall goal of compute dictionary() is to load the training data, iterate through the paths to the image files to read the images, and extract αT filter responses over the training files, and call k-means. This can be slow to run; however, the images can be processed independently and in parallel. Inside compute dictionary one image(), you should read an image, extract the responses, and save to a temporary file. Here, args is a collection of arguments passed into the function. Inside compute dictionary(), you should load all the training data and create subprocesses to call compute dictionary one image(). After all the subprocesses are finished, load the temporary files back, collect the filter responses, and run k-means. A sensible initial value to try for K is between 100 and 300, and for α is between 50 and 500, but they depend on your system configuration and you might want to play with these values. Finally, execute compute dictionary() and go get a coffee. If all goes well, you will have a file named dictionary.npy that contains the dictionary of visual words. If the clustering takes too long, reduce the number of clusters and samples. If you have debugging issues, try passing in a small number of training files manually. 1.3 Computing Visual Words Q1.3 (10 points): We want to map each pixel in the image to its closest word in the dictionary. Complete the following function to do this: visual words.get visual words(image,dictionary) and return wordmap, a matrix with the same width and height as image, where each pixel in wordmap is assigned the closest visual word of the filter response at the respective pixel in image. We will use the standard Euclidean distance to do this; to do this efficiently, use the function scipy.spatial.distance.cdist(). Some sample results are shown in Fig. 5. Visualize three wordmaps of three images from any one of the category and submit in the write-up along with their original RGB image. Also, provide your comments about the visualization. They should look similar to the ones in Figure 5. 3Try using numpy.random.permutation(). 5 Figure 5: Visual words over images. You will use the spatially un-ordered distribution of visual words in a region (a bag of visual words) as a feature for scene classification, with some coarse information provided by spatial pyramid matching [4] 2 Building a Recognition System We have formed a convenient representation for recognition. We will now produce a basic recognition system with spatial pyramid matching. The goal of the system is presented in Fig. 1: given an image, classify (colloquially, “name”) the scene where the image was taken. Traditional classification problems follow two phases: training and testing. During training time, the computer is given a pile of formatted data (i.e., a collection of feature vectors) with corresponding labels (e.g., “desert”, “kitchen”) and then builds a model of how the data relates to the labels: “if green, then kitchen”. At test time, the computer takes features and uses these rules to infer the label: e.g., “this is green, so therefore it is kitchen”. In this assignment, we will use the simplest classification model: nearest neighbor. At test time, we will simply look at the query’s nearest neighbor in the training set and transfer that label. In this example, you will be looking at the query image and looking up its nearest neighbor in a collection of training images whose labels are already known. This approach works surprisingly well given a huge amount of data, e.g., a very cool graphics applications from [3]. The components of any nearest-neighbor system are: features (how do you represent your instances?) and similarity (how do you compare instances in the feature space?). You will implement both. 2.1 Extracting Features We will first represent an image with a bag of words approach. In each image, we simply look at how often each word appears. Q2.1 (10 points): Write the function visual recog.get feature from wordmap(wordmap,dict size) that extracts the histogram4 of visual words within the given image (i.e., the bag of visual words). As inputs, the function will take: • wordmap is a H × W image containing the IDs of the visual words • dict size is the maximum visual word ID (i.e., the number of visual words, the dictionary size). Notice that your histogram should have dict size bins, corresponding to how often that each word occurs. 4Look into numpy.histogram() 6 As output, the function will return hist, a dict size histogram that is L1 normalized, (i.e., the sum equals 1). You may wish to load a single visual word map, visualize it, and verify that your function is working correctly before proceeding. 2.2 Multi-resolution: Spatial Pyramid Matching Bag of words is simple and efficient, but it discards information about the spatial structure of the image and this information is often valuable. One way to alleviate this issue is to use spatial pyramid matching [4]. The general idea is to divide the image into a small number of cells, and concatenate the histogram of each of these cells to the histogram of the original image, with a suitable weight. Here we will implement a popular scheme that chops the image into 2l × 2 l cells where l is the layer number. We treat each cell as a small image and count how often each visual word appears. This results in a histogram for every single cell in every layer. Finally to represent the entire image, we concatenate all the histograms together after normalization by the total number of features in the image. If there are L + 1 layers and K visual words, the resulting vector has dimensionality K PL l=0 4 l = K 4 (L+1) − 1 /3. Now comes the weighting scheme. Note that when concatenating all the histograms, histograms from different levels are assigned different weights. Typically (in [4]), a histogram from layer l gets half the weight of a histogram from layer l + 1, with the exception of layer 0, which is assigned a weight equal to layer 1. A popular choice is for layer 0 and layer 1 the weight is set to 2−L, and for the rest it is set to 2l−L−1 (e.g., in a three layer spatial pyramid, L = 2 and weights are set to 1/4, 1/4 and 1/2 for layer 0, 1 and 2 respectively, see Fig. 6). Note that the L1 norm (absolute values of all dimensions summed up together) for the final vector is 1. Figure 6: Spatial Pyramid Matching: From [4]. Toy example of a pyramid for L = 2. The image has three visual words, indicated by circles, diamonds, and crosses. We subdivide the image at three different levels of resolution. For each level of resolution and each channel, we count the features that fall in each spatial bin. Finally, weight each spatial histogram. Q2.2 (15 points): Create a function getImageFeaturesSPM that form a multi-resolution representation of the given image. visual recog.get feature from wordmap SPM(wordmap,layer num,dict size) As inputs, the function will take: • layer num the number of layers in the spatial pyramid, i.e., L + 1 • wordmap is a H × W image containing the IDs of the visual words • dict size is the maximum visual word ID (i.e., the number of visual words, the dictionary size) As output, the function will return hist all, a vector that is L1 normalized. Please use a 3-layer spatial pyramid (L = 2) for all the following recognition tasks. 7 One small hint for efficiency: a lot of computation can be saved if you first compute the histograms of the finest layer, because the histograms of coarser layers can then be aggregated from finer ones. Make sure you normalize the histogram after aggregation. 2.3 Comparing images We will also need a way of comparing images to find the “nearest” instance in the training data. In this assignment, we’ll use the histogram intersection similarity. The histogram intersection similarity between two histograms is the sum of the minimum value of each corresponding bins. Note that since this is a similarity, you want the largest value to find the “nearest” instance. Q2.3 (10 points): Create the function visual recog.distance to set(word hist,histograms) where word hist is a K 4 (L+1) − 1 /3 vector and histograms is a T ×K 4 (L+1) − 1 /3 matrix containing T features from T training samples concatenated along the rows. This function returns the histogram intersection similarity between word hist and each training sample as a vector of length T. Since this is called every time you want to look up a classification, you want this to be fast, so doing a for-loop over tens of thousands of histograms is a very bad idea. 2.4 Building A Model of the Visual World Now that we’ve obtained a representation for each image, and defined a similarity measure to compare two spatial pyramids, we want to put everything up to now together. You will need to load the training file names from data/train data.npz and the filter bank and visual word dictionary from dictionary.npy. You will save everything to a .npz file named trained system.npz. Included will be: 1. dictionary: your visual word dictionary. 2. features: a N × K 4 (L+1) − 1 /3 matrix containing all of the histograms of the N training images in the data set. A dictionary with 150 words will make a train features matrix of size 1440 × 3150. 3. labels: an N vector containing the labels of each of the images. (features[i] will correspond to label labels[i]). 4. SPM layer num: the number of spatial pyramid layers you used to extract the features for the training images. We have provided you with the names of the training images in data/train data.npz. You want to use the dictionary entry image names for training. You are also provided the names of the test images in data/test data.npz, which is structured in the same way as the training data; however, you cannot use the testing images for training. If it’s any helpful, the below table lists the class names that correspond to the label indices: 0 1 2 3 4 5 6 7 auditorium baseball field desert highway kitchen laundromat waterfall windmill Q2.4 (15 points): Implement the function visual recog.build recognition system() that produces trained system.npz. You may include any helper functions you write in visual recog.py. Implement visual recog.get image feature(file path,dictionary,layer num,K) that load image, extract word map from the image, compute SPM feature and return the computed feature. Use this function in your visual recog.build recognition system(). 8 2.5 Quantitative Evaluation Qualitative evaluation is all well and good (and very important for diagnosing performance gains and losses), but we want some hard numbers. Load the corresponding test images and their labels, and compute the predicted labels of each, i.e., compute its distance to every image in training set and return the label with least distance difference as the predicted label. To quantify the accuracy, you will compute a confusion matrix C: given a classification problem, the entry C(i,j) of a confusion matrix counts the number of instances of class i that were predicted as class j. When things are going well, the elements on the diagonal of C are large, and the off-diagonal elements are small. Since there are 8 classes, C will be 8 × 8. The accuracy, or percent of correctly classified images, is given by the trace of C divided by the sum of C. Q2.5 (10 points): Implement the function visual recog.evaluate recognition system() that tests the system and outputs the confusion matrix. Report the confusion matrix and accuracy for your results in your write-up. This does not have to be formatted prettily: if you are using LATEX, you can simply copy/paste it into a verbatim environment. Additionally, do not worry if your accuracy is low: with 8 classes, chance is 12.5%. To give you a more sensible number, a reference implementation with spatial pyramid matching gives an overall accuracy of around 50%. 2.6 Find the failed cases There are some classes/samples that are more difficult to classify than the rest using the bags-of-words approach. As a result, they are classified incorrectly into other categories. Q2.6 (5 points): List some of these classes/samples and discuss why they are more difficult in your write-up. 3 Deep Learning Features – An Alternative to “Bag of Words” As we have discussed in class, another powerful method for scene classification in computer vision is the employment of convolutional neural networks (CNNs) – sometimes referred to generically as “deep learning”. It is important to understand how previously trained (pretrained) networks can be used as another form of feature extraction, and how they relate to classical Bag of Words (BoW) features. We will be covering details on how one chooses the network architecture and training procedures later in the course. For this question, however, we will be asking you to deal with the VGG-16 pretrained network. VGG-16 is a pretrained Convolutional Neural Network (CNN) that has been trained on approximately 1.2 million images from the ImageNet Dataset (http://image-net.org/index) by the Visual Geometry Group (VGG) at University of Oxford. The model can classify images into a 1000 object categories (e.g. keyboard, mouse, coffee mug, pencil). or later installed. One lesson we want you to take away from this exercise is to understand the effectiveness of “deep features” for general classification tasks within computer vision – even when those features have been previously trained on a different dataset (i.e. ImageNet) and task (i.e. object recognition). 3.1 Extracting Deep Features To complete this question, you need to install the torchvision library from Pytorch, a popular Pythonbased deep learning library. If you are using the Anaconda package manager (https://www.anaconda.com/ download/), this can be done with the following command: conda install pytorch torchvision -c pytorch To check that you have installed it correctly, make sure that you can import torch in a Python interpreter without errors. Please refer to https://pytorch.org/ for more detailed installation instructions. 9 Q3.1 (25 points): We want to extract out deep features corresponding to the convolutional layers of the VGG-16 network. To load the network, add the line vgg16 = torchvision.models.vgg16(pretrained=True).double() followed by vgg16.eval() The latter line ensures that the VGG-16 network is in evaluation mode, not training mode. We want you to complete a function that is able to output VGG-16 network outputs at the fc7 layer in network layers.py. network layers.extract deep feature(x,vgg16 weights) where x refers to the input image and vgg16 weights is a structure containing the CNN’s network parameters. In this function you will need to write sub-functions multichannel conv2d, relu, max pool2d, and linear corresponding to the fundamental elements of the CNN: multi-channel convolution, rectified linear units (ReLU), max pooling, and fully-connected weighting. We have provided a helper function util.get VGG16 weights() that extracts the weight parameters of VGG-16 and its meta information. The returned variable is a numpy array of shape L × 3, where L is the number of layers in VGG-16. The first column of each row is a string indicating the layer type. The second/third columns may contain the learned weights and biases, or other meta-information (e.g. kernel size of max-pooling). Please refer to the function docstring for details. VGG-16 assumes that all input imagery to the network is resized to 224×224 with the three color channels preserved (use skimage.transform.resize() to do this before passing any imagery to the network). And be sure to normalize the image using suggested mean and std before extracting the feature: mean=[0.485,0.456,0.406] std=[0.229,0.224,0.225] In order to build the extract deep feature function, you should run a for-loop through each layer index until layer “fc7”, which corresponds to the second linear layer (Refer to VGG structure to see where “fc7” is). Remember: the output of the preceding layer should be passed through as an input to the next. Details on the sub-functions needed for the extract deep feature function can be found below. Please use scipy.ndimage.convolve and numpy functions to implement these functions instead of using pytorch. Please keep speed in mind when implementing your function, for example, using double for loop over pixels is not a good idea. 1. multichannel conv2d(x,weight,bias): a function that will perform multi-channel 2D convolution which can be defined as follows, y (j) = X K k=1 x (k) ∗ h (j,k) + b[j] (1) where ∗ denotes 2D convolution, x = {x (k)} K k=1 is our vectorized K-channel input signal, h = {h (j,k)} K,J k=1,j=1 is our J × K set of vectorized convolutional filters and r = {y (j)} J j=1 is our J channel vectorized output response. Further, unlike traditional single-channel convolution CNNs often append a bias vector b whose J elements are added to the J channels of the output response. To implement multichannel conv2d, use the Scipy convolution function scipy.ndimage.convolve() with for loops to cycle through the filters and channels. All the necessary details concerning the number of filters (J), number of channels (K), filter weights (h) and biases (b) can be inferred from the shapes/dimensions of the weights and biases. 10 2. relu(x): a function that shall perform the Rectified Linear Unit (ReLU) which can be defined mathematically as, ReLU(x) = max x (x, 0) (2) and is applied independently to each element of the matrix/vector x passed to it. 3. max pool2d(x,size): a function that shall perform max pooling over x using a receptive field of size size × size (we assume a square receptive field here for simplicity). If the function receives a multi-channel input, then it should apply the max pooling operation across each input channel independently. (Hint: making use of smart array indexing can drastically speed up the code.) 4. linear(x,W,b): a function that will compute a node vector where each element is a linear combination of the input nodes, written as y[j] = X K k=1 W[j, k]x[k] + b[j] (3) or more succinctly in vector form as y = Wx + b – where x is the (K × 1) input node vector, W is the (J × K) weight matrix, b is the (J × 1) bias vector and y is the (J × 1) output node vector. You should not need for-loops to implement this function. For efficiency you should check that each sub-function is working properly before putting them all together – otherwise it will be hard to track any errors. You can check the performance of each layer by creating your own single-layer network. 3.2 Building a Visual Recognition System: Revisited We want to compare how useful deep features are in a visual recognition system. Q3.2 (5 points): Implement the functions deep recog.build recognition system(vgg16) and deep recog.eval recognition system(vgg16) both of which takes the pretrained VGG-16 network as the input arguments. These functions should follow a very similar pipeline to those in the previous questions, so you are free to reuse much of the code. The former function should produce trained system deep.npz as the output. Included will be: 1. features: a N × K matrix containing all the deep features of the N training images in the data set. 2. labels: an N vector containing the labels of each of the images. (features[i] will correspond to label labels[i]). The latter function should produce the confusion matrix, as with the previous question. Instead of using the histogram intersection similarity, write a function to just use the negative Euclidean distance (as larger values are more similar). Report the confusion matrix and accuracy for your results in your write-up. Can you comment in your writeup on whether the results are better or worse than classical BoW – why do you think that is? 4 HW1 Distribution Checklist After unpacking hw1.zip, you should have a folder hw1 containing one folder for the data (data) and one for your code (code). In the code folder, where you will primarily work, you will find: • visual words.py: function definitions for extracting visual words. • visual recog.py: function definitions for building a visual recognition system. 11 • network layers.py: function definitions for implementing deep network layers. • deep recog.py: function definitions for building a visual recognition system using deep features. • util.py: utility functions • main.py: main function for running the system The data folder contains: • data/: a directory containing .jpg images from SUN database. • data/train data.npz: a .npz file containing the training set. • data/test data.npz: a .npz file containing the test set. • data/vgg16 list.npy: a .npy file with the weights of VGG-16. 5 HW1 Submission Checklist The assignment should be submitted to Canvas. The writeup should be submitted as a pdf file named hw1.pdf. The code should be submitted as a zip file named .zip. By extracting the zip file, it should have the following files in the structure defined below. (Note: Missing to follow the structure will incur huge penalty in scores). When you submit, remove the folder data/, as well as any large temporary files that we did not ask you to create. • / # A directory inside .zip file – code/ ∗ dictionary.npy ∗ trained system.npz ∗ <!– all .py files inside code directory > – hw1.pdf make sure you upload this pdf file to Gradescope. Please assign the locations of answers to each question on Gradescope. References [1] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In British Machine Vision Conference, 2011. [2] K. Grauman and T. Darrell. The pyramid match kernel: discriminative classification with sets of image features. In Computer Vision (ICCV), 2005 IEEE International Conference on, volume 2, pages 1458– 1465 Vol. 2, 2005. [3] James Hays and Alexei A Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (SIGGRAPH 2007), 26(3), 2007. [4] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition (CVPR), 2006 IEEE Conference on, volume 2, pages 2169–2178, 2006. [5] D.G. Lowe. Object recognition from local scale-invariant features. In Computer Vision (ICCV), 1999 IEEE International Conference on, volume 2, pages 1150–1157 vol.2, 1999. [6] Laura Walker Renninger and Jitendra Malik. When is scene identification just texture recognition? Vision research, 44(19):2301–2311, 2004. 12 [7] J. Winn, A. Criminisi, and T. Minka. Object categorization by learned universal visual dictionary. In Computer Vision (ICCV), 2005 IEEE International Conference on, volume 2, pages 1800–1807 Vol. 2, 2005. [8] Jianxiong Xiao, J. Hays, K.A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3485–3492, 2010. 13
16720B Homework 2 Feature Descriptors, Homographies & RANSAC Solution
Introduction In this homework, you will implement an interest point (keypoint) detector, a feature descriptor and an image stitching tool based on feature matching and homography. Interest point (keypoint) detectors find particularly salient points in an image. We can then extract a feature descriptor that helps describe the region around each of the interest points. SIFT, SURF and BRIEF are all examples of commonly used feature descriptors. Once we have extracted the interest points, we can use feature descriptors to match them (find correspondences) between images to do interesting things like panorama stitching or scene reconstruction. 1 We will implement the BRIEF feature descriptor in this homework. It has a compact representation, is quick to compute, has a discriminative yet easily computed distance metric, and is relatively simple to implement. This allows for real-time computation, as you have seen in class. Most importantly, as you will see, it is also just as powerful as more complex descriptors like SIFT for many cases. After matching the features that we extract, we will explore the homography between images based on the locations of the matched features. Specifically, we will look at the planar homographies. Why is this useful? In many robotics applications, robots must often deal with tabletops, ground, and walls among other flat planar surfaces. When two cameras observe a plane, there exists a relationship between the captured images. This relationship is defined by a 3 × 3 transformation matrix, called a planar homography. A planar homography allows us to compute how a planar scene would look from a second camera location, given only the first camera image. In fact, we can compute how images of the planes will look from any camera at any location without knowing any internal camera parameters and without actually taking the pictures, all using the planar homography matrix. 1 Keypoint Detector We will implement an interest point detector similar to SIFT. A good reference for its implementation can be found in [3]. Keypoints are found by using the Difference of Gaussian (DoG) detector. This detector finds points that are extrema in both scale and space of a DoG pyramid. This is described in [1], an important paper in computer vision. Here, we will implement a simplified version of the DoG detector described in Section 3 of [3]. HINT: All the functions to implement here are located in keypointDetect.py. NOTE: The parameters to use for the following sections are: σ0 = 1, k = √ 2, levels = [−1, 0, 1, 2, 3, 4], θc = 0.03 and θr = 12. 1.1 Gaussian Pyramid In order to create a DoG pyramid, we will first need to create a Gaussian pyramid. Gaussian pyramids are constructed by progressively applying a low pass Gaussian filter to the input image. This function is already provided to your in keypointDetect.py. GaussianPyramid = createGaussianPyramid(im, sigma0, k, levels) The function takes as input an image which is going to be converted to grayscale with intensity between 0 and 1 (hint: cv2.cvtColor(…)), the scale of the zeroth level of the pyramid sigma0, the pyramid factor k, and a vector levels specifying the levels of the pyramid to construct. At level l in the pyramid, the image is smoothed by a Gaussian filter with σl = σ0k l . The output GaussianPyramid is a R × C × L matrix, where R × C is the size of the input image im and L is the number of levels. An example of a Gaussian pyramid 2 Figure 1: Example Gaussian pyramid for model chickenbroth.jpg Figure 2: Example DoG pyramid for model chickenbroth.jpg can be seen in Figure 1. You can visualize this pyramid with the provided function displayPyramid(pyramid). 1.2 The DoG Pyramid (5 pts) The DoG pyramid is obtained by subtracting successive levels of the Gaussian pyramid. Specifically, we want to find: Dl(x, y, σl) = (G(x, y, σl−1) − G(x, y, σl)) ∗ I(x, y) (1) where G(x, y, σl) is the Gaussian filter used at level l and ∗ is the convolution operator. Due to the distributive property of convolution, this simplifies to Dl(x, y, σl) = G(x, y, σl−1) ∗ I(x, y) − G(x, y, σl) ∗ I(x, y) (2) = GPl − GPl−1 (3) where GPl denotes level l in the Gaussian pyramid. Q 1.2: Write the following function to construct a Difference of Gaussian pyramid: DoGPyramid, DoGLevels = createDoGPyramid(GaussianPyramid, levels) The function should return DoGPyramid an R × C × (L − 1) matrix of the DoG pyramid created by differencing the GaussianPyramid input. Note that you will have one level less than the Gaussian Pyramid. DoGLevels is an (L−1) vector specifying the corresponding levels of the DoG Pyramid (should be the last L−1 elements of levels). An example of the DoG pyramid can be seen in Figure 2. 3 Figure 3.a: Without edge suppression Figure 3.b: With edge suppression Figure 3: Interest Point (keypoint) Detection without and with edge suppression for model chickenbroth.jpg 1.3 Edge Suppression (10 pts) The Difference of Gaussian function responds strongly on corners and edges in addition to blob-like objects. However, edges are not desirable for feature extraction as they are not as distinctive and do not provide a substantially stable localization for keypoints. Here, we will implement the edge removal method described in Section 4.1 of [3], which is based on the principal curvature ratio in a local neighborhood of a point. The paper presents the observation that edge points will have a “large principal curvature across the edge but a small one in the perpendicular direction.” Q 1.3: Implement the following function: PrincipalCurvature = computePrincipalCurvature(DoGPyramid) The function takes in DoGPyramid generated in the previous section and returns PrincipalCurvature, a matrix of the same size where each point contains the curvature ratio R for the corresponding point in the DoG pyramid: R = Tr(H) 2 Det(H) = (λmin + λmax) 2 λminλmax (4) where H is the Hessian of the Difference of Gaussian function (i.e. one level of the DoG pyramid) computed by using pixel differences as mentioned in Section 4.1 of [3]. (hint: use Sobel filter cv2.Sobel(…)). H = Dxx Dxy Dyx Dyy (5) 4 This is similar in spirit to but different than the Harris corner detection matrix you saw in class. Both methods examine the eigenvalues λ of a matrix, but the method in [3] performs a test without requiring the direct computation of the eigenvalues. Note that you need to compute each term of the Hessian before being able to take the trace and determinant. We can see that R reaches its minimum when the two eigenvalues λmin and λmax are equal, meaning that the curvature is the same in the two principal directions. Edge points, in general, will have a principal curvature significantly larger in one direction than the other. To remove edge points, we simply check against a threshold R > θr. Fig. 3 shows the DoG detector with and without edge suppression. 1.4 Detecting Extrema (10 pts) To detect corner-like, scale-invariant interest points, the DoG detector chooses points that are local extrema in both scale and space. Here, we will consider a point’s eight neighbors in space and its two neighbors in scale (one in the scale above and one in the scale below). Q 1.4: Write the function : locsDoG = getLocalExtrema(DoGPyramid, DoGLevels, PrincipalCurvature, th contrast, th r) This function takes as input DoGPyramid and DoGLevels from Section 1.2 and PrincipalCurvature from Section 1.3. It also takes two threshold values, th contrast and th r. The threshold θc should remove any point that is a local extremum but does not have a Difference of Gaussian (DoG) response magnitude above this threshold (i.e. |D(x, y, σ)| > θc). The threshold θr should remove any edge-like points that have too large a principal curvature ratio specified by PrincipalCurvature. The function should return locsDoG, an N × 3 matrix where the DoG pyramid achieves a local extrema in both scale and space, and also satisfies the two thresholds. The first and second column of locsDoG should be the (x, y) values of the local extremum and the third column should contain the corresponding level of the DoG pyramid where it was detected. (Try to eliminate loops in the function so that it runs efficiently.) NOTE: In all implementations, we assume the x coordinate corresponds to columns and y coordinate corresponds to rows. For example, the coordinate (10, 20) corresponds to the (row 21, column 11) in the image. 1.5 Putting it together (5 pts) Q 1.5: Write the following function to combine the above parts into a DoG detector: locsDoG, GaussianPyramid = DoGdetector(im, sigma0, k, levels, th contrast, th r) 5 The function should take in a gray scale image, im, scaled between 0 and 1, and the parameters sigma0, k, levels, th contrast, and th r. It should use each of the above functions and return the keypoints in locsDoG and the Gaussian pyramid in GaussianPyramid. Figure 3 shows the keypoints detected for an example image. Note that we are dealing with real images here, so your keypoint detector may find points with high scores that you do not perceive to be corners. Include the image with the detected keypoints in your report (similiar to the one shown in Fig. 3 (b) ). You can use any of the provided images. 2 BRIEF Descriptor Now that we have interest points that tell us where to find the most informative feature points in the image, we can compute descriptors that can be used to match to other views of the same point in different images. The BRIEF descriptor encodes information from a 9 × 9 patch p centered around the interest point at the characteristic scale of the interest point. See the lecture notes to refresh your memory. HINT: All the functions to implement here are located in BRIEF.py. 2.1 Creating a Set of BRIEF Tests (5 pts) The descriptor itself is a vector that is n-bits long, where each bit is the result of the following simple test: τ (P; x, y) := ( 1, if P[x] < P[y]. 0, otherwise. (6) Set n to 256 bits. There is no need to encode the test results as actual bits. It is fine to encode them as a 256 element vector. There are many choices for the 256 test pairs {x, y} (remember x and y are 2D vectors relating to discrete 2D coordinates within the 2D image patch matrix P) used to compute τ (P; x, y) (each of the n bits). The authors describe and test some of them in [2]. Read Section 3.2 of that paper and implement one of these solutions. You should generate a static set of test pairs and save that data to a file. You will use these pairs for all subsequent computations of the BRIEF descriptor. Q 2.1: Write the function to create the x and y pairs that we will use for comparison to compute τ : compareX, compareY = makeTestPattern(patchWidth, nbits) patchWidth is the width of the image patch (usually 9) and nbits is the number of tests n in the BRIEF descriptor. compareX and compareY are linear indices into the patchWidth × patchWidth image patch and are each nbits × 1 vectors. Run this routine for the given parameters patchWidth = 9 and n = 256 and save the results in testPattern.npy. Include this file in your submission. 6 2.2 Compute the BRIEF Descriptor (10 pts) Now we can compute the BRIEF descriptor for the detected keypoints. Q 2.2: Write the function: locs,desc = computeBrief(im, GaussianPyramid, locsDoG, k, levels, compareX, compareY) Where im is a grayscale image with values from 0 to 1, locsDoG are the keypoint locations returned by the DoG detector from Section 1.5, levels are the Gaussian scale levels that were given in Section 1, and compareX and compareY are the test patterns computed in Section 2.1 and were saved into testPattern.npy. The function returns locs, an m×3 vector, where the first two columns are the image coordinates of keypoints and the third column is the pyramid level of the keypoints, and desc is an m × n bits matrix of stacked BRIEF descriptors. m is the number of valid descriptors in the image and will vary. You may have to be careful about the input DoG detector locations since they may be at the edge of an image where we cannot extract a full patch of width patchWidth. Thus, the number of output locs may be less than the input locsDoG. Note: Its possible that you may not require all the arguments to this function to compute the desired output. They have just been provided to permit the use of any of some different approaches to solve this problem. 2.3 Putting it all Together (5 pts) Q 2.3: Write a function : locs, desc = briefLite(im) which accepts a grayscale image im with values between 0 and 1 and returns locs, an m × 3 vector, where the first two columns are the image coordinates of keypoints and the third column is the pyramid level of the keypoints, and desc, an m × n bits matrix of stacked BRIEF descriptors. m is the number of valid descriptors in the image and will vary. n is the number of bits for the BRIEF descriptor. This function should perform all the necessary steps to extract the descriptors from the image, including • Compute DoG pyramid. • Get keypoint locations. • Compute a set of valid BRIEF descriptors. 2.4 Check Point: Descriptor Matching (5 pts) A descriptor’s strength is in its ability to match to other descriptors generated by the same world point despite change of view, lighting, etc. The distance metric used to 7 compute the similarity between two descriptors is critical. For BRIEF, this distance metric is the Hamming distance. The Hamming distance is simply the number of bits in two descriptors that differ. (Note that the position of the bits matters.) To perform the descriptor matching mentioned above, we have provided the function in BRIEF.py): matches = briefMatch(desc1, desc2, ratio) Which accepts an m1 × n bits stack of BRIEF descriptors from a first image and a m2×n bits stack of BRIEF descriptors from a second image and returns a p×2 matrix of matches, where the first column are indices into desc1 and the second column are indices into desc2. Note that m1, m2, and p may be different sizes and p ≤ min (m1, m2). Q 2.4: Write a test script or utilize the code provided in the main function of BRIEF.py to load two of the chickenbroth images and compute feature matches. Use the provided plotMatches and briefMatch functions to visualize the result. plotMatches(im1, im2, matches, locs1, locs2) where im1 and im2 are two colored images stored as uint8, matches is the list of matches returned by briefMatch and locs1 and locs2 are the locations of keypoints from briefLite. Save the resulting figure and submit it in your PDF report. Also, present results with the two incline images and with the computer vision textbook cover page (template is in file pf scan scaled.jpg) against the other pf * images. Briefly discuss any cases that perform worse or better. See Figure 4 for an example result. Suggestion for debugging: A good test of your code is to check that you can match an image to itself. 2.5 BRIEF and rotations (10 pts) You may have noticed worse performance under rotations. Let’s investigate this! Q 2.5: Take the model chickenbroth.jpg test image and match it to itself while rotating the second image (hint: cv2.getRotationMatrix2D(…), cv2.warpAffine(…)) in increments of 10 degrees. Count the number of correct matches at each rotation and construct a bar graph showing rotation angle vs the number of correct matches. Include this in your PDF and explain why you think the descriptor behaves this way. Create a separate script briefRotTest.py that performs this task. 3 Planar Homographies: Theory (20 pts) Suppose we have two cameras looking at a common plane in 3D space. Any 3D point w on this plane generates a projected 2D point located at u˜ = [u1, v1, 1]T on the first camera and x˜ = [x2, y2, 1]T on the second camera. Since w is confined to a plane, we 8 Figure 4: Example of BRIEF matches for model chickenbroth.jpg and chickenbroth 01.jpg. expect that there is a relationship between u˜ and x˜. In particular, there exists a common 3 × 3 matrix H, so that for any w, the following condition holds: λx˜ = Hu˜ (7) where λ is an arbitrary scalar weighting. We call this relationship a planar homography. Recall that the ˜ operator implies a vector is employing homogenous coordinates such that x˜ = [x, 1]T . It turns out this homography relationship is also true for cameras that are related by pure rotation without the planar constraint. Q 3.1 We have a set of N 2D homogeneous coordinates {x˜1, . . . , x˜N } taken at one camera view, and {u˜1, . . . , u˜N } taken at another. Suppose we know there exists an unknown homography H between the two views such that, λnx˜n = Hu˜n, for n = 1 : N (8) where again λn is an arbitrary scalar weighting. (a) Given the N correspondences across the two views and using Equation 8, derive a set of 2N independent linear equations in the form: Ah = 0 (9) where h is a vector of the elements of H and A is a matrix composed of elements derived from the point coordinates. Write out an expression for A. Hint: Start by writing out Equation 8 in terms of the elements of H and the homogeneous coordinates for u˜n and x˜n. 9 (b) How many elements are there in h? (c) How many point pairs (correspondences) are required to solve this system? Hint: How many degrees of freedom are in H? How much information does each point correspondence give? (d) Show how to estimate the elements in h to find a solution to minimize this homogeneous linear least squares system. Step us through this procedure. Hint: Use the Rayleigh quotient theorem (homogeneous least squares). 4 Planar Homographies: Implementation (10 pts) Note: Implement the method in planarH.py. Now that we have derived how to find H mathematically in Q 3.1, we will implement in this section. Q 4.1 (10pts) Implement the function H2to1 = computeH(X1,X2) Inputs: X1 and X2 should be 2 × N matrices of corresponding (x, y) T coordinates between two images. Outputs: H2to1 should be a 3 × 3 matrix encoding the homography that best matches the linear equation derived above for Equation 8 (in the least squares sense). Hint: Remember that a homography is only determined up to scale. The numpy.linalg function eigh() or svd() will be useful. Note that this function can be written without an explicit for-loop over the data points. 5 RANSAC (15 pts) Note: Implement the method in planarH.py. The least squares method you implemented for computing homographies is not robust to outliers. If all the correspondences are good matches, this is not a problem. But even a single false correspondence can completely throw off the homography estimation. When correspondences are determined automatically (using BRIEF feature matching for instance), some mismatches in a set of point correspondences are almost certain. RANSAC (Random Sample Consensus can be used to fit models robustly in the presence of outliers. Q 5.1 (15pts): Write a function that uses RANSAC to compute homographies automatically between two images: bestH = ransacH(matches, locs1, locs2, nIter, tol) The inputs and output of this function should be as follows: 10 Figure 5.a: incline L.jpg (img1) Figure 5.b: incline R.jpg (img2) Figure 5.c: img2 warped to img1’s frame Figure 5: Example output for Q 6.1: Original images img1 and img2 (left and center) and img2 warped to fit img1 (right). Notice that the warped image clips out of the image. We will fix this in Q 6.2 • Inputs: locs1 and locs2 are matrices specifying point locations in each of the images and matches is a matrix specifying matches between these two sets of point locations. These matrices are formatted identically to the output of the provided briefMatch function. • Algorithm Input Parameters: nIter is the number of iterations to run RANSAC for, tol is the tolerance value for considering a point to be an inlier. Define your function so that these two parameters have reasonable default values. • Outputs: bestH should be the homography model with the most inliers found during RANSAC. 6 Stitching it together: Panoramas (40 pts) NOTE: All the functions to implement here are in panoramas.py. We can also use homographies to create a panorama image from multiple views of the same scene. This is possible for example when there is no camera translation between the views (e.g., only rotation about the camera center). First, you will generate panoramas using matched point correspondences between images using the BRIEF matching in Section 2.4. We will assume that there is no error in your matched point correspondences between images (Although there might be some errors, and even small errors can have drastic impacts). In the next section you will extend the technique to deal with the noisy keypoint matches. You will need to use the perspective warping function from OpenCV: warp im = cv2.warpPerspective(im, H, out size), which warps image im using the homography transform H. The pixels in warp_im are sampled at coordinates in the rectangle (0, 0) to (out_size[0]-1, out_size[1]-1). The coordinates of the pixels in the source image are taken to be (0, 0) to (im.shape[1]-1, im.shape[0]-1) and transformed according to H. To understand this function, you may review Homework 0. Q 6.1 (15pts) In this problem you will implement and use the function: 11 panoImg = imageStitching(img1, img2, H2to1) on two images from the Dusquesne incline. This function accepts two images and the output from the homography estimation function. This function: (a) Warps img2 into img1’s reference frame using the aforementioned perspective warping function (b) Blends img1 and warped img2 and outputs the panorama image. For this problem, use the provided images incline L as img1 and incline R as img2. The point correspondences pts are generated by your BRIEF descriptor matching. Apply your ransacH() to these correspondences to compute H2to1, which is the homography from incline R onto incline L. Then apply this homography to incline R using warpH(). Note: Since the warped image will be translated to the right, you will need a larger target image. Visualize the warped image and save this figure as results/6 1.jpg using OpenCV’s cv2.imwrite() function and save only H2to1 as results/q6 1.npy using Numpy np.save() function. Q 6.2 (15pts) Notice how the output from Q 6.1 is clipped at the edges? We will fix this now. Implement a function [panoImg] = imageStitching noClip(img1, img2, H2to1) that takes in the same input types and produces the same outputs as in Q 6.1. To prevent clipping at the edges, we instead need to warp both image 1 and image 2 into a common third reference frame in which we can display both images without any clipping. Specifically, we want to find a matrix M that only does scaling and translation such that: warp im1 = cv2.warpPerspective(im1, M, out size) warp im2 = cv2.warpPerspective(im2, np.matmul(M,H2to1), out size) This produces warped images in a common reference frame where all points in im1 and im2 are visible. To do this, we will only take as input either the width or height of out size and compute the other one based on the given images such that the warped images are not squeezed or elongated in the panorama image. For now, assume we only take as input the width of the image (i.e., out size[0]) and should therefore compute the correct height(i.e., out size[1]). Hint: The computation will be done in terms of H2to1 and the extreme points (corners) of the two images. Make sure M includes only scale (find the aspect ratio of the full-sized panorama image) and translation. Again, pass incline L as img1 and incline R as img2. Save the resulting panorama in results/q6 2 pan.jpg. 12 Q 6.3 (10pts): You now have all the tools you need to automatically generate panoramas. Write a function that accepts two images as input, computes keypoints and descriptors for both the images, finds putative feature correspondences by matching keypoint descriptors, estimates a homography using RANSAC and then warps one of the images with the homography so that they are aligned and then overlays them. im3 = generatePanorama(im1, im2) Run your code on the image pair data/incline L.jpg, data/incline R.jpg. However during debugging, try on scaled down versions of the images to keep running time low. Save the resulting panorama on the full sized images as results/q6 3.jpg. (see Figure 7 for example output). Include the figure in your writeup. Figure 6: Final panorama view. With homography estimated using RANSAC. 7 Augmented Reality (25 pts) Homographies are also really useful for Augmented Reality (AR) applications. For this AR exercise we will be using the image in Figure 7 with the following data points:- W = 0.0 18.2 18.2 0.0 0.0 0.0 26.0 26.0 0.0 0.0 0.0 0.0 are the 3D planar points of the textbook (in cm); X = 483 1704 2175 67 810 781 2217 2286 are the corresponding projected points; and K = 3043.72 0.0 1196.00 0.0 3043.72 1604.00 0.0 0.0 1.0 13 Figure 7: Image file we will be using in this exercise, where we provide you with the 3D planar points of corners of the book (W) the 2D projected points (X) in the image, and the camera intrinsic parameters (K). contains the intrinsics matrix for the image. The relationship between 3D planar and 2D projected point can be defined through, λnx˜n = K[R, t]w˜ n (10) where xn and wn relate to the n-th column vector of X and W. Q7.1 (10pts) Using the method from Section 15.4.1 (pp. 335-337), of Prince’s textbook, implement the function R,t = compute extrinsics(K, H) where K contains the intrinsic parameters and H contains the estimated homography. As discussed in class the extrinsic parameters refer to the relative 3D rotation R and translation t of the camera between views. Remember: Prince uses the notation Ω and τ to refer the extrinsic parameters R and t respectively. They also employ the notation Λ to refer to the intrinsics matrix K. Q7.2 (15pts) Next implement the function X=project extrinsics(K, W, R, t) 14 that will apply a pinhole camera projection (described in Equation 10) of any set of 3D points using a pre-deterimined set of extrinsics and intrinsics. Use this function then project the set of 3D points in the file sphere.txt (which has been fixed to have the same radius as a tennis ball – of 6.858l) onto the image prince book.jpg. Specifically, attempt to place the bottom of the ball in the middle of the “o” in “Computer” text of the book title. Use matplotlib’s plot function to display the points of the sphere in yellow. For this question disregard self occlusion (i.e. points on the projected sphere occluding other points). Capture the image and include it in your written response. 8 Submission and Deliverables The assignment should be submitted to Canvas as a zip file named .zip and to Gradescope as a pdf file named hw2.pdf. You can use check files.py to ensure you have all required files ready before submission. The zip file should contain: • A folder code containing all the .py files you were asked to write. • A PDF hw2.pdf. • A folder results containing all the .npy files you were asked to generate. The PDF hw2.pdf should contain the results, explanations and images asked for in the assignment along with to the answers to the questions on homographies. Submit all the code needed to make your panorama generator run. Make sure all the .py files that need to run are accessible from the code folder without any editing of the path variable. You may leave the data folder in your submission, but it is not needed. Note: Missing to follow the structure will incur huge penalty in scores! Appendix: Image Blending Note: This section is not for credit and is for informational purposes only. For overlapping pixels, it is common to blend the values of both images. You can simply average the values but that will leave a seam at the edges of the overlapping images. Alternatively, you can obtain a blending value for each image that fades one image into the other. To do this, first create a mask like this for each image you wish to blend: mask = np.zeros((im.shape[0], im.shape[1])) mask[0,:] = 1 mask[-1,:] = 1 mask[:,0] = 1 mask[:,-1] = 1 mask = scipy.ndimage.morphology.distance_transform_edt(1-mask) mask = mask/mask.max(0) 15 The function distance_transform_edt computes the distance transform of the binarized input image, so this mask will be zero at the borders and 1 at the center of the image. You can warp this mask just as you warped your images. How would you use the mask weights to compute a linear combination of the pixels in the overlap region? Your function should behave well where one or both of the blending constants are zero. References [1] P. Burt and E. Adelson. The Laplacian Pyramid as a Compact Image Code. IEEE Transactions on Communications, 31(4):532–540, April 1983. [2] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. BRIEF : Binary Robust Independent Elementary Features. [3] David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision, 60(2):91–110, November 2004. 16
16-720B Homework 3 Lucas-Kanade Tracking and Correlation Filters Solution
1 Lucas-Kanade Tracking In this section you will be implementing a simple Lucas & Kanade tracker with one single template. In the scenario of two-dimensional tracking with a pure translation warp function, W(x; p) = x + p . (1) 1 The problem can be described as follows: starting with a rectangular neighborhood of pixels N ∈ {xd} D d=1 on frame It, the Lucas-Kanade tracker aims to move it by an offset p = [px, py] T to obtain another rectangle on frame It+1, so that the pixel squared difference in the two rectangles is minimized: p ∗ = arg minp = X x∈N ||It+1(x + p) − It(x)||2 2 (2) = It+1(x1 + p) . . . It+1(xD + p) − It(x1) . . . It(xD) 2 2 (3) Q1.1 (5 points) Starting with an initial guess of p (for instance, p = [0, 0]T ), we can compute the optimal p ∗ iteratively. In each iteration, the objective function is locally linearized by first-order Taylor expansion, It+1(x 0 + ∆p) ≈ It+1(x 0 ) + ∂It+1(x 0 ) ∂x0T ∂W(x; p) ∂pT ∆p (4) where ∆p = [∆px, ∆py] T , is the template offset. Further, x 0 = W(x; p) = x + p and ∂I(x 0 ) ∂x0T is a vector of the x− and y− image gradients at pixel coordinate x 0 . In a similar manner to Equation 3 one can incorporate these linearized approximations into a vectorized form such that, arg min ∆p ||A∆p − b||2 2 (5) such that p ← p + ∆p at each iteration. • What is ∂W(x;p) ∂pT ? • What is A and b? • What conditions must AT A meet so that a unique solution to ∆p can be found? Q1.2 (15 points) Implement a function with the following signature LucasKanade(It, It1, rect, p0 = np.zeros(2)) that computes the optimal local motion from frame It to frame It+1 that minimizes Equation 3. Here It is the image frame It, It1 is the image frame It+1, rect is the 4-by-1 vector that represents a rectangle describing all the pixel coordinates within N within the image frame It, and p0 is the initial parameter guess (δx, δy). The four components of the rectangle are [x1, y1, x2, y2] T , where [x1, y1] T is the top-left corner and [x2, y2] T is the bottom-right corner. The rectangle is inclusive, i.e., in includes all the four corners. To deal with fractional movement of the template, you will need to interpolate the image using the Scipy module ndimage.shift or something similar. You will also need to iterate the estimation until the change in ||∆p||2 2 is below a threshold. In order to perform interpolation you might find RectBivariateSpline from the scipy.interpolate package. Read the documentation of defining the spline ( RectBivariateSpline) as well as evaluating the spline using RectBivariateSpline.ev carefully. 2 Q1.3 (10 points) Write a script testCarSequence.py that loads the video frames from carseq.npy, and runs the Lucas-Kanade tracker that you have implemented in the previous task to track the car. carseq.npy can be located in the data directory and it contains one single three-dimensional matrix: the first two dimensions correspond to the height and width of the frames respectively, and the third dimension contain the indices of the frames (that is, the first frame can be visualized with imshow(frames[:, :, 0])). The rectangle in the first frame is [x1, y1, x2, y2] T = [59, 116, 145, 151]T . Report your tracking performance (image + bounding rectangle) at frames 1, 100, 200, 300 and 400 in a format similar to Figure 1. Also, create a file called carseqrects.npy, which contains one single n × 4 matrix, where each row stores the rect that you have obtained for each frame, and n is the total number of frames. Figure 1: Lucas-Kanade Tracking with One Single Template Q1.4 (20 points) As you might have noticed, the image content we are tracking in the first frame differs from the one in the last frame. This is understandable since we are updating the template after processing each frame and the error can be accumulating. This problem is known as template drifting. There are several ways to mitigate this problem. Iain Matthews et al. (2003, https://www.ri.cmu.edu/publication_view.html?pub_id=4433) suggested one possible approach. Write a script testCarSequenceWithTemplateCorrection.py with a similar functionality to Q1.3, but with a template correction routine incorporated. Save the resulting rects as carseqrects-wcrt.npy, and also report the performance at those frames. An example is given in Figure 2. Figure 2: Lucas-Kanade Tracking with Template Correction Here the green rectangles are created with the baseline tracker in Q1.3, the yellow ones with the tracker in Q1.4. The tracking performance has been improved non-trivially. Note that you do not necessarily have to draw two rectangles in each frame, but make sure that the performance improvement can be easily visually inspected. 2 Lucas-Kanade Tracking with Appearance Basis The tracker we have implemented in the first secion, with or without template drifting correction, may suffice if the object being tracked is not subject to drastic appearance variance. However, in real life, this can hardly be the case. We have prepared another 3 sequence sylvseq.npy (the initial rectangle is [101, 61, 155, 107]), with exactly the same format as carseq.mat, on which you can test the baseline implementation and see what would happen. In this section, you will implement a variant of the Lucas-Kanade tracker (see Section 3.4 in [2]), to model linear appearance variation in the tracking. 2.1 Appearance Basis One way to address this issue is to use eigen-space approach (aka, principal component analysis, or PCA). The idea is to analyze the historic data we have collected on the object, and produce a few bases, whose linear combination would most likely to constitute the appearance of the object in the new frame. This is actually similar to the idea of having a lot of templates, but looking for too many templates may be expensive, so we only worry about the principal templates. Mathematically, suppose we are given a set of k image bases {Bk} K k=1 of the same size. We can approximate the appearance variation of the new frame It+1 as a linear combination of the previous frame It and the bases weighted by w = [w1, . . . , wK] T , such that It+1(x) = It(x) +X K k=1 wkBk(x) (6) Q2.1 (5 points) Express w as a function of It+1, It, and {Bk} K k=1, given Equation 6. Note that since the Bk’s are orthobases, thay are orthogonal to each other. 2.2 Tracking Given K bases, {Bk} K k=1, our goal is then to simultaneously find the translation p = [px, py] T and the weights w = [w1, . . . , wK] T that minimizes the following objective function: min p,w = X x∈N ||It+1(x + p) − It(x) − X K k=1 wkBk(x)||2 2 . (7) Again, starting with an initial guess of p (for instance, p = [0, 0]T ), one can linearize It+1(x+ p + ∆p) with respect to ∆p. In a similar manner to Equation 5 one can incorporate these linearized approximations into a vectorized form such that, arg min ∆p,w ||A∆p − b − Bw||2 2 . (8) As discussed in Section 3.4 of [2] (ignore the inverse compositional discussion) this can be simplified down to arg min ∆p ||B ⊥(A∆p − b)||2 2 (9) where B⊥ spans the null space of B. Note that ||B⊥z||2 2 = ||z − BBT z||2 2 when B is an orthobasis. Q2.2 (15 points) Implement a function with the following signature 4 LucasKanadeBasis(It, It1, rect, bases, p0 = np.zeros(2)) where bases is a three-dimensional matrix that contains the bases. It has the same format as frames as is described earlier and can be found in sylvbases.npy. Q2.3 (15 points) Write a script testSylvSequence.py that loads the video frames from sylvseq.npy and runs the new Lucas-Kanade tracker to track the sylv (the toy). The bases are available in sylvbases.npy in the data directory. The rectangle in the first frame is [x1, y1, x2, y2] T = [101, 61, 155, 107]T . Please report the performance of this tracker at frames 1, 200, 300, 350 and 400 (the frame + bounding box), in comparison to that of the tracker in the first section. That is, there should be two rectangles for each frame, as exemplified in Figure 3. Also, create a sylvseqrects.npy for all the rects you have obtained for each frame. It should contain one single N × 4 matrix named rects, where N is the number of frames, and each row contains [x1, y1, x2, y2] T , where [x1, y1]T is the coordinate of the top-left corner of the tracking box, and [x2, y2] T the bottom-right corner. Figure 3: Lucas-Kanade Tracking with Appearance Basis 3 Affine Motion Subtraction In this section, you will implement a tracker for estimating dominant affine motion in a sequence of images and subsequently identify pixels corresponding to moving objects in the scene. You will be using the images in the file aerialseq.npy, which consists aerial views of moving vehicles from a non-stationary camera. 3.1 Dominant Motion Estimation In the first section of this homework we assumed the the motion is limited to pure translation. In this section you shall implement a tracker for affine motion using a planar affine warp function. To estimate dominant motion, the entire image It will serve as the template to be tracked in image It+1, that is, It+1 is assumed to be approximately an affine warped version of It. This approach is reasonable under the assumption that a majority of the pixels correspond to the stationary objects in the scene whose depth variation is small relative to their distance from the camera. Using a planar affine warp function you can recover the vector ∆p = [p1, . . . , p6] T , x 0 = W(x; p) = 1 + p1 p2 p4 1 + p5 x y + p3 p6 . (10) One can represent this affine warp in homogeneous coordinates as, x˜ 0 = Mx˜ (11) 5 where, M = 1 + p1 p2 p3 p4 1 + p5 p6 0 0 1 . (12) Note that M will differ between successive image pairs. Starting with an initial guess of p = 0 (i.e. M = I) you will need to solve a sequence of least-squares problem to determine ∆p such that p → p + ∆p at each iteration. Note that unlike previous examples where the template to be tracked is usually small in comparison with the size of the image, image It will almost always not be contained fully in the warped version It+1. Hence, one must only consider pixels lying in the region common to It and the warped version of It+1 when forming the linear system at each iteration. Q3.1 (15 points) Write a function with the following signature LucasKanadeAffine(It, It1) which returns the affine transformation matrix M, and It and It1 are It and It+1 respectively. LucasKanadeAffine should be relatively similar to LucasKanade from the first section (you will probably also find scipy.ndimage.affine transform helpful). 3.2 Moving Object Detection Once you are able to compute the transformation matrix M relating an image pair It and It+1, a naive way for determining pixels lying on moving objects is as follows: warp the image It using M so that it is registered to It+1 and subtract it from It+1; the locations where the absolute difference exceeds a threshold can then be declared as corresponding to locations of moving objects. To obtain better results, you can check out the following scipy.morphology functions: binary erosion, and binary dilation. Q3.2 (10 points) Using the function you have developed for dominant motion estimation, write a function with the following signature SubtractDominantMotion(image1, image2) where image1 and image2 form the input image pair, and the return value mask is a binary image of the same size that dictates which pixels are considered to be corresponding to moving objects. You should invoke LucasKanadeAffine in this function to derive the transformation matrix M, and produce the aforementioned binary mask accordingly. Q3.3 (10 points) Write a script testAerialSequence.py that loads the image sequence from aerialseq.npy and run the motion detection routine you have developed to detect the moving objects. Report the performance at frames 30, 60, 90 and 120 with the corresponding binary masks superimposed, as exemplified in Figure 4. Feel free to visualize the motion detection performance in a way that you would prefer, but please make sure it can be visually inspected without undue effort. 6 Figure 4: Lucas-Kanade Tracking with Appearance Basis 4 Efficient Tracking 4.1 Inverse Composition The inverse compositional extension of the Lucas-Kanade algorithm (see [1]) has been used in literature to great effect for the task of efficient tracking. When utilized within tracking it attempts to linearize the current frame as, It(W(x; 0 + ∆p) ≈ It(x) + ∂It(x) ∂xT ∂W(x; 0) ∂pT ∆p . (13) In a similar manner to the conventional Lucas-Kanade algorithm one can incorporate these linearized approximations into a vectorized form such that, arg min ∆p ||A0∆p − b 0 ||2 2 (14) for the specific case of an affine warp where p ← M and ∆p ← ∆M this results in the update M = M(∆M) −1 . Q4.1 (15 points) Reimplement the function LucasKanadeAffine(It,It1) as InverseCompositionAffine(It,It1) using the inverse compositional method. In your own words please describe why the inverse compositional approach is more computationally efficient than the classical approach? 4.2 Correlation Filters Enter the directory Corr-Filters. Run the script file example.py, and observe a visual depiction of the extraction of sub-images from the Lena image (Figure 1) as well as the desired output response from our linear disciminant (Figure 2). Inspecting example.py you can see that it extracts a set of sub-images X = [x1, . . . , xN ] from within the Lena image. These sub-images are stored in vector form so that xn ∈ R D (in the example code all the sub-images are 29 × 45 therefore D = 1305). Associated with these images are desired output labels y = [y1, . . . , yN ] where yn lies between zero and one. For this example we have made D = N on purpose. Please proceed to answer the following questions:- Q4.2 (15 points) A linear least-squares discriminant can be estimated by solving arg min g X N n=1 1 2 ||yn − x T n g||2 2 7 we can simplify this objective in vector form and include an additional penalty term arg min g 1 2 ||y − XT g||2 2 + λ 2 ||g||2 2 . (15) Please write down the solution to Equation 15 in terms of the matrices S = XXT , X and y. Place the answer in your written response. Q4.3 (15 points) Add your solution to Equation 15 in the example.py code. Visualize the resultant linear discriminant weight vector g for the penalty values λ = 0 and λ = 1 (remember to use the reshape function to convert g back into a 2D array). Apply the filter to the entire Lena image using the scipy.ndimage.correlate function. Visualize the responses for both values of λ. Include these visualizations in your written response document. Can you comment on which value of λ performs best and why? Place your answers and figures in your written response. Q4.4 (15 points) Visualize the response you get if you attempt to use the 2D convolution function scipy.ndimage.convolve. Why does this get a different response to the one obtained using correlate? How could you use the numpy indexing operations to get a response more similar to the one obtained using correlate? Place the answer in your written response. 5 Deliverables The assignment should be submitted to canvas. The writeup should be submitted as a pdf named .pdf. The code should be submitted as a zip named .zip. The zip when uncompressed should prodcue the following files. • LucasKanade.py • LucasKanadeAffine.py • LucasKanadeBasis.py • SubtractDominantMotion.py • InverseCompositionAffine.py • testCarSequence.py • testSylvSequence.py • testCarSequenceWithTemplateCorrection.py • testAerialSequence.py • carseqrects.npy • carseqrects-wcrt.npy • sylvseqrects.npy • aerialseqrects.npy Do not include the data directory in your submission. 8
16-720B Homework 4 3D Reconstruction Solution
Part I
Theory
Before implementing our own 3D reconstruction, let’s take a look at some simple theory questions
that may arise. The answers to the below questions should be relatively short, consisting of a few
lines of math and text (maybe a diagram if it helps your understanding).
Q1.1 (5 points) Suppose two cameras fixate on a point P (see Figure 1) in space such that
their principal axes intersect at that point. Show that if the image coordinates are normalized
so that the coordinate origin (0, 0) coincides with the principal point, the F33 element of the
fundamental matrix is zero.
(0, 0)
(0, 0)
P
C1 C2
Figure 1: Figure for Q1.1. C1 and C2 are the optical centers. The principal axes intersect at point
P.
Q1.2 (5 points) Consider the case of two cameras viewing an object such that the second
camera differs from the first by a pure translation that is parallel to the x-axis. Show that the
epipolar lines in the two cameras are also parallel to the x-axis. Backup your argument with
relevant equations.
Q1.3 (5 points) Suppose we have an inertial sensor which gives us the accurate positions
(Ri and ti
, where R is the rotation matrix and t is corresponding translation vector) of the robot
at time i. What will be the effective rotation (Rrel) and translation (trel) between two frames at
different time stamps? Suppose the camera intrinsics (K) are known, express the essential matrix
(E) and the fundamental matrix (F) in terms of K, Rrel and trel.
1
Figure 2: Figure for Q1.3. C1 and C2 are the optical centers. The rotation and the translation is
obtained using inertial sensors. Rrel and trel are the relative rotation and translation between two
frames.
Q1.4 (10 points) Suppose that a camera views an object and its reflection in a plane mirror.
Show that this situation is equivalent to having two images of the object which are related by
a skew-symmetric fundamental matrix. You may assume that the object is flat, meaning that
all points on the object are of equal distance to the mirror (Hint: draw the relevant vectors to
understand the relationships between the camera, the object and its reflected image.)
Part II
Practice
1 Overview
In this part you will begin by implementing the two different methods seen in class to estimate
the fundamental matrix from corresponding points in two images (Section 2). Next, given the fundamental matrix and calibrated intrinsics (which will be provided) you will compute the essential
matrix and use this to compute a 3D metric reconstruction from 2D correspondences using triangulation (Section 3). Then, you will implement a method to automatically match points taking
advantage of epipolar constraints and make a 3D visualization of the results (Section 4). Finally,
you will implement RANSAC and bundle adjustment to further improve your algorithm (Section 5).
2
Figure 3: Temple images for this assignment
Figure 4: displayEpipolarF in helper.py creates a GUI for visualizing epipolar lines
2 Fundamental matrix estimation
In this section you will explore different methods of estimating the fundamental matrix given a pair
of images. In the data/ directory, you will find two images (see Figure 3) from the Middlebury multiview dataset1
, which is used to evaluate the performance of modern 3D reconstruction algorithms.
The Eight Point Algorithm
The 8-point algorithm (discussed in class, and outlined in Section 10.1 of Forsyth & Ponce) is
arguably the simplest method for estimating the fundamental matrix. For this section, you can use
provided correspondences you can find in data/some corresp.npz.
Q2.1 (10 points) Finish the function eightpoint in submission.py. Make sure you follow
the signature for this portion of the assignment:
F = eightpoint(pts1, pts2, M)
where pts1 and pts2 are N × 2 matrices corresponding to the (x, y) coordinates of the N points
in the first and second image repectively. M is a scale parameter.
1
http://vision.middlebury.edu/mview/data/
3
• You should scale the data as was discussed in class, by dividing each coordinate by M (the
maximum of the image’s width and height). After computing F, you will have to “unscale”
the fundamental matrix.
Hint: If xnormalized = T x, then Funnormalized = T
T FT.
You must enforce the singularity condition of the F before unscaling.
• You may find it helpful to refine the solution by using local minimization. This probably won’t
fix a completely broken solution, but may make a good solution better by locally minimizing
a geometric cost function.
For this we have provided a helper function refineF in helper.py taking in F and the two
sets of points, which you can call from eightpoint before unscaling F.
• Remember that the x-coordinate of a point in the image is its column entry, and y-coordinate
is the row entry. Also note that eight-point is just a figurative name, it just means that you
need at least 8 points; your algorithm should use an over-determined system (N > 8 points).
• To visualize the correctness of your estimated F, use the supplied function displayEpipolarF
in helper.py, which takes in F, and the two images. This GUI lets you select a point in one
of the images and visualize the corresponding epipolar line in the other image (Figure 4).
• Output: Save your matrix F, scale M to the file q2 1.npz.
In your write-up: Write your recovered F and include an image of some example output
of displayEpipolarF.
The Seven Point Algorithm
Q2.2 (15 points) Since the fundamental matrix only has seven degrees of freedom, it is
possible to calculate F using only seven point correspondences. This requires solving a polynomial
equation. In the section, you will implement the seven-point algorithm (described in class, and
outlined in Section 15.6 of Forsyth and Ponce). Manually select 7 points from provided point in
data/some corresp.npz, and use these points to recover a fundamental matrix F. The function
should have the signature:
Farray = sevenpoint(pts1, pts2, M)
where pts1 and pts2 are 7 × 2 matrices containing the correspondences and M is the normalizer
(use the maximum of the images’ height and width), and Farray is a list array of length either 1
or 3 containing Fundamental matrix/matrices. Use M to normalize the point values between [0, 1]
and remember to “unnormalize” your computed F afterwards.
• Use displayEpipolarF to visualize F and pick the correct one.
• Output: Save your matrix F, scale M, 2D points pts1 and pts2 to the file q2 2.npz.
In your write-up: Write your recovered F and print an output of displayEpipolarF.
Also, include an image of some example output of displayEpipolarF using the seven point
algorithm.
• Hints: You can use Numpy’s function roots(). The epipolar lines may not match exactly
due to imperfectly selected correspondences, and the algorithm is sensitive to small changes
in the point correspondences. You may want to try with different sets of matches.
4
3 Metric Reconstruction
You will compute the camera matrices and triangulate the 2D points to obtain the 3D scene structure. To obtain the Euclidean scene structure, first convert the fundamental matrix F to an essential
matrix E. Examine the lecture notes and the textbook to find out how to do this when the internal
camera calibration matrices K1 and K2 are known; these are provided in data/intrinsics.npz.
Q3.1 (5 points) Write a function to compute the essential matrix E given F, K1 and K2
with the signature:
E = essentialMatrix(F, K1, K2)
In your write-up: Write your estimated E using F from the eight-point algorithm.
Given an essential matrix, it is possible to retrieve the projective camera matrices M1 and M2
from it. Assuming M1 is fixed at [I, 0], M2 can be retrieved up to a scale and four-fold rotation
ambiguity. For details on recovering M2, see section 7.2 in Szeliski. We have provided you with
the function camera2 in python/helper.py to recover the four possible M2 matrices given E.
Note: The M1 and M2 here are projection matrices of the form: M1 =
I|0
and M2 =
R|t
.
Q3.2 (10 points) Using the above, write a function to triangulate a set of 2D coordinates in
the image to a set of 3D points with the signature:
[P, err] = triangulate(C1, pts1, C2, pts2)
where pts1 and pts2 are the N ×2 matrices with the 2D image coordinates and P is an N ×3 matrix
with the corresponding 3D points per row. C1 and C2 are the 3 × 4 camera matrices. Remember
that you will need to multiply the given intrinsics matrices with your solution for the canonical
camera matrices to obtain the final camera matrices. Various methods exist for triangulation –
probably the most familiar for you is based on least squares (see Szeliski Chapter 7 if you want to
learn about other methods):
For each point i, we want to solve for 3D coordinates Pi =
xi
, yi
, zi
T
, such that when they
are projected back to the two images, they are close to the original 2D points. To project the
3D coordinates back to 2D images, we first write Pi
in homogeneous coordinates, and compute
C1Pi and C2Pi to obtain the 2D homogeneous coordinates projected to camera 1 and camera 2,
respectively.
For each point i, we can write this problem in the following form:
AiPi = 0,
where Ai
is a 4×4 matrix, and Pi
is a 4×1 vector of the 3D coordinates in the homogeneous form.
Then, you can obtain the homogeneous least-squares solution (discussed in class) to solve for each
Pi
.
In your write-up: Write down the expression for the matrix Ai
.
Once you have implemented triangulation, check the performance by looking at the reprojection
error:
err =
X
i
kp1i
, pc1ik
2 + kp2i
, pc2ik
2
5
where pc1i = P roj(C1, Pi) and pc2i = P roj(C2, Pi).
Note: C1 and C2 here are projection matrices of the form: C1 = K1M1 = K1
I|0
and
C2 = K2M2 = K2
R|t
.
Q3.3 (10 points) Write a script findM2.py to obtain the correct M2 from M2s by testing the
four solutions through triangulations. Use the correspondences from data/some corresp.npz.
Output: Save the correct M2, the corresponding C2, and 3D points P to q3 3.npz.
4 3D Visualization
You will now create a 3D visualization of the temple images. By treating our two images as a
stereo-pair, we can triangulate corresponding points in each image, and render their 3D locations.
Q4.1 (15 points) Implement a function with the signature:
[x2, y2] = epipolarCorrespondence(im1, im2, F, x1, y1)
This function takes in the x and y coordinates of a pixel on im1 and your fundamental matrix
F, and returns the coordinates of the pixel on im2 which correspond to the input point. The match
is obtained by computing the similarity of a small window around the (x1, y1) coordinates in im1
to various windows around possible matches in the im2 and returning the closest.
Instead of searching for the matching point at every possible location in im2, we can use F and
simply search over the set of pixels that lie along the epipolar line (recall that the epipolar line
passes through a single point in im2 which corresponds to the point (x1, y1) in im1).
There are various possible ways to compute the window similarity. For this assignment, simple
methods such as the Euclidean or Manhattan distances between the intensity of the pixels should
suffice. See Szeliski Chapter 11, on stereo matching, for a brief overview of these and other methods.
Implementation hints:
• Experiment with various window sizes.
• It may help to use a Gaussian weighting of the window, so that the center has greater influence
than the periphery.
• Since the two images only differ by a small amount, it might be beneficial to consider matches
for which the distance from (x1, y1) to (x2, y2) is small.
To help you test your epipolarCorrespondence, we have included a helper function epipolarMatchGUI
in python/helper.py, which takes in two images the fundamental matrix. This GUI allows you to
click on a point in im1, and will use your function to display the corresponding point in im2. See
Figure 5.
It’s not necessary for your matcher to get every possible point right, but it should get easy
points (such as those with distinctive, corner-like windows). It should also be good enough to
render an intelligible representation in the next question.
Output: Save the matrix F, points pts1 and pts2 which you used to generate the screenshot to
the file q4 1.npz.
In your write-up: Include a screenshot of epipolarMatchGUI with some detected correspondences.
6
Figure 5: epipolarMatchGUI shows the corresponding point found by calling
epipolarCorrespondence
Q4.2 (10 points) Included in this homework is a file data/templeCoords.npz which contains
288 hand-selected points from im1 saved in the variables x1 and y1.
Now, we can determine the 3D location of these point correspondences using the triangulate
function. These 3D point locations can then plotted using the Matplotlib or plotly package. Write
a script visualize.py, which loads the necessary files from ../data/ to generate the 3D reconstruction using scatter. An example is shown in Figure 6.
Output: Again, save the matrix F, matrices M1, M2, C1, C2 which you used to generate the
screenshots to the file q4 2.npz.
In your write-up: Take a few screenshots of the 3D visualization so that the outline of the temple
is clearly visible, and include them with your homework submission.
5 Bundle Adjustment
Q5.1 (15 points) In some real world applications, manually determining correspondences is
infeasible and often there will be noisy coorespondances. Fortunately, the RANSAC method seen
in class can be applied to the problem of fundamental matrix estimation.
Implement the above algorithm with the signature:
[F, inliers] = ransacF(pts1, pts2, M)
where M is defined in the same way as in Section 2 and inliers is a boolean vector of size equivalent
to the number of points. Here inliers is set to true only for the points that satisfy the threshold
defined for the given fundamental matrix F.
We have provided some noisy coorespondances in some corresp noisy.npz in which around
75% of the points are inliers. Compare the result of RANSAC with the result of the eightpoint
when ran on the noisy coorespondances. Briefly explain the error metrics you used, how you decided
which points were inliers and any other optimizations you may have made.
• Hints: Use the seven point to compute the fundamental matrix from the minimal set of
points. Then compute the inliers, and refine your estimate using all the inliers.
Q5.2 (15 points)
7
Figure 6: An example point cloud
So far we have independently solved for camera matrix, Mj and 3D projections, Pi
. In bundle
adjustment, we will jointly optimize the reprojection error with respect to the points Pi and the
camera matrix Mj .
err =
X
ij
kpij − P roj(Cj , Pi)k
2
,
where Cj = KjMj , same as in Q3.2.
For this homework we are going to only look at optimizing the extrinsic matrix. To do this we
will be parametrizing the rotation matrix R using Rodrigues formula to produce vector r ∈ R
3
.
Write a function that converts a Rodrigues vector r to a rotation matrix R
R = rodrigues(r)
as well as the inverse function that converts a rotation matrix R to a Rodrigues vector r
r = invRodrigues(R)
Q5.3 (10 points)
Using this parameterization, write an optimization function
residuals = rodriguesResidual(K1, M1, p1, K2, p2, x)
where x is the flattened concatenation of P, r2, and t2. P are the 3D points; r2 and t2 are
the rotation (in the Rodrigues vector form) and translation vectors associated with the projection
matrix M2.The residuals are the difference between original image projections and estimated
projections (the square of 2-norm of this vector corresponds to the error we computed in Q3.2):
8
residuals = numpy.concatenate([(p1-p1 hat).reshape([-1]),
(p2-p2 hat).reshape([-1])])
Use this error function and Scipy’s nonlinear least square optimizer leastsq write a function
to optimize for the best extrinsic matrix and 3D points using the inlier correspondences from
some corresp noisy.npz and the RANSAC estimate of the extrinsics and 3D points as an initialization.
[M2, P] = bundleAdjustment(K1, M1, p1, K2, M2 init, p2, P init)
In your write-up: include an image of the original 3D points and the optimized points as well
as the reprojection error with your initial M2 and P, and with the optimized matrices.
6 Deliverables
If your andrew id is bovik, your submission should be the writeup bovik.pdf and a zip file
bovik.zip. Please submit the zip file to Canvas and the pdf file to Gradescope.
The zip file should include the following directory structure:
• submission.py: your implementation of algorithms.
• findM2.py: script to compute the correct camera matrix.
• visualize.py: script to visualize the 3d points.
• q2 1.npz: file with output of Q2.1.
• q2 2.npz: file with output of Q2.2.
• q3 3.npz: file with output of Q3.3.
• q4 1.npz: file with output of Q4.1.
• q4 2.npz: file with output of Q4.2.
9
16720-B Homework 5 Neural Networks for Recognition with Solution
1 Theory
Q1.1 Theory [2 points] Prove that softmax is invariant to translation, that is
sof tmax(x) = sof tmax(x + c) ∀c ∈ R
Softmax is defined as below, for each index i in a vector x.
sof tmax(xi) = e
xi
P
j
e
xj
Often we use c = − max xi Why is that a good idea? (Tip: consider the range of values that
numerator will have with c = 0 and c = − max xi)
Q1.2 Theory [2 points] Softmax can be written as a three step processes, with si = e
xi
,
S =
Psi and sof tmax(xi) = 1
S
si
.
• As x ∈ R
d
, what are the properties of sof tmax(x), namely what is the range of each
element? What is the sum over all elements?
• One could say that ”softmax takes an arbitrary real valued vector x and turns it into
a ”.
• Can you see the role of each step in the multi-step process now? Explain them.
Q1.3 Theory [2 points] Show that multi-layer neural networks without a non-linear
activation function are equivalent to linear regression.
Q1.4 Theory [3 points] Given the sigmoid activation function σ(x) = 1
1+e−x , derive the
gradient of the sigmoid function and show that it can be written as a function of σ(x)
(without having access to x directly)
Q1.5 Theory [12 points] Given y = x
TW +b (or yj =
Pd
i=1 xiWij +bj ), and the gradient
of some loss J with respect y, show how to get ∂J
∂W ,
∂J
∂x and ∂J
∂b . Be sure to do the derivatives
with scalars and re-form the matrix form afterwards. Here is some notional suggestions.
∂J
∂y = δ ∈ R
k×1 W ∈ R
d×k x ∈ R
d×1
b ∈ R
k×1
We won’t grade the derivative with respect to b but you should do it anyways, you will need
it later in the assignment.
2
Q1.6 Theory [4 points] When the neural network applies the elementwise activation
function (such as sigmoid), the gradient of the activation function scales the backpropogation
update. This is directly from the chain rule, d
dx f(g(x)) = f
0
(g(x))g
0
(x).
1. Consider the sigmoid activation function for deep neural networks. Why might it lead
to a ”vanishing gradient” problem if it is used for many layers (consider plotting Q1.3)?
2. Often it is replaced with tanh(x) = 1−e−2x
1+e−2x . What are the output ranges of both tanh
and sigmoid? Why might we prefer tanh ?
3. Why does tanh(x) have less of a vanishing gradient problem? (plotting the derivatives
helps! for reference: tanh0
(x) = 1 − tanh(x)
2
)
4. tanh is a scaled and shifted version of the sigmoid. Show how tanh(x) can be written
in terms of σ(x). (Hint: consider how to make it have the same range)
2 Implement a Fully Connected Network
All of these functions should be implemented in python/nn.py
2.1 Network Initialization
Q2.1.1 Theory [2 points] Why is it not a good idea to initialize a network with all zeros?
If you imagine that every layer has weights and biases, what can a zero-initialized network
output after training?
Q2.1.2 Code [1 points] Implement a function to initialize neural network weights with
Xavier initialization [1], where V ar[w] = 2
nin+nout
where n is the dimensionality of the vectors
and you use a uniform distribution to sample random numbers (see eq 16 in the paper).
Q2.1.3 Theory [1 points] Why do we initialize with random numbers? Why do we scale
the initialization depending on layer size (see near Fig 6 in the paper)?
2.2 Forward Propagation
The appendix (sec 8) has the math for forward propagation, we will implement it here.
Q2.2.1 Code [4 points] Implement sigmoid, along with forward propagation for a single layer with an activation function, namely y = σ(XW + b), returning the output and
intermediate results for an N × D dimension input X, with examples along the rows, data
dimensions along the columns.
Q2.2.2 Code [3 points] Implement the softmax function. Be sure the use the numerical
stability trick you derived in Q1.1 softmax.
3
Q2.2.3 Code [3 points] Write a function to compute the accuracy of a set of labels, along
with the scalar loss across the data. The loss function generally used for classification is the
cross-entropy loss.
Lf(D) = −
X
(x,y)∈D
y · log(f(x))
Here D is the full training dataset of data samples x (N × 1 vectors, N = dimensionality of
data) and labels y (C × 1 one-hot vectors, C = number of classes).
2.3 Backwards Propagation
Q2.3.1 Code [7 points] Compute backpropogation for a single layer, given the original
weights, the appropriate intermediate results, and given gradient with respect to the loss.
You should return the gradient with respect to X so you can feed it into the next layer. As
a size check, your gradients should be the same dimensions as the original objects.
2.4 Training Loop
You will tend to see gradient descent in three forms: “normal”, “stochastic” and “batch”.
“Normal” gradient descent aggregates the updates for the entire dataset before changing the
weights. Stochastic gradient descent applies updates after every single data example. Batch
gradient descent is a compromise, where random subsets of the full dataset are evaluated
before applying the gradient update.
Q2.4.1 Code [5 points] Write a training loop that generates random batches, iterates over
them for many iterations, does forward and backward propagation, and applies a gradient
update step.
2.5 Numerical Gradient Checker
Q2.5.1 [5 points] Implement a numerical gradient checker. Instead of using the analytical
gradients computed from the chain rule, add offset to each element in the weights, and
compute the numerical gradient of the loss with central differences. Central differences is
just f(x+)−f(x−)
2
. Remember, this needs to be done for each scalar dimension in all of your
weights independently.
4
3 Training Models
First, be sure to run the script, from inside the scripts folder, get data.sh. This will use
wget and unzip to download
https://cmu.box.com/shared/static/ebklocte1t2wl4dja61dg1ct285l65md.zip
https://cmu.box.com/shared/static/147jyd6uewh1fff96yfh6e11s98v11xm.zip
and extract them to data and image folders
Since our input images are 32 × 32 images, unrolled into one 1024 dimensional vector, that
gets multiplied by W(1), each row of W(1) can be seen as a weight image. Reshaping each
row into a 32×32 image can give us an idea of what types of images each unit in the hidden
layer has a high response to.
We have provided you three data .mat files to use for this section. The training data in
nist36 train.mat contains samples for each of the 26 upper-case letters of the alphabet
and the 10 digits. This is the set you should use for training your network. The crossvalidation set in nist36 valid.mat contains samples from each class, and should be used
in the training loop to see how the network is performing on data that it is not training
on. This will help to spot over fitting. Finally, the test data in nist36 test.mat contains
testing data, and should be used for the final evaluation on your best model to see how well
it will generalize to new unseen data.
Q3.1.1 Code [3 points] Train a network from scratch. Use a single hidden layer with 64
hidden units, and train for at least 30 epochs. Modify the script to plot generate two plots:
one showing the accuracy on both the training and validation set over the epochs, and the
other showing the cross-entropy loss averaged over the data. The x-axis should represent
the epoch number, while the y-axis represents the accuracy or loss. With these settings, you
should see an accuracy on the validation set of at least 75%.
Q3.1.2 Writeup [2 points] Use your modified training script to train three networks, one
with your best learning rate, one with 10 times that learning rate and one with one tenth
that learning rate. Include all 4 plots in your writeup. Comment on how the learning rates
affect the training, and report the final accuracy of the best network on the test set.
Q3.1.3 Writeup [3 points] Visualize the first layer weights that your network learned
(using reshape and ImageGrid). Compare these to the network weights immediately after
initialization. Include both visualizations in your writeup. Comment on the learned weights.
Do you notice any patterns?
Q3.1.4 Writeup [2 points] Visualize the confusion matrix for your best model. Comment
on the top few pairs of classes that are most commonly confused.
5
4 Extract Text from Images
Figure 1: Sample image with handwritten characters annotated with boxes around each
character.
Now that you have a network that can recognize handwritten letters with reasonable accuracy, you can now use it to parse text in an image. Given an image with some text on it,
our goal is to have a function that returns the actual text in the image. However, since your
neural network expects a a binary image with a single character, you will need to process
the input image to extract each character. There are various approaches that can be done
so feel free to use any strategy you like.
Here we outline one possible method, another is that given in a tutorial
1. Process the image (blur, threshold, opening morphology, etc. (perhaps in that order))
to classify all pixels as being part of a character or background.
2. Find connected groups of character pixels (see skimage.measure.label). Place a bounding box around each connected component.
3. Group the letters based on which line of the text they are a part of, and sort each
group so that the letters are in the order they appear on the page.
4. Take each bounding box one at a time and resize it to 32 × 32, classify it with your
network, and report the characters in order (inserting spaces when it makes sense).
Since the network you trained likely does not have perfect accuracy, you can expect there
to be some errors in your final text parsing. Whichever method you choose to implement
for the character detection, you should be able to place a box on most of there characters in
the image. We have provided you with 01 list.jpg, 02 letters.jpg, 03 haiku.jpg and
04 deep.jpg to test your implementation on.
6
Q4.1 Theory [2 points] The method outlined above is pretty simplistic, and makes
several assumptions. What are two big assumptions that the sample method makes. In your
writeup, include two example images where you expect the character detection to fail (either
miss valid letters, or respond to non-letters).
Q4.2 Code [10 points] Find letters in the image. Given an RGB image, this function
should return bounding boxes for all of the located handwritten characters in the image, as
well as a binary black-and-white version of the image im. Each row of the matrix should
contain [y1,x1,y2,x2] the positions of the top-left and bottom-right corners of the box.
The black and white image should be floating point, 0 to 1, with the characters in black and
background in white.
Q4.3 Writeup [3 points] Run findLetters(..) on all of the provided sample images
in images/. Plot all of the located boxes on top of the image to show the accuracy of your
findLetters(..) function. Include all the result images in your writeup.
Q4.4 Code/Writeup [8 points] Now you will load the image, find the character locations, classify each one with the network you trained in Q3.2.1, and return the text contained
in the image. Be sure you try to make your detected images look like the images from the
training set. Visualize them and act accordingly.
Run your run q4 on all of the provided sample images in images/. Include the extracted
text in your writeup.
7
5 Image Compression with Autoencoders
An autoencoder is a neural network that is trained to attempt to copy its input to its output,
but it usually allows copying only approximately. This is typically achieved by restricting
the number of hidden nodes inside the autoencoder; in other words, the autoencoder would
be forced to learn to represent data with this limited number of hidden nodes. This is a
useful way of learning compressed representations. In this section, we will continue using
the NIST36 dataset you have from the previous questions.
5.1 Building the Autoencoder
Q5.1.1 Code [5 points] Due to the difficulty in training auto-encoders, we have to move
to the relu(x) = max(x, 0) activation function. It is provided for you in util.py. Implement
a 2 hidden layer autoencoder where the layers are
• 1024 to 32 dimensions, followed by a ReLU
• 32 to 32 dimensions, followed by a ReLU
• 32 to 32 dimensions, followed by a ReLU
• 32 to 1024 dimensions, followed by a sigmoid (this normalizes the image output for us)
The loss function that you’re using is total squared error for the output image compared to
the input image (they should be the same!).
Q5.1.2 Code [3 points] To help even more with convergence speed, we will implement
momentum. Now, instead of updating W = W − α
∂J
∂W , we will use the update rules MW =
0.9MW − α
∂J
∂W and W = W + MW . To implement this, populate the parameters dictionary
with zero-initialized momentum accumulators, one for each parameter. Then simply perform
both update equations for every batch.
5.2 Training the Autoencoder
Q5.2 Writeup/Code [2 points] Using the provided default settings, train the network
for 100 epochs. What do you observe in the plotted training loss curve as it progresses?
5.3 Evaluating the Autoencoder
Q5.3.1 Writeup/Code [2 points] Now lets evaluate how well the autoencoder has been
trained. Select 5 classes from the total 36 classes in your dataset and for each selected class
include in your report 2 validation images and their reconstruction. What differences do you
observe that exist in the reconstructed validation images, compared to the original ones?
Q5.3.2 Writeup [2 points] Lets evaluate the reconstruction quality using Peak Signalto-noise Ratio (PSNR). PSNR is defined as
PSNR = 20 × log10(MAXI ) − 10 × log10(MSE) (1)
8
where MAXI is the maximum possible pixel value of the image, and MSE (mean squared
error) is computed across all pixels. You may use skimage.measure.compare psnr for convenience. Report the average PSNR you get from the autoencoder across all validation
images.
6 Comparing against PCA
As a baseline for comparison, we will use one of the most popular methods for data dimensionality reduction – Principle Component Analysis (PCA). PCA allows one to find the
best low-rank approximation of the data by keeping only a specified number of principle
components. To perform PCA, we will use a factorization method called Singular Value
Decomposition (SVD).
Run SVD on the training data. One of the matrices will be an orthonormal matrix that
indicates the components of your data, sorted by their importances. Extract the first 32
principle components and form a projection matrix; you will need to figure out how to do
these from the U,S,V matrices.
Q6.1 Writeup [2 points] What is the size of your projection matrix? What is its rank?
This projection matrix was “trained” from our training data. Now lets “test” it on our test
data. Use the projection matrix on test data to obtain the reconstructed test images. Note
that these reconstructions can also be represented with only 32 numbers.
Q6.2 Writeup [2 points] Use the classes you selected in Q5.3.1, and for each of these
5 classes, include in your report 2 test images and their reconstruction. You may use test
labels to help you find the corresponding classes. What differences do you observe that exist
in the reconstructed test images, compared to the original ones? How do they compare to
the ones reconstructed from your autoencoder?
Q6.3 Writeup [2 points] Report the average PSNR you get from PCA. Is it better than
your autoencoder? Why?
Q6.4 Writeup [2 points] Count the number of learned parameters for both your autoencoder and the PCA model. How many parameters do the respective approaches learn? Why
is there such a difference in terms of performance?
9
7 PyTorch
While you were able to derive manual backpropogation rules for sigmoid and fully-connected
layers, wouldn’t it be nice if someone did that for lots of useful primatives and made it fast
and easy to use for general computation? Meet automatic differentiation. Since we have
high-dimensional inputs (images) and low-dimensional outputs (a scalar loss), it turns out
forward mode AD is very efficient. Popular autodiff packages include pytorch (Facebook),
tensorflow (Google), autograd (Boston-area academics). Autograd provides its own replacement for numpy operators and is a drop-in replacement for numpy, except you can ask for
gradients now. The other two are able to act as shim layers for cuDNN, an implementation
of auto-diff made by Nvidia for use on their GPUs. Since GPUs are able to perform large
amounts of math much faster than CPUs, this makes the former two packages very popular for researchers who train large networks. Tensorflow asks you to build a computational
graph using its API, and then is able to pass data through that graph. PyTorch builds a
dynamic graph and allows you to mix autograd functions with normal python code much
more smoothly, so it is currently more popular among CMU students.
For extra credit, we will use PyTorch as a framework. Many computer vision projects use
neural networks as a basic building block, so familiarity with one of these frameworks is
a good skill to develop. Here, we basically replicate and slightly expand our handwritten
character recognition networks, but do it in PyTorch instead of doing it ourselves. Feel free
to use any tutorial you like, but we like the offical one or this tutorial (in a jupyter notebook)
or these slides (starting from number 35).
For this section, you’re free to implement these however you like. All of the
tasks required here are fairly small and don’t require a GPU if you use small
networks. Including 7.2.
7.1 Train a neural network in PyTorch
Q7.1.1 Code/Writeup [5 points] Re-write and re-train your fully-connected network
on NIST36 in PyTorch. Plot training accuracy and loss over time.
Q7.1.2 Code/Writeup [2 points] Train a convolutional neural network with PyTorch
on MNIST. Plot training accuracy and loss over time.
Q7.1.3 Code/Writeup [3 points] Train a convolutional neural network with PyTorch
on the included NIST36 dataset.
Q7.1.4 Code/Writeup [10 points] Train a convolutional neural network with PyTorch
on the EMNIST Balanced dataset and evaluate it on the findLetters bounded boxes from
the images folder.
10
7.2 Fine Tuning
When training from scratch, a lot of epochs and data are often needed to learn anything
meaningful. One way to avoid this is to instead initialize the weights more intelligently.
These days, it is most common to initialize a network with weights from another deep
network that was trained for a different purpose. This is because, whether we are doing image
classification, segmentation, recognition etc…, most real images share common properties.
Simply copying the weights from the other network to yours gives your network a head start,
so your network does not need to learn these common weights from scratch all over again.
This is also referred to as fine tuning.
Q7.2.1 Code/Writeup [5 points] Fine-tune a single layer classifier using pytorch on the
flowers 17 (or flowers 102!) dataset using squeezenet1 1, as well as an architecture you’ve
designed yourself (3 conv layers, followed 2 fc layers, it’s standard slide 6) and trained from
scratch. How do they compare?
We include a script in scripts/ to fetch the flowers dataset and extract it in a way that
PyTorch ImageFolder can consume it, see an example, from data/oxford-flowers17. You
should look at how SqueezeNet is defined, and just replace the classifier layer. There exists
a pretty good example for fine-tuning in PyTorch.
References
[1] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep
feedforward neural networks. 2010. http://proceedings.mlr.press/v9/glorot10a/
glorot10a.pdf.
[2] P. J. Grother. Nist special database 19 handprinted forms and characters database.
https://www.nist.gov/srd/nist-special-database-19, 1995.
11
8 Appendix: Neural Network Overview
Deep learning has quickly become one of the most applied machine learning techniques in
computer vision. Convolutional neural networks have been applied to many different computer vision problems such as image classification, recognition, and segmentation with great
success. In this assignment, you will first implement a fully connected feed forward neural
network for hand written character classification. Then in the second part, you will implement a system to locate characters in an image, which you can then classify with your deep
network. The end result will be a system that, given an image of hand written text, will
output the text contained in the image.
8.1 Basic Use
Here we will give a brief overview of the math for a single hidden layer feed forward network.
For a more detailed look at the math and derivation, please see the class slides.
A fully-connected network f, for classification, applies a series of linear and non-linear
functions to an input data vector x of size N × 1 to produce an output vector f(x) of size
C ×1, where each element i of the output vector represents the probability of x belonging to
the class i. Since the data samples are of dimensionality N, this means the input layer has
N input units. To compute the value of the output units, we must first compute the values
of all the hidden layers. The first hidden layer pre-activation a
(1)(x) is given by
a
(1)(x) = W(1)x + b
(1)
Then the post-activation values of the first hidden layer h
(1)(x) are computed by applying a
non-linear activation function g to the pre-activation values
h
(1)(x) = g(a
(1)(x)) = g(W(1)x + b
(1))
Subsequent hidden layer (1 < t ≤ T) pre- and post activations are given by:
a
(t)
(x) = W(t)h
(t−1) + b
(t)
h
(t)
(x) = g(a
(t)
(x))
The output layer pre-activations a
(T)
(x) are computed in a similar way
a
(T)
(x) = W(T)h
(T −1)(x) + b
(T)
and finally the post-activation values of the output layer are computed with
f(x) = o(a
(T)
(x)) = o(W(T)h
(T −1)(x) + b
(T)
)
where o is the output activation function. Please note the difference between g and o! For
this assignment, we will be using the sigmoid activation function for the hidden layer, so:
g(y) = 1
1 + exp(−y)
12
Figure 2: Samples from NIST Special 19 dataset [2]
where when g is applied to a vector, it is applied element wise across the vector.
Since we are using this deep network for classification, a common output activation function
to use is the softmax function. This will allow us to turn the real value, possibly negative
values of a
(T)
(x) into a set of probabilities (vector of positive numbers that sum to 1). Letting
xi denote the i
th element of the vector x, the softmax function is defined as:
oi(y) = exp(yi)
P
j
exp(yj )
Gradient descent is an iterative optimisation algorithm, used to find the local optima. To
find the local minima, we start at a point on the function and move in the direction of
negative gradient (steepest descent) till some stopping criteria is met.
8.2 Backprop
The update equation for a general weight W
(t)
ij and bias b
(t)
i
is
W
(t)
ij = W
(t)
ij − α ∗
∂Lf
∂W(t)
ij
(x) b
(t)
i = b
(t)
i − α ∗
∂Lf
∂b(t)
i
(x)
α is the learning rate. Please refer to the backpropagation slides for more details on how to
derive the gradients. Note that here we are using softmax loss (which is different from the
least square loss in the slides).
13




