Image Classification

Challenges

challenges

Viewpoint variation. A single instance of an object can be oriented in many ways with respect to the camera.
Scale variation. Visual classes often exhibit variation in their size (size in the real world, not only in terms of their extent in the image).
Deformation. Many objects of interest are not rigid bodies and can be deformed in extreme ways.
Occlusion. The objects of interest can be occluded. Sometimes only a small portion of an object (as little as few pixels) could be visible.
Illumination conditions. The effects of illumination are drastic on the pixel level.
Background clutter. The objects of interest may blend into their environment, making them hard to identify.
Intra-class variation. The classes of interest can often be relatively broad, such as chair. There are many different types of these objects, each with their own appearance.

Data-Driven Approach

first accumulating a training dataset of labeled images
then develop learning algorithms that look at these examples and learn about the visual appearance of each class

Image Classification Pipeline

Input: Our input consists of a set of N images, each labeled with one of K different classes. We refer to this data as the training set.
Learning: Our task is to use the training set to learn what every one of the classes looks like. We refer to this step as training a classifier, or learning a model.
Evaluation: In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before. We will then compare the true labels of these images to the ones predicted by the classifier. Intuitively, we’re hoping that a lot of the predictions match up with the true answers (which we call the ground truth).

Example Image Classification Dataset: CIFAR-10

CIFAR-10 dataset
60,000 tiny images that are 32 pixels high and wide
Each image is labeled with one of 10 classes (for example “airplane, automobile, bird, etc”)
These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images.
kaggle leaderboard

kNN

challenges

Hyperparameters: $k$ , distance function
Higher $k$ , variance $\downarrow$ (more resistant to outliers), bias $\uparrow$ $\Rightarrow$ bias-variance tradeoff
L2 distance prefers many medium disagreements to one big one.
L1 distance: depends on the choice of coordinate system
- if entries of feature vector have important meaning, L1 may be more natural
kNN demo

Load Data

Xtr_orig, Ytr, Xte_orig, Yte = load_CIFAR10(cifar10_dir) # a magic function we provide
# flatten out all images to be one-dimensional
Xtr = Xtr_orig.reshape(Xtr_orig.shape[0], -1) # Xtr becomes 50000 x 3072
Xte = Xte_orig.reshape(Xte_orig.shape[0], -1) # Xte becomes 10000 x 3072

Train and Evaluate a Classifier

API KNearestNeighbour()

knn = KNearestNeighbor() # create a kNN classifier instance
knn.train(Xtr, ytr) # train the classifier on the training images and labels
yte_predict = knn.predict(Xte) # predict labels on the test images
# accuracy = fraction of examples that were correctly predicted
accuracy = np.mean(yte_predict == yte)
print('accuracy: %f' % accuracy)

Implementation of kNN Classifier with L2 Distance

Vectorization of Distance Computation

$$ A = \begin{pmatrix} a_1\newline \vdots\newline a_N \end{pmatrix} \in \mathbb{R}^{N \times D},\ \ B = \begin{pmatrix} b_1\newline \vdots\newline b_M \end{pmatrix} \in \mathbb{R}^{M \times D} $$

$$ E_{ij} = \Vert b_i- a_j\Vert_2, \ \ E=(E_{ij})\in\mathbb{R}^{M\times N} $$

$$ E_{ij}^2=-2b_i a_j^T +\Vert b_i\Vert_2^2 +\Vert a_j\Vert_2^2 =-2 (BA^T)_{ij} +\Vert b_i\Vert_2^2 +\Vert a_j\Vert_2^2 $$

b = np.sum(B * B, axis = 1, keepdims = True)
a = np.sum(A * A, axis = 1, keepdims = True)
E = np.sqrt(-2 * np.dot(B, A.T) + b + a.T) # numpy broadcasting

import numpy as np
from scipy.stats import mode

class KNearestNeighbor(object):
    """ a kNN classifier with L2 distance """
    def __init__(self):
        pass

    def train(self, X, y):
        """
        X: X.shape == (N, D), N examples, each of dim D
        y: y.shape == (N,)
           y[i] is the label of X[i]
        """
        # the nearest neighbor classifier simply remembers all the training data
        self.Xtrain = X
        self.ytrain = y

    def computeDistances(self, X):
        """
        Compute the distances between each test point in X
        and each training point in self.Xtrain
        Input:
        X: each row is an example we wish to predict label for
           X.shape == (ntest, D)
        Output:
        dists: dists.shape == (ntest, ntrain)
               dists[i, j] == L2 distance between X[i] and self.Xtrain[j]
        """
        ntest, ntrain = X.shape[0], self.Xtrain.shape[0]

        te = np.sum(X * X, axis = 1, keepdims = True)
        tr = np.sum(self.Xtrain * self.Xtrain, axis = 1, keepdims = True)
        dists = np.sqrt(-2 * np.dot(X, self.Xtrain.T) + te + tr.T)

    def predict(self, X, k = 1):
        """
        Predict labels for test data using this classifier.

        Input:
        X: each row is an example we wish to predict label for
           X.shape == (ntest, D)
        Output:
        ypred: ypred.shape == (ntest,)
               ypred[i] is the predicted label for X[i]
        """
        dists = self.computeDistances(X)
        ntest = X.shape[0]
        # ith row: indices of k nearest neighbors of X[i]
        k_idx = dists.argsort(axis = 1)[:, :k]
        # ith row: labels of k nearest neighbors of X[i]
        closest_y = self.ytrain[k_idx]
        # ith row: most common label in closest_y[i]
        y_pred = mode(closest_y, axis = 1).mode

        return np.squeeze(y_pred)

Hyperparameter Tuning

Validation Set (50%-90% for training, rest for validation)
Cross Validation
- in practice, tend to avoid CV since CV can be computationally expensive
- typical # of folds: 3, 5, 10
If the # of hyperparameters is large you may prefer to use bigger validation splits
If the # of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use CV

Evaluate on the test set only a single time, at the very end.

Split your training set into training set and a validation set. Use validation set to tune all hyperparameters. At the end run a single time on the test set and report performance.

Pros and Cons of kNN

With $N$ examples, how fast are training and prediction?

Training: $O(1)$ , only need to memorize training data
Prediction: $O(N)$ , need to compare test example to each of the sample in the training set
in practice, we want classifiers that are fast at prediction; slow for training is ok

Approximate Nearest Neighbor (ANN) algorithms (e.g. FLANN)

accelerate the NN lookup
trade off the correctness of the nearest neighbor retrieval with its space/time complexity during retrieval
rely on a pre-processing/indexing stage that involves building a KDtree, or running the k-means algorithm

Python Code

Jupyter Notebook

.py

kNN in Practice (Never Used on Images)

Normalize the features in your data (e.g. one pixel in images) to have zero mean and unit variance.
If your data is very high-dimensional, consider using a dimensionality reduction technique such as PCA (wiki ref, CS229ref, blog ref) or even Random Projections.
Split your training data randomly into train/val splits.
Train and evaluate the kNN classifier on the validation data (for all folds, if doing cross-validation) for many choices of $k$ and across different distance types.
If your kNN classifier is running too long, consider using an Approximate Nearest Neighbor library (e.g. FLANN) to accelerate the retrieval (at cost of some accuracy).
Take note of the hyperparameters that gave the best results. There is a question of whether you should use the full training set with the best hyperparameters, since the optimal hyperparameters might change if you were to fold the validation data into your training set (since the size of the data would be larger). In practice it is cleaner to not use the validation data in the final classifier and consider it to be burned on estimating the hyperparameters. Evaluate the best model on the test set. Report the test set accuracy and declare the result to be the performance of the kNN classifier on your data.

CS231n-CNN for Visual Recognition-Assignment1

2018/03/07