Neural Networks and Deep Learning

2018/02/27

Categories: neuralNetworks deepLearning Tags: logisticRegression imageClassification

Introduction

What is a Neural Network?

Input Output Application Model
Home Features Price Real Estate NN
Ad, User info Click on ad? 0/1 Online Advertising NN
Image Object Photo Tagging CNN
Audio Text transcript Speech Recognition RNN
English Chinese Machine Translation RNN
Image, Radar info Position of other cars Autonomous Driving Hybrid


Why is Deep Learning taking off?

Being able to train a big enough neural network:

scale


Logistic Regression as a Neural Network

$$ X = \begin{pmatrix} x^{(1)} &x^{(2)} &\cdots &x^{(m)} \end{pmatrix} \in \mathbb{R}^{n_x \times m} $$

$$ y= \begin{pmatrix} y^{(1)} &y^{(2)} &\cdots &y^{(m)} \end{pmatrix} \in \mathbb{R}^{1 \times m} $$

Given $x\in\mathbb{R}^{n_x}$, want $\hat{y}=\Pr(y=1|x)$

Output: $\hat{y}=\sigma(\omega^Tx+b)$, where sigmoid function $\sigma(z)=\frac{1}{1+e^{-z}}$

Given $\{x^{(i)}, y^{(i)}\}_{i=1}^m$, want $\hat{y}^{(i)}=\sigma(\omega^Tx^{(i)}+b)\approx y^{(i)}$

loss(error) function: $\mathcal{L}(\hat{y},y)=-\left[y\log \hat{y}+(1-y)\log(1-\hat{y})\right]$

cost function: $J(\omega,b)=\frac{1}{m}\sum\limits_{i=1}^m \mathcal{L}(\hat{y}^{(i)}, y^{(i)})=-\frac{1}{m}\sum\limits_{i=1}^m\left[y^{(i)}\log \hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})\right]$

repeat{

$\omega\leftarrow \omega - \alpha\frac{\partial}{\partial\omega} J(\omega,b)$

$b\leftarrow b - \alpha\frac{\partial}{\partial b} J(\omega,b)$

}

One training example:

$z= \omega^Tx+b$

$\hat{y} = a = \sigma(z)$

$\mathcal{L}(a,y)=-(y\log a+(1-y)\log(1-a))$


da := $\frac{d\mathcal{L}}{da}=-\frac{y}{a}+\frac{1-y}{1-a}$

dz := $\frac{d\mathcal{L}}{dz}=\frac{d\mathcal{L}}{da}\frac{da}{dz}=\left[-\frac{y}{a}+\frac{1-y}{1-a}\right]a(1-a)=a-y$

dwi := $\frac{\partial \mathcal{L}}{\partial \omega_i}=x_i\frac{d\mathcal{L}}{dz}$ = xi*dz

db := $\frac{\partial \mathcal{L}}{\partial b}=\frac{d\mathcal{L}}{dz}$ = dz

$\omega_i\leftarrow \omega_i - \alpha$ dwi

$b\leftarrow b - \alpha$ db

$m$ training examples:

$z^{(i)}= \omega^Tx^{(i)}+b$

$\hat{y}^{(i)} = a^{(i)} = \sigma(z^{(i)})$

$\mathcal{L}(a^{(i)},y^{(i)})=-(y^{(i)}\log a^{(i)}+(1-y^{(i)})\log(1-a^{(i)}))$

$J(\omega,b)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(a^{(i)},y^{(i)})$


$\frac{\partial J}{\partial\omega_i} = \frac{1}{m}\sum_{i=1}^m\frac{\partial}{\partial\omega_i}\mathcal{L}(a^{(i)},y^{(i)})$

2 for loops: loop over all entries ($m$), loop over all features ($n$)

$\Rightarrow$ vectorization, more efficient!

scale


Vectorization

def sigmoid(u):
    return 1/(1 + np.exp(-u))

# X: data, n*m, n = NO. of features, m = NO. of examples
# Y: labels, 1*m
# w: weights, n*1
# b: bias, scalar

# compute activation
A = sigmoid(np.dot(w.T, X) + b)

# cost
J = - np.mean(np.log(A) * Y +  np.log(1-A) *(1-Y))

# back propagation (compute gradient)
m = X.shape[1]
dZ = (A - Y)/m
db = np.sum(dZ)
dw = np.dot(X, dZ.T)

# update params
w -= learning_rate*dw
b -= learning_rate*db

Explanation of Logistic Regression Cost Function

$$ y|x \sim Binom(1, \hat{y}), \hat{y} =\sigma(\omega^Tx+b) $$

$$ \Rightarrow \Pr(y|x)=\hat{y}^y(1-\hat{y})^{1-y}=\begin{cases} 1-\hat{y}, &y=0\newline \hat{y}, &y=1 \end{cases} $$

$$ \Rightarrow \log\Pr(y|x)=y\log\hat{y}+(1-y)\log(1-\hat{y})=-\mathcal{L}(\hat{y},y) $$

Goal: maximize $\Pr(y|x)$ $\Leftrightarrow$ minimize $\mathcal{L}(\hat{y},y)$


Cost on $m$ examples:

maximize $\Pr(\text{labels in training set}) = \prod_{i=1}^n\Pr(y^{(i)}|x^{(i)})$

$\Leftrightarrow$ maximize $\log \prod_{i=1}^n\Pr(y^{(i)}|x^{(i)})=\sum_{i=1}^m\log\Pr(y^{(i)}|x^{(i)})=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})$

$\Leftrightarrow$ minimize $J(\omega,b)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})$


Image Classification (cat/non-cat) - Python Code

Jupyter Notebook


Shallow Neural Network

Forward Propagation

Parameters Dimension
$W^{[1]}$ $(n^{[1]}, n^{[0]})$
$b^{[1]}$ $(n^{[1]}, 1)$
$W^{[2]}$ $(n^{[2]}, n^{[1]})$
$b^{[2]}$ $(n^{[2]}, 1)$

$$ A^{[0]} = X, n^{[2]}=1 $$

$$ \text{Forward Propagation }\begin{cases} Z^{[k]} &= W^{[k]}A^{[k-1]} + b^{[k]}, (\text{linear combination}) \newline A^{[k]}&=g^{[k]}(Z^{[k]}), (activation) \end{cases} \ \ k=1,2,\ldots $$

scale

Activation Function

function derivative
sigmoid $\frac{1}{1+e^{-z}}$ $g(z)(1-g(z))$
tanh $\frac{e^z-e^{-z}}{e^z+e^{-z}}$ $1-(g(z))^2$
ReLU $\max(0, z)$ $g'(z)=\begin{cases}0, &z<0\newline 1, &z\geq 0\end{cases}$
Leaky ReLU $\max(0.01z, z)$ $g'(z)=\begin{cases}0.01, &z<0\newline 1, &z\geq 0\end{cases}$

Gradient Descent

Single Training Example m Training Examples
$dz^{[2]}=a^{[2]}-y$ $dZ^{[2]}=\frac{1}{m}(A^{[2]}-y)$
$dW^{[2]}=dz^{[2]}a^{[1]T}$ $dW^{[2]}=dZ^{[2]} (A^{[1]})^T$
$db^{[2]}=dz^{[2]}$ $db^{[2]}=np.sum(dZ^{[2]}, axis = 1, keepdims = True)$
$da^{[1]}=(W^{[2]})^T dz^{[2]} \circ g^{[1]'}(z^{[1]})$ $dZ^{[1]}=(W^{[2]})^T dZ^{[2]} \circ g^{[1]'}(Z^{[1]})$
$dW^{[1]}=dz^{[1]}x^T$ $dW^{[1]}=dZ^{[1]} X^T$
$db^{[1]}=dz^{[1]}$ $db^{[1]}=np.sum(dZ^{[1]}, axis = 1, keepdims = True)$

Summary

Forward Propagation Backward Propagation
$Z^{[1]}=W^{[1]}X + b^{[1]}$ $dZ^{[2]}=\frac{1}{m}(A^{[2]}-Y)$
$A^{[1]}=g^{[1]}(Z^{[1]})$ $dW^{[2]}= dZ^{[2]} (A^{[1]})^T$
$db^{[2]}=np.sum(dZ^{[2]}, axis = 1, keepdims = True)$
$Z^{[2]}=W^{[2]}A^{[1]}+b^{[2]}$ $dZ^{[1]}=(W^{[2]})^T dZ^{[2]} \circ g^{[1]'}(Z^{[1]})$ 【elementwise product】
$A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})$ $dW^{[1]}= dZ^{[1]} X^T$
$db^{[1]}=np.sum(dZ^{[1]}, axis = 1, keepdims = True)$

Random Initialization


Planar Data Classification - Python Code

Jupyter Notebook


Deep Neural Network

Notations

shallow and deep


Forward Propagation

Layer $\ell$:

(Vectorized) layer $\ell$:


Check Dimensions


Forward and Backward Functions

layer $\ell$

$$ dz^{[\ell]} = da^{[\ell]} * (g^{[\ell]})‘(z^{[\ell]}) \text{ (elementwise product)}\\
dW^{[\ell]} = dz^{[\ell]}a^{[\ell-1]}\\
db^{[\ell]}=dz^{[\ell]}\\
da^{[\ell-1]}=W^{[\ell]T}dz^{[\ell]} \Rightarrow dz^{[\ell]} =W^{[\ell+1]T}dz^{[\ell+1]} * (g^{[\ell]})‘(z^{[\ell]}) $$

$$ \text{Vectorized}: dZ^{[\ell]} = dA^{[\ell]} * (g^{[\ell]})‘(Z^{[\ell]}) \text{ (elementwise product)}\\
dW^{[\ell]} = dZ^{[\ell]}A^{[\ell-1]T}\\
db^{[\ell]}= np.sum(dZ^{[\ell]}, axis=1,keepdims=True)\\
dA^{[\ell-1]}=W^{[\ell]T}dZ^{[\ell]} $$

input and output


Parameters and Hyperparameters


Build Deep Neural Network - Image Classification (cat/non-cat) - Python Code

Jupyter Notebook


Practical Aspects of Deep Learning


Regularization


L2 Regularization

$$ J(W^{[1]},b^{[1]},\ldots,W^{[L]},b^{[L]})=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat{y}^{(i)}, y^{(i)})+\frac{\lambda}{2m}\sum_{\ell=1}^L\Vert W^{[\ell]}\Vert_F^2 $$

where $\Vert\cdot\Vert_F$ is the Frobenius norm.

tanh


Dropout Regularization

Illustrate with layer $\ell=3$

keepProb = 0.8 # probability a unit will be kept
d3 = np.random.rand(a3.shape) < keepProb
a3 *= d3 # elementwise product
a3 /= keepProb # ensure the expected value of a3 remains the same

Other Regularization Methods

Orthogonalization:

  1. optimize cost function: gradient descent, …
  2. not overfit: regularization, …

early stopping


Speed Up Training

normalization

$z=w_1x_1 +w_2x_2 +\cdots + w_nx_n$ (omit bias)

set $var(w_i)=\frac{1}{n}$ ($n=$ # of input nodes)

W = np.random.randn(node_in, node_out) / np.sqrt(node_in)
# relu activation
# W = np.random.randn(node_in, node_out) / np.sqrt(node_in/2)
# tanh activation
# W = np.random.randn(node_in, node_out) / np.sqrt(node_in)
# W = np.random.randn(node_in, node_out) / np.sqrt((node_in + node_out)/2)

Debugging of Backpropagation: gradient checking

for each $i$: $$ d\theta_{approx}[i]:=\frac{J(\theta_1,\theta_2,\ldots,\theta_i+\epsilon,\ldots)-J(\theta_1,\theta_2,\ldots,\theta_i-\epsilon,\ldots)}{2\epsilon}\approx \frac{\partial J}{\partial \theta_i} $$

check $$ \frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2} \approx 0, e.g. < 1e-7 $$


Python Code

Initialization

Regularization

Gradient Checking


Optimization Algorithms

Mini-Batch Gradient Descent

mini batch

mini batch cost

stochastic gradient descent


Momentum

Exponentially Weighted Averages

$$ v_0=0, v_t = \beta v_{t-1}+(1-\beta)\theta_t $$

$$ \Rightarrow v_t = \beta^tv_0 +(1-\beta)\left[ \theta_t + \beta\theta_{t-1} +\ldots \beta^{t-1}\theta_1 \right]=(1-\beta)\sum_{i=0}^{t-1}\beta^i\theta_{t-i} $$

Since $$ \beta^{\frac{1}{1-\beta}}=\left(1-(1-\beta)\right)^{\frac{1}{1-\beta}}\approx \frac{1}{e}, $$ $v_t\approx$ average of the last $\frac{1}{1-\beta}$ terms of $\theta$’s

$v_{\theta}:=0$

repeat{

get next $\theta_t$

$v_{\theta}:=\beta v_{\theta} + (1-\beta)\theta_t$

}


Bias Correction in Exponentially Weighted Averages

Take $v_t := v_t/(1-\beta^t)$

bias correction


Gradient Descent With Momentum

on iteration $t$

momentum

momentum

  1. gradient descent
  2. with momentum (small $\beta$)
  3. with momentum (large $\beta$)

Root Mean Square Prop (RMS Prop)

on iteration $t$


Adam Optimization Algorithm

$v_{dw}=0, s_{dw}=0,v_{db}=0,s_{db}=0$

on iteration $t$


Learning Rate Decay

$$ \alpha =\frac{1}{1+\text{decay-rate}\times \text{epoch-num}}\alpha_0 $$


Local Optima in Neural Networks

saddle

plateau


Python Code - Optimization Algorithms

jupyter notebook


Tuning Process

Hyperparameters



Hyperparameters Tuning in Practice

panda

Batch Normalization

batch norm

batch norm

for $t=1,2,\ldots, $ numMiniBatch



Multiclass Classification

Softmax Regression

$$ \text{activation: } t = e^{z^{[L]}},\ a^{[L]}=\frac{e^{z^{[L]}}}{\sum_{j=1}^C t_i} $$

$$ \mathcal{L}(y,\hat{y})=-\sum_{j=1}^Cy_j\log(\hat{y}_j) $$

$$ J=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(y^{(i)},\hat{y}^{(i)}) $$

$$ dZ^{[L]}:=\frac{\partial J}{\partial Z^{[L]}}=\hat{y}-y $$


Deep Learning Frameworks

Choosing Deep Learning Frameworks


Signs Recognition with Tensorflow

jupyter notebook


ML Strategy

Single Number Evaluation Metric


Satisfying and Optimizing Metric


Train/Dev/Test Distributions


Size of the Dev and Test Sets


When to Change Dev/Test Sets and Metrics

cat

cat

cat


Comparing to Human-Level Performance

bayes

human



Error Analysis

error analysis

error analysis


Cleaning Up Incorrectly Labeled Data

error analysis


Correcting in correct dev/test set examples


Mismatched Training and Dev/Test Set

option

mismatch

  1. training error - human-level error = avoidable error
  2. training-dev error - training error = variance
  3. dev error - training error = data mismatch
  4. test error - dev error = degree of overfitting to dev set (larger dev set)

mismatch


Addressing Data Mismatch


Transfer Learning

transfer learning

When transfer learning makes sense (Task A $\rightarrow$ Task B)


Multi-task Learning

multi task

When multi-task learning makes sense


End-to-End Deep Learning

pros of end-to-end deep learning

cons

end to end