Introduction

What is a Neural Network?

ReLU = Rectified Linear Unit

Input	Output	Application	Model
Home Features	Price	Real Estate	NN
Ad, User info	Click on ad? 0/1	Online Advertising	NN
Image	Object	Photo Tagging	CNN
Audio	Text transcript	Speech Recognition	RNN
English	Chinese	Machine Translation	RNN
Image, Radar info	Position of other cars	Autonomous Driving	Hybrid

Image - convolutional neural network, CNN
sequence data (temporal data, time series) - recurrent neural network, RNN

Structured data: database
Unstructured data: audio, image, text

Why is Deep Learning taking off?

Being able to train a big enough neural network:

Data: huge amount of labeled data
Computation
Algorithms:
- try to make NNs run faster
- e.g. switching from sigmoid to ReLU makes gradient descent algorithm run much faster

scale

Logistic Regression as a Neural Network

binary classification problem: $(x,y), x\in \mathbb{R}^{n_x}, y\in\{0,1\}$
$m$ training examples: $\{x^{(i)}, y^{(i)}\}_{i=1}^m$

$$ X = \begin{pmatrix} x^{(1)} &x^{(2)} &\cdots &x^{(m)} \end{pmatrix} \in \mathbb{R}^{n_x \times m} $$

$$ y= \begin{pmatrix} y^{(1)} &y^{(2)} &\cdots &y^{(m)} \end{pmatrix} \in \mathbb{R}^{1 \times m} $$

logistic regression:

Given $x\in\mathbb{R}^{n_x}$ , want $\hat{y}=\Pr(y=1|x)$

Output: $\hat{y}=\sigma(\omega^Tx+b)$ , where sigmoid function $\sigma(z)=\frac{1}{1+e^{-z}}$

if $z$ is large, $\sigma(z)\approx 1$

if $z$ is large negative number, $\sigma(z)\approx 0$

$\sigma(0)=0.5$

cost function:

Given $\{x^{(i)}, y^{(i)}\}_{i=1}^m$ , want $\hat{y}^{(i)}=\sigma(\omega^Tx^{(i)}+b)\approx y^{(i)}$

loss(error) function: $\mathcal{L}(\hat{y},y)=-\left[y\log \hat{y}+(1-y)\log(1-\hat{y})\right]$

cost function: $J(\omega,b)=\frac{1}{m}\sum\limits_{i=1}^m \mathcal{L}(\hat{y}^{(i)}, y^{(i)})=-\frac{1}{m}\sum\limits_{i=1}^m\left[y^{(i)}\log \hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})\right]$

gradient descent:

$J(\omega,b)$ is convex!

algorithm:

repeat{

$\omega\leftarrow \omega - \alpha\frac{\partial}{\partial\omega} J(\omega,b)$

$b\leftarrow b - \alpha\frac{\partial}{\partial b} J(\omega,b)$

}

computation graph:
- forward propagation: compute current loss
- back propagation: compute current gradient
logistic regression gradient descent: update parameters

One training example:

$z= \omega^Tx+b$

$\hat{y} = a = \sigma(z)$

$\mathcal{L}(a,y)=-(y\log a+(1-y)\log(1-a))$

da := $\frac{d\mathcal{L}}{da}=-\frac{y}{a}+\frac{1-y}{1-a}$

dz := $\frac{d\mathcal{L}}{dz}=\frac{d\mathcal{L}}{da}\frac{da}{dz}=\left[-\frac{y}{a}+\frac{1-y}{1-a}\right]a(1-a)=a-y$

dwi := $\frac{\partial \mathcal{L}}{\partial \omega_i}=x_i\frac{d\mathcal{L}}{dz}$ = xi*dz

db := $\frac{\partial \mathcal{L}}{\partial b}=\frac{d\mathcal{L}}{dz}$ = dz

$\omega_i\leftarrow \omega_i - \alpha$ dwi

$b\leftarrow b - \alpha$ db

$m$ training examples:

$z^{(i)}= \omega^Tx^{(i)}+b$

$\hat{y}^{(i)} = a^{(i)} = \sigma(z^{(i)})$

$\mathcal{L}(a^{(i)},y^{(i)})=-(y^{(i)}\log a^{(i)}+(1-y^{(i)})\log(1-a^{(i)}))$

$J(\omega,b)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(a^{(i)},y^{(i)})$

$\frac{\partial J}{\partial\omega_i} = \frac{1}{m}\sum_{i=1}^m\frac{\partial}{\partial\omega_i}\mathcal{L}(a^{(i)},y^{(i)})$

2 for loops: loop over all entries ( $m$ ), loop over all features ( $n$ )

$\Rightarrow$ vectorization, more efficient!

scale

Vectorization

Whenever possible, avoid explicit for-loops.

def sigmoid(u):
    return 1/(1 + np.exp(-u))

# X: data, n*m, n = NO. of features, m = NO. of examples
# Y: labels, 1*m
# w: weights, n*1
# b: bias, scalar

# compute activation
A = sigmoid(np.dot(w.T, X) + b)

# cost
J = - np.mean(np.log(A) * Y +  np.log(1-A) *(1-Y))

# back propagation (compute gradient)
m = X.shape[1]
dZ = (A - Y)/m
db = np.sum(dZ)
dw = np.dot(X, dZ.T)

# update params
w -= learning_rate*dw
b -= learning_rate*db

Explanation of Logistic Regression Cost Function

$$ y|x \sim Binom(1, \hat{y}), \hat{y} =\sigma(\omega^Tx+b) $$

$$ \Rightarrow \Pr(y|x)=\hat{y}^y(1-\hat{y})^{1-y}=\begin{cases} 1-\hat{y}, &y=0\newline \hat{y}, &y=1 \end{cases} $$

$$ \Rightarrow \log\Pr(y|x)=y\log\hat{y}+(1-y)\log(1-\hat{y})=-\mathcal{L}(\hat{y},y) $$

Goal: maximize $\Pr(y|x)$ $\Leftrightarrow$ minimize $\mathcal{L}(\hat{y},y)$

Cost on $m$ examples:

maximize $\Pr(\text{labels in training set}) = \prod_{i=1}^n\Pr(y^{(i)}|x^{(i)})$

$\Leftrightarrow$ maximize $\log \prod_{i=1}^n\Pr(y^{(i)}|x^{(i)})=\sum_{i=1}^m\log\Pr(y^{(i)}|x^{(i)})=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})$

$\Leftrightarrow$ minimize $J(\omega,b)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})$

Image Classification (cat/non-cat) - Python Code

Jupyter Notebook

Shallow Neural Network

Forward Propagation

Neural Network representation: $k$ th layer, with activation function $g^{[k]}$

Parameters	Dimension
$W^{[1]}$	$(n^{[1]}, n^{[0]})$
$b^{[1]}$	$(n^{[1]}, 1)$
$W^{[2]}$	$(n^{[2]}, n^{[1]})$
$b^{[2]}$	$(n^{[2]}, 1)$

$$ A^{[0]} = X, n^{[2]}=1 $$

$$ \text{Forward Propagation }\begin{cases} Z^{[k]} &= W^{[k]}A^{[k-1]} + b^{[k]}, (\text{linear combination}) \newline A^{[k]}&=g^{[k]}(Z^{[k]}), (activation) \end{cases} \ \ k=1,2,\ldots $$

scale

Activation Function

activation function can be sigmoid, tanh, ReLU, Leaky ReLU…
- for hidden layers, tanh always works better than sigmoid 【the mean of its output is closer to zero, and so it centers the data better for the next layer】
- for output layer of binary classification (0/1), sigmoid may be better
- different layers can have different activation functions
- ReLU is increasingly the default choice (will learn faster)

	function	derivative
sigmoid	$\frac{1}{1+e^{-z}}$	$g(z)(1-g(z))$
tanh	$\frac{e^z-e^{-z}}{e^z+e^{-z}}$	$1-(g(z))^2$
ReLU	$\max(0, z)$	$g'(z)=\begin{cases}0, &z<0\newline 1, &z\geq 0\end{cases}$
Leaky ReLU	$\max(0.01z, z)$	$g'(z)=\begin{cases}0.01, &z<0\newline 1, &z\geq 0\end{cases}$

Gradient Descent

Single Training Example	m Training Examples
$dz^{[2]}=a^{[2]}-y$	$dZ^{[2]}=\frac{1}{m}(A^{[2]}-y)$
$dW^{[2]}=dz^{[2]}a^{[1]T}$	$dW^{[2]}=dZ^{[2]} (A^{[1]})^T$
$db^{[2]}=dz^{[2]}$	$db^{[2]}=np.sum(dZ^{[2]}, axis = 1, keepdims = True)$
$da^{[1]}=(W^{[2]})^T dz^{[2]} \circ g^{[1]'}(z^{[1]})$	$dZ^{[1]}=(W^{[2]})^T dZ^{[2]} \circ g^{[1]'}(Z^{[1]})$
$dW^{[1]}=dz^{[1]}x^T$	$dW^{[1]}=dZ^{[1]} X^T$
$db^{[1]}=dz^{[1]}$	$db^{[1]}=np.sum(dZ^{[1]}, axis = 1, keepdims = True)$

Summary

Forward Propagation	Backward Propagation
$Z^{[1]}=W^{[1]}X + b^{[1]}$	$dZ^{[2]}=\frac{1}{m}(A^{[2]}-Y)$
$A^{[1]}=g^{[1]}(Z^{[1]})$	$dW^{[2]}= dZ^{[2]} (A^{[1]})^T$ $db^{[2]}=np.sum(dZ^{[2]}, axis = 1, keepdims = True)$
$Z^{[2]}=W^{[2]}A^{[1]}+b^{[2]}$	$dZ^{[1]}=(W^{[2]})^T dZ^{[2]} \circ g^{[1]'}(Z^{[1]})$ 【elementwise product】
$A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})$	$dW^{[1]}= dZ^{[1]} X^T$ $db^{[1]}=np.sum(dZ^{[1]}, axis = 1, keepdims = True)$

Random Initialization

If initialize weights and biases with 0: Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.
If initialize weights to relative large values ( using the tanh activation for all the hidden units): the inputs of the tanh to also be very large, thus causing gradients to be close to zero. The optimization algorithm will thus become slow.

Planar Data Classification - Python Code

Jupyter Notebook

Deep Neural Network

Notations

shallow and deep

$L$ : # of layers
$n^{[\ell]}$ : # of units in layer $\ell$
$a^{[\ell]}$ : activations in layer $\ell$
- $a^{[\ell]}=g^{[\ell]}(z^{[\ell]})$
$w^{[\ell]}, b^{[\ell]}$ : weights and bias for computing $z^{[\ell]}$

Forward Propagation

Layer $\ell$ :

$a^{[0]}:=x$
$z^{[\ell]}=W^{[\ell]}a^{[\ell-1]}+b^{[\ell]}$
$a^{[\ell]}=g^{[\ell]}(z^{[\ell]})$

(Vectorized) layer $\ell$ :

$A^{[0]}:=X$
$Z^{[\ell]}=W^{[\ell]}A^{[\ell-1]}+b^{[\ell]}$ (numpy broadcasting)
$A^{[\ell]}=g^{[\ell]}(Z^{[\ell]})$

Check Dimensions

shape of $W^{[\ell]}$ and $dW^{[\ell]}$ : $(n^{[\ell]}, n^{[\ell-1]})$
shape of $b^{[\ell]}$ and $db^{[\ell]}$ : $(n^{[\ell]}, 1)$
shape of $Z^{[\ell]}, A^{[\ell]}, dZ^{[\ell]}, dA^{[\ell]}$ : $(n^{[\ell]},m)$

Forward and Backward Functions

layer $\ell$

Forward:
- input: $a^{[\ell-1]}$
- output: $a^{[\ell]}$ , cache $z^{[\ell]}$
Backward:
- input: $da^{[\ell]}$
- output: $da^{[\ell-1]}, dW^{[\ell]}, db^{[\ell]}$
- computation:

$$ dz^{[\ell]} = da^{[\ell]} * (g^{[\ell]})‘(z^{[\ell]}) \text{ (elementwise product)}\\
dW^{[\ell]} = dz^{[\ell]}a^{[\ell-1]}\\
db^{[\ell]}=dz^{[\ell]}\\
da^{[\ell-1]}=W^{[\ell]T}dz^{[\ell]} \Rightarrow dz^{[\ell]} =W^{[\ell+1]T}dz^{[\ell+1]} * (g^{[\ell]})‘(z^{[\ell]}) $$

$$ \text{Vectorized}: dZ^{[\ell]} = dA^{[\ell]} * (g^{[\ell]})‘(Z^{[\ell]}) \text{ (elementwise product)}\\
dW^{[\ell]} = dZ^{[\ell]}A^{[\ell-1]T}\\
db^{[\ell]}= np.sum(dZ^{[\ell]}, axis=1,keepdims=True)\\
dA^{[\ell-1]}=W^{[\ell]T}dZ^{[\ell]} $$

input and output

Parameters and Hyperparameters

parameters: $W^{[\ell]}, b^{[\ell]}$
hyperparameters:
- learning rate $\alpha$
- # of iterations
- # of hidden layers $L$
- # of hidden units $n^{[\ell]}$
- choice of activation function
- momentum
- minibatch size
- regularizations

Build Deep Neural Network - Image Classification (cat/non-cat) - Python Code

Jupyter Notebook

Practical Aspects of Deep Learning

train/dev/test sets
- traditional rules of thumb ratio: 60/20/20
- big data set: e.g. 98/1/1, 99.5/.4/.1
mismatched train/test distribution
- guideline: make sure dev and test come from same distribution
not having a test set is ok (only dev set)
bias/variance $\Rightarrow$ Basic Recipe:
- high bias? $\rightarrow$ bigger network, train longer, NN architecture search
- high variance? $\rightarrow$ get more data, regularization, NN architecture search
no more bias/variance tradeoff:
- pre-deep-learning era: can not just reduce bias or variance without hurting the other
- deep-learning big-data era: a bigger network almost always reduces bias without necessarily hurting the variance so long as you regularize properly; getting more data almost always reduces variance without hurting bias much

Regularization

reduce variance?
- more data: expensive
- use regularization, prevent overfitting

L2 Regularization

Cost function

$$ J(W^{[1]},b^{[1]},\ldots,W^{[L]},b^{[L]})=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat{y}^{(i)}, y^{(i)})+\frac{\lambda}{2m}\sum_{\ell=1}^L\Vert W^{[\ell]}\Vert_F^2 $$

where $\Vert\cdot\Vert_F$ is the Frobenius norm.

$dW^{[\ell]}=$ (from backprop) $+ \frac{\lambda}{m}W^{[\ell]}$
- update: $W^{[\ell]} := W^{[\ell]} -\alpha dW^{[\ell]}$
- $\Rightarrow W^{[\ell]}:=(1-\frac{\alpha\lambda}{m})W^{[\ell]}-\alpha$ (from backprop)
- L2 regularization = weight decay
why regularization reduces overfitting?
- large $\lambda$ $\Rightarrow$ $W^{[\ell]}\approx 0$ $\Rightarrow$ “simpler” network (not zero out hidden units, but some of them have a smaller effect)
- e.g., activation $g = \tanh$ , large $\lambda$ $\Rightarrow$ small $W^{[\ell]}$ $\Rightarrow$ $z^{[\ell]}=W^{[\ell]}a^{[\ell-1]}+b^{[\ell]}$ small $\Rightarrow$ $g(z)\approx$ linear function $\Rightarrow$ nearly linear decision boundary

tanh

debugging: with regularization, cost $J$ will not decease monotonically with # of iterations

Dropout Regularization

Implement Dropout (”Inverted Dropout”)

Illustrate with layer $\ell=3$

keepProb = 0.8 # probability a unit will be kept
d3 = np.random.rand(a3.shape) < keepProb
a3 *= d3 # elementwise product
a3 /= keepProb # ensure the expected value of a3 remains the same

make predictions at test time: no dropout
why does dropout work?
- intuition: cannot rely on any one feature, so have to spread out weights
- shrink weights (similar to L2)
vary keepProb by layer
cost function is no longer well-defined

Other Regularization Methods

Data augmentation: e.g. image flip/zoom-in/rotate/distortion
Early stopping:
- cons: can no longer work on 2 steps of orthogonalization independently
- alternative: L2 regularization (cons: more computationally expensive)

Orthogonalization:

optimize cost function: gradient descent, …

not overfit: regularization, …

early stopping

Speed Up Training

normalizing training sets
- make sure all features are on similar scale, thus making cost function easier and faster to optimize
- use the same mean/sd to normalized the test set

normalization

weight initialization for deep networks
- partial solution to vanishing/exploding gradients
- use Xavier initialization, …
- can also be tune as hyperparameter

$z=w_1x_1 +w_2x_2 +\cdots + w_nx_n$ (omit bias)

set $var(w_i)=\frac{1}{n}$ ( $n=$ # of input nodes)

W = np.random.randn(node_in, node_out) / np.sqrt(node_in)
# relu activation
# W = np.random.randn(node_in, node_out) / np.sqrt(node_in/2)
# tanh activation
# W = np.random.randn(node_in, node_out) / np.sqrt(node_in)
# W = np.random.randn(node_in, node_out) / np.sqrt((node_in + node_out)/2)

Debugging of Backpropagation: gradient checking

take $W^{[1]},b^{[1]},\ldots,W^{[L]},b^{[L]}$ and reshape into a big vector $\theta$
take $dW^{[1]},db^{[1]},\ldots,dW^{[L]},db^{[L]}$ and reshape into a big vector $d\theta$
grad check:

for each $i$ : $$ d\theta_{approx}[i]:=\frac{J(\theta_1,\theta_2,\ldots,\theta_i+\epsilon,\ldots)-J(\theta_1,\theta_2,\ldots,\theta_i-\epsilon,\ldots)}{2\epsilon}\approx \frac{\partial J}{\partial \theta_i} $$

check $$ \frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2} \approx 0, e.g. < 1e-7 $$

do not use in training - only to debug
if algorithm fails grad check, look at components to identify bug
remember regularization term
grad check does not work with dropout
- implement grad check without dropout (turn off dropout, keepProb = 1)
run at random initialization; perhaps again after some training

Python Code

Initialization

Regularization

Gradient Checking

Optimization Algorithms

Mini-Batch Gradient Descent

batch vs mini-batch
mini-batch $X^{\{t\}}, y^{\{t\}}$
mini-batch gradient descent

mini batch

mini batch cost

choose mini-batch size: hyperparameter
- size = $m$ : batch gradient descent
- size = 1: stochastic gradient descent, every example is a mini-batch
- in practice, size between 1 and $m$
  - batch gradient descent: too long per iteration
  - gradient gradient descent: lose speedup from vectorization
- small training set ( $m\leq 2000$ ): use batch gradient descent
- typical mini-batch size: 64,128,256,512
- make sure mini-batch fits in CPU/GPU memory

stochastic gradient descent

Momentum

Exponentially Weighted Averages

$$ v_0=0, v_t = \beta v_{t-1}+(1-\beta)\theta_t $$

$$ \Rightarrow v_t = \beta^tv_0 +(1-\beta)\left[ \theta_t + \beta\theta_{t-1} +\ldots \beta^{t-1}\theta_1 \right]=(1-\beta)\sum_{i=0}^{t-1}\beta^i\theta_{t-i} $$

Since $$ \beta^{\frac{1}{1-\beta}}=\left(1-(1-\beta)\right)^{\frac{1}{1-\beta}}\approx \frac{1}{e}, $$ $v_t\approx$ average of the last $\frac{1}{1-\beta}$ terms of $\theta$ ’s

$v_{\theta}:=0$

repeat{

get next $\theta_t$

$v_{\theta}:=\beta v_{\theta} + (1-\beta)\theta_t$

}

Bias Correction in Exponentially Weighted Averages

Take $v_t := v_t/(1-\beta^t)$

bias correction

decrease $\beta$ :
- more oscillation
- shift the line to the left

Gradient Descent With Momentum

on iteration $t$

compute $dw,db$ on current mini-batch

$v_{dw} = \beta v_{dw}+(1-\beta)dw$ ( $v_{dw}$ : velocity, $dw$ : acceleration, $\beta$ : friction)

$v_{db} = \beta v_{db}+(1-\beta)db$

update: $w := w- \alpha v_{dw}, b :=b- \alpha v_{db}$

hyperparameters: $\alpha, \beta=0.9$
no need to do bias correction

momentum

gradient descent
with momentum (small $\beta$ )
with momentum (large $\beta$ )

Root Mean Square Prop (RMS Prop)

on iteration $t$

compute $dw,db$ on current mini-batch

$s_{dw} = \beta s_{dw}+(1-\beta)dw^2$ (elementwise square)

$s_{db} = \beta s_{db}+(1-\beta)db^2$

update: $w := w- \alpha \frac{dw}{\sqrt{s_{dw}}+\epsilon}, b :=b- \alpha \frac{db}{\sqrt{s_{db}}+\epsilon}$

then can use a large learning rate (faster learning), without diverging gradients
$\epsilon$ : avoid divide by 0

Adam Optimization Algorithm

putting it together: momentum + RMS
adam = adaptive moment estimation

$v_{dw}=0, s_{dw}=0,v_{db}=0,s_{db}=0$

on iteration $t$

compute $dw,db$ on current mini-batch

momentum: $v_{dw} = \beta_1 v_{dw}+(1-\beta_1)dw$ , $v_{db} = \beta_1 v_{db}+(1-\beta_1)db$

rms: $s_{dw} = \beta_2 s_{dw}+(1-\beta_2)dw^2$ , $s_{db} = \beta_2 s_{db}+(1-\beta_2)db^2$

bias correction:

$v_{dw}^{corrected}=\frac{v_{dw}}{1-\beta_1^t}, v_{db}^{corrected}=\frac{v_{db}}{1-\beta_1^t}$

$s_{dw}^{corrected}=\frac{s_{dw}}{1-\beta_2^t}, s_{db}^{corrected}=\frac{s_{db}}{1-\beta_2^t}$

update: $w := w- \alpha \frac{v_{dw}^{corrected} }{\sqrt{s^{corrected}_{dw}}+\epsilon}, b :=b- \alpha \frac{v^{corrected}_{db}}{\sqrt{s^{corrected}_{db}}+\epsilon}$

hyperparameters:
- $\alpha$ : needs to tune
- $\beta_1$ : 0.9 (default)
- $\beta_2$ : 0.99 (default)
- $\epsilon$ : 1e-8 (default)

Learning Rate Decay

slowly reduce learning rate
1 epoch = 1 pass thru the data
set

$$ \alpha =\frac{1}{1+\text{decay-rate}\times \text{epoch-num}}\alpha_0 $$

other learning rate decay methods
- exponentially decay: $\alpha=0.95^{\text{epoch_num}}\alpha_0$
- $\alpha=\frac{k}{\sqrt{\text{epoch_num}}}\alpha_0$ or $\alpha=\frac{k}{\sqrt{t}}\alpha_0$ ( $t=$ mini-batch number)
- discrete staircase
- manual decay

Local Optima in Neural Networks

unlikely to get stuck in a bad local optima
- in nn, most point of 0 gradients are not local optima, but saddle points

saddle

plateaus can make learning slow

plateau

Python Code - Optimization Algorithms

jupyter notebook

Tuning Process

Hyperparameters

learning rate $\alpha$
momentum $\beta=0.9$ , # of hidden units, mini-batch size
# of layers, learning rate decay
adam hyperparameters $\beta_1=0.9, \beta_2=0.99, \epsilon=1e-8$

try random values, do not use a grid
coarse to find
appropriate scale for hyperparameters
- e.g. use log-scale to sample (sample more densely than linear-scale) learning rate $\alpha$ , $1-\beta$

Hyperparameters Tuning in Practice

panda

Batch Normalization

normalizing input features can speed up learning
batch norm: normalize hidden units

batch norm

implementing gradient descent

for $t=1,2,\ldots, $ numMiniBatch

compute forward prop on $X^{\{t\}}$

in each hidden layer, use BN to replace $Z^{[\ell]}$ with $\tilde{Z}^{[\ell]}$

use back prop to compute $dW^{[\ell]}, d\beta^{[\ell]}, d\gamma^{[\ell]}$

update parameters (momentum, RMS prop, adam)

batch norm reduces covariate shift
batch norm has slight regularization effect
batch norm at test time:
- exponentially weighted averages of $\mu$ and $\sigma^2$ across mini-batch

Multiclass Classification

Softmax Regression

$C$ = # of classes
$n^{[L]}=C$
softmax layer:

$$ \text{activation: } t = e^{z^{[L]}},\ a^{[L]}=\frac{e^{z^{[L]}}}{\sum_{j=1}^C t_i} $$

loss function:

$$ \mathcal{L}(y,\hat{y})=-\sum_{j=1}^Cy_j\log(\hat{y}_j) $$

cost on the entire training set:

$$ J=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(y^{(i)},\hat{y}^{(i)}) $$

gradient descent with softmax:

$$ dZ^{[L]}:=\frac{\partial J}{\partial Z^{[L]}}=\hat{y}-y $$

Deep Learning Frameworks

Caffe/Caffe2
CNTK
DL4J
Keras
Lasagne
mxnet
PaddlePaddle
Tensorflow
Theano
Torch

Choosing Deep Learning Frameworks

ease of programming (development and deployment)
running speed
truly open (open source with good governance)

Signs Recognition with Tensorflow

jupyter notebook

ML Strategy

Single Number Evaluation Metric

precision = PPV = TP/(TP + FP) = 1-FDR
recall = TPR = TP/P = TP/(TP + FN)
F1 score = harmonic mean of precision and recall

Satisfying and Optimizing Metric

accuracy vs running time
- cost = accuracy - 0.5 * running time
- maximize accuracy s.t. running time $\leq$ 100ms
- accuracy: optimizing
- running time: satisfying
N metrics:
- pick 1: optimizing
- others: satisfying (threshold)
- e.g. maximize accuracy s.t. false positives $\leq$ threshold

Train/Dev/Test Distributions

guideline: choose a dev set and test set to reflect data you expect to get in the future and consider important to do well on
dev set and test set should come from the same distribution

Size of the Dev and Test Sets

old way of splitting data:
- train/test = ⁷⁰⁄₃₀
- train/dev/test = 60/20/20
deep learning
- train/dev/test = 98/1/1
- set your test set to be big enough to give high confidence in the overall performance of your system
- not having a test set is ok, train/dev

When to Change Dev/Test Sets and Metrics

cat

Comparing to Human-Level Performance

Bayes optimal error
human-level performance might not be far from Bayes error
human-level error as a proxy for Bayes error
avoidable/unavoidable error
difference between human-level performance & training error: avoidable bias
difference between training error & dev error: variance

bayes

human

surpassing human-level performance: (structured data, not natural perception/ lots of data, pattern)
- online advertising
- product recommendations
- logistics (predicting transit time)
- loan approvals
- speech recognition
- some image recognition
- medical: ECG, ….

two fundamental assumptions of supervised learning
- you can fit the training set pretty well (avoidable bias)
- the training set performance generalized pretty well to dev/test set (variance)
reduce avoidable bias
- train bigger model
- train longer/ better optimization algo (momentum, RMS prop, Adam)
- NN architecture (CNN, RNN)/ hyperpatameters search
reduce variance
- more data
- regularization (L2, dropout, data augmentation)
- NN architecture (CNN, RNN)/ hyperpatameters search

Error Analysis

error analysis

Cleaning Up Incorrectly Labeled Data

DL algorithms are quite robust to random errors in the training set
incorrectly labeled data in the dev/test set: add one col in error analysis

error analysis

Correcting in correct dev/test set examples

apply same process to dev and test sets to make sure they continue to come from the same distribution
consider examining examples your algo got right as well as ones it got wrong
training and dev/test data may now come from slightly different distributions

Mismatched Training and Dev/Test Set

option

data mismatch problem vs variance problem?
- training-dev set: same distribution as training set, but not used for training

mismatch

training error - human-level error = avoidable error
training-dev error - training error = variance
dev error - training error = data mismatch
test error - dev error = degree of overfitting to dev set (larger dev set)

mismatch

Addressing Data Mismatch

collect more data similar to dev/test sets
artificial synthesized data (e.g. speech recognition)

Transfer Learning

transfer learning

When transfer learning makes sense (Task A $\rightarrow$ Task B)

Task A and B have the same input
You have a lot more data for Task A than Task B
Low level features from A could be helpful for learning B

Multi-task Learning

multi task

When multi-task learning makes sense

training on a set of tasks that could benefit from having shared lower-level features
usually: amount of data you have for each task is quite similar
can train a big enough nn to do well on all the tasks

End-to-End Deep Learning

pros of end-to-end deep learning

let the data speak
less hand-designing of components needed

cons

may need large amount of data
excludes potentially useful hand-designed components

end to end

Neural Networks and Deep Learning

2018/02/27

Introduction

What is a Neural Network?

Why is Deep Learning taking off?

Logistic Regression as a Neural Network

Vectorization

Explanation of Logistic Regression Cost Function

Image Classification (cat/non-cat) - Python Code

Shallow Neural Network

Forward Propagation

Activation Function

Gradient Descent

Summary

Random Initialization

Planar Data Classification - Python Code

Deep Neural Network

Notations

Forward Propagation

Check Dimensions

Forward and Backward Functions

Parameters and Hyperparameters

Build Deep Neural Network - Image Classification (cat/non-cat) - Python Code

Practical Aspects of Deep Learning

Regularization

L2 Regularization

Dropout Regularization

Other Regularization Methods

Speed Up Training

Debugging of Backpropagation: gradient checking

Python Code

Optimization Algorithms

Mini-Batch Gradient Descent

Momentum

Exponentially Weighted Averages

Bias Correction in Exponentially Weighted Averages

Gradient Descent With Momentum

Root Mean Square Prop (RMS Prop)

Adam Optimization Algorithm

Learning Rate Decay

Local Optima in Neural Networks

Python Code - Optimization Algorithms

Tuning Process

Hyperparameters

Hyperparameters Tuning in Practice

Batch Normalization

Multiclass Classification

Softmax Regression

Deep Learning Frameworks

Choosing Deep Learning Frameworks

Signs Recognition with Tensorflow

ML Strategy

Single Number Evaluation Metric

Satisfying and Optimizing Metric

Train/Dev/Test Distributions

Size of the Dev and Test Sets

When to Change Dev/Test Sets and Metrics

Comparing to Human-Level Performance

Error Analysis

Cleaning Up Incorrectly Labeled Data

Mismatched Training and Dev/Test Set

Addressing Data Mismatch

Transfer Learning

Multi-task Learning

End-to-End Deep Learning