Neural Networks and Deep Learning


What is a Neural Network?

Input Output Application Model
Home Features Price Real Estate NN
Ad, User info Click on ad? 0/1 Online Advertising NN
Image Object Photo Tagging CNN
Audio Text transcript Speech Recognition RNN
English Chinese Machine Translation RNN
Image, Radar info Position of other cars Autonomous Driving Hybrid

Why is Deep Learning taking off?

Being able to train a big enough neural network:


Logistic Regression as a Neural Network

$$ X = \begin{pmatrix} x^{(1)} &x^{(2)} &\cdots &x^{(m)} \end{pmatrix} \in \mathbb{R}^{n_x \times m} $$

$$ y= \begin{pmatrix} y^{(1)} &y^{(2)} &\cdots &y^{(m)} \end{pmatrix} \in \mathbb{R}^{1 \times m} $$

Given $x\in\mathbb{R}^{n_x}$, want $\hat{y}=\Pr(y=1|x)$

Output: $\hat{y}=\sigma(\omega^Tx+b)$, where sigmoid function $\sigma(z)=\frac{1}{1+e^{-z}}$

Given $\{x^{(i)}, y^{(i)}\}_{i=1}^m$, want $\hat{y}^{(i)}=\sigma(\omega^Tx^{(i)}+b)\approx y^{(i)}$

loss(error) function: $\mathcal{L}(\hat{y},y)=-\left[y\log \hat{y}+(1-y)\log(1-\hat{y})\right]$

cost function: $J(\omega,b)=\frac{1}{m}\sum\limits_{i=1}^m \mathcal{L}(\hat{y}^{(i)}, y^{(i)})=-\frac{1}{m}\sum\limits_{i=1}^m\left[y^{(i)}\log \hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})\right]$


$\omega\leftarrow \omega - \alpha\frac{\partial}{\partial\omega} J(\omega,b)$

$b\leftarrow b - \alpha\frac{\partial}{\partial b} J(\omega,b)$


One training example:

$z= \omega^Tx+b$

$\hat{y} = a = \sigma(z)$

$\mathcal{L}(a,y)=-(y\log a+(1-y)\log(1-a))$

da := $\frac{d\mathcal{L}}{da}=-\frac{y}{a}+\frac{1-y}{1-a}$

dz := $\frac{d\mathcal{L}}{dz}=\frac{d\mathcal{L}}{da}\frac{da}{dz}=\left[-\frac{y}{a}+\frac{1-y}{1-a}\right]a(1-a)=a-y$

dwi := $\frac{\partial \mathcal{L}}{\partial \omega_i}=x_i\frac{d\mathcal{L}}{dz}$ = xi*dz

db := $\frac{\partial \mathcal{L}}{\partial b}=\frac{d\mathcal{L}}{dz}$ = dz

$\omega_i\leftarrow \omega_i - \alpha$ dwi

$b\leftarrow b - \alpha$ db

$m$ training examples:

$z^{(i)}= \omega^Tx^{(i)}+b$

$\hat{y}^{(i)} = a^{(i)} = \sigma(z^{(i)})$

$\mathcal{L}(a^{(i)},y^{(i)})=-(y^{(i)}\log a^{(i)}+(1-y^{(i)})\log(1-a^{(i)}))$

$J(\omega,b)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(a^{(i)},y^{(i)})$

$\frac{\partial J}{\partial\omega_i} = \frac{1}{m}\sum_{i=1}^m\frac{\partial}{\partial\omega_i}\mathcal{L}(a^{(i)},y^{(i)})$

2 for loops: loop over all entries ($m$), loop over all features ($n$)

$\Rightarrow$ vectorization, more efficient!



def sigmoid(u):
    return 1/(1 + np.exp(-u))

# X: data, n*m, n = NO. of features, m = NO. of examples
# Y: labels, 1*m
# w: weights, n*1
# b: bias, scalar

# compute activation
A = sigmoid(, X) + b)

# cost
J = - np.mean(np.log(A) * Y +  np.log(1-A) *(1-Y))

# back propagation (compute gradient)
m = X.shape[1]
dZ = (A - Y)/m
db = np.sum(dZ)
dw =, dZ.T)

# update params
w -= learning_rate*dw
b -= learning_rate*db

Explanation of Logistic Regression Cost Function

$$ y|x \sim Binom(1, \hat{y}), \hat{y} =\sigma(\omega^Tx+b) $$

$$ \Rightarrow \Pr(y|x)=\hat{y}^y(1-\hat{y})^{1-y}=\begin{cases} 1-\hat{y}, &y=0\newline \hat{y}, &y=1 \end{cases} $$

$$ \Rightarrow \log\Pr(y|x)=y\log\hat{y}+(1-y)\log(1-\hat{y})=-\mathcal{L}(\hat{y},y) $$

Goal: maximize $\Pr(y|x)$ $\Leftrightarrow$ minimize $\mathcal{L}(\hat{y},y)$

Cost on $m$ examples:

maximize $\Pr(\text{labels in training set}) = \prod_{i=1}^n\Pr(y^{(i)}|x^{(i)})$

$\Leftrightarrow$ maximize $\log \prod_{i=1}^n\Pr(y^{(i)}|x^{(i)})=\sum_{i=1}^m\log\Pr(y^{(i)}|x^{(i)})=-\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})$

$\Leftrightarrow$ minimize $J(\omega,b)=\frac{1}{m}\sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)},y^{(i)})$

Image Classification (cat/non-cat) - Python Code

Shallow Neural Network

Forward Propagation

Parameters Dimension
$W^{[1]}$ $(n^{[1]}, n^{[0]})$
$b^{[1]}$ $(n^{[1]}, 1)$
$W^{[2]}$ $(n^{[2]}, n^{[1]})$
$b^{[2]}$ $(n^{[2]}, 1)$

$$ A^{[0]} = X, n^{[2]}=1 $$

$$ \text{Forward Propagation }\begin{cases} Z^{[k]} &= W^{[k]}A^{[k-1]} + b^{[k]}, (\text{linear combination}) \newline A^{[k]}&=g^{[k]}(Z^{[k]}), (activation) \end{cases} \ \ k=1,2,\ldots $$


Activation Function

function derivative
sigmoid $\frac{1}{1+e^{-z}}$ $g(z)(1-g(z))$
tanh $\frac{e^z-e^{-z}}{e^z+e^{-z}}$ $1-(g(z))^2$
ReLU $\max(0, z)$ $g'(z)=\begin{cases}0, &z<0\newline 1, &z\geq 0\end{cases}$
Leaky ReLU $\max(0.01z, z)$ $g'(z)=\begin{cases}0.01, &z<0\newline 1, &z\geq 0\end{cases}$

Gradient Descent

Single Training Example m Training Examples
$dz^{[2]}=a^{[2]}-y$ $dZ^{[2]}=\frac{1}{m}(A^{[2]}-y)$
$dW^{[2]}=dz^{[2]}a^{[1]T}$ $dW^{[2]}=dZ^{[2]} (A^{[1]})^T$
$db^{[2]}=dz^{[2]}$ $db^{[2]}=np.sum(dZ^{[2]}, axis = 1, keepdims = True)$
$da^{[1]}=(W^{[2]})^T dz^{[2]} \circ g^{[1]'}(z^{[1]})$ $dZ^{[1]}=(W^{[2]})^T dZ^{[2]} \circ g^{[1]'}(Z^{[1]})$
$dW^{[1]}=dz^{[1]}x^T$ $dW^{[1]}=dZ^{[1]} X^T$
$db^{[1]}=dz^{[1]}$ $db^{[1]}=np.sum(dZ^{[1]}, axis = 1, keepdims = True)$


Forward Propagation Backward Propagation
$Z^{[1]}=W^{[1]}X + b^{[1]}$ $dZ^{[2]}=\frac{1}{m}(A^{[2]}-Y)$
$A^{[1]}=g^{[1]}(Z^{[1]})$ $dW^{[2]}= dZ^{[2]} (A^{[1]})^T$
$db^{[2]}=np.sum(dZ^{[2]}, axis = 1, keepdims = True)$
$Z^{[2]}=W^{[2]}A^{[1]}+b^{[2]}$ $dZ^{[1]}=(W^{[2]})^T dZ^{[2]} \circ g^{[1]'}(Z^{[1]})$ 【elementwise product】
$A^{[2]}=g^{[2]}(Z^{[2]})=\sigma(Z^{[2]})$ $dW^{[1]}= dZ^{[1]} X^T$
$db^{[1]}=np.sum(dZ^{[1]}, axis = 1, keepdims = True)$

Random Initialization

Planar Data Classification - Python Code

Deep Neural Network


shallow and deep

Forward Propagation

Layer $\ell$:

(Vectorized) layer $\ell$:

Check Dimensions

Forward and Backward Functions

layer $\ell$

$$ dz^{[\ell]} = da^{[\ell]} * (g^{[\ell]})‘(z^{[\ell]}) \text{ (elementwise product)}\\
dW^{[\ell]} = dz^{[\ell]}a^{[\ell-1]}\\
da^{[\ell-1]}=W^{[\ell]T}dz^{[\ell]} \Rightarrow dz^{[\ell]} =W^{[\ell+1]T}dz^{[\ell+1]} * (g^{[\ell]})‘(z^{[\ell]}) $$

$$ \text{Vectorized}: dZ^{[\ell]} = dA^{[\ell]} * (g^{[\ell]})‘(Z^{[\ell]}) \text{ (elementwise product)}\\
dW^{[\ell]} = dZ^{[\ell]}A^{[\ell-1]T}\\
db^{[\ell]}= np.sum(dZ^{[\ell]}, axis=1,keepdims=True)\\
dA^{[\ell-1]}=W^{[\ell]T}dZ^{[\ell]} $$

input and output

Parameters and Hyperparameters

Build Deep Neural Network - Image Classification (cat/non-cat) - Python Code

Practical Aspects of Deep Learning


L2 Regularization

$$ J(W^{[1]},b^{[1]},\ldots,W^{[L]},b^{[L]})=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat{y}^{(i)}, y^{(i)})+\frac{\lambda}{2m}\sum_{\ell=1}^L\Vert W^{[\ell]}\Vert_F^2 $$

where $\Vert\cdot\Vert_F$ is the Frobenius norm.


Dropout Regularization

Illustrate with layer $\ell=3$

keepProb = 0.8 # probability a unit will be kept
d3 = np.random.rand(a3.shape) < keepProb
a3 *= d3 # elementwise product
a3 /= keepProb # ensure the expected value of a3 remains the same

Other Regularization Methods


  1. optimize cost function: gradient descent, …
  2. not overfit: regularization, …

early stopping

Speed Up Training


$z=w_1x_1 +w_2x_2 +\cdots + w_nx_n$ (omit bias)

set $var(w_i)=\frac{1}{n}$ ($n=$ # of input nodes)

W = np.random.randn(node_in, node_out) / np.sqrt(node_in)
# relu activation
# W = np.random.randn(node_in, node_out) / np.sqrt(node_in/2)
# tanh activation
# W = np.random.randn(node_in, node_out) / np.sqrt(node_in)
# W = np.random.randn(node_in, node_out) / np.sqrt((node_in + node_out)/2)

Debugging of Backpropagation: gradient checking

for each $i$: $$ d\theta_{approx}[i]:=\frac{J(\theta_1,\theta_2,\ldots,\theta_i+\epsilon,\ldots)-J(\theta_1,\theta_2,\ldots,\theta_i-\epsilon,\ldots)}{2\epsilon}\approx \frac{\partial J}{\partial \theta_i} $$

check $$ \frac{\Vert d\theta_{approx}-d\theta\Vert_2}{\Vert d\theta_{approx}\Vert_2+\Vert d\theta\Vert_2} \approx 0, e.g. < 1e-7 $$

Python Code



Gradient Checking

Optimization Algorithms

Mini-Batch Gradient Descent

mini batch

mini batch cost

stochastic gradient descent


Exponentially Weighted Averages

$$ v_0=0, v_t = \beta v_{t-1}+(1-\beta)\theta_t $$

$$ \Rightarrow v_t = \beta^tv_0 +(1-\beta)\left[ \theta_t + \beta\theta_{t-1} +\ldots \beta^{t-1}\theta_1 \right]=(1-\beta)\sum_{i=0}^{t-1}\beta^i\theta_{t-i} $$

Since $$ \beta^{\frac{1}{1-\beta}}=\left(1-(1-\beta)\right)^{\frac{1}{1-\beta}}\approx \frac{1}{e}, $$ $v_t\approx$ average of the last $\frac{1}{1-\beta}$ terms of $\theta$’s



get next $\theta_t$

$v_{\theta}:=\beta v_{\theta} + (1-\beta)\theta_t$


Bias Correction in Exponentially Weighted Averages

Take $v_t := v_t/(1-\beta^t)$

bias correction

Gradient Descent With Momentum

on iteration $t$



  1. gradient descent
  2. with momentum (small $\beta$)
  3. with momentum (large $\beta$)

Root Mean Square Prop (RMS Prop)

on iteration $t$

Adam Optimization Algorithm

$v_{dw}=0, s_{dw}=0,v_{db}=0,s_{db}=0$

on iteration $t$

Learning Rate Decay

$$ \alpha =\frac{1}{1+\text{decay-rate}\times \text{epoch-num}}\alpha_0 $$

Local Optima in Neural Networks



Python Code - Optimization Algorithms

Tuning Process


Hyperparameters Tuning in Practice


Batch Normalization

batch norm

batch norm

for $t=1,2,\ldots, $ numMiniBatch

Multiclass Classification

Softmax Regression

$$ \text{activation: } t = e^{z^{[L]}},\ a^{[L]}=\frac{e^{z^{[L]}}}{\sum_{j=1}^C t_i} $$

$$ \mathcal{L}(y,\hat{y})=-\sum_{j=1}^Cy_j\log(\hat{y}_j) $$

$$ J=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(y^{(i)},\hat{y}^{(i)}) $$

$$ dZ^{[L]}:=\frac{\partial J}{\partial Z^{[L]}}=\hat{y}-y $$

Deep Learning Frameworks

Choosing Deep Learning Frameworks

Signs Recognition with Tensorflow

ML Strategy

Single Number Evaluation Metric

Satisfying and Optimizing Metric

Train/Dev/Test Distributions

Size of the Dev and Test Sets

When to Change Dev/Test Sets and Metrics




Comparing to Human-Level Performance



Error Analysis

error analysis

error analysis

Cleaning Up Incorrectly Labeled Data

error analysis

Correcting in correct dev/test set examples

Mismatched Training and Dev/Test Set



  1. training error - human-level error = avoidable error
  2. training-dev error - training error = variance
  3. dev error - training error = data mismatch
  4. test error - dev error = degree of overfitting to dev set (larger dev set)


Addressing Data Mismatch

Transfer Learning

transfer learning

When transfer learning makes sense (Task A $\rightarrow$ Task B)

Multi-task Learning

multi task

When multi-task learning makes sense

End-to-End Deep Learning

pros of end-to-end deep learning


end to end