ML - Neural Networks

2017/11/04

Categories: neuralNetworks Python machineLearning

Model Representation

Neural Network

$$ a_i^{[j]} = \text{“activation” of unit $i$ in layer $j$} \\
\Theta^{[j]} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$} $$


Forward Propagation: Vectorized Implementation

$$ [x] \rightarrow [a^{[2]}] \rightarrow [a^{[3]}]\rightarrow \cdots $$


Non-linear Classification Example: XNOR operator

x1 x2 XNOR = NOT XOR
0 0 1
0 1 0
1 0 0
1 1 1

XNOR

Multiclass Classification

Multiclass Classification

Cost Function

$$ h_{\Theta}(x)\in\mathbb{R}^K, h_{\Theta}(x)_i = i\text{th output} $$

$$ J(\Theta)=-\frac{1}{N}\sum_{i=1}^N\sum_{k=1}^K\left[ y_{k}^{(i)}\log h_{\Theta}(x^{(i)})_k + (1-y_{k}^{(i)})\log (1-h_{\Theta}(x^{(i)})_k) \right]+ \frac{\lambda}{2N}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}(\Theta_{ji}^{(l)})^2 $$

Back Propagation

min $J(\Theta)$, need to compute

Given one training example $(x,y)$

Forward Propagation

Single training example: $$ \begin{aligned} a^{[1]}&=x \ (\text{add } a^{[1]}_0)\newline z^{[\ell]}&=\Theta^{[\ell-1]}a^{[\ell-1]},\ a^{[\ell]}=g(z^{[\ell]}) \ (\text{add } a^{[\ell]}_0), L=2,3,\ldots,L-1\newline z^{[L]}&=\Theta^{[L-1]}a^{[L-1]},\ a^{[L]}=h_{\Theta}(x)=\sigma(z^{[L]}) \end{aligned} $$


$N$ training examples: $$ X = \begin{pmatrix} x^{(1)T}\newline x^{(2)T}\newline \vdots\newline x^{(N)T} \end{pmatrix} $$

$$ \begin{aligned} A^{[1]}&=\left[\textbf{1}\ X\right] \newline Z^{[\ell]}&=A^{[\ell-1]}\Theta^{[\ell-1]T},\ A^{[\ell]}=\left[\textbf{1}\ g(Z^{[\ell]})\right] , L=2,3,\ldots,L-1\newline Z^{[L]}&=A^{[L-1]}\Theta^{[L-1]T},\ A^{[L]}=h_{\Theta}(X)=\sigma(Z^{[L]}) \end{aligned} $$

Backward Propagation

Single training example:

$$ \begin{aligned} J &= -y^T\log(a^{[L]}) - (1-y)^T\log (1-a^{[L]})\newline \Rightarrow \frac{\partial J}{\partial a^{[L]}_j} &=-\frac{y_j}{a^{[L]}_j} - \frac{1-y_j}{1-a^{[L]}_j}\newline \Rightarrow \frac{\partial J}{\partial z_j^{[\ell]}} &= \frac{\partial J}{\partial a^{[L]}_j}\frac{d a^{[L]}_j}{dz^{[L]}_j}= \left(-\frac{y_j}{a^{[L]}_j} - \frac{1-y_j}{1-a^{[L]}_j} \right)a^{[L]}_j(1-a^{[L]}_j)=a^{[L]}_j-y_j\newline \Rightarrow \frac{\partial J}{\partial z^{[L]}}&=a^{[L]}-y \end{aligned} $$

chain rule of vector derivative $$ \begin{aligned} \delta^{[\ell]}&=\frac{\partial J}{\partial z^{[\ell]}}\newline &=\frac{\partial a^{[\ell]}}{\partial z^{[\ell]}} \frac{\partial z^{[\ell+1]}}{\partial a^{[\ell]}} \frac{\partial J}{\partial z^{[\ell+1]}}\newline &=\frac{\partial a^{[\ell]}}{\partial z^{[\ell]}} (\tilde{\Theta}^{[\ell]})^T \delta^{[\ell+1]} \end{aligned}, $$

$$ \frac{\partial a^{[\ell]}}{\partial z^{[\ell]}}=diag\{g’(z^{[\ell]}_1), g’(z^{[\ell]}_2),\ldots,g’(z^{[\ell]}_{s_l})\}, $$

$$ \begin{aligned} \Rightarrow \delta^{[\ell]} &=diag\{g’(z^{[\ell]}_1), g’(z^{[\ell]}_2),\ldots,g’(z^{[\ell]}_{s_l})\} (\tilde{\Theta}^{[\ell]})^T \delta^{[\ell+1]}\newline &= (\tilde{\Theta}^{[\ell]})^T \delta^{[\ell+1]} * g’(z^{[\ell]}) \text{ (element-wise multiplication)} \end{aligned} $$

$$ \begin{aligned} \frac{\partial J}{\partial \Theta^{[\ell]}_{ij}}&=\frac{\partial J}{\partial z^{[\ell+1]}_i}\frac{\partial z^{[\ell+1]}_i}{\partial \Theta^{[\ell]}_{ij}}\newline &=\delta^{[\ell+1]}_i a^{[\ell]}_j \end{aligned} $$

$$ \Rightarrow \frac{\partial J}{\partial\Theta^{[\ell]}}=\delta^{[\ell+1]}a^{[\ell]T} $$


$N$ training examples: VECTORIZATION

$$ D^{[\ell]}=\frac{1}{N}\Delta^{[\ell+1]T}A^{[\ell]}+\frac{\lambda}{N}\hat{\Theta}^{[\ell]},\text{ where } \hat{\Theta}^{[\ell]}=\Theta^{[\ell]} \text{ with first column being all 0} $$

NN for Handwritten Digits Recognition - Python Code

jupyter notebook