Sequence Models

2018/03/20

Categories: neuralNetworks deepLearning rnn lstm nlp

Examples of Sequence Data


Recurrent Neural Networks

Forward Propagation

forward propagation

$$ a^{t} = g(W_{aa}a^{t-1} + W_{ax}x^{t} + b_a)\\
\hat{y}^{t}=g(W_{ya}a^{t}+b_y) $$

$$ a^{t} = g(W_a[a^{t-1},x^{t}] + b_a), \text{where} $$

$$ \begin{cases} W_a =[W_{aa}, W_{ax}]\newline [a^{t-1},x^{t}] =\begin{pmatrix} a^{t-1}\newline x^{t} \end{pmatrix} \end{cases}\tag{1} $$

$$ \hat{y}^{t}=g(W_ya^{t}+b_y)\tag{2} $$


Back Propagation Through Time

back propagation

$$ \mathcal{L}^t(\hat{y}^t,y^{t})=-y^t\log \hat{y}^t-(1-y^t)\log(1-\hat{y}^t) $$

$$ \mathcal{L}(\hat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}^t(\hat{y}^t,y^{t})\tag{3} $$

Derivative of softmax loss: $$ p = sofmax(z),\ \ \mathcal{L}=-\sum_{k=1}^Ky_k\log p_k $$

$$ \begin{aligned} \frac{\partial p_k}{\partial z_k}&=\frac{\partial}{\partial z_k}\frac{e^{z_k}}{\sum_j e^{z_j}}\newline &=\frac{e^{z_k}}{\sum_j e^{z_j}}-\left(\frac{e^{z_k}}{\sum_j e^{z_j}}\right)^2=p_k(1-p_k), \end{aligned} $$

$$ \begin{aligned} \forall j\neq k, \frac{\partial p_j}{\partial z_k}&=\frac{\partial}{\partial z_k}\frac{e^{z_j}}{\sum_j e^{z_j}}\newline &=-\frac{e^{z_ke^{z_j}}}{(\sum_j e^{z_j})^2}=-p_kp_j \end{aligned} $$

$$ \begin{aligned} \Rightarrow \frac{\partial\mathcal{L}}{\partial{z_k}}&=-\frac{y_k}{p_k}\frac{\partial p_k}{\partial z_k}-\sum_{j\neq k}\frac{y_j}{p_j}\frac{\partial p_j}{\partial z_k}\newline &=-y_k(1-p_k)+\sum_{j\neq k}y_jp_k\newline &=-y_k(1-p_k)+p_k(1-y_k)=p_k-y_k, \end{aligned} $$

$$ \Rightarrow\frac{\partial \mathcal{L}}{\partial z}=p-y $$


backpropagation: single training example

$$ \begin{aligned} \frac{\partial J}{\partial a^t}&= \sum_{u\geq t}\frac{\partial\mathcal{L}(u)}{\partial a^t}\newline &=\sum_{u\geq t+1}\frac{\partial\mathcal{L}(u)}{\partial a^t}+\frac{\partial\mathcal{L}(t)}{\partial a^t}\newline &=\sum_{u\geq t+1}\frac{\partial a^{ t+1 }}{\partial a^{ t }}\frac{\partial\mathcal{L}(u)}{\partial a^{t+1}} +\frac{\partial\mathcal{L}(t)}{\partial a^{ t }}\newline &=\frac{\partial a^{ t+1 }}{\partial a^{ t }}\frac{\partial J}{\partial a^{t+1}} +\frac{\partial\mathcal{L}(t)}{\partial a^{t}} \end{aligned} $$

$$ z^{t+1}:=W_{aa}a^t + W_{ax}x^{t+1}+b_a, $$

$$ \begin{aligned} \frac{\partial a^{ t+1 }}{\partial a^{ t }}\frac{\partial J}{\partial a^{t+1}} &=\frac{\partial z^{t+1}}{\partial a^t}\frac{\partial a^{ t+1 }}{\partial z^{ t+1 }}\frac{\partial J}{\partial a^{t+1}}\newline &=W_{aa}^T diag\{\tanh’(z^{t+1})_1,\ldots,\tanh’(z^{t+1})_{n_a}\} \frac{\partial J}{\partial a^{t+1}}\newline &=W_{aa}^T\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right)\newline &\text{ (* stands for element-wise multiplication)} \end{aligned} $$

$$ \begin{aligned} \frac{\partial J}{\partial W_{aa}(i,j)}&=\sum_{t=0}^{T_x-1}\frac{\partial J}{\partial a^{t+1}_i} \frac{\partial a^{t+1}_i}{\partial z^{t+1}_i} \frac{\partial z^{t+1}_i}{\partial W_{aa}(i,j)}\newline &= \sum_{t=0}^{T_x-1}\frac{\partial J}{\partial a^{t+1}_i} \tanh’(z^{t+1}_i) a^t_j \end{aligned} $$

$$ \Rightarrow \frac{\partial J}{\partial W_{aa}}=\sum_{t=0}^{T_x-1}\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right)(a^t)^T $$

Similarly, $$ \frac{\partial J}{\partial x^{t+1}}=W_{ax}^T\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right) $$

$$ \frac{\partial J}{\partial W_{ax}}=\sum_{t=0}^{T_x-1}\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right)(x^{t+1})^T $$

$$ \begin{aligned} \frac{\partial J}{\partial b_a}&=\sum_{t=0}^{T_x-1}\frac{\partial z^{t+1}}{\partial b_a} \frac{\partial a^{t+1}}{\partial z^{t+1}} \frac{\partial J}{\partial a^{t+1}}\newline &=\sum_{t=0}^{T_x-1}\mathbf{1}_{n_a}^T diag\{\tanh’(z^{t+1})_1,\ldots,\tanh’(z^{t+1})_{n_a}\} \frac{\partial J}{\partial a^{t+1}}\newline &=\sum_{t=0}^{T_x-1}\mathbf{1}_{n_a}^T \left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right) \end{aligned} $$

RNN Step by Step - Python Code

jupyter notebook


Different Types of RNNs

rnns


Language Model and Sequence Generation

$$ y^1,y^2,\ldots,y^{T_y} $$

$$ \Pr(y^1,y^2,\ldots,y^{T_y}) $$

language model

Sampling Novel Sequences

sample


Character Level Language Model - Python Code

jupyter notebook


Vanishing Gradients


Gated Recurrent Unit (GRU)

$$ \tilde{c}^t=\tanh(W_c[c^{t-1},x^t]+b_c),\\
\Gamma_u=\sigma(W_u[c^{t-1},x^t]+b_u),\ \ \text{u for ‘update’, decide when to update } c^t\\
c^t = \Gamma_u * \tilde{c}^t +(1-\Gamma_u) *c^{t-1},\ \ \text{element-wise multiplication} $$

gru

$$ \tilde{c}^t=\tanh(W_c[\Gamma_r *c^{t-1},x^t]+b_c),\ \ \text{‘r’ for ‘relevant’}\\
\Gamma_u=\sigma(W_u[c^{t-1},x^t]+b_u),\\
\Gamma_r=\sigma(W_r[c^{t-1},x^t]+b_r),\\
c^t = \Gamma_u * \tilde{c}^t +(1-\Gamma_u) *c^{t-1} $$


Long Short Term Memory (LSTM)

lstm


LSTM Step by Step - Python Code

jupyter notebook


Text Generation: Writing Like Shakespeare - Python Code

jupyter notebook


Bidirectional RNNs (BRNN)

brnn


Deep RNNs

For layer $\ell$, $$ a^{[\ell]t}=g(W_a^{[\ell]}[a^{[\ell]t-1}, a^{[\ell-1]t}] +b_a^{[\ell]}) $$ deep rnn


NLP and Word Embedding

Word Representation

word embedding

tsne


Using Word Embeddings

name entity

face encoding


Properties of Word Embeddings

$$ \arg\max_w sim(e_w, e_{king}-e_{man}+e_{woman}) $$

analogy

analogy

$$ sim(u,v)=\frac{u^Tv}{\Vert u\Vert_2 \Vert v\Vert_2} $$


Embedding Matrix

$$ EO_w = e_w $$


Learning Word Embeddings

learn


Word2Vec

$$ O_c\rightarrow E\rightarrow e_c\rightarrow softmax\rightarrow \hat{y} $$

$$ softmax: \Pr(t|c)=\frac{e^{\theta_t^Te_c}}{\sum_{j=1}^{v}e^{\theta_j^Te_c}} $$

$$ \mathcal{L}(\hat{y},y)=-\sum_{i=1}^{v}y_i\log\hat{y}_i $$


Negative Sampling

negative sampling

$$ \Pr(y=1|c,t)=\sigma(\theta_t^Te_c) $$

$$ \Pr(w_i)=\frac{f(w_i)^{0.75}}{\sum_jf(w_j)^{0.75}}, \ \ f(w_i) = \text{ frequency of word }w_i $$


GloVe

$$ X_{ct} = \text{# of times word }c \text{ appears in context of word }t = X_{tc} $$

$$ \sum_{c,t=1}^v f(X_{ct})(\theta_c^Te_t+b_c+b’_t-\log X_{ct})^2, \text{where weighting term} $$

$$ f(X_{ct})=0 \text{ for } X_{ct}=0 \text{ and gives more weight to uncommon words} $$

$$ e_w^{(final)}:=\frac{\theta_w+e_w}{2} $$


Sentiment Classification

sentiment

sentiment


Debiasing Word Embeddings

debias


Sequence to Sequence Model

sequence


Image Captioning

caption


Machine Translation as Building a Conditional Language Model

caption

$$ \arg\max_{y^1,y^2,\ldots,y^{T_y}}\Pr(y^1,\ldots,y^{T_y}|x) $$


$$ \Pr(y^1,y^2|x)=\Pr(y^1|x)\Pr(y^2|y^1,x) $$


$$ \arg\max_y\prod_{t=1}^{T_y}\Pr(y^t|x,y^1,\ldots,y^{t-1}) $$

$$ \overset{\text{avoid numerical rounding error}}{\Longrightarrow} \arg\max_y\sum_{t=1}^{T_y}\log\Pr(y^t|x,y^1,\ldots,y^{t-1}) $$

$$ \frac{1}{T_y^{\alpha}}\arg\max_y\sum_{t=1}^{T_y}\log\Pr(y^t|x,y^1,\ldots,y^{t-1}) $$


error


Bleu Score

$$ BP \exp\left(\frac{1}{K}\sum_{i=1}^K p_K\right) $$

$$ \text{BP penalty (brevity penalty)}:= \begin{cases} 1, &\text{if MT_output_length>reference_output_length}\newline \exp(1-\text{ MT_output_length/reference_output_length}),&\text{otherwise} \end{cases} $$

bleu

bleu

bleu


Attention Model Intuition

attention

attention

attention


Speech Recognition

ctc


Trigger Word Detection

trigger