Examples of Sequence Data

speech recognition
music generation
sentiment classification
DNA sequence analysis
machine translation
video activity recognition
name entity recognition

Recurrent Neural Networks

Why not a standard network?
- inputs and outputs can be different lengths in different examples, e.g. sentence as input
- does not share features learned across different positions of text

Forward Propagation

forward propagation

at time $t$ :

$$ a^{t} = g(W_{aa}a^{t-1} + W_{ax}x^{t} + b_a)\\
\hat{y}^{t}=g(W_{ya}a^{t}+b_y) $$

simplified notation:

$$ a^{t} = g(W_a[a^{t-1},x^{t}] + b_a), \text{where} $$

$$ \begin{cases} W_a =[W_{aa}, W_{ax}]\newline [a^{t-1},x^{t}] =\begin{pmatrix} a^{t-1}\newline x^{t} \end{pmatrix} \end{cases}\tag{1} $$

$$ \hat{y}^{t}=g(W_ya^{t}+b_y)\tag{2} $$

Back Propagation Through Time

back propagation

logistic loss:

$$ \mathcal{L}^t(\hat{y}^t,y^{t})=-y^t\log \hat{y}^t-(1-y^t)\log(1-\hat{y}^t) $$

overall loss:

$$ \mathcal{L}(\hat{y},y)=\sum_{t=1}^{T_y}\mathcal{L}^t(\hat{y}^t,y^{t})\tag{3} $$

Derivative of softmax loss: $$ p = sofmax(z),\ \ \mathcal{L}=-\sum_{k=1}^Ky_k\log p_k $$

$$ \begin{aligned} \frac{\partial p_k}{\partial z_k}&=\frac{\partial}{\partial z_k}\frac{e^{z_k}}{\sum_j e^{z_j}}\newline &=\frac{e^{z_k}}{\sum_j e^{z_j}}-\left(\frac{e^{z_k}}{\sum_j e^{z_j}}\right)^2=p_k(1-p_k), \end{aligned} $$

$$ \begin{aligned} \forall j\neq k, \frac{\partial p_j}{\partial z_k}&=\frac{\partial}{\partial z_k}\frac{e^{z_j}}{\sum_j e^{z_j}}\newline &=-\frac{e^{z_ke^{z_j}}}{(\sum_j e^{z_j})^2}=-p_kp_j \end{aligned} $$

$$ \begin{aligned} \Rightarrow \frac{\partial\mathcal{L}}{\partial{z_k}}&=-\frac{y_k}{p_k}\frac{\partial p_k}{\partial z_k}-\sum_{j\neq k}\frac{y_j}{p_j}\frac{\partial p_j}{\partial z_k}\newline &=-y_k(1-p_k)+\sum_{j\neq k}y_jp_k\newline &=-y_k(1-p_k)+p_k(1-y_k)=p_k-y_k, \end{aligned} $$

$$ \Rightarrow\frac{\partial \mathcal{L}}{\partial z}=p-y $$

backpropagation: single training example

$\mathcal{L}(t):=\mathcal{L}^t(\hat{y}^t,y^{t})$

$\forall t<T_x, \frac{\partial J}{\partial a^t}=\frac{\partial a^{t+1}}{\partial a^t}\frac{\partial J}{\partial a^{t+1}}+\frac{\partial \mathcal{L}(t)}{\partial a^t}$

$$ \begin{aligned} \frac{\partial J}{\partial a^t}&= \sum_{u\geq t}\frac{\partial\mathcal{L}(u)}{\partial a^t}\newline &=\sum_{u\geq t+1}\frac{\partial\mathcal{L}(u)}{\partial a^t}+\frac{\partial\mathcal{L}(t)}{\partial a^t}\newline &=\sum_{u\geq t+1}\frac{\partial a^{ t+1 }}{\partial a^{ t }}\frac{\partial\mathcal{L}(u)}{\partial a^{t+1}} +\frac{\partial\mathcal{L}(t)}{\partial a^{ t }}\newline &=\frac{\partial a^{ t+1 }}{\partial a^{ t }}\frac{\partial J}{\partial a^{t+1}} +\frac{\partial\mathcal{L}(t)}{\partial a^{t}} \end{aligned} $$

$\frac{\partial a^{ t+1 }}{\partial a^{ t }}\frac{\partial J}{\partial a^{t+1}}=?$

$$ z^{t+1}:=W_{aa}a^t + W_{ax}x^{t+1}+b_a, $$

$$ \begin{aligned} \frac{\partial a^{ t+1 }}{\partial a^{ t }}\frac{\partial J}{\partial a^{t+1}} &=\frac{\partial z^{t+1}}{\partial a^t}\frac{\partial a^{ t+1 }}{\partial z^{ t+1 }}\frac{\partial J}{\partial a^{t+1}}\newline &=W_{aa}^T diag\{\tanh’(z^{t+1})_1,\ldots,\tanh’(z^{t+1})_{n_a}\} \frac{\partial J}{\partial a^{t+1}}\newline &=W_{aa}^T\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right)\newline &\text{ (* stands for element-wise multiplication)} \end{aligned} $$

$\frac{\partial J}{\partial W_{aa}}=?$

$$ \begin{aligned} \frac{\partial J}{\partial W_{aa}(i,j)}&=\sum_{t=0}^{T_x-1}\frac{\partial J}{\partial a^{t+1}_i} \frac{\partial a^{t+1}_i}{\partial z^{t+1}_i} \frac{\partial z^{t+1}_i}{\partial W_{aa}(i,j)}\newline &= \sum_{t=0}^{T_x-1}\frac{\partial J}{\partial a^{t+1}_i} \tanh’(z^{t+1}_i) a^t_j \end{aligned} $$

$$ \Rightarrow \frac{\partial J}{\partial W_{aa}}=\sum_{t=0}^{T_x-1}\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right)(a^t)^T $$

Similarly, $$ \frac{\partial J}{\partial x^{t+1}}=W_{ax}^T\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right) $$

$$ \frac{\partial J}{\partial W_{ax}}=\sum_{t=0}^{T_x-1}\left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right)(x^{t+1})^T $$

$\frac{\partial J}{\partial b_a}=?$

$$ \begin{aligned} \frac{\partial J}{\partial b_a}&=\sum_{t=0}^{T_x-1}\frac{\partial z^{t+1}}{\partial b_a} \frac{\partial a^{t+1}}{\partial z^{t+1}} \frac{\partial J}{\partial a^{t+1}}\newline &=\sum_{t=0}^{T_x-1}\mathbf{1}_{n_a}^T diag\{\tanh’(z^{t+1})_1,\ldots,\tanh’(z^{t+1})_{n_a}\} \frac{\partial J}{\partial a^{t+1}}\newline &=\sum_{t=0}^{T_x-1}\mathbf{1}_{n_a}^T \left(tanh’(z^{t+1})* \frac{\partial J}{\partial a^{t+1}}\right) \end{aligned} $$

RNN Step by Step - Python Code

jupyter notebook

Different Types of RNNs

$T_x=T_y$ : many-to-many
$1< T_x\neq T_y >1$ : many-to-many
- e.g. machine translation
$T_y=1$ : many-to-one
- e.g. sentiment classification
$T_x=1$ : one-to-many
- e.g. music generation, $x=$ genre

rnns

Language Model and Sequence Generation

input sentence:

$$ y^1,y^2,\ldots,y^{T_y} $$

language model: to estimate

$$ \Pr(y^1,y^2,\ldots,y^{T_y}) $$

training set: large corpus of English text
- tokenize: e.g. map each word to one-hot representation
- add <EOS> (end of sentence token)
- replace word not in the vocabulary with <UNK> (unknown word token)
build RNN model

language model

Sampling Novel Sequences

given input, sampling according to the predicted softmax distribution
np.random.choice
use the previous sampled output as latter input
keep sampling until <EOS> token is generated, or decide the max sentence length beforehand

sample

Character Level Language Model - Python Code

jupyter notebook

Vanishing Gradients

language can have long-term dependency
deep RNN, difficult for the gradients of output to propagate back to affect the weights of earlier layers
- $\Rightarrow$ vanishing gradient problem
solve exploding gradients: gradient clipping
- if gradients exceed some threshold, rescale gradients
solve vanishing gradients?

Gated Recurrent Unit (GRU)

captures long range connections
helps with vanishing gradient problems
simplified GRU unit: memory cell $c^t = a^t$ (need $c^t$ to be highly dependent on $c^{t-1}$ )

$$ \tilde{c}^t=\tanh(W_c[c^{t-1},x^t]+b_c),\\
\Gamma_u=\sigma(W_u[c^{t-1},x^t]+b_u),\ \ \text{u for ‘update’, decide when to update } c^t\\
c^t = \Gamma_u * \tilde{c}^t +(1-\Gamma_u) *c^{t-1},\ \ \text{element-wise multiplication} $$

gru

full GRU unit: (simplified: $\Gamma_r=1$ )

$$ \tilde{c}^t=\tanh(W_c[\Gamma_r *c^{t-1},x^t]+b_c),\ \ \text{‘r’ for ‘relevant’}\\
\Gamma_u=\sigma(W_u[c^{t-1},x^t]+b_u),\\
\Gamma_r=\sigma(W_r[c^{t-1},x^t]+b_r),\\
c^t = \Gamma_u * \tilde{c}^t +(1-\Gamma_u) *c^{t-1} $$

Long Short Term Memory (LSTM)

u, f, o = update, forget, output
GRU: simpler, can build deep model
- $\Gamma_u, 1-\Gamma_u$ play the same role as $\Gamma_u, \Gamma_f$
LSTM: more powerful

lstm

LSTM Step by Step - Python Code

jupyter notebook

Text Generation: Writing Like Shakespeare - Python Code

jupyter notebook

Bidirectional RNNs (BRNN)

motivation: gain information from the future

brnn

Deep RNNs

For layer $\ell$ , $$ a^{[\ell]t}=g(W_a^{[\ell]}[a^{[\ell]t-1}, a^{[\ell-1]t}] +b_a^{[\ell]}) $$ deep rnn

NLP and Word Embedding

Word Representation

one-hot representation
featurized representation: word embedding

word embedding

t-SNE: visualizing word embeddings

tsne

Using Word Embeddings

name entity recognition
transfer learning
- learn word embeddings from large text corpus (1-100b words) (or download pre-trained embedding online)
- transfer embedding to new task with smaller training set (say, 100k words)
- optional: continue to fine-tune the word embeddings with new data
BRNN

name entity

relation to face encoding

face encoding

Properties of Word Embeddings

analogies: man:woman is king:?
find word $w$ :

$$ \arg\max_w sim(e_w, e_{king}-e_{man}+e_{woman}) $$

analogy

similarity function:
- cosine similarity (most common)

$$ sim(u,v)=\frac{u^Tv}{\Vert u\Vert_2 \Vert v\Vert_2} $$

Embedding Matrix

embedding matrix $E$
embedding for word $w$ : $e_w$
one-hot representation for word $w$ : $O_w$

$$ EO_w = e_w $$

in practice, not efficient to use matrix multiplication to look up an embedding

Learning Word Embeddings

neural language model
fixed history

learn

context/target pairs $\rightarrow$ context
- goal: build language model
  - last 4 words
- goal: learn embedding
  - 4 words on left & right
  - last 1 word
  - nearby 1 word

Word2Vec

skip-grams
- randomly pick a word to be the context word
- randomly pick another word within some window to be the target word
supervised model

vocab size $= v$

context $c\rightarrow$ target $t$

$$ O_c\rightarrow E\rightarrow e_c\rightarrow softmax\rightarrow \hat{y} $$

$$ softmax: \Pr(t|c)=\frac{e^{\theta_t^Te_c}}{\sum_{j=1}^{v}e^{\theta_j^Te_c}} $$

loss:

$$ \mathcal{L}(\hat{y},y)=-\sum_{i=1}^{v}y_i\log\hat{y}_i $$

problems: computationally expensive to compute $\Pr(t|c)$
solutions:
- hierarchical softmax:
  - tree of binary classifiers
  - cost: $O(\log v)$ better than $O(v)$ (softmax classifier)
  - build unbalanced tree: common words on the top
- negative sampling
how to sample context $c$ ?
- not uniform sampling: the, of, a, …

Negative Sampling

pick a context word
generate a positive example:
- for the context word, look around a window
generate $K$ negative examples:
- for the context word, randomly pick a word
supervised learning problem

negative sampling

logistic regression model:

$$ \Pr(y=1|c,t)=\sigma(\theta_t^Te_c) $$

instead of a large softmax model, train $v$ logistic models
how to sample random target words?
- heuristic: between uniform distribution and empirical distribution

$$ \Pr(w_i)=\frac{f(w_i)^{0.75}}{\sum_jf(w_j)^{0.75}}, \ \ f(w_i) = \text{ frequency of word }w_i $$

GloVe

Global Vectors for Word Representation

$$ X_{ct} = \text{# of times word }c \text{ appears in context of word }t = X_{tc} $$

model: minimize

$$ \sum_{c,t=1}^v f(X_{ct})(\theta_c^Te_t+b_c+b’_t-\log X_{ct})^2, \text{where weighting term} $$

$$ f(X_{ct})=0 \text{ for } X_{ct}=0 \text{ and gives more weight to uncommon words} $$

since $\theta_c,e_t$ are symmetric

$$ e_w^{(final)}:=\frac{\theta_w+e_w}{2} $$

Sentiment Classification

simple sentiment classification

sentiment

(many-to-one) RNN for sentiment classification

sentiment

Debiasing Word Embeddings

debias

Sequence to Sequence Model

encoder network
decoder network

sequence

Image Captioning

use pre-trained cnn

caption

Machine Translation as Building a Conditional Language Model

language model vs machine translation model

caption

finding the most likely translation

$$ \arg\max_{y^1,y^2,\ldots,y^{T_y}}\Pr(y^1,\ldots,y^{T_y}|x) $$

beam search vs greedy search
- greedy search: each time, find the most likely single word

Beam Search

first word? $\Pr(y^1|x)$
- parameter $B$ : beam width (consider $B$ possibilities at one time)
- $B=1$ , greedy search
second word? $y^1,y^2$ with top $B$

$$ \Pr(y^1,y^2|x)=\Pr(y^1|x)\Pr(y^2|y^1,x) $$

BS unnaturally prefers shorter sentences

$$ \arg\max_y\prod_{t=1}^{T_y}\Pr(y^t|x,y^1,\ldots,y^{t-1}) $$

$$ \overset{\text{avoid numerical rounding error}}{\Longrightarrow} \arg\max_y\sum_{t=1}^{T_y}\log\Pr(y^t|x,y^1,\ldots,y^{t-1}) $$

length normalization
- heuristic: $\alpha \in (0,1)$

$$ \frac{1}{T_y^{\alpha}}\arg\max_y\sum_{t=1}^{T_y}\log\Pr(y^t|x,y^1,\ldots,y^{t-1}) $$

choice of $B$
- large beam width: consider large possibilities, better result, slower
- small beam width: worse result, faster
unlike exact search algorithms like BFS or DFS, BS runs faster but is not guaranteed to find exact maximum

Error Analysis in Beam Search

human translation $y^*$ , algorithm output $\hat{y}$
case 1: $\Pr(y^*|x)>\Pr(\hat{y}|x)$
- BS chose $\hat{y}$ , but $y^*$ attains higher $\Pr(y|x)$
- BS is at fault
case 2: $\Pr(y^*|x)<\Pr(\hat{y}|x)$
- $y^*$ is a better translation than $\hat{y}$ , but RNN predicted $\Pr(y^*|x)<\Pr(\hat{y}|x)$
- RNN model is at fault

error

Bleu Score

evaluate machine translation
bleu = bilingual evaluation understudy
$p_n$ = Bleu score on n-grams only
combined Bleu score:

$$ BP \exp\left(\frac{1}{K}\sum_{i=1}^K p_K\right) $$

$$ \text{BP penalty (brevity penalty)}:= \begin{cases} 1, &\text{if MT_output_length>reference_output_length}\newline \exp(1-\text{ MT_output_length/reference_output_length}),&\text{otherwise} \end{cases} $$

download open source implementation

bleu

Attention Model Intuition

problem of long sequences
BRNN, attention weight

attention

Speech Recognition

attention models
CTC cost (connectionist temporal classification)
- collapse repeated characters note separated by “blank”

ctc

Trigger Word Detection

trigger

Sequence Models

2018/03/20

Examples of Sequence Data

Recurrent Neural Networks

Forward Propagation

Back Propagation Through Time

RNN Step by Step - Python Code

Different Types of RNNs

Language Model and Sequence Generation

Sampling Novel Sequences

Character Level Language Model - Python Code

Vanishing Gradients

Gated Recurrent Unit (GRU)

Long Short Term Memory (LSTM)

LSTM Step by Step - Python Code

Text Generation: Writing Like Shakespeare - Python Code

Bidirectional RNNs (BRNN)

Deep RNNs

NLP and Word Embedding

Word Representation

Using Word Embeddings

Properties of Word Embeddings

Embedding Matrix

Learning Word Embeddings

Word2Vec

Negative Sampling

GloVe

Sentiment Classification

Debiasing Word Embeddings

Sequence to Sequence Model

Image Captioning

Machine Translation as Building a Conditional Language Model

Beam Search

Error Analysis in Beam Search

Bleu Score

Attention Model Intuition

Speech Recognition

Trigger Word Detection

Sequence Models

2018/03/20

Examples of Sequence Data

Recurrent Neural Networks

Forward Propagation

Back Propagation Through Time

RNN Step by Step - Python Code

Different Types of RNNs

Language Model and Sequence Generation

Sampling Novel Sequences

Character Level Language Model - Python Code

Vanishing Gradients

Gated Recurrent Unit (GRU)

Long Short Term Memory (LSTM)

LSTM Step by Step - Python Code

Text Generation: Writing Like Shakespeare - Python Code

Bidirectional RNNs (BRNN)

Deep RNNs

NLP and Word Embedding

Word Representation

Using Word Embeddings

Properties of Word Embeddings

Embedding Matrix

Learning Word Embeddings

Word2Vec

Negative Sampling

GloVe

Sentiment Classification

Debiasing Word Embeddings

Sequence to Sequence Model

Image Captioning

Machine Translation as Building a Conditional Language Model

Beam Search

Refinements to Beam Search

Error Analysis in Beam Search

Bleu Score

Attention Model Intuition

Speech Recognition

Trigger Word Detection