Stanford NLP Ch1 Regular Expressions, Edit Distance

2019/05/12

Categories: nlp Tags: nlp regularExpressions editDistance

Regular Expressions

RE Operator Precedence Hierarchy

from highest precedence to lowest precedence

Non-greedy Matching

another meaning of ?

match as little text as possible


reduce error rate

  1. increase precision = minimize false positives
  2. increase recall = minimize false negatives

Text Normalization

Word Tokenization

Unix Tools for Crude Tokenization

less = display one page at a time - space bar => pagedowm - q => quit

less hamlet.txt | less

tr = translate, transform

syntax

tr [OPTION] set1 [set2]

options:

tr -sc 'A-Za-z' '\n' < hamlet.txt | less

sort options:

tr -sc 'A-Za-z' '\n' < hamlet.txt | sort | less

uniq = remove adjacent duplicate lines

uniq -c = display unique lines with count

tr -sc 'A-Za-z' '\n' < hamlet.txt | sort |  uniq -c | less

sort by frequency

tr -sc 'A-Za-z' '\n' < hamlet.txt | sort |  uniq -c | sort -n -r | less
tr 'A-Z' 'a-z' < hamlet.txt | tr -sc 'a-z' '\n' | sort |  uniq -c | sort -n -r | less

Maximum Matching Algorithm

MaxMatch

Given a wordlist of Chinese, and a string

  1. start a pointer at the beginning of the string
  2. find the longest word in dict that matches the string starting at pointer
  3. move the pointer over the word in string
  4. go to step 2

Example

the table down there

thetabledownthere

theta bled own there

=> maximum matching doesn’t generally work in English (Chinese has much shorter words than English)


quantify how well a segmenter works

Word Normalization

BPE = byte-pair encoding

Unix Tools for Crude Normalization

tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep 'ing$' | sort | uniq -c | sort -n -r | less
tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep '[aeiou].*ing$' | sort | uniq -c | sort -n -r | less

Sentence Segmentation

Minimum Edit Distance

For 2 strings, $X$ of len $n$, $Y$ of len $m$, define $D(i,j)$ as

Dynamic Programming for Minimum Edit Distance

MinEdit (Levenshtein)

$$ \begin{aligned} D(i,0)&=i\newline D(0,j)&=j \end{aligned} $$

For each $i=1,\ldots,n$ For each $j=1,\ldots,m$ $$ D(i,j)=\min \begin{cases} (deletion) D(i-1,j)+1 \newline (insertion) D(i,j-1)+1 \newline (substitution) D(i-1,j-1)+ \begin{cases} 0, X(i)=Y(j)\newline 2, otherwise \end{cases} \end{cases} $$

Backtrace $$ ptr(i,j)= \begin{cases} down (deletion)\newline left (insertion)\newline diag (substition) \end{cases} $$

Performance

Python Code

import numpy as np

def levenshtein(str1, str2):
    if str1 == str2: return 0
    ncol, nrow = len(str1) + 1, len(str2) + 1
    dm = np.zeros((ncol, nrow)) # distance matrix
    # initialization
    for i in range(ncol):
        dm[i, 0] = i
    for j in range(nrow):
        dm[0, j] = j

    # recurrence relation
    for i in range(1, ncol):
        for j in range(1, nrow):
            a = dm[i - 1, j] + 1 # deletion
            b = dm[i, j - 1] + 1 # insertion
            c = dm[i - 1, j - 1] + 2*(str1[i - 1] != str2[j - 1]) # substitution
            dm[i, j] = min(a, b, c)
    # print(dm)
    return dm[ncol - 1, nrow - 1]

print(levenshtein("intention", "execution")) # 8