text normalization => convert it to a more convenient, standard form
- word tokenization = separate out words
- Chinese doesn’t have spaces between words, so word tokenization more difficult
 
 - lemmatization = determine that 2 words have the same root
- stemming => strip suffixes, simpler version of lemmatization
 
 - sentence segmentation: break up a text into individual sentences, using cues like periods or exclamations
 
- word tokenization = separate out words
 edit distance: an algo used to compare similarit between 2 strings
- spell correction
 - computational biology
 - evaluate machine translation and speech recognition
 - named entity extraction and entity coreference
 
Regular Expressions
- disjunction 
[] - range 
-[A-Z][a-z][0-9]
 - negation 
^caret[^A-Z][^e^]: neithereor^
 - disjunction 
|pipe symbola|b|c=[abc][gG]roundhog|[Ww]oodchunk
 ?: optional previous charcolou?r
Kleene operator
*: 0 or more of previous char+: 1 or more of previous char
.= any characteranchors
^: beginning^[A-Z]^[^A-Za-z]
$: end\b: word boundary\B: non-boundary
RE Operator Precedence Hierarchy
from highest precedence to lowest precedence
- parenthesis 
() - counters 
*+?{} - sequences and anchors
 - disjunction 
| 
Non-greedy Matching
another meaning of ?
*?+?
match as little text as possible
reduce error rate
- increase precision = minimize false positives
 - increase recall = minimize false negatives
 
Text Normalization
- segment/tokenize words in running text
 - normalize word formats
 - segment sentences in running text
 
Word Tokenization
lemma: cat and cats = same lemmawordform: cat and cats = different wordforms
Unix Tools for Crude Tokenization
less = display one page at a time
- space bar => pagedowm
- q => quit
less hamlet.txt | less
tr = translate, transform
syntax
tr [OPTION] set1 [set2]
options:
c= complement of set1s= single, replace repeated characters in set1 with single occurence
tr -sc 'A-Za-z' '\n' < hamlet.txt | less
sort
options:
n= sort numericallyr= reverse order
tr -sc 'A-Za-z' '\n' < hamlet.txt | sort | less
uniq = remove adjacent duplicate lines
uniq -c = display unique lines with count
tr -sc 'A-Za-z' '\n' < hamlet.txt | sort |  uniq -c | less
sort by frequency
tr -sc 'A-Za-z' '\n' < hamlet.txt | sort |  uniq -c | sort -n -r | less
tr 'A-Z' 'a-z' < hamlet.txt | tr -sc 'a-z' '\n' | sort |  uniq -c | sort -n -r | less
Maximum Matching Algorithm
- word tokenization = word segmentation in Chinese (since no spaces between words)
 - average word in Chinese is 2.4 characters long
 - standard baseline segmentation algorithm: maximum matching (also called greedy)
 
MaxMatch
Given a wordlist of Chinese, and a string
- start a pointer at the beginning of the string
 - find the longest word in dict that matches the string starting at pointer
 - move the pointer over the word in string
 - go to step 2
 
Example
the table down there
thetabledownthere
theta bled own there
=> maximum matching doesn’t generally work in English (Chinese has much shorter words than English)
quantify how well a segmenter works
- compare output segmentation with a perfect hand-segmented (gold) sentence
 - metric: word error rate = normalized minimum edit distance = edit distance / len of the gold sentence in words
 - problem of 
MaxMatch: unknown words (words not in dict) 
Word Normalization
BPE = byte-pair encoding
Unix Tools for Crude Normalization
ingwords
tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep 'ing$' | sort | uniq -c | sort -n -r | less
tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep '[aeiou].*ing$' | sort | uniq -c | sort -n -r | less
Sentence Segmentation
- binary classifier: determine if a word is 
EndOfSentence/NotEndOfSentence- handwritten rules
 - regular expressions
 - machine learning
 
 
Minimum Edit Distance
- similarity between 2 strings
 - minimum edit distance = minimum # of editing operations needed to transform one into the other
- insertion
 - deletion
 - substitution
 
 - e.g. 
intentionvsexecution- if each operation has cost of 1, dist = 5
 - if substitions cost 2, dist = 8 (
Levenshtein distance) 
 
For 2 strings, $X$ of len $n$, $Y$ of len
$m$, define $D(i,j)$ as
- the edit distance between 
$X[1...i]$and$Y[1...j]$ - edit distance between 
$X$and$Y$thus$D(n,m)$ 
Dynamic Programming for Minimum Edit Distance
- tabular computation of 
$D(n,m)$
- compute
$D(i,j)$for small$i,j$ - compute larger 
$D(i,j)$based on previously computed smaller values 
 - compute
 
MinEdit (Levenshtein)
- initialization
 
$$ \begin{aligned} D(i,0)&=i\newline D(0,j)&=j \end{aligned} $$
- recurrence relation
 
For each $i=1,\ldots,n$
For each $j=1,\ldots,m$
$$
D(i,j)=\min
\begin{cases}
(deletion) D(i-1,j)+1 \newline
(insertion) D(i,j-1)+1 \newline
(substitution) D(i-1,j-1)+
\begin{cases}
0, X(i)=Y(j)\newline
2, otherwise
\end{cases}
\end{cases}
$$
- termination: 
$D(n,m)$is distance 
Backtrace $$ ptr(i,j)= \begin{cases} down (deletion)\newline left (insertion)\newline diag (substition) \end{cases} $$
- every non-decreasing path from 
$(0,0)$to$(m,n)$corresponds to an alignment of the 2 seqs - an optimal alignment is composed of optimal subalignments
 
Performance
- time: 
$O(nm)$ - space: 
$O(nm)$ - backtrace: 
$O(n+m)$ 
Python Code
import numpy as np
def levenshtein(str1, str2):
    if str1 == str2: return 0
    ncol, nrow = len(str1) + 1, len(str2) + 1
    dm = np.zeros((ncol, nrow)) # distance matrix
    # initialization
    for i in range(ncol):
        dm[i, 0] = i
    for j in range(nrow):
        dm[0, j] = j
    # recurrence relation
    for i in range(1, ncol):
        for j in range(1, nrow):
            a = dm[i - 1, j] + 1 # deletion
            b = dm[i, j - 1] + 1 # insertion
            c = dm[i - 1, j - 1] + 2*(str1[i - 1] != str2[j - 1]) # substitution
            dm[i, j] = min(a, b, c)
    # print(dm)
    return dm[ncol - 1, nrow - 1]
print(levenshtein("intention", "execution")) # 8