text normalization => convert it to a more convenient, standard form
- word tokenization = separate out words
  - Chinese doesn’t have spaces between words, so word tokenization more difficult
- lemmatization = determine that 2 words have the same root
  - stemming => strip suffixes, simpler version of lemmatization
- sentence segmentation: break up a text into individual sentences, using cues like periods or exclamations
edit distance: an algo used to compare similarit between 2 strings
- spell correction
- computational biology
- evaluate machine translation and speech recognition
- named entity extraction and entity coreference

Regular Expressions

disjunction []
range -
- [A-Z]
- [a-z]
- [0-9]
negation ^ caret
- [^A-Z]
- [^e^] : neither e or ^
disjunction | pipe symbol
- a|b|c = [abc]
- [gG]roundhog|[Ww]oodchunk
?: optional previous char
- colou?r
Kleene operator
- *: 0 or more of previous char
- +: 1 or more of previous char
. = any character
anchors
- ^: beginning
  - ^[A-Z]
  - ^[^A-Za-z]
- $: end
- \b: word boundary
- \B: non-boundary

RE Operator Precedence Hierarchy

from highest precedence to lowest precedence

parenthesis ()
counters *+?{}
sequences and anchors
disjunction |

Non-greedy Matching

another meaning of ?

*?
+?

match as little text as possible

reduce error rate

increase precision = minimize false positives
increase recall = minimize false negatives

Text Normalization

segment/tokenize words in running text
normalize word formats
segment sentences in running text

Word Tokenization

lemma: cat and cats = same lemma
wordform: cat and cats = different wordforms

Unix Tools for Crude Tokenization

less = display one page at a time - space bar => pagedowm - q => quit

less hamlet.txt | less

tr = translate, transform

syntax

tr [OPTION] set1 [set2]

options:

c = complement of set1
s = single, replace repeated characters in set1 with single occurence

tr -sc 'A-Za-z' '\n' < hamlet.txt | less

sort options:

n = sort numerically
r = reverse order

tr -sc 'A-Za-z' '\n' < hamlet.txt | sort | less

uniq = remove adjacent duplicate lines

uniq -c = display unique lines with count

tr -sc 'A-Za-z' '\n' < hamlet.txt | sort |  uniq -c | less

sort by frequency

tr -sc 'A-Za-z' '\n' < hamlet.txt | sort |  uniq -c | sort -n -r | less

tr 'A-Z' 'a-z' < hamlet.txt | tr -sc 'a-z' '\n' | sort |  uniq -c | sort -n -r | less

Maximum Matching Algorithm

word tokenization = word segmentation in Chinese (since no spaces between words)
average word in Chinese is 2.4 characters long
standard baseline segmentation algorithm: maximum matching (also called greedy)

MaxMatch

Given a wordlist of Chinese, and a string

start a pointer at the beginning of the string
find the longest word in dict that matches the string starting at pointer
move the pointer over the word in string
go to step 2

Example

the table down there

thetabledownthere

theta bled own there

=> maximum matching doesn’t generally work in English (Chinese has much shorter words than English)

quantify how well a segmenter works

compare output segmentation with a perfect hand-segmented (gold) sentence
metric: word error rate = normalized minimum edit distance = edit distance / len of the gold sentence in words
problem of MaxMatch: unknown words (words not in dict)

Word Normalization

BPE = byte-pair encoding

Unix Tools for Crude Normalization

ing words

tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep 'ing$' | sort | uniq -c | sort -n -r | less

tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep '[aeiou].*ing$' | sort | uniq -c | sort -n -r | less

Sentence Segmentation

binary classifier: determine if a word is EndOfSentence/NotEndOfSentence
- handwritten rules
- regular expressions
- machine learning

Minimum Edit Distance

similarity between 2 strings
minimum edit distance = minimum # of editing operations needed to transform one into the other
- insertion
- deletion
- substitution
e.g. intention vs execution
- if each operation has cost of 1, dist = 5
- if substitions cost 2, dist = 8 (Levenshtein distance)

For 2 strings, $X$ of len $n$ , $Y$ of len $m$ , define $D(i,j)$ as

the edit distance between $X[1...i]$ and $Y[1...j]$
edit distance between $X$ and $Y$ thus $D(n,m)$

Dynamic Programming for Minimum Edit Distance

tabular computation of $D(n,m)$
- compute $D(i,j)$ for small $i,j$
- compute larger $D(i,j)$ based on previously computed smaller values

MinEdit (Levenshtein)

initialization

$$ \begin{aligned} D(i,0)&=i\newline D(0,j)&=j \end{aligned} $$

recurrence relation

For each $i=1,\ldots,n$ For each $j=1,\ldots,m$ $$ D(i,j)=\min \begin{cases} (deletion) D(i-1,j)+1 \newline (insertion) D(i,j-1)+1 \newline (substitution) D(i-1,j-1)+ \begin{cases} 0, X(i)=Y(j)\newline 2, otherwise \end{cases} \end{cases} $$

termination: $D(n,m)$ is distance

Backtrace $$ ptr(i,j)= \begin{cases} down (deletion)\newline left (insertion)\newline diag (substition) \end{cases} $$

every non-decreasing path from $(0,0)$ to $(m,n)$ corresponds to an alignment of the 2 seqs
an optimal alignment is composed of optimal subalignments

Performance

time: $O(nm)$
space: $O(nm)$
backtrace: $O(n+m)$

Python Code

import numpy as np

def levenshtein(str1, str2):
    if str1 == str2: return 0
    ncol, nrow = len(str1) + 1, len(str2) + 1
    dm = np.zeros((ncol, nrow)) # distance matrix
    # initialization
    for i in range(ncol):
        dm[i, 0] = i
    for j in range(nrow):
        dm[0, j] = j

    # recurrence relation
    for i in range(1, ncol):
        for j in range(1, nrow):
            a = dm[i - 1, j] + 1 # deletion
            b = dm[i, j - 1] + 1 # insertion
            c = dm[i - 1, j - 1] + 2*(str1[i - 1] != str2[j - 1]) # substitution
            dm[i, j] = min(a, b, c)
    # print(dm)
    return dm[ncol - 1, nrow - 1]

print(levenshtein("intention", "execution")) # 8

Stanford NLP Ch1 Regular Expressions, Edit Distance

2019/05/12