text normalization => convert it to a more convenient, standard form
- word tokenization = separate out words
- Chinese doesn’t have spaces between words, so word tokenization more difficult
- lemmatization = determine that 2 words have the same root
- stemming => strip suffixes, simpler version of lemmatization
- sentence segmentation: break up a text into individual sentences, using cues like periods or exclamations
- word tokenization = separate out words
edit distance: an algo used to compare similarit between 2 strings
- spell correction
- computational biology
- evaluate machine translation and speech recognition
- named entity extraction and entity coreference
Regular Expressions
- disjunction
[]
- range
-
[A-Z]
[a-z]
[0-9]
- negation
^
caret[^A-Z]
[^e^]
: neithere
or^
- disjunction
|
pipe symbola|b|c
=[abc]
[gG]roundhog|[Ww]oodchunk
?
: optional previous charcolou?r
Kleene operator
*
: 0 or more of previous char+
: 1 or more of previous char
.
= any characteranchors
^
: beginning^[A-Z]
^[^A-Za-z]
$
: end\b
: word boundary\B
: non-boundary
RE Operator Precedence Hierarchy
from highest precedence to lowest precedence
- parenthesis
()
- counters
*+?{}
- sequences and anchors
- disjunction
|
Non-greedy Matching
another meaning of ?
*?
+?
match as little text as possible
reduce error rate
- increase precision = minimize false positives
- increase recall = minimize false negatives
Text Normalization
- segment/tokenize words in running text
- normalize word formats
- segment sentences in running text
Word Tokenization
lemma
: cat and cats = same lemmawordform
: cat and cats = different wordforms
Unix Tools for Crude Tokenization
less
= display one page at a time
- space bar => pagedowm
- q
=> quit
less hamlet.txt | less
tr
= translate, transform
syntax
tr [OPTION] set1 [set2]
options:
c
= complement of set1s
= single, replace repeated characters in set1 with single occurence
tr -sc 'A-Za-z' '\n' < hamlet.txt | less
sort
options:
n
= sort numericallyr
= reverse order
tr -sc 'A-Za-z' '\n' < hamlet.txt | sort | less
uniq
= remove adjacent duplicate lines
uniq -c
= display unique lines with count
tr -sc 'A-Za-z' '\n' < hamlet.txt | sort | uniq -c | less
sort by frequency
tr -sc 'A-Za-z' '\n' < hamlet.txt | sort | uniq -c | sort -n -r | less
tr 'A-Z' 'a-z' < hamlet.txt | tr -sc 'a-z' '\n' | sort | uniq -c | sort -n -r | less
Maximum Matching Algorithm
- word tokenization = word segmentation in Chinese (since no spaces between words)
- average word in Chinese is 2.4 characters long
- standard baseline segmentation algorithm: maximum matching (also called greedy)
MaxMatch
Given a wordlist of Chinese, and a string
- start a pointer at the beginning of the string
- find the longest word in dict that matches the string starting at pointer
- move the pointer over the word in string
- go to step 2
Example
the table down there
thetabledownthere
theta bled own there
=> maximum matching doesn’t generally work in English (Chinese has much shorter words than English)
quantify how well a segmenter works
- compare output segmentation with a perfect hand-segmented (gold) sentence
- metric: word error rate = normalized minimum edit distance = edit distance / len of the gold sentence in words
- problem of
MaxMatch
: unknown words (words not in dict)
Word Normalization
BPE = byte-pair encoding
Unix Tools for Crude Normalization
ing
words
tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep 'ing$' | sort | uniq -c | sort -n -r | less
tr -sc 'A-Za-z' '\n' < hamlet.txt | tr 'A-Z' 'a-z' | grep '[aeiou].*ing$' | sort | uniq -c | sort -n -r | less
Sentence Segmentation
- binary classifier: determine if a word is
EndOfSentence/NotEndOfSentence
- handwritten rules
- regular expressions
- machine learning
Minimum Edit Distance
- similarity between 2 strings
- minimum edit distance = minimum # of editing operations needed to transform one into the other
- insertion
- deletion
- substitution
- e.g.
intention
vsexecution
- if each operation has cost of 1, dist = 5
- if substitions cost 2, dist = 8 (
Levenshtein distance
)
For 2 strings, $X$
of len $n$
, $Y$
of len
$m$
, define $D(i,j)$
as
- the edit distance between
$X[1...i]$
and$Y[1...j]$
- edit distance between
$X$
and$Y$
thus$D(n,m)$
Dynamic Programming for Minimum Edit Distance
- tabular computation of
$D(n,m)$
- compute
$D(i,j)$
for small$i,j$
- compute larger
$D(i,j)$
based on previously computed smaller values
- compute
MinEdit
(Levenshtein)
- initialization
$$ \begin{aligned} D(i,0)&=i\newline D(0,j)&=j \end{aligned} $$
- recurrence relation
For each $i=1,\ldots,n$
For each $j=1,\ldots,m$
$$
D(i,j)=\min
\begin{cases}
(deletion) D(i-1,j)+1 \newline
(insertion) D(i,j-1)+1 \newline
(substitution) D(i-1,j-1)+
\begin{cases}
0, X(i)=Y(j)\newline
2, otherwise
\end{cases}
\end{cases}
$$
- termination:
$D(n,m)$
is distance
Backtrace $$ ptr(i,j)= \begin{cases} down (deletion)\newline left (insertion)\newline diag (substition) \end{cases} $$
- every non-decreasing path from
$(0,0)$
to$(m,n)$
corresponds to an alignment of the 2 seqs - an optimal alignment is composed of optimal subalignments
Performance
- time:
$O(nm)$
- space:
$O(nm)$
- backtrace:
$O(n+m)$
Python Code
import numpy as np
def levenshtein(str1, str2):
if str1 == str2: return 0
ncol, nrow = len(str1) + 1, len(str2) + 1
dm = np.zeros((ncol, nrow)) # distance matrix
# initialization
for i in range(ncol):
dm[i, 0] = i
for j in range(nrow):
dm[0, j] = j
# recurrence relation
for i in range(1, ncol):
for j in range(1, nrow):
a = dm[i - 1, j] + 1 # deletion
b = dm[i, j - 1] + 1 # insertion
c = dm[i - 1, j - 1] + 2*(str1[i - 1] != str2[j - 1]) # substitution
dm[i, j] = min(a, b, c)
# print(dm)
return dm[ncol - 1, nrow - 1]
print(levenshtein("intention", "execution")) # 8