Cost Function

Logistic regression $$ \min_{\theta}\frac{1}{N}\sum_{i=1}^N \left[y^{(i)}\left(-\log \frac{1}{1+e^{-\theta^T x^{(i)}}}\right)+(1-y^{(i)})\left(-\log \frac{e^{-\theta^T x^{(i)}}}{1+e^{-\theta^T x^{(i)}}}\right)\right] +\frac{\lambda}{2N}\sum_{j=1}^p\theta_j^2 $$

if $y=1$ , want $\theta^Tx\gg 0$
if $y=0$ , want $\theta^Tx\ll 0$

Logistic Cost Function

SVM $$ \min_{\theta} C\sum_{i=1}^N \left[y^{(i)}cost_1(\theta^Tx^{(i)}) +(1-y^{(i)})cost_0(\theta^T x^{(i)})\right] +\frac{1}{2}\sum_{j=1}^p\theta_j^2 $$

if $y=1$ , want $\theta^Tx\geq 1$
if $y=0$ , want $\theta^Tx\leq -1$

Large Margin Classification

SVM Decision Boundary

$$ \min_{\theta} \frac{1}{2}\sum_{j=1}^p\theta_j^2\\
s.t. \begin{cases} \theta^Tx^{(i)}\geq 1 &\text{if } y^{(i)}=1\newline \theta^Tx^{(i)}\leq -1 &\text{if } y^{(i)}=0\newline \end{cases} $$ Simplification: $\theta_0=0, p=2$ $$ \min_{\theta} \frac{1}{2}\Vert\theta\Vert_2^2\\
s.t. \begin{cases} u^{(i)}\Vert\theta\Vert \geq 1 &\text{if } y^{(i)}=1\newline u^{(i)}\Vert\theta\Vert \leq -1 &\text{if } y^{(i)}=0\newline \end{cases} $$ where $u^{(i)}$ is the projection of $x^{(i)}$ onto $\theta$ .

large margin

Kernels

kernels = similarity functions
Gaussian kernel: landmark $l^{(k)}$ $$ f_k=\text{similarity}(x,l^{(k)})=\exp\left(-\frac{\Vert x-l^{(k)}\Vert^2}{2\sigma^2}\right) $$
- if $x\approx l^{(k)}$ , $f_k\approx 1$
- if $x$ is far from $l^{(k)}$ , $f_k\approx 0$
- predict 1 when $\theta^T f\geq 0$
choice of landmarks

Given $\{(x^{(i)}, y^{(i)})\}_{i=1}^N$

Choose $l^{(i)}=x^{(i)}, i=1,2,\ldots,N$

Given example $x$ ,

$$ f :=\begin{pmatrix} f_0\newline f_1\newline \vdots\newline f_N \end{pmatrix},\ \ \text{where } f_0:=1, f_k :=\text{similarity}(x, l^{(k)})=\text{similarity}(x, x^{(k)}) $$

hypothesis: given $x$ , compute features $f\in\mathbb{R}^{N+1}$ . Predict $y=1$ if $\theta^T y\geq 0$ .

training:

$$ \min_{\theta}C\sum_{i=1}\left[y^{(i)}cost_1(\theta^T f^{(i)}) + (1-y^{(i)})cost_0(\theta^T f^{(i)})\right]+\frac{1}{2}\sum_{j=1}^N \theta_j^2 $$

replace $\sum_{j=1}^N \theta_j^2=\theta^T\theta$ (ignore $\theta_0$ ) with $\theta^TM\theta$ (rescale, for efficiency)

Hyperparameters

$C=\frac{1}{\lambda}$

large $C$ : lower bias, higher variance
small $C$ : higher bias, lower variance

$\sigma^2$

large $\sigma^2$ : feature $f_i$ vary more smoothly, higher bias, lower variance
small $\sigma^2$ : feature $f_i$ vary less smoothly, lower bias, higher variance

address overfitting

decrease $C$
increase $\sigma^2$

SVM in Practice

use SVM software packages (e.g. liblinear, libsvm) to solve for parameters $\theta$
need to specify
- choice of $C$
- choice of kernel
  - no kernel = “linear kernel”, predict $y=1$ if $\theta^Tx\geq 0$
  - Gaussian kernel: need to choose $\sigma^2$ , do perform feature scaling beforehand
- other choices of kernel: need to satisfy Mercer’s Theorem
- polynomial kernel: $k(x,l)=(x^Tl + c)^d$
- string kernel
- chi-square kernel
- histogram intersection kernel

Multiclass SVM Classification

one-versus-all

Logistic Regression versus SVMs

if # of features is large (relative to # of examples): use logistic regression, or SVM without a kernel (linear kernel)
if # of features is small, # of examples is intermediate: use SVM with Gaussian kernel
if # of features is small, # of examples is large: add/create more features, then use logistic regression or SVM without a kernel
neural networks likely to work well for most of these settings, but may be slower to train

ML - Support Vector Machines

2017/11/06