Linear Regression with One Variable

$X$ : space of input values, $Y$ : space of output values
training set $\{(x^{(i)}, y^{(i)})\}_{i=1}^N$
goal: given a training set, to learn a function $h : X \rightarrow Y$ so that $h(x)$ is a “good” predictor for the corresponding value of $y$ . For historical reasons, this function $h$ is called a hypothesis.
$h(x)=h_{\theta}(x) = \theta_0 + \theta_1 x$
cost function (squared error function, or mean squared error, MSE):

$$ J(\theta_0, \theta_1)= \frac{1}{2N}\sum _{i=1}^N \left(\hat{y}^{(i)} -y^{(i)} \right)^2 =\frac{1}{2N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right)^2 $$

To break it apart, it is $\frac{1}{2} \bar{x}$ where $\bar{x}$ is the mean of the squares of $h_\theta (x_{i}) - y_{i}$ , or the difference between the predicted value and the actual value.

The mean is halved $\left(\frac{1}{2}\right)$ as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the $\frac{1}{2}$ term.

optimization: minimize $J(\theta_0, \theta_1)$ w.r.t parameters $(\theta_0, \theta_1)$
Gradient Descent:

Outline:

Initialize $\theta_0, \theta_1$

Keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0, \theta_1)$ until we hopefully end up at a minimum

Algorithm:

repeat until convergence{

$\theta_j \leftarrow \theta_j -\alpha\frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1),\ \ \ j=0,1$

}

$\alpha$ : learning rate (step size)

Correct: simultaneous update

$temp0 \leftarrow \theta_0 -\alpha\frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1)$

$temp1 \leftarrow \theta_1 -\alpha\frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1)$

$\theta_0 \leftarrow temp0 $

$\theta_1 \leftarrow temp1 $

Incorrect: WHY?

$temp0 \leftarrow \theta_0 -\alpha\frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1)$

$\theta_0 \leftarrow temp0 $

$temp1 \leftarrow \theta_1 -\alpha\frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1)$

$\theta_1 \leftarrow temp1 $

learning rate $\alpha$ :
- too small, gradient descent can be slow
- too large, gradient descent may fail to converge, or even diverge
- as we approach a local minimum, gradient descent will automatically take smaller steps, even with fixed learning rate
apply gradient descent to linear regression:

$$ \frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1) =\frac{\partial}{\partial \theta_j}\frac{1}{2N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right)^2= \begin{cases} \frac{1}{N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right), & j=0\newline \frac{1}{N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right)x^{(i)}, & j=1 \end{cases} $$

cost function of linear regression is convex $\Rightarrow$ no local mimina
Batch Gradient Descent: each step of gradient descent uses all the training examples

Multivariate Linear Regression

notations:
- $p$ : # of features
- $\mathbf{x}^{(i)}$ : input of $i$ th training example
- $x^{(i)}_j$ : value of feature $j$ in $i$ th training example
hypothesis: define $x_0^{(i)}=1, \forall i$

$$ h_{\theta}(\mathbf{x}) =\mathbf{\theta}^T \mathbf{x} = \begin{pmatrix} \theta_0 &\theta_1& \cdots& \theta_p \end{pmatrix} \begin{pmatrix} x_0\newline x_1\newline \vdots\newline x_p \end{pmatrix} $$

cost function:

$$ J(\mathbf{\theta}) = \frac{1}{2N}\sum_{i=1}^N\left(h(\mathbf{x}^{(i)})-y^{(i)}\right)^2 $$

Gradient Descent:

Algorithm:

repeat until convergence{

$\theta_j \leftarrow \theta_j -\alpha\frac{\partial}{\partial \theta_j}J(\mathbf{\theta})=\theta_j -\alpha\frac{1}{N}\sum _{i=1}^N \left(h(\mathbf{x}^{(i)}) - y^{(i)}\right)x^{(i)}_j$

} (simultaneously update for $j=0,1,...,p$ )

$\alpha$ : learning rate (step size)

Gradient Descent in Practice

Feature Scaling

idea: make sure features are on a similar scale, i.e. approximately $[-1, 1]$ range

Gradient Descent will descent slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Feature scaling can speed up Gradient Descent.

mean normalization: replace $x_i$ with $x_i-\mu_i$ to make features have aproximately 0 mean

$$ x_i \leftarrow \frac{x_i-\mu_i}{s_i}, s_i \text{ range or SD} $$

featureScaling

Learning Rate

fix $\alpha$ , plot $J(\mathbf{\theta})$ versus no. of iterations
- $\alpha$ too small: slow convergence
- $\alpha$ too large: may not converge

learning rate

Vectorization

compute cost:

$$ J(\theta) = \frac{1}{2N}(X\theta-y)^T(X\theta-y) $$

import numpy as np
# residual
resid = np.dot(X, theta) - y
cost = np.mean(resid ** 2)/2

gradient descent:

$$ \theta_j \leftarrow \theta_j -\alpha\frac{\partial}{\partial \theta_j}J(\mathbf{\theta})\\
\Rightarrow \mathbf{\theta} \leftarrow \mathbf{\theta}-\alpha\frac{d}{d\mathbf{\theta}}J(\mathbf{\theta})=\mathbf{\theta}-\frac{\alpha}{N}X^T (X\theta -y) $$

theta -= learning_rate/m*np.dot(X.T, resid)

Normal Equation

design matrix:

$$ X_{N\times(p+1)}=\begin{pmatrix} \mathbf{1} &\mathbf{x}_1 &\cdots &\mathbf{x}_p \end{pmatrix} $$

where $$ \mathbf{x}_j = \begin{pmatrix} x_j^{(1)}\newline x_j^{(2)}\newline \vdots\newline x_j^{(N)} \end{pmatrix} $$

output:

$$ \mathbf{y}=\begin{pmatrix} y^{(1)}\newline y^{(2)}\newline \vdots\newline y^{(N)} \end{pmatrix} $$

$$ \Rightarrow \hat{\mathbf{\theta}} = (X^TX)^{-1}X^T\mathbf{y} $$

Gradient Descent	Normal Equation
need to choose learning rate	no need to choose learning rate
need feature scaling	no need to do feature scaling
many iterations	no need to iterate
$O(kp^2)$ , $k$ is the no. of iterations	$O(p^3)$ , need to compute $(X^TX)^{-1}$
works well when $p$ is large	slow if $p$ is very large

Normal Equation Noninvertibility

what is $X^TX$ is noninvertible? (singular/degenerate)
- redundant features (linearly dependent)
- too many features ( $N\leq p$ ) $\Rightarrow$ delete some features, or use regularization
pseudo-inverse

Code in Python

jupyter notebook

ML - Linear Regression with Multiple Variables

2017/11/01