ML - Linear Regression with Multiple Variables

2017/11/01

Categories: Python machineLearning Tags: linearRegression

Linear Regression with One Variable

$$ J(\theta_0, \theta_1)= \frac{1}{2N}\sum _{i=1}^N \left(\hat{y}^{(i)} -y^{(i)} \right)^2 =\frac{1}{2N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right)^2 $$

To break it apart, it is $\frac{1}{2} \bar{x}$ where $\bar{x}$ is the mean of the squares of $h_\theta (x_{i}) - y_{i}$ , or the difference between the predicted value and the actual value.

The mean is halved $\left(\frac{1}{2}\right)$ as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the $\frac{1}{2}$ term.

Outline:

  1. Initialize $\theta_0, \theta_1$
  2. Keep changing $\theta_0, \theta_1$ to reduce $J(\theta_0, \theta_1)$ until we hopefully end up at a minimum

Algorithm:

repeat until convergence{

$\theta_j \leftarrow \theta_j -\alpha\frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1),\ \ \ j=0,1$

}

$\alpha$: learning rate (step size)


Correct: simultaneous update

$temp0 \leftarrow \theta_0 -\alpha\frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1)$

$temp1 \leftarrow \theta_1 -\alpha\frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1)$

$\theta_0 \leftarrow temp0 $

$\theta_1 \leftarrow temp1 $


Incorrect: WHY?

$temp0 \leftarrow \theta_0 -\alpha\frac{\partial}{\partial \theta_0}J(\theta_0, \theta_1)$

$\theta_0 \leftarrow temp0 $

$temp1 \leftarrow \theta_1 -\alpha\frac{\partial}{\partial \theta_1}J(\theta_0, \theta_1)$

$\theta_1 \leftarrow temp1 $

$$ \frac{\partial}{\partial \theta_j}J(\theta_0, \theta_1) =\frac{\partial}{\partial \theta_j}\frac{1}{2N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right)^2= \begin{cases} \frac{1}{N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right), & j=0\newline \frac{1}{N}\sum _{i=1}^N \left(h(x^{(i)}) - y^{(i)}\right)x^{(i)}, & j=1 \end{cases} $$


Multivariate Linear Regression

$$ h_{\theta}(\mathbf{x}) =\mathbf{\theta}^T \mathbf{x} = \begin{pmatrix} \theta_0 &\theta_1& \cdots& \theta_p \end{pmatrix} \begin{pmatrix} x_0\newline x_1\newline \vdots\newline x_p \end{pmatrix} $$

$$ J(\mathbf{\theta}) = \frac{1}{2N}\sum_{i=1}^N\left(h(\mathbf{x}^{(i)})-y^{(i)}\right)^2 $$

Algorithm:

repeat until convergence{

$\theta_j \leftarrow \theta_j -\alpha\frac{\partial}{\partial \theta_j}J(\mathbf{\theta})=\theta_j -\alpha\frac{1}{N}\sum _{i=1}^N \left(h(\mathbf{x}^{(i)}) - y^{(i)}\right)x^{(i)}_j$

} (simultaneously update for $j=0,1,...,p$)

$\alpha$: learning rate (step size)


Gradient Descent in Practice

Feature Scaling

Gradient Descent will descent slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Feature scaling can speed up Gradient Descent.

$$ x_i \leftarrow \frac{x_i-\mu_i}{s_i}, s_i \text{ range or SD} $$

featureScaling


Learning Rate

learning rate

learning rate


Vectorization

$$ J(\theta) = \frac{1}{2N}(X\theta-y)^T(X\theta-y) $$

import numpy as np
# residual
resid = np.dot(X, theta) - y
cost = np.mean(resid ** 2)/2

$$ \theta_j \leftarrow \theta_j -\alpha\frac{\partial}{\partial \theta_j}J(\mathbf{\theta})\\
\Rightarrow \mathbf{\theta} \leftarrow \mathbf{\theta}-\alpha\frac{d}{d\mathbf{\theta}}J(\mathbf{\theta})=\mathbf{\theta}-\frac{\alpha}{N}X^T (X\theta -y) $$

theta -= learning_rate/m*np.dot(X.T, resid)

Normal Equation

$$ X_{N\times(p+1)}=\begin{pmatrix} \mathbf{1} &\mathbf{x}_1 &\cdots &\mathbf{x}_p \end{pmatrix} $$

where $$ \mathbf{x}_j = \begin{pmatrix} x_j^{(1)}\newline x_j^{(2)}\newline \vdots\newline x_j^{(N)} \end{pmatrix} $$

$$ \mathbf{y}=\begin{pmatrix} y^{(1)}\newline y^{(2)}\newline \vdots\newline y^{(N)} \end{pmatrix} $$

$$ \Rightarrow \hat{\mathbf{\theta}} = (X^TX)^{-1}X^T\mathbf{y} $$

Gradient Descent Normal Equation
need to choose learning rate no need to choose learning rate
need feature scaling no need to do feature scaling
many iterations no need to iterate
$O(kp^2)$, $k$ is the no. of iterations $O(p^3)$, need to compute $(X^TX)^{-1}$
works well when $p$ is large slow if $p$ is very large

Normal Equation Noninvertibility

Code in Python

jupyter notebook