Linear Regression with One Variable
X: space of input values, Y: space of output values
training set {(x(i),y(i))}Ni=1
goal: given a training set, to learn a function h:X→Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis.
h(x)=hθ(x)=θ0+θ1x
cost function (squared error function, or mean squared error, MSE):
J(θ0,θ1)=12NN∑i=1(ˆy(i)−y(i))2=12NN∑i=1(h(x(i))−y(i))2
To break it apart, it is 12ˉx where ˉx is the mean of the squares of hθ(xi)−yi , or the difference between the predicted value and the actual value.
The mean is halved (12) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 12 term.
- optimization: minimize J(θ0,θ1) w.r.t parameters (θ0,θ1)
- Gradient Descent:
Outline:
- Initialize θ0,θ1
- Keep changing θ0,θ1 to reduce J(θ0,θ1) until we hopefully end up at a minimum
Algorithm:
repeat until convergence{
θj←θj−α∂∂θjJ(θ0,θ1), j=0,1
}
α: learning rate (step size)
Correct: simultaneous update
temp0←θ0−α∂∂θ0J(θ0,θ1)
temp1←θ1−α∂∂θ1J(θ0,θ1)
θ0←temp0
θ1←temp1
Incorrect: WHY?
temp0←θ0−α∂∂θ0J(θ0,θ1)
θ0←temp0
temp1←θ1−α∂∂θ1J(θ0,θ1)
θ1←temp1
- learning rate α:
- too small, gradient descent can be slow
- too large, gradient descent may fail to converge, or even diverge
- as we approach a local minimum, gradient descent will automatically take smaller steps, even with fixed learning rate
- apply gradient descent to linear regression:
∂∂θjJ(θ0,θ1)=∂∂θj12NN∑i=1(h(x(i))−y(i))2={1N∑Ni=1(h(x(i))−y(i)),j=01N∑Ni=1(h(x(i))−y(i))x(i),j=1
- cost function of linear regression is convex ⇒ no local mimina
- Batch Gradient Descent: each step of gradient descent uses all the training examples
Multivariate Linear Regression
- notations:
- p: # of features
- x(i): input of ith training example
- x(i)j: value of feature j in ith training example
- hypothesis: define x(i)0=1,∀i
hθ(x)=θTx=(θ0θ1⋯θp)(x0x1⋮xp)
- cost function:
J(θ)=12NN∑i=1(h(x(i))−y(i))2
- Gradient Descent:
Algorithm:
repeat until convergence{
θj←θj−α∂∂θjJ(θ)=θj−α1N∑Ni=1(h(x(i))−y(i))x(i)j
} (simultaneously update for j=0,1,...,p)
α: learning rate (step size)
Gradient Descent in Practice
Feature Scaling
- idea: make sure features are on a similar scale, i.e. approximately[−1,1] range
Gradient Descent will descent slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Feature scaling can speed up Gradient Descent.
- mean normalization: replace xi with xi−μi to make features have aproximately 0 mean
xi←xi−μisi,si range or SD
Learning Rate
- fix α, plot J(θ) versus no. of iterations
- α too small: slow convergence
- α too large: may not converge
Vectorization
- compute cost:
J(θ)=12N(Xθ−y)T(Xθ−y)
import numpy as np
# residual
resid = np.dot(X, theta) - y
cost = np.mean(resid ** 2)/2
- gradient descent:
θj←θj−α∂∂θjJ(θ)⇒θ←θ−αddθJ(θ)=θ−αNXT(Xθ−y)
theta -= learning_rate/m*np.dot(X.T, resid)
Normal Equation
- design matrix:
XN×(p+1)=(1x1⋯xp)
where xj=(x(1)jx(2)j⋮x(N)j)
- output:
y=(y(1)y(2)⋮y(N))
⇒ˆθ=(XTX)−1XTy
Gradient Descent | Normal Equation |
---|---|
need to choose learning rate | no need to choose learning rate |
need feature scaling | no need to do feature scaling |
many iterations | no need to iterate |
O(kp2), k is the no. of iterations | O(p3), need to compute (XTX)−1 |
works well when p is large | slow if p is very large |
Normal Equation Noninvertibility
- what is XTX is noninvertible? (singular/degenerate)
- redundant features (linearly dependent)
- too many features (N≤p) ⇒ delete some features, or use regularization
- pseudo-inverse