This post is just a quick explanation of linear regression, ridge regression, and LASSO from the probabilistic perspective.

Linear Regression

Given data D=X,y where XRm×n and yRm×1 , we assume the hypothesis is in the form

hθ(x)=θ0+θ1x1++θnxn=ni=0θixi=θx   where x0=1

then the objective function of linear regression is

argminθJ(θ)=12mi=1(hθ(x(i))y(i))

Probabilistic Interpretation

Assume target variables and the inputs are related via the equation:

y(i)=θx(i)+ϵ(i)

where ϵ(i) is an error term that capturers either unmodeled effects (some important pertinent features we left out) or random noise. Further, we assume ϵ(i) are distributed I.I.D and

ϵ(i)N(0,σ2)

Thus, we could write the probability of the error term as

P(ϵ(i))=12πσ2exp((ϵ(i))22σ2)

Since ϵ(i) is a random variable, thus y(i) is also a random variable, which implies that

P(yix(i);θ)=12πσ2exp((y(i)θx(i))22σ2)

That is

yix(i);θN(θx(i),σ2)

Therefore, we could write the likelihood function of all data D=(x(i),y(i))m as

P(yX;θ)=L(θ;X,y)=mi=1P(yix(i);θ)=mi=112πσ2exp((y(i)θx(i))22σ2)

We could maximize the log-likelihood to find best θ that fit the data with high probability, hence we have

argmaxθL(θ;X,y)=argmaxθlogL(θ;X,y)=argmaxθlogmi=112πσ2exp((y(i)θx(i))22σ2)=argmaxθmi=1log[12πσ2exp((y(i)θx(i))22σ2)]=argmaxθmi=1[log12πσ2+logexp((y(i)θx(i))22σ2)]=argmaxθmi=1[log12πσ2(y(i)θx(i))22σ2]=argmaxθmi=1log12πσ2mi=1(y(i)θx(i))22σ2=argmaxθmi=1(y(i)θx(i))22σ2=argminθ12mi=1(y(i)θx(i))2
  • <span style="color:blue"> We can see that minimizing the least square loss is same as maximizing the likelihood in linear regression under the assumption that ϵ(i)N(0,σ2).
  • Notice that, σ2 is also the parameter we need to find. Thus, we also need to use the MLE w.r.t σ2.

Ridge Regression and LASSO

The objective of ridge regression is

argminθ12yθX22+λθ22

and the objective of the least absolute shrinkage and selection operator (LASSO) is

argminθ12yθX22+λθ1

We will show that ridge regression which imposes 2 regularization on least square cost function is just maximum a posterior (MAP) with Gaussian prior; LASSO which impose 1 regularization on least square cost function is just MAP with Laplacian prior.

Maximum A Posterior

θMAP=argmaxθP(θD)=argmaxθP(θ)P(Dθ)P(D)=argmaxθP(θ)P(Dθ)=argmaxθlog[P(θ)P(Dθ)]=argmaxθlogP(Dθ)+logP(θ)

Probabilistic Interpretation of Ridge Regression

Assume the prior follows the Gaussian distribution, that is

θN(0,τ2)

Then, we could write the down the MAP as

θMAP=argmaxθlogP(Dθ)+logP(θ)=argmaxθlog[mi=112πσ2exp((y(i)θx(i))22σ2)]+log[mi=112πτ2exp(θ22τ2)]=argmaxθ[mi=1(y(i)θx(i))22σ2mi=1θ2j2τ2]=argmaxθ1σ2[mi=1(y(i)θx(i))22+σ22τ2mi=1θ2j]=argminθ12mi=1(y(i)θx(i))2+σ22τ2mi=1θ2j=argminθyθX22+λθ22

where λ=σ22τ2.

Probabilistic Interpretation of LASSO

Laplace Distribution: if random variable ZLaplace(μ,b), then we have

P(Z=zμ,b)=12bexp(|zμ|b)

Now, let’s assume the prior follows the Laplace distribution as described above, that is

θLaplace(μ,b)        P(θ)=12bexp(|θμ|b) θMAP=argmaxθlogP(Dθ)+logP(θ)=argmaxθlog[mi=112πσ2exp((y(i)θx(i))22σ2)]+log[mi=112bexp(|θ|b)]=argmaxθ[mi=1(y(i)θx(i))22σ2mi=1|θj|b]=argmaxθ1σ2[mi=1(y(i)θx(i))22+σ2bmi=1|θj|]=argminθ12mi=1(y(i)θx(i))2+σ2bmi=1|θj|=argminθyθX22+λθ1

where λ=σ2b.

Bayesian Linear Regression

From Bayesian view, there is uncertainty in our choice of parameters θ, and there should have a probability in θ, typically

θN(0,τ2I)

Based on the Baye’s rule, we can write down the parameter posterior as

P(θD)=P(θ)P(Dθ)P(D)=P(θ)P(Dθ)ˆθP(ˆθ)P(Dˆθ)dˆθ

For Bayesian linear regression, the input is a testing data x, but the output is a probability distribution over y instead of a numerical value:

P(yx,D)=θP(y|x,θ)P(θD)dθ