Probabilistic Interpretation of Ridge Regression and LASSO
This post is just a quick explanation of linear regression, ridge regression, and LASSO from the probabilistic perspective.
Linear Regression
Given data D=X,y where X∈Rm×n and y∈Rm×1 , we assume the hypothesis is in the form
hθ(x)=θ0+θ1x1+⋯+θnxn=n∑i=0θixi=θ⊤x where x0=1then the objective function of linear regression is
argminθJ(θ)=12m∑i=1(hθ(x(i))−y(i))Probabilistic Interpretation
Assume target variables and the inputs are related via the equation:
y(i)=θ⊤x(i)+ϵ(i)where ϵ(i) is an error term that capturers either unmodeled effects (some important pertinent features we left out) or random noise. Further, we assume ϵ(i) are distributed I.I.D and
ϵ(i)∼N(0,σ2)Thus, we could write the probability of the error term as
P(ϵ(i))=1√2πσ2exp(−(ϵ(i))22σ2)Since ϵ(i) is a random variable, thus y(i) is also a random variable, which implies that
P(yi∣x(i);θ)=1√2πσ2exp(−(y(i)−θ⊤x(i))22σ2)That is
yi∣x(i);θ∼N(θ⊤x(i),σ2)Therefore, we could write the likelihood function of all data D=(x(i),y(i))m as
P(y∣X;θ)=L(θ;X,y)=m∏i=1P(yi∣x(i);θ)=m∏i=11√2πσ2exp(−(y(i)−θ⊤x(i))22σ2)We could maximize the log-likelihood to find best θ that fit the data with high probability, hence we have
argmaxθL(θ;X,y)=argmaxθlogL(θ;X,y)=argmaxθlogm∏i=11√2πσ2exp(−(y(i)−θ⊤x(i))22σ2)=argmaxθm∑i=1log[1√2πσ2exp(−(y(i)−θ⊤x(i))22σ2)]=argmaxθm∑i=1[log1√2πσ2+logexp(−(y(i)−θ⊤x(i))22σ2)]=argmaxθm∑i=1[log1√2πσ2−(y(i)−θ⊤x(i))22σ2]=argmaxθm∑i=1log1√2πσ2−m∑i=1(y(i)−θ⊤x(i))22σ2=argmaxθ−m∑i=1(y(i)−θ⊤x(i))22σ2=argminθ12m∑i=1(y(i)−θ⊤x(i))2<span style="color:blue">
We can see that minimizing the least square loss is same as maximizing the likelihood in linear regression under the assumption that ϵ(i)∼N(0,σ2).- Notice that, σ2 is also the parameter we need to find. Thus, we also need to use the MLE w.r.t σ2.
Ridge Regression and LASSO
The objective of ridge regression is
argminθ12‖y−θ⊤X‖22+λ‖θ‖22and the objective of the least absolute shrinkage and selection operator (LASSO) is
argminθ12‖y−θ⊤X‖22+λ‖θ‖1We will show that ridge regression which imposes ℓ2 regularization on least square cost function is just maximum a posterior (MAP) with Gaussian prior; LASSO which impose ℓ1 regularization on least square cost function is just MAP with Laplacian prior.
Maximum A Posterior
θMAP=argmaxθP(θ∣D)=argmaxθP(θ)P(D∣θ)P(D)=argmaxθP(θ)P(D∣θ)=argmaxθlog[P(θ)P(D∣θ)]=argmaxθlogP(D∣θ)+logP(θ)Probabilistic Interpretation of Ridge Regression
Assume the prior follows the Gaussian distribution, that is
θ∼N(0,τ2)Then, we could write the down the MAP as
θMAP=argmaxθlogP(D∣θ)+logP(θ)=argmaxθlog[m∏i=11√2πσ2exp(−(y(i)−θ⊤x(i))22σ2)]+log[m∏i=11√2πτ2exp(−θ22τ2)]=argmaxθ[−m∑i=1(y(i)−θ⊤x(i))22σ2−m∑i=1θ2j2τ2]=argmaxθ−1σ2[m∑i=1(y(i)−θ⊤x(i))22+σ22τ2m∑i=1θ2j]=argminθ12m∑i=1(y(i)−θ⊤x(i))2+σ22τ2m∑i=1θ2j=argminθ‖y−θ⊤X‖22+λ‖θ‖22where λ=σ22τ2.
Probabilistic Interpretation of LASSO
Laplace Distribution: if random variable Z∼Laplace(μ,b), then we have
P(Z=z∣μ,b)=12bexp(−|z−μ|b)Now, let’s assume the prior follows the Laplace distribution as described above, that is
θ∼Laplace(μ,b) P(θ)=12bexp(−|θ−μ|b) θMAP=argmaxθlogP(D∣θ)+logP(θ)=argmaxθlog[m∏i=11√2πσ2exp(−(y(i)−θ⊤x(i))22σ2)]+log[m∏i=112bexp(−|θ|b)]=argmaxθ[−m∑i=1(y(i)−θ⊤x(i))22σ2−m∑i=1|θj|b]=argmaxθ−1σ2[m∑i=1(y(i)−θ⊤x(i))22+σ2bm∑i=1|θj|]=argminθ12m∑i=1(y(i)−θ⊤x(i))2+σ2bm∑i=1|θj|=argminθ‖y−θ⊤X‖22+λ‖θ‖1where λ=σ2b.
Bayesian Linear Regression
From Bayesian view, there is uncertainty in our choice of parameters θ, and there should have a probability in θ, typically
θ∼N(0,τ2I)Based on the Baye’s rule, we can write down the parameter posterior as
P(θ∣D)=P(θ)P(D∣θ)P(D)=P(θ)P(D∣θ)∫ˆθP(ˆθ)P(D∣ˆθ)dˆθFor Bayesian linear regression, the input is a testing data x∗, but the output is a probability distribution over y∗ instead of a numerical value:
P(y∗∣x∗,D)=∫θP(y∗|x∗,θ)⋅P(θ∣D)dθ