Understanding Maximum Likelihood Estimation in Supervised Learning

This article demystifies the ML learning modeling process under the prism of statistics. We will understand how our assumptions on the data enable us to create meaningful optimization problems. In fact, we will derive commonly used criteria such as cross-entropy in classification and mean square error in regression. Finally, I am trying to answer an interview question that I encountered: What would happen if we use MSE on binary classification?

Likelihood VS probability and probability density

To begin, let’s start with a fundamental question: what is the difference between likelihood and probability? The data xx are connected to the possible models θ\theta by means of a probability P(x,θ)P(x,\theta) or a probability density function (pdf) p(x,θ)p(x,\theta).

In short, A pdf gives the probabilities of occurrence of different possible values. The pdf describes the infinitely small probability of any given value. We’ll stick with the pdf notation here. For any given set of parameters θ\theta, p(x,θ)p(x,\theta) is intended to be the probability density function of x.

The likelihood p(x,θ)p(x,\theta) is defined as the joint density of the observed data as a function of model parameters. That means, for any given x, p(x=fixed,θ)p(x=\operatorname{fixed},\theta)

Notations

We will consider the case were we are dealt with a set XX of mm data instances X={x(1),..,x(m)}X= \{ \textbf{x}^{(1)}, . . , \textbf{x}^{(m)} \}

The Independent and identically distributed assumption

This brings us to the most fundamental assumption of ML: Independent and Identically Distributed (IID) data (random variables). Statistical independence means that for random variables A and B, the joint distribution PA,B(a,b)P_{A,B}(a,b)

Our estimator (model) will have some learnable parameters θ\boldsymbol{\theta} that make another probability distribution pmodel(x,θ)p_{model}(\textbf{x}, \boldsymbol{\theta})

The essence of ML is to pick a good initial model that exploits the assumptions and the structure of the data. Less literally, a model with a decent inductive bias. As the parameters are iteratively optimized, pmodel(x,θ)p_{model}(\textbf{x}, \boldsymbol{\theta})

In neural networks, because the iterations happen in a mini-batch fashion instead of the whole dataset, mm will be the mini-batch size.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is simply a common principled method with which we can derive good estimators, hence, picking θ\boldsymbol{\theta} such that it fits the data.

To disentangle this concept, let’s observe the formula in the most intuitive form:

θMLE=argmaxparamspmodel(outputinputs,params)\boldsymbol{\theta}_{\mathrm{MLE}}= \underset{\operatorname{params}}{\arg \max } \operatorname{p}_{\operatorname{model}}( \operatorname{output} | {\operatorname{inputs}},\operatorname{params})

The optimization problem is maximizing the likelihood of the given data. Outputs are the conditions in the probability world. Unconditional MLE means we have no conditioning on the outputs, so no labels.

θMLE=argmaxθpmodel (X,θ),=argmaxθi=1mpmodel (x(i),θ).\begin{aligned}

\boldsymbol{\theta}_{\mathrm{MLE}} &=\underset{\boldsymbol{\theta}}{\arg \max } \operatorname{p}_{\text {model }}({X} , \boldsymbol{\theta}), \\

&=\underset{\boldsymbol{\theta}}{\arg \max } \prod_{i=1}^{m} \operatorname{p}_{\text {model }}\left(\boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right) .

\end{aligned}

In a supervised ML context, the condition would simply be the data labels.

θML=argmaxθi=1mlogpmodel(y(i)x(i),θ)\boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{model}\left(\boldsymbol{y}^{(i)} \mid \boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right)

Quantifying distribution closeness: KL-div

One way to interpret MLE is to view it as minimizing the “closeness” between the training data distribution pdata(x)p_{data}(\textbf{x})

DKL(pdatapmodel)=Expdata[logpdata(x)pmodel(x,θ)]=Expdata[logpdata(x)logpmodel(x,θ)],\begin{gathered}

D_{K L}( p_{data} \| p_{model})=E_{x\sim p_{data}} [ \log \frac{p_{data}(\textbf{x})}{p_{model}(\textbf{x}, \boldsymbol{\theta})}] = \\

E_{x\sim p_{data}} [ \log p_{data}(\textbf{x}) – \log p_{model}(\textbf{x}, \boldsymbol{\theta})],

\end{gathered}

where EE denotes the expectation over all possible training data. In general, the expected value EE is a weighted average of all possible outcomes. We will replace the expectation with a sum, whilst multiplying each term with its possible “weight” of happening, that is pdatap_{data}


Understanding Maximum Likelihood Estimation in Supervised Learning


Illustration of the relative entropy for two normal distributions. The typical asymmetry is clearly visible. By Mundhenk at English Wikipedia, CC BY-SA 3.0

Notice that I intentionally avoided using the term distance. Why? Because a distance function is defined to be symmetric. KL-div, on the other hand, is asymmetric meaning DKL(pdatapmodel)DKL(pmodelpdata)D_{K L}( p_{data} \| p_{model}) \neq D_{K L}(p_{model} \| p_{data} )


kl-div-assymetry


Source: Datumorphism

Intuitively, you can think of pdatap_{data}

By replacing the expectation EE with our lovely sum:

DKL(pdatapmodel)=x=1Npdata(x)logpdata(x)pmodel(x,θ)=x=1Npdata(x)[logpdata(x)logpmodel(x,θ)]\begin{gathered}

D_{K L}(p_{data} \| p_{model})=\sum_{x=1}^{N} p_{data}(\textbf{x}) \log \frac{p_{data}(\textbf{x})}{p_{model}(\textbf{x}, \boldsymbol{\theta})} \\

=\sum_{x=1}^{N} p_{data}(\textbf{x})[\log p_{data}(\textbf{x})-\log p_{model}(\textbf{x}, \boldsymbol{\theta})]

\end{gathered}

When we minimize the KL divergence with respect to the parameters of our estimator, logpdata(x)\log p_{data}(\textbf{x})

θDKL(pdatapmodel)=x=1Npdata(x)logpmodel(x,θ). \nabla_{\theta} D_{K L}(p_{data} \| p_{model}) = – \sum_{x=1}^{N} p_{data}(\textbf{x}) \log p_{model}(\textbf{x}, \boldsymbol{\theta}).

In other words, minimizing KL-div is mathematically equivalent to minimizing cross-entropy (H(P,Q)=xP(x)logQ(x)H(P, Q)=-\sum_{x} P(x) \log Q(x)

H(pdata,pmodel)=H(pdata)+DKL(pdatapmodel)θH(pdata,pmodel)=θ(H(pdata)+DKL(pdatapmodel))=θDKL(pdatapmodel)\begin{aligned}

H\left(p_{data}, p_{model}\right) &=H(p_{data})+D_{K L}\left(p_{data} \| p_{model}\right) \\

\nabla_{\theta} H\left(p_{data}, p_{model}\right) &=\nabla_{\theta}\left(H(p_{data})+D_{K L}\left(p_{data} \| p_{model}\right)\right) \\

&=\nabla_{\theta} D_{K L}\left(p_{data} \| p_{model}\right)

\end{aligned}

The optimal parameters θ\boldsymbol{\theta} will, in principle, be the same. Even though the optimization landscape would be different (as defined by the objective functions), maximizing the likelihood is equivalent to minimizing the KL divergence. In this case, the entropy of the data H(pdata)H(p_{data})

From the statistical point of view, it’s more of bringing the distributions close so KL-div. From the aspect of information theory, cross-entropy might make more sense to you.

MLE in Linear regression

Let’s consider linear regression. Imagine that each single prediction y^\hat{y}


linear-regression


Source

Now we need an assumption. We hypothesize the neural network or any estimator ff as y^=f(x,θ)\hat{y}=f(\textbf{x} , \theta)

y^=f(x,θ)yN(y,μ=y^,σ2)p(yx,θ)=1σ2πexp((yy^)22σ2)\begin{aligned}

& \hat{y}=f(\textbf{x} , \boldsymbol{\theta}) \\

y & \sim \mathcal{N}\left(y , \mu=\hat{y}, \sigma^{2}\right) \\

p(y \mid \textbf{x} , \boldsymbol{\theta}) &=\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-(y-\hat{y})^{2}}{2 \sigma^{2}}\right)

\end{aligned}

In terms of log-likelihood we can form a loss function:

L=i=1mlogp(yx,θ)=i=1mlog1σ2πexp((y^(i)y(i))22σ2)=i=1mlog(σ2π)logexp((y^(i)y(i))2.2σ2)=i=1mlog(σ)12log(2π)(y^(i)y(i))22σ2=mlog(σ)m2log(2π)i=1m(y^(i)y(i))22σ2\begin{aligned}

L &=\sum_{i=1}^{m} \log p(y \mid \textbf{x} , \boldsymbol{\theta}) \\

&=\sum_{i=1}^{m} \log \frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{-\left(\hat{y}^{(i)}-y^{(i)}\right)^{2}}{2 \sigma^{2}}\right) \\

&=\sum_{i=1}^{m}-\log (\sigma \sqrt{2 \pi})-\log \exp \left(\frac{(\hat{y}^{(i)}-y^{{(i)}} )^{2}.}{2 \sigma^{2}}\right) \\

&=\sum_{i=1}^{m}-\log (\sigma)-\frac{1}{2} \log (2 \pi)-\frac{(\hat{y}^{(i)}-y^{{(i)}})^{2}}{2 \sigma^{2}} \\

&=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left(\hat{y}^{(i)}-y^{{(i)}}\right)^{2}}{2 \sigma^{2}} \\

\end{aligned}

By taking the partial derivative with respect to the parameters, we get the desired MSE.

θL=θi=1my^(i)y(i)22σ2=mlog(σ)m2log(2π)i=1my^(i)y(i)22σ2=mlog(σ)m2log(2π)m2σ2MSE\begin{aligned}

\nabla_{\theta} L &=-\nabla_{\theta} \sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\

&=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)-\sum_{i=1}^{m} \frac{\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}}{2 \sigma^{2}} \\

&=-m \log (\sigma)-\frac{m}{2} \log (2 \pi)- \frac{m}{2 \sigma^{2}} MSE

\end{aligned}

Since MSE=1mi=1my^(i)y(i)2\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|\hat{y}^{(i)}-y^{(i)}\right\|^{2}

MLE in supervised classification

In linear regression, we parametrized pmodel(yx,θ)p_{model}(y | \mathbf{x}, \boldsymbol{\theta})

It is possible to convert linear regression to a classification problem. All we need to do is encode the ground truth as a one-hot vector:

pdata(yxi)={1 if y=yi0 otherwise ,p_{data}\left(y \mid \textbf{x}_{i}\right)= \begin{cases}1 & \text { if } y=y_{i} \\ 0 & \text { otherwise }\end{cases} ,

where ii refer to a single data instance.

Hi(pdata,pmodel)=yYpdata(yxi)logpmodel(yxi)=logpmodel(yixi)\begin{aligned}

H_{i}\left(p_{data}, p_{model}\right) &=-\sum_{y \in Y} p_{data}\left(y \mid \textbf{x}_{i}\right) \log p_{model}\left(y \mid \textbf{x}_{i}\right) \\

&=-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right)

\end{aligned}

For simplicity let’s consider the binary case of two labels, 0 and 1.

L=i=1nHi(pdata,pmodel)=i=1nlogpmodel(yixi)=i=1nlogpmodel(yixi)\begin{aligned}

L &=\sum_{i=1}^{n} H_{i}\left(p_{data}, p_{model}\right) \\

&=\sum_{i=1}^{n}-\log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right) \\

&=-\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right)

\end{aligned}

=argminθL=argminθi=1nlogpmodel(yixi)\begin{aligned}

=\underset{\boldsymbol{\theta}}{\arg \min } L &= \underset{\boldsymbol{\theta}}{\arg \min } -\sum_{i=1}^{n} \log p_{model}\left(y_{i} \mid \textbf{x}_{i}\right)

\end{aligned}

This is in line with our definition of conditional MLE:

θML=argmaxθi=1mlogpmodel(y(i)x(i),θ)\boldsymbol{\theta}_{\mathrm{ML}}=\underset{\boldsymbol{\theta}}{\arg \max } \sum_{i=1}^{m} \log p_{model}\left(\boldsymbol{y}^{(i)} \mid \boldsymbol{x}^{(i)} , \boldsymbol{\theta}\right)

Broadly speaking, MLE can be applied to most (supervised) learning problems,

by specifying a parametric family of (conditional) probability distributions.

Another way to achieve this in a binary classification problem would be to take the scalar output yy of the linear layer and pass it from a sigmoid function. The output will be in the range [0,1] and we define this as the probability of p(y=1x,θ)p(y = 1 | \mathbf{x}, \boldsymbol{\theta})

p(y=1x,θ)=σ(θTx)=sigmoid(θTx)[0,1]p(y = 1 | \mathbf{x}, \boldsymbol{\theta}) = \sigma( \boldsymbol{\theta}^T \mathbf{x}) = \operatorname{sigmoid}( \boldsymbol{\theta}^T \mathbf{x}) \in [0,1]

Consequently, p(y=0x,θ)=1p(y=1x,θ)p(y = 0 | \mathbf{x}, \boldsymbol{\theta}) = 1 – p(y = 1 | \mathbf{x}, \boldsymbol{\theta})

Bonus: What would happen if we use MSE on binary classification?

So far I presented the basics. This is a bonus question that I was asked during an ML interview: What if we use MSE on binary classification?

When y^(i)=0\hat{y}^{(i)}=0

MSE=1mi=1my(i)2=1mi=1mσ(θTx)2=1mi=1mσ(θTx)2\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|-y^{(i)}\right\|^{2}= \frac{1}{m} \sum_{i=1}^{m}\left\|-\sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2} = \frac{1}{m} \sum_{i=1}^{m}\left\|\sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2}

When y^(i)=1\hat{y}^{(i)}=1

MSE=1mi=1m1y(i)2=1mi=1m1σ(θTx)2\operatorname{MSE}=\frac{1}{m} \sum_{i=1}^{m}\left\|1 -y^{(i)}\right\|^{2}=

\frac{1}{m} \sum_{i=1}^{m}\left\|1 – \sigma( \boldsymbol{\theta}^T \mathbf{x}) \right\|^{2}

One intuitive way to guess what’s happening without diving into the math is this one: in the beginning of training the network will output something very close to 0.5, which gives roughly the same signal for both classes. Below is a more principled method proposed after the initial release of the article by Jonas Maison.

Proposed demonstration by Jonas Maison

Let’s assume that we have a simple neural network with weights θ\theta such as z=θxz=\theta^\intercal x

Lθ=Ly^y^zzθ\frac{\partial L}{\partial \theta}=\frac{\partial L}{\partial \hat{y}}\frac{\partial \hat{y}}{\partial z}\frac{\partial z}{\partial \theta}

MSE Loss

L(y,y^)=12(yy^)2L(y, \hat{y}) = \frac{1}{2}(y-\hat{y})^2
Lθ=(yy^)σ(z)(1σ(z))x\frac{\partial L}{\partial \theta}=-(y-\hat{y})\sigma(z)(1-\sigma(z))x
Lθ=(yy^)y^(1y^)x\frac{\partial L}{\partial \theta}=-(y-\hat{y})\hat{y}(1-\hat{y})x

σ(z)(1σ(z))\sigma(z)(1-\sigma(z))

Binary Cross Entropy (BCE) Loss

L(y,y^)=ylog(y^)(1y)log(1y^)L(y, \hat{y}) = -ylog(\hat{y})-(1-y)log(1-\hat{y})

For y=0y=0

Lθ=1y1y^σ(z)(1σ(z))x\frac{\partial L}{\partial \theta}=\frac{1-y}{1-\hat{y}}\sigma(z)(1-\sigma(z))x
Lθ=1y1y^y^(1y^)x\frac{\partial L}{\partial \theta}=\frac{1-y}{1-\hat{y}}\hat{y}(1-\hat{y})x
Lθ=(1y)(y^)x\frac{\partial L}{\partial \theta}=(1-y)(\hat{y})x
Lθ=y^x\frac{\partial L}{\partial \theta}=\hat{y}x

If the network is right, y^=0\hat{y}=0

For y=1y=1

Lθ=yy^σ(z)(1σ(z))x\frac{\partial L}{\partial \theta}=-\frac{y}{\hat{y}}\sigma(z)(1-\sigma(z))x
Lθ=yy^y^(1y^)x\frac{\partial L}{\partial \theta}=-\frac{y}{\hat{y}}\hat{y}(1-\hat{y})x
Lθ=y(1y^)x\frac{\partial L}{\partial \theta}=-y(1-\hat{y})x
Lθ=(1y^)x\frac{\partial L}{\partial \theta}=-(1-\hat{y})x

If the network is right, y^=1\hat{y}=1

Conclusion and References

This short analysis explains why we blindly choose our objective functions to minimize such as cross-entropy. MLE is a principled way to define an optimization problem and I find it a common discussion topic to back up design choices during interviews.

If you like our content consider supporting us, by any possible means. It would be massively appreciated.

Deep Learning in Production Book 📖

Learn how to build, train, deploy, scale and maintain deep learning models. Understand ML infrastructure and MLOps using hands-on examples.

Learn more

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.

Related articles

Introductory time-series forecasting with torch

This is the first post in a series introducing time-series forecasting with torch. It does assume some prior...

Does GPT-4 Pass the Turing Test?

Large language models (LLMs) such as GPT-4 are considered technological marvels capable of passing the Turing test successfully....