Maximum likelihood estimation

1 Problem

We need to estimate probability density p(x) of a random variable from observed values.

Illustration

2 Approach

We will use idea of parametric distribution estimation, which involves choosing the best parameters, of a chosen family of densities p_\theta(x), indexed by a parameter \theta. The idea is very natural: we choose such parameters, which maximizes the probability (or logarithm of probability) of observed values.

\arg \max\limits_{\theta} \log p_\theta(x) = \theta^*

2.1 Linear measurements with i.i.d. noise

Suppose, we are given the set of observations:

x_i = \theta^\top a_i + \xi_i, \quad i = [1,m],

where

  • \theta \in \mathbb{R}^n - unknown vector of parameters
  • \xi_i are IID noise random variables with density p(z)
  • x_i - measurements, x \in \mathbb{R}^m

Which implies the following optimization problem:

\max\limits_{\theta} \log p(x) = \max_\theta \sum\limits_{i=1}^m \log p (x_i - \theta^\top a_i) = \max_\theta L(\theta)

Where the sum goes from the fact, that all observation are independent, which leads to the fact, that p(\xi) = \prod\limits_{i=1}^m p(\xi_i). The target function is called log-likelihood function L(\theta).

2.1.1 Gaussian noise

p(z) = \dfrac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{z^2}{2 \sigma^2}}

\log p(z) = - \dfrac{1}{2} \log (2 \pi \sigma^2) - \dfrac{z^2}{2 \sigma^2}

\begin{split} L(\theta) &= \sum\limits_{i=1}^m \left[ - \dfrac{1}{2} \log (2 \pi \sigma^2) - \dfrac{(x_i - \theta^\top a_i)^2}{2 \sigma^2} \right] \\ &= - \dfrac{m}{2} \log (2 \pi \sigma^2) - \dfrac{1}{2 \sigma^2} \sum\limits_{i=1}^m (x_i - \theta^\top a_i)^2 \end{split}

Which means, the maximum likelihood estimation in case of gaussian noise is a least squares solution.

2.1.2 Laplacian noise

p(z) = \dfrac{1}{2a} e^{-\frac{|z|}{a}}

\log p(z) = - \log (2a) - -\dfrac{|z|}{a}

\begin{split} L(\theta) &= \sum\limits_{i=1}^m \left[ - \log (2a) - -\dfrac{|(x_i - \theta^\top a_i)|}{a} \right] \\ &= - m \log (2 a) - \dfrac{1}{a} \sum\limits_{i=1}^m |x_i - \theta^\top a_i| \end{split}

Which means, the maximum likelihood estimation in case of Laplacian noise is a l_1-norm solution.

2.1.3 Uniform noise

p(z) = \begin{cases} \frac{1}{2a}, & -a \leq z \leq a, \\ 0, & z<-a \text{ or } z>a \end{cases}

\log p(z) = \begin{cases} - \log(2a), & -a \leq z \leq a, \\ -\infty, & z<-a \text{ or } z>a \end{cases}

$$

L() = \begin{cases} - m\log(2a), & |x_i - \theta^\top a_i| \leq a, \\ -\infty, & \text{ otherwise } \end{cases}

$$

Which means, the maximum likelihood estimation in case of uniform noise is any vector \theta, which satisfies \vert x_i - \theta^\top a_i \vert \leq a.

2.2 Binary logistic regression

Suppose, we are given a set of binary random variables y_i \in \{0,1\}. Let us parametrize the distribution function as a sigmoid, using linear transformation of the input as an argument of a sigmoid.

Picture from Wikipedia

\begin{split} p(y_i = 1) &= \dfrac{\text{exp}(\theta_0^\top x_i + \theta_1)}{1 + \text{exp}(\theta_0^\top x_i + \theta_1)} \\ p(y_i = 0) &= \dfrac{1}{1 + \text{exp}(\theta_0^\top x_i + \theta_1)} \end{split}

Let’s assume, that first k observations are ones: y_1, \ldots, y_k =1, y_{k+1}, \ldots, y_m = 0. Then, log-likelihood function will be written as follows:

L(\theta_0, \theta_1) = \sum\limits_{i=1}^k (\theta_0^\top x_i + \theta_1) - \sum\limits_{i=1}^m \log(1 + \text{exp}(\theta_0^\top x_i + \theta_1))

3 References