Applications: Statistics and optimal estimation

Definition

Mathe­ma­ti­cal sta­tis­tics invol­ves methods for coll­ec­ting, ana­ly­zing, and eva­lua­ting data. The goal is to deri­ve usable infor­ma­ti­on in the form of sta­tis­ti­cal metrics and the opti­mal esti­ma­ti­on of rela­ti­onships fraught with uncertainty.

Opti­ma­li­ty in esti­mat­ing para­me­ters, unknown func­tion­al rela­ti­onships, uncer­tain­ties, etc., is achie­ved by maxi­mi­zing pro­ba­bi­li­ties con­di­tio­ned on available data. Opti­mal esti­ma­ti­on is neces­sa­ry whe­re­ver real-world com­ple­xi­ties impe­de the unam­bi­guous sol­va­bi­li­ty of a problem.

Example

This is often the case when infor­ma­ti­on is to be extra­c­ted from mea­su­re­ment data. Mea­su­re­ment data are sub­ject to ran­dom and sys­te­ma­tic fluc­tua­tions, stem­ming from imper­fec­tions in the mea­su­re­ment sys­tem and dyna­mic influen­ces of the envi­ron­ment acting on the sys­tem. Data are typi­cal­ly con­tra­dic­to­ry and must be pro­ces­sed befo­re they are useful. Based on a series of mea­su­red values at posi­ti­ons \(z_k, k=1, …, n\), various ques­ti­ons can be relevant:

  1. Regres­si­on. Find para­me­ters of a model that best explain the observations.
  2. Inter­po­la­ti­on. Esti­ma­te mea­su­re­ment values at posi­ti­ons whe­re no mea­su­re­ments have taken place.
  3. Signal sepa­ra­ti­on. Decom­po­se the mea­su­red values into sys­te­ma­tic and ran­dom components.
  4. Uncer­tain­ty esti­ma­ti­on. Quan­ti­fy the uncer­tain­ties in the infor­ma­ti­on deri­ved from data.
Figu­re 1: The results of regres­si­on, inter­po­la­ti­on, and signal sepa­ra­ti­on for fic­ti­tious 1D and 2D data.

Explanation regression

In the exam­p­le illus­t­ra­ting regres­si­on, the values \(l_k, k=1, …, n\) have been obser­ved at the posi­ti­ons \(z_k, k=1, …, n\). Now the para­me­ters \(x\) are to be cho­sen such that pre­dic­tions \(g(x,z_k)\) and obser­va­tions \(l_k\) match as clo­se­ly as pos­si­ble. The pre­dic­ti­ve model in the depic­ted 1D case is the line­ar equation

$$ g(x,z)=x_1+x_2z$$

which pre­dicts the obser­va­ti­on \(g(x,z)\) at any posi­ti­on \(z\). More com­plex models are also pos­si­ble. They can take the form \(g(x,z)=\sum_{k=1}^{m_1}x_kg_k(z_1, …, z_{m_2})\) with \(m_1\) para­me­ters \(x=[x_1, …, x_{m_1}]\) and \(m_1\) dif­fe­rent func­tions \(g_k(z_1, …, z_{m_2})\) depen­ding on a \(m_2\)-dimensional posi­ti­on varia­ble \(z=[z_1, …, z_{m_2}]\).

The requi­re­ment for a pro­ba­bi­li­ty-maxi­mi­zing choice of the para­me­ter vec­tor \(x\) can be for­ma­li­zed as the optimi­zation problem

$$ \begin{align} \max_x ~~~&  p(l_1, …, l_n, z_1, …, z_n | x_1, x_2) \end{align}$$

whe­re the objec­ti­ve func­tion \(p(l,z|x)\) indi­ca­tes the pro­ba­bi­li­ty of an obser­va­ti­on \(l\) at the posi­ti­on \(z\) when the para­me­ters are set to \(x\).

Least squares

Assum­ing \(l=x_1+x_2z+\epsilon\) with \(\epsilon\) being stan­dard nor­mal­ly dis­tri­bu­ted noi­se, the pro­ba­bi­li­ty refor­mu­la­tes as

$$ p(l_1, …, l_n, z_1, …, z_n|x_1,x_2) = \prod_{k=1}^n p(l_k,z_k|x_1,x_2)= c \exp\left(-\sum_{k=1}^n \frac{1}{2}[l_k-x_1-x_2z_k]^2\right).$$

The con­stant \(c\) is irrele­vant for maxi­mi­zing the pro­ba­bi­li­ties \(p(l,z|x)\) or mini­mi­zing \(-\log p(l,z|x)\), and the fol­lo­wing optimi­zation pro­blem results.

$$ \begin{align} \min_{x_1, x_2} ~~~& \sum_{k=1}^n \left[l_k-x_1-x_2z_k\right]^2 \\ =\min_{x_1, x_2} ~~~& \|l‑Ax\|_2^2 \\ ~~~& \|l‑Ax\|_2^2=(l‑Ax)^T(l‑Ax) \\ ~~~& A=\begin{bmatrix} 1 & z_1 \\ \vdots & \vdots \\ 1 & z_n \end{bmatrix}^T ~~~ l = \begin{bmatrix}l_1 \\ \vdots \\ l_n\end{bmatrix} \end{align}$$

This is a simp­le qua­dra­tic pro­gram wit­hout cons­traints that can actual­ly be sol­ved by hand for the opti­mal \(x^*=(A^TA)^{-1}A^Tl\). This for­mu­la­ti­on is known as the least squa­res pro­blem, sin­ce it mini­mi­zes the squa­res of the dis­crepan­ci­es bet­ween mea­su­red and pre­dic­ted values.

Figu­re 2: Illus­tra­ti­on of the objec­ti­ve func­tion in least squa­res. The opti­mal solu­ti­on mini­mi­zes the total area of the squared errors. Lar­ger squared errors cor­re­spond to lower con­sis­ten­cy bet­ween para­me­ters and observations.

Other tasks

Inter­po­la­ti­on, signal sepa­ra­ti­on, and uncer­tain­ty esti­ma­ti­on can also be for­mu­la­ted as optimi­zation problems.

Inter­po­la­ti­on:
Mini­mi­ze \(\|x\|_{\mathcal{H}_2}^2\) sub­ject to \(Ax = l\), whe­re \(x \in \mathcal{H}_2\).

Signal Sepa­ra­ti­on:
Mini­mi­ze \(\|Ax‑l\|_{\mathcal{H}_1}^2 + \|x\|_{\mathcal{H}_2}^2\) sub­ject to \(x \in \mathcal{H}_2\).

Uncer­tain­ty Esti­ma­ti­on:
Mini­mi­ze \(\langle \Sigma, P \rangle_F + 2q^T\mu + r\) sub­ject to \(\begin{bmatrix} P & q \\ q^T & r \end{bmatrix} \succeq \tau_i \begin{bmatrix} 0 & a_i/2 \\ a_i^T/2 & ‑b_i \end{bmatrix}\) and \(\begin{bmatrix} P & q \\ q^T & r \end{bmatrix} \succeq 0\).

More details on the pre­cise mea­ning of the qua­dra­tic and semi­de­fi­ni­te pro­grams can be found on the sub­se­quent pages.

Solution procedures

When the num­ber of data points to be inte­gra­ted into the model is not over­whel­mingly lar­ge, optimi­zation pro­blems can be sol­ved using publicly available open-source sol­vers. This is typi­cal­ly the case. Howe­ver, when deal­ing with seve­ral hundred thousand or mil­li­ons of data points, nume­ri­cal com­pli­ca­ti­ons may ari­se, which can be miti­ga­ted by intel­li­gent­ly lever­aging under­ly­ing pro­blem structures.

To avo­id pro­ces­sing huge cor­re­la­ti­on matri­ces with \(n^2\) ent­ries (whe­re \(n\) is the num­ber of data points), ten­sor decom­po­si­ti­on and nume­ri­cal inver­si­on are used. Sto­cha­stic gra­di­ent des­cent, well-known from machi­ne lear­ning, also bypas­ses the lar­ge matri­ces encoun­te­red in holi­stic data eva­lua­ti­on by sequen­ti­al­ly pro­ces­sing the data. The­se stra­te­gies are rare­ly neces­sa­ry when deal­ing with time series, audio data, or spot mea­su­re­ments but are essen­ti­al when pro­ces­sing auto­ma­ti­cal­ly gene­ra­ted mul­ti­di­men­sio­nal data from came­ras or radar instruments.

Applications

Every prac­ti­cal pro­blem invol­ves unknown quan­ti­ties and rela­ti­onships, which is why methods of sta­tis­tics and opti­mal esti­ma­ti­on are now encoun­te­red ever­y­whe­re. Appli­ca­ti­ons include the opti­mal esti­ma­ti­on of tra­vel times, house pri­ces, ore gra­des, mate­ri­al pro­per­ties, buil­ding defor­ma­ti­ons, flight tra­jec­to­ries, and the che­mi­cal com­po­si­ti­ons of distant pla­nets. Fur­ther­mo­re, they include the ana­ly­sis of win­ning pro­ba­bi­li­ties in sports or the fail­ure pro­ba­bi­li­ty of com­pon­ents, the decom­po­si­ti­on of mea­su­re­ment data into signal and noi­se, the iden­ti­fi­ca­ti­on of objects in images, and model buil­ding for the spread of dise­a­ses or poli­ti­cal opi­ni­ons. More appli­ca­ti­ons can be found in this list.

Opti­mal esti­ma­ti­on is the respon­se to the ubi­qui­tous pre­sence of data and model uncertainties.

Figu­re 3: Sym­bo­lic illus­tra­ti­on of opti­mal esti­ma­ti­on’s role as a mini­mi­zer for dis­crepan­ci­es bet­ween data and model.

Practical aspects

The main chall­enge in set­ting up opti­mal esti­ma­ti­on pro­blems with real-world back­grounds lies in how ran­dom ele­ments in the data and models can be repre­sen­ted. At a mini­mum, this requi­res recour­se to pro­ba­bi­li­ty theo­ry, and sto­cha­stic mode­ling of ran­dom effects requi­res expe­ri­ence with pro­ba­bi­li­ty dis­tri­bu­ti­ons tail­o­red to various situa­tions, more on this here.

In many cases, the opti­mal esti­ma­ti­on of para­me­ters or func­tion values can be redu­ced to a least-squa­res pro­blem and even sol­ved manu­al­ly. Howe­ver, when non-nor­mal­ly dis­tri­bu­ted varia­bles are invol­ved, the pro­ba­bi­li­ties to be maxi­mi­zed can quick­ly take on a com­plex form, and dedi­ca­ted optimi­zation algo­rith­ms are required.