Optimal estimation: Parameter estimation

Definition

Para­me­ter esti­ma­ti­on (also cal­led regres­si­on) encom­pas­ses all tho­se tasks that can be redu­ced to data-dri­ven model fit­ting. The objec­ti­ve is to sel­ect para­me­ters that bring the model beha­vi­or into the clo­sest pos­si­ble ali­gnment with the actu­al obser­ved real-world behavior.

It does not mat­ter what the natu­re of the data is or the com­ple­xi­ty of the model to be adjus­ted. For exam­p­le, image clas­si­fi­ca­ti­on can be con­side­red a para­me­ter esti­ma­ti­on pro­blem. In this case, the data con­sists of images anno­ta­ted with image clas­ses, and the model is a neu­ral net­work with many para­me­ters, which is inten­ded to assign image clas­ses to images. Thus, the tran­si­ti­on to machi­ne lear­ning is seamless.

Example

Sup­po­se a wea­ther sta­ti­on pro­vi­des tem­pe­ra­tu­re rea­dings \(l_k, k=1, … ‚n\) at regu­lar inter­vals, lea­ding to the sequence of mea­su­re­ments \( (l_1, z_1), … , (l_n, z_n)\), whe­re \(z_k, k=1, … ‚n\) deno­tes the mea­su­re­ment times. Expec­ting long-term trends and peri­odic fluc­tua­tions due to time-of-day influen­ces, the fol­lo­wing model \(g(x,z) \) is pro­po­sed to explain the actu­al obser­ved tem­pe­ra­tu­re data \(l\):

$$ \begin{align} l \approx g(x,z) = & x_1g_1(z) + x_2 g_2(z) + x_3 g_3(z) \\ & \underbrace{x_1}_{\text{Constant}} + \underbrace{x_2z}_{\text{linear trend}} + \underbrace{x_3\sin\left(\frac{2 \pi}{24}z \right)}_{\text{periodic influence}} \end{align}$$

In this model, the para­me­ters \(x_1, x_2, x_3\) are unde­ter­mi­ned and are to be cho­sen such that \(g(x,z_k)\) and \(l_k\) are appro­xi­m­ate­ly equal.

Figu­re 1: The data and the indi­vi­du­al com­pon­ents of the model \(g(x,z)=x_1 g_1(z)+ x_2 g_2(z) + x_3 g_3(z)\) are ali­gned as clo­se­ly as pos­si­ble by opti­mi­zing over \(x\). The opti­mal model \(g(x^*,z)\) is also illustrated.

The solu­ti­on to the optimi­zation problem

$$ \begin{align} \min_x ~~~& \sum_{k=1}^n (l_k‑g(x,z_k))^2 \end{align}$$

is the opti­mal para­me­ter vec­tor \(x^*=[x_1^*, x_2^*, x_3^*]^T\). It allows inter­pre­ta­ti­ons of the rela­ti­ve strength of con­stant, line­ar, and peri­odic effects. The dis­crepan­cy \(\sum_{k=1}^n(l_k‑g(x^*,z_k))^2\) is an indi­ca­tor of the over­all sui­ta­bi­li­ty of the model.

Generalization

The approach abo­ve can easi­ly be exten­ded to data \(l_k\) of any dimen­si­on \(d_1\), func­tion inputs \(z\) of any dimen­si­on \(d_2\), and any num­ber of basis func­tions \(g_1(x,z), … , g_m(x,z)\). The optimi­zation pro­blem to deter­mi­ne the opti­mal para­me­ters \(x^*\) is then for­mu­la­ted as:

$$ \begin{align} \min_x ~~~& \sum_{k=1}^n \|l_k-\sum_{j=1}^m x_jg_j(z_k)\|_2^2 \end{align}$$

whe­re \(g(x,z)=\sum_{j=1}^m x_jg_j(z)\) is a func­tion from \(\mathbb{R}^{d_2}\) to \(\mathbb{R}^{d_1}\) and \(\|v\|_2^2 = \sum_{k=1}^{d_1}v_k^2\) mea­su­res the length of the vec­tor \(v\in \mathbb{R}^{d_1}\). In the form abo­ve, wit­hout addi­tio­nal cons­traints or com­pli­ca­ti­ons, this is a qua­dra­tic pro­gram who­se solu­ti­on can be expli­cit­ly writ­ten as:

$$ \begin{align} x &=(G^TG)^{-1}G^T l \\ l&\in \mathbb{R}^{n d_1} ~~~~ l=[l_{11}, … , l_{1 d_1}, …, l_{n1}, …, l_{nd_1}]^T \\ G & \in \mathbb{R}^{n d_1} \times \mathbb{R}^{m} ~~~~ G=\begin{bmatrix} G_1 \\ \vdots \\ G_n \end{bmatrix} \\ G_k &\in \mathbb{R}^{d_1} \times \mathbb{R}^{m} ~~~~ G_k=\begin{bmatrix} g_{11}(z_k) & \cdots & g_{n1}(z_k) \\ \vdots & \ddots & \vdots \\ g_{1d_1}(z_k) & \cdots & g_{md_1}(z_k)\end{bmatrix} \end{align}$$

Here, \(g_{ij}(z_k)\) is the \(j-\)th ent­ry of the vec­tor \(g_i(z_k)\). This for­mu­la­ti­on allows sol­ving para­me­ter esti­ma­ti­on pro­blems for, e.g., two-dimen­sio­nal tra­jec­to­ries and vec­tor fields; see the illustration.

Figu­re 2: Illus­tra­ti­on of mul­ti­di­men­sio­nal esti­ma­ti­on pro­blems whe­re the out­put dimen­si­on \(d_1=2\) (a) or both the out­put dimen­si­on \(d_1=2\) and input dimen­si­on \(d_2=2\) (b). Data, basis func­tions, and the opti­mal­ly fit­ted com­bi­na­ti­on of basis func­tions are also displayed.

Interpretation

The pre­vious for­mu­la­ti­ons con­tain qua­dra­tic objec­ti­ve func­tions of the form \(\sum_{k=1}^n \|l_k‑g(x,z_k)\|_2^2\) and are thus least-squa­res pro­blems. Mini­mi­zing the­se objec­ti­ve func­tions is sen­si­ble when the model class \(g(x,z)=\sum_{k=1}^m x_mg_m(z)\) can com­ple­te­ly explain the data except for (stan­dard) nor­mal­ly dis­tri­bu­ted resi­du­als \(\epsilon\). Then it holds that

$$ l_k=g(x,z_k)+\epsilon_k \Leftrightarrow \epsilon_k= l_k‑g(x,z_k). $$

The pro­ba­bi­li­ties of the resi­du­als \(\epsilon_1, …, \epsilon_n\) are then \(p(\epsilon_k)= (2 \sqrt{\pi})^{-1} \exp\left( — \frac{1}{2} \epsilon_k^2\right)\). The pro­ba­bi­li­ty of the occur­rence of all resi­du­als \(\epsilon_1, …, \epsilon_n\) tog­e­ther is

$$ p(\epsilon_1, …, \epsilon_n)=\prod_{k=1}^n p(\epsilon_k) = (2^n \pi^{n/2})^{-1} \exp\left( -\frac{1}{2} \sum_{k=1}^n \epsilon_k^2 \right)$$ for sta­tis­ti­cal­ly inde­pen­dent resi­du­als \(\epsilon_k \coprod \epsilon_j, k \neq j\). A choice of \(x\) such that all the resi­du­als \(\epsilon_1= l_1‑g(x,z_1), …, \epsilon_n=l_n‑g(x,z_n)\) are as pro­ba­ble as pos­si­ble leads to

$$ \max_x p(\epsilon_1, …, \epsilon_n) \Leftrightarrow \min_x \sum_{k=1}^n \left(l_k‑g(x,z_k)\right)^2. $$

The pre­vious mini­miza­ti­on pro­blems can the­r­e­fo­re be writ­ten as pro­ba­bi­li­ty-maxi­mi­zing esti­ma­tors for the expec­ted value of data \(l_k=g(x,z_k)+\epsilon_k\) under the assump­ti­on of uncor­re­la­ted, nor­mal­ly dis­tri­bu­ted noi­se \(\epsilon_k\).

Alternatives to least-squares

Not always is the assump­ti­on of nor­ma­li­ty rea­li­stic. The objec­ti­ve func­tion and cons­traints may need to be adjus­ted to appro­pria­te­ly reflect the pro­per­ties of the under­ly­ing data-gene­ra­ting pro­ces­ses. For exam­p­le, the obser­ved data might be bina­ry indi­ca­tor varia­bles \(l \in \{0,1\}\), repre­sen­ting the fail­ure of a com­po­nent. If the fail­ure pro­ba­bi­li­ty needs to be esti­ma­ted as a func­tion depen­ding on varia­bles such as the com­po­nen­t’s ser­vice dura­ti­on, this leads to the logi­stic regres­si­on [1, p. 119] for the direct esti­ma­ti­on of the cumu­la­ti­ve pro­ba­bi­li­ty distribution

$$p(\text{Component fail­ure befo­re time } t) = g(x,z)=[1+\exp(-x^Tz)]^{-1}$$

by maxi­mi­zing the total pro­ba­bi­li­ty \(\prod_{k=1}^n g(x,z_k)^{l_k}(1‑g(x,z_k))^{(1‑l_k)}\) of the fail­ure obser­va­tions \((l_1,z_1), … , (l_n,z_n)\). Mul­ti­di­men­sio­nal expl­ana­to­ry varia­bles such as \(z=[\text{service dura­ti­on, com­po­nent pri­ce}]\) are also per­mis­si­ble and do not chan­ge the equations.

Figu­re 3: Visua­liza­ti­on of logi­stic regres­si­on. Based on obser­va­tions, a cumu­la­ti­ve pro­ba­bi­li­ty dis­tri­bu­ti­on is esti­ma­ted that quan­ti­fies the risk of com­po­nent fail­ure and, for exam­p­le, enables ana­ly­ses of avera­ge life­spans and sys­tem costs.

Applications and practical aspects

Clas­sic least-squa­res para­me­ter esti­ma­ti­on is the work­hor­se of data ana­ly­sis and mode­ling tech­ni­ques. It is used whe­re­ver infor­ma­ti­on needs to be extra­c­ted from data or models need to be adapt­ed to real obser­va­tions. It is easy to under­stand, well-rese­ar­ched, and allows for clo­sed and simp­le-to-imple­ment solu­ti­on for­mu­las. The­r­e­fo­re, it is wide­ly used and any lis­ting of spe­ci­fic appli­ca­ti­ons would be gross­ly unre­pre­sen­ta­ti­ve. Other para­me­ter esti­ma­ti­on approa­ches are less wide­spread but essen­ti­al in cases invol­ving non-nor­mal­ly dis­tri­bu­ted data.

From a prac­ti­cal stand­point, the chall­enge in for­mu­la­ting para­me­ter esti­ma­ti­on pro­blems often lies in the sto­cha­stic mode­ling of the real-world pro­cess, which needs to be ana­ly­zed in terms of its pro­ba­bi­li­ty dis­tri­bu­ti­on. Addi­tio­nal­ly, if the pro­ba­bi­li­ty dis­tri­bu­ti­ons are not from a spe­ci­fic para­me­tric fami­ly, the pro­ba­bi­li­ty-maxi­mi­zing para­me­ters may be dif­fi­cult to find, as the asso­cia­ted optimi­zation pro­blem does not belong to one of the well-known clas­ses such as LP, QP, SOCP, or SDP.

Code & Sources

Exam­p­le code: OE_logistic_regression.py , OE_parameter_estimation_1.pyOE_parameter_estimation_2.py OE_simulation_support_funs.py  in our tuto­ri­al­fol­der

[1] Has­tie, T., Tibs­hira­ni, R., & Fried­man, J. (2013).  The Ele­ments of Sta­tis­ti­cal Lear­ning: Data Mining, Infe­rence, and Pre­dic­tion. Ber­lin Hei­del­berg: Sprin­ger Sci­ence & Busi­ness Media.