Optimal estimation: Estimation of correlation structures

Definition

Cor­re­la­ti­on struc­tures descri­be the rela­ti­onships bet­ween func­tion values \(f(z_1), … , f(z_n)\) wit­hout pre­scrib­ing an expli­cit func­tion­al form of \(f\). The most com­mon are the so-cal­led cova­ri­ance matri­ces \(C\). They ful­ly encode the cor­re­la­ti­ons among all func­tion values.

Each ent­ry of \(C\) repres­ents the cova­ri­ance bet­ween two ran­dom varia­bles \(f_i, f_j\).

$$ C= \begin{bmatrix} c_{11} & \cdots & c_{1n} \\ \vdots & \ddots & \vdots \\ c_{n1} & \cdots & c_{nn}\end{bmatrix} ~~~~~~ c_{ij}=E[f_if_j]-E[f_i]E[f_j]$$

Here, \(E[f_i]\) is the expec­ted value of the ran­dom varia­ble \(f_i\). The equa­ti­on impli­es that \(C\) is a sym­me­tric matrix with dimen­si­ons \(n \times n\), whe­re \(n\) is the num­ber of all ran­dom variables.

Figu­re 1: Examp­les of cova­ri­ance matri­ces for 3‑dimensional ran­dom vec­tors, for ran­dom 1D func­tions, ran­dom 2D func­tions, and ran­dom vec­tor fields.

Problem statement

If cova­ri­ance matri­ces are ful­ly known, they are an essen­ti­al part of the equa­tions for opti­mal esti­ma­ti­on. Often, howe­ver, this is not the case, and func­tions \(K(z_1, z_2) = E[f_{z_1}f_{z_2}] — E[f_{z_1}]E[f_{z_2}]\) must be esti­ma­ted from spo­ra­di­cal­ly available data \(f_1, …, f_n\). The­se cova­ri­ance func­tions \(K(z_1, z_2) : Z \times Z \rightarrow \mathbb{R}\) are then used to gene­ra­te the cova­ri­ance matri­ces of all sizes nee­ded for opti­mal estimation.

Deri­ving the cova­ri­ance func­tion \(K\) from data \(f_1, …, f_n\) pres­ents a chall­enge. Sin­ce cova­ri­ance matri­ces must be posi­ti­ve semi­de­fi­ni­te, \(K\) must satisfy

$$ C \succeq 0 ~~~~~~ C_{ij} = K(z_i, z_j) $$

regard­less of the spe­ci­fic sequence of eva­lua­ti­on points \(\{z_i\}_{i=1}^n \subset Z\) [1, p. 11]. Such \(K(\cdot, \cdot)\) are cal­led posi­ti­ve defi­ni­te ker­nels. Deri­ving \(K(\cdot, \cdot)\) from data requi­res sol­ving a sta­tis­ti­cal infe­rence pro­blem with cons­traints con­cer­ning the spec­trum of matri­ces con­s­truc­ti­ble from \(K(\cdot, \cdot)\). This is not a straight­for­ward problem.

Relevance

Despi­te its com­ple­xi­ty, it remains a cru­cial pro­blem and an essen­ti­al pre­re­qui­si­te for tru­ly relia­ble opti­mal esti­ma­ti­on, as demons­tra­ted, for exam­p­le, in the equa­tions for sol­ving the inter­po­la­ti­on pro­blem. The opti­mal esti­ma­tor \(f(\cdot)\) based on obser­va­tions \(l = [f(z_1), …, f(z_n)]^T\) is:

$$ \begin{align}
f(z) & = \operatorname{argmin} \|f\|_{\mathcal{H}_K}^2 \\
& \text{s.t.} f(z_j) = l_j, \quad j = 1, …, n \\
& = \begin{bmatrix} K(z, z_1) & \cdots & K(z, z_n) \end{bmatrix} \begin{bmatrix} K(z_1, z_1) & \cdots & K(z_1, z_n) \\ \vdots & \ddots & \vdots \\ K(z_n, z_1) & \cdots & K(z_n, z_n) \end{bmatrix}^{-1} \begin{bmatrix} l_1 \\ \vdots \\ l_n \end{bmatrix} \\
& = \sum_{j=1}^n \alpha_j K(z, z_j)
\end{align}$$

Thus, it depends only on the data and the cova­ri­ance func­tion. For sto­cha­stic pro­ces­ses, know­ledge of the cova­ri­ance func­tion is often equi­va­lent to kno­wing the enti­re pro­ba­bi­li­ty dis­tri­bu­ti­on and thus is a cen­tral com­po­nent of the esti­ma­ti­on equa­tions. Con­se­quent­ly, the impact of the cova­ri­ance func­tion on the opti­mal esti­ma­tor is significant.

Figu­re 2: The choice of the cova­ri­ance func­tion deter­mi­nes the shape of the solu­ti­on for the inter­po­la­ti­on problem.

Functional model for covariance functions

Simi­lar to the repre­sen­ta­ti­on of posi­ti­ve semi­de­fi­ni­te matri­ces \(C\) as the pro­duct \(B^TB\) of matri­ces \(B\), a cova­ri­ance func­tion can always be writ­ten as \(\int_Z B(z_1,z)B(z,z_2)dz\). Under some assump­ti­ons [2, pp. 158–163], this impli­es a decom­posa­bi­li­ty of \(K(z_1,z_2)\) into

$$ K(z_1,z_2)=\sum_{i=1}^{\infty}\sum_{j=1}^{\infty} \gamma_{ij}\varphi_i(z_1)\varphi_j(z_2) $$

whe­re \(\varphi_i(z), i=1, … \) form an ortho­nor­mal basis of func­tions on \(Z\) and \(\gamma \succeq 0\) is a posi­ti­ve semi­de­fi­ni­te matrix that appro­xi­m­ate­ly fol­lows a Wis­hart dis­tri­bu­ti­on. This model is very fle­xi­ble and can gene­ra­te a varie­ty of cor­re­la­ti­on structures.

Figu­re 3: Two ran­dom­ly gene­ra­ted cova­ri­ance matri­ces and simu­la­ti­ons of pro­ces­ses that exhi­bit such cor­re­la­ti­on structures.

Inference equations

Assum­ing \(K(z_1,z_2)\) is a squa­re \(B^TB = \int_Z B(z_1,z)B(z,z_2)dz\) of nor­mal­ly dis­tri­bu­ted ran­dom func­tions \(B\), then \(\gamma\) is appro­xi­m­ate­ly Wis­hart dis­tri­bu­ted. With \(p(l|\gamma)\) being the con­di­tio­nal pro­ba­bi­li­ty of obser­va­tions \(l\) given the cova­ri­ance func­tion depen­dent on \(\gamma\) and \(p(\gamma)\) the pro­ba­bi­li­ty of \(\gamma\), accor­ding to Bayes’ theorem,

$$p(\gamma|l) = c p(l|\gamma)p(\gamma) ~~~~~~ c \text{ a constant}.$$

Here, \(p(\gamma|l)\) is the con­di­tio­nal pro­ba­bi­li­ty of \(\gamma\) when \(l\) has been obser­ved. Maxi­mi­zing the pro­ba­bi­li­ty is equi­va­lent to mini­mi­zing \(-\log p(\gamma|l)=-\log p(l|\gamma)-\log p(\gamma)\).

The optimi­zation pro­blem is for­mu­la­ted as

$$ \begin{align} \min_{\gamma} ~~~& \log \det C_{\gamma} + \operatorname{tr} (SC_{\gamma}^+)+ r\left[ — \log \det \gamma + \operatorname{tr}(\Lambda^+\gamma)\right] \\ \text{s.t.}
~~~& A \gamma =b \\ ~~~& \gamma \succeq 0 \end{align}$$

whe­re \(S\) is the empi­ri­cal cova­ri­ance matrix of the data \(l\), \(C_{\gamma}\) is the cova­ri­ance matrix recon­s­truc­ted using \(\gamma\), \(C_{\gamma}^+\) its pseu­do­in­ver­se, and \(\Lambda\) a matrix enco­ding a‑priori assump­ti­ons about \(\gamma\). The regu­la­riza­ti­on con­stant \(r\) balan­ces pri­or know­ledge and data, and \(A \gamma = b\) are line­ar con­di­ti­ons for \(\gamma\) repre­sen­ting defi­ni­te know­ledge about under­ly­ing cor­re­la­ti­ons. With infor­ma­ti­on about first and second deri­va­ti­ves of the objec­ti­ve func­tion, the fol­lo­wing nume­ri­cal sche­me emer­ges [2, p. 194]

$$ \begin{align} \gamma &= \gamma +\Delta \gamma ‑A^T(A\gamma A^T)^+A\Delta \gamma \\ \Delta \gamma & = \frac{1}{1+r} \left[ (r‑1)\gamma +S_{\psi} — r\gamma \Lambda^+\gamma\right] \end{align}$$

whe­re \(S_{\psi}\) is a trans­for­med ver­si­on of the empi­ri­cal cova­ri­ance matrix. When ite­ra­ti­ons over \(\gamma\) are nume­ri­cal­ly sta­bi­li­zed and car­ri­ed out to con­ver­gence, the result is a cova­ri­ance func­tion \( K(z_1,z_2)\) con­s­truc­ted using \(\gamma\). This pre­dicts for all pos­si­ble posi­ti­ons \(z_1, z_2\)  the asso­cia­ted cor­re­la­ti­ons ensu­ring they are maxi­mal­ly con­sis­tent with the point obser­va­tions \(l\) and the pri­or assumptions.

Results

The opti­mal esti­ma­ti­on of cor­re­la­ti­on struc­tures using the abo­ve method is more fle­xi­ble than the clas­sic approach based on the sub­jec­ti­ve choice of para­me­tric fami­lies of cova­ri­ance func­tions. A varie­ty of real-world pro­ces­ses can­not be repre­sen­ted by simp­le func­tions in terms of their cor­re­la­ti­on structure.

The sto­cha­stic mode­ling of such pro­ces­ses bene­fits from the fle­xi­bi­li­ty of the abo­ve method. This beco­mes par­ti­cu­lar­ly evi­dent when cova­ri­ance func­tions deri­ved from various methods are used within the frame­work of opti­mal func­tion estimation.

Figu­re 4: The under­ly­ing true cova­ri­ance func­tion is typi­cal­ly unknown. Only indi­vi­du­al mea­su­re­ments are acces­si­ble, along with a con­jec­tu­re about, for exam­p­le, the smooth­ness pro­per­ties of the cova­ri­ance func­tion. When data and pri­or know­ledge are com­bi­ned, a cova­ri­ance func­tion is crea­ted that clo­se­ly resem­bles the true one. Note that this cova­ri­ance func­tion leads to signi­fi­cant­ly bet­ter inter­po­la­ti­ons than a gene­ric cova­ri­ance function.

Extensions

The esti­ma­ti­on pro­ce­du­re can be exten­ded to situa­tions whe­re both means and cova­ri­ances are unknown and obser­va­tions \(l\) fall into \(m\) dif­fe­rent groups \((l^1_1, …, l^1_{n_1}, … l^m_1, …, l^m_{n_m})\) with vary­ing data coll­ec­tion metho­do­lo­gies. The cor­re­spon­ding optimi­zation pro­blem is:

$$ \begin{align} \min_{\beta, \gamma} ~~~& \sum_{i=1}^m \log \det \eta_i + \operatorname{tr} (S_{\phi}^i\eta_i^+)+ r\left[ — \log \det \gamma + \operatorname{tr}(\Lambda^+\gamma)\right] \\ \text{s.t.} ~~~& A \gamma =b \\ &F_i\gamma -\eta_i= 0 ~~~~ i=1, …, m \\ ~~~& \gamma \succeq 0 \\ & \eta_i \succeq 0 ~~~~ i=1, …, m \end{align}$$

whe­re \(\eta_i, i=1, …, m\) are cova­ri­ance matri­ces lin­ked to \(\gamma\) through \(F_i\gamma-\eta_i=0\) and \(S_{\phi}^i\) are trans­for­med empi­ri­cal cova­ri­ance matri­ces. The optimi­zation can be per­for­med using a pro­jec­ted New­ton method, see here (Git­hub Link). The exten­si­on to mul­ti­ple dimen­si­ons is straightforward.

Figu­re 5: Exem­pla­ry appli­ca­ti­on of the method for opti­mal esti­ma­ti­on of 1D, 2D pro­ces­ses, and trajectories.

Practical aspects

The esti­ma­ti­on of cor­re­la­ti­on struc­tures is very important for gene­ra­ting meaningful sto­cha­stic models and is the­r­e­fo­re a fre­quent topic of dis­cus­sion in spa­ti­al sta­tis­tics, the ana­ly­sis of phy­si­cal pro­ces­ses, the use of regres­si­on models, and in machi­ne learning.

If suf­fi­ci­ent data is available, a sto­cha­stic model crea­ted manu­al­ly based on assump­ti­ons can be refi­ned, which posi­tively affects the qua­li­ty of the state­ments deri­ved using the model.

Code & Sources

Exam­p­le code: OE_KI.py , OE_KI_support_funs.py , OE_KI_covariance_impact_1.py, OE_KI_covariance_impact_2.py , OE_KI_interpolate_brownian_bridge.py , OE_KI_interpolate_field.py , OE_KI_interpolate_function.py , OE_KI_interpolate_trajectgory.py  in our tuto­ri­al­fol­der.

[1] Ber­li­net, A., & Tho­mas-Agnan, C. (2011). Repro­du­cing Ker­nel Hil­bert Spaces in Pro­ba­bi­li­ty and Sta­tis­tics: . Ber­lin Hei­del­berg: Sprin­ger Sci­ence & Busi­ness Media.

[2] Butt, J. A. (2019). An RKHS approach to model­ling and infe­rence for spa­tio­tem­po­ral geo­de­tic data with appli­ca­ti­ons to ter­restri­al radar inter­fe­ro­me­try. Dis­ser­ta­ti­on ETH Zürich, Zürich.