Methods: Dynamic programming

Definition

Dyna­mic pro­gramming (DP) addres­ses optimi­zation pro­blems whe­re the optimi­zation varia­bles repre­sent sequen­ces of decis­i­ons. This sequen­tia­li­ty is used to estab­lish recur­si­ve rela­ti­onships for the value of spe­ci­fic sta­tes and actions of the system.

This recur­si­on breaks down the over­all pro­blem into a sequence of inter­de­pen­dent sub­pro­blems that can be sol­ved effi­ci­ent­ly. The­r­e­fo­re, DP is both a solu­ti­on approach and a class of pro­blems. For cla­ri­ty, we inter­pret DP more nar­row­ly than neces­sa­ry as the optimi­zation problem

$$ \begin{align} \min_{\pi} ~~~& f(\pi) \\f(\pi)
~~~&= \sum_{k=1}^{end} E\left[ r(s_k, s_{k+1}, \pi(s_k))\right]  \end{align}$$

whe­re the optimi­zation varia­ble is \(\pi\) [1, p. 23]. The pecu­lia­ri­ty lies in the inter­pre­ta­ti­on of \(f(\pi)\) as the expec­ted reward \(E[r]\) in respon­se to actions \(\pi(s_k)\) depen­dent on sys­tem sta­tes \(s_k\). The goal is to find a poli­cy \(\pi\) that maps sys­tem sta­tes to con­trol inputs that opti­mal­ly steer the sys­tem on avera­ge. The qua­li­ty of a solu­ti­on is mea­su­red by the cumu­la­ti­ve amount of reward \(r\) collected.

Example

Ima­gi­ne, for exam­p­le, a con­ti­nuous inven­to­ry plan­ning sce­na­rio whe­re cost-mini­mi­zing decis­i­ons need to be made based on cur­rent stock levels and sales oppor­tu­ni­ties that beco­me known over time. Inven­to­ry shorta­ges lead to loss of pro­fit, while excess inven­to­ry incurs sto­rage costs.

Figu­re 1: Illus­tra­ti­on of a simp­le dyna­mic inven­to­ry plan­ning pro­blem. A warehouse has stocks that can be sold pro­fi­ta­b­ly accor­ding to demand. At the end of each week, an assess­ment is made and a decis­i­on is taken on the quan­ti­ties of pro­ducts to order. Addi­tio­nal­ly, demand chan­ges stochastically.

In the abo­ve exam­p­le, it is clear that a con­stant reor­der rate is sub­op­ti­mal and decis­i­ons on order quan­ti­ties depend on cur­rent inven­to­ry levels, demand, and expec­ted chan­ges in both para­me­ters. For­mu­la­ting the enti­re mat­ter as a sin­gle func­tion \(f\) seems dif­fi­cult, and inde­ed, the situa­ti­on is best repre­sen­ted as a Mar­kov Decis­i­on Pro­cess (MDP).

MDP

A Mar­kov Decis­i­on Pro­cess (MDP) is a con­troll­able sto­cha­stic pro­cess. It con­sists of a 4‑tuple \(\{S, A, P, R\}\) which defi­nes the dyna­mics of the process.

  • \(S\) is a set of sta­tes \(s\).
  • \(A\) is a set of actions \(a\).
  • \(P(s’| s, a)\)are the tran­si­ti­on pro­ba­bi­li­ties from \(s\) to \(s’\) when action \(a\) is chosen.
  • \(R(s’| s, a)\) )\) is the reward for tran­si­tio­ning from \(s\) to \(s’\) via action \(a\).

Thus, it invol­ves a set of sta­tes \(s\) of a sys­tem that evol­ves under the influence of actions \(a\). The sub­se­quent sta­te \(s’\) depends on \(s, a\), and rand­om­ness, while the reward depends on \(s’, s, a\). The goal is to maxi­mi­ze cumu­la­ti­ve rewards. To achie­ve this, the opti­mal action for each sta­te must be deter­mi­ned. The func­tion \(\pi: s \mapsto a\) is also cal­led a policy.

Application example

In the con­text of the inven­to­ry plan­ning exam­p­le, \(S=\mathbb{R}^2\) becau­se the sta­te is cha­rac­te­ri­zed by two num­bers: inven­to­ry and demand. Fur­ther­mo­re, \(A=\mathbb{R}\), \(R(s’, s, a)\) repres­ents the balan­ce bet­ween sales pro­fit and sto­rage cos­ts, and the tran­si­ti­on pro­ba­bi­li­ties reflect \(\text{Inventory}’=\text{Inventory}-\text{Demand}+a\) along with a ran­dom shift in demand.

Figu­re 2: Illus­tra­ti­on of the MDP for­ma­li­zing the inven­to­ry plan­ning exam­p­le. Note that from \(B=2, N=1\), with \(a=0\), the sub­se­quent sta­te is \(B=1, N\in \{0,1,2,3\}\). The reward is $$ for each pro­duct sold and -$ for each pro­duct stored.

The result of DP is the poli­cy \(\pi:s\mapsto a\), which maps sta­tes \(s=(B, N)\) to pro­duct reor­der actions \(a\). In this case, it is a sca­lar field with an input dimen­si­on of 2 (the sta­tes) and an out­put dimen­si­on of 1 (the actions).

Figu­re 3: The opti­mal poli­cy for the inven­to­ry plan­ning exam­p­le (a). The opti­mal pro­duct reor­der quan­ti­ties pri­ma­ri­ly depend on the dif­fe­rence \(B‑N\). Howe­ver, this chan­ges in more com­plex models with non-tri­vi­al tran­si­ti­on pro­ba­bi­li­ties (b).

A poli­cy gene­ra­ted in this way can now be used in prac­ti­ce to deter­mi­ne the most pro­mi­sing actions for each con­fi­gu­ra­ti­on of inven­to­ry and demand. Thus, a poli­cy is more than just a rigid plan—it is an adap­ti­ve set of measures.

Applications

DP, espe­ci­al­ly in the con­text of MDPs, is pri­ma­ri­ly used for opti­mal sys­tem con­trol. Even in its simp­lest form, the for­ma­lism is capa­ble of hand­ling non­line­ar rewards, dis­crete con­trol signals, and com­plex tran­si­ti­on pro­ba­bi­li­ties, and con­ver­ting them into opti­mal poli­ci­es. The uni­que aspect of poli­ci­es is that they are fle­xi­ble action plans that pro­po­se sequen­ces of actions for all pos­si­ble sta­tes, thus allo­wing them to respond appro­pria­te­ly and effort­less­ly to unfo­re­seen events.

Plans using other methods, such as pro­duc­tion plan­ning with LP, pro­vi­de only a sin­gle sequence of opti­mal decis­i­ons start­ing from an initi­al sta­te and offer less fle­xi­bi­li­ty and adap­ti­vi­ty. The­r­e­fo­re, DP is often used in ope­ra­tio­nal are­as whe­re decis­i­ons need to be made quick­ly and data for pro­ba­bi­li­stic mode­ling is available. This is the case, for exam­p­le, with dyna­mic con­trol of traf­fic lights, ambu­lan­ce dis­patches, com­mu­ni­ca­ti­on net­works, sup­p­ly chains, and water manage­ment pro­ces­ses, and extends to pre­dic­ti­ve sche­du­ling in medi­ci­ne, occu­p­an­cy plan­ning in call cen­ters, and the deve­lo­p­ment of fishing poli­ci­es. Details can be found here.

Solution

In simp­le cases, cha­rac­te­ri­zed by a fini­te num­ber of sta­tes and a mana­geable num­ber of dis­crete actions, MDPs can be sol­ved through ite­ra­ti­ve methods. To maxi­mi­ze the term

$$E_{\pi}\left[ \sum_{k=1}^{\text{end}} r(s_{k+1}, s_k, \pi(s_k))\right]$$

repre­sen­ting the expec­ted total reward, a func­tion \(V_{\pi}(s)\) is typi­cal­ly intro­du­ced. This so-cal­led value func­tion mea­su­res the expec­ted total reward when start­ing from sta­te \(s\) and fol­lo­wing the poli­cy \(\pi\). For arbi­tra­ry sta­tes \(s\) and \(s’\), \(V_{\pi}(s)\) and \(V_{\pi}(s’)\) are func­tion­al­ly rela­ted. Alter­na­ting ite­ra­ti­on to deter­mi­ne \(V_{\pi}\) and to opti­mi­ze \(\pi\) is cal­led value ite­ra­ti­on [1, p. 37] and leads to the opti­mal poli­cy \(\pi^*\).

Figu­re 4: The ite­ra­ti­on pro­gres­si­ve­ly refi­nes the esti­ma­tes for the value func­tion and the poli­cy, deri­ving the value func­tion \(V_{\pi^*}\) of the opti­mal poli­cy \(\pi^*\) (a). In (b) and ©, the func­tion­al rela­ti­onship bet­ween \(V_{\pi}(s)\) and \(V_{\pi}(s’)\) is sym­bo­li­zed. Fol­lo­wing poli­cy \(\pi\) in sta­te \(s\) leads to sta­tes \(s_1’, s_2’, …\), and cor­re­spon­ding rewards \(r_1, r_2, …\) with pro­ba­bi­li­ties \(p_1, p_2, …\). This impli­es func­tion­al rela­ti­onships on the value func­tion. Illus­tra­ti­ons are based on [3, p. 87].

Dyna­mic Pro­gramming (DP) for Mar­kov Decis­i­on Pro­ces­ses (MDPs) in infi­ni­te-dimen­sio­nal spaces is also pos­si­ble and neces­sa­ry when more than a fini­te num­ber of poten­ti­al sta­tes or actions must be con­side­red. This invol­ves sol­ving sequen­ces of line­ar pro­grams [1, p. 377]. If sta­tes are only par­ti­al­ly obser­va­ble or if the model of sta­te tran­si­ti­ons is enti­re­ly unknown, for­mu­la­ti­ons as Par­ti­al­ly Obser­va­ble Mar­kov Decis­i­on Pro­ces­ses (POMDPs) or as Rein­force­ment lear­ning pro­blems are possible.

Practical aspects

Dyna­mic Pro­gramming (DP) and Mar­kov Decis­i­on Pro­ces­ses (MDPs) are effec­tively sol­va­ble with open-source algo­rith­ms such as tho­se pro­vi­ded in the pymdp Tool­box for Python. Howe­ver, the­se algo­rith­ms are less exten­si­ve­ly tes­ted and main­tai­ned com­pared to tho­se available for LP, QP, etc. Sol­ving opti­mal con­trol pro­blems via DP and MDP has a more expe­ri­men­tal cha­rac­ter. None­thel­ess, it is often the only approach that ade­qua­te­ly addres­ses the com­ple­xi­ty of real-world pro­blems [2]. Both mode­ling the pro­blem by for­ma­li­zing sta­te spaces and tran­si­ti­on matri­ces, and the effec­ti­ve solu­ti­on of MDPs are challenging.

Code & Sources

Exam­p­le code: Methods_mdp.py in our tuto­ri­al­fol­der

[1] Fein­berg, E. A., and Shwartz, A. (2012). Hand­book of Mar­kov Decis­i­on Pro­ces­ses: Methods and Appli­ca­ti­ons. Ber­lin Hei­del­berg: Sprin­ger Sci­ence & Busi­ness Media.

[2] Bou­cherie, R. J., & Dijk, N. M. (2017). Mar­kov Decis­i­on Pro­ces­ses in Prac­ti­ce. Hei­del­berg, New York, Dor­d­recht, Lon­don: Sprin­ger Inter­na­tio­nal Publishing.

[3] Sut­ton, R. S., & Bar­to, A. G. (2018). Rein­force­ment Lear­ning: An Intro­duc­tion. Cam­bridge: MIT Press.