Optimal control: Reinforcement learning

Definition

Rein­force­ment lear­ning (RL) is a method for opti­mal con­trol under very chal­len­ging con­di­ti­ons. The sta­te chan­ges are ran­dom, the pro­ba­bi­li­ties are unknown, and the func­tion to be opti­mi­zed its­elf has no clo­sed form. Despi­te the­se chal­lenges, an opti­mal poli­cy is to be found.

A poli­cy is unders­tood to be a plan that spe­ci­fies for every pos­si­ble sta­te the con­trol signal that leads to the best results in the long run. The ran­ge of tasks that can be for­mu­la­ted in this way is so gene­ral that ever­y­thing from machi­ne con­trol, adap­ti­ve plan­ning of search-and-res­cue mis­si­ons, to the deve­lo­p­ment of inde­pendent­ly lear­ning chess algo­rith­ms that play at a world cham­pi­on level can be con­side­red an RL problem.

Example

An exam­p­le of an opti­mal con­trol pro­blem that can be ins­truc­tively sol­ved with RL is the navi­ga­ti­on of a maze from the start­ing point \(A\) to the tar­get point \(B\). The lay­out of the maze is unknown but local­ly querya­ble by the algo­rithm navi­ga­ting the maze. Sin­ce the algo­rithm can actively inter­act with its envi­ron­ment to gain infor­ma­ti­on, it is also refer­red to as an agent. Describ­ing the abo­ve pro­blem as an LP, QP, … SDP would be — if at all pos­si­ble — enorm­ously difficult.

Figu­re 1: The com­ple­te pro­blem model (a) of maze navi­ga­ti­on is not acces­si­ble to the RL agent. It can only per­form dif­fe­rent actions at dif­fe­rent posi­ti­ons and recei­ves imme­dia­te rewards for the action (b). The opti­mal poli­cy \(\pi^*\) is a vec­tor field that indi­ca­tes the best action for each state ©.

Exemplary analysis

The solu­ti­on to the maze pro­blem is a poli­cy \(\pi^*:S \ni s \mapsto \pi^*(s) \in A\), which accepts a posi­ti­on \(s \in S\) as input and gene­ra­tes an action \(a \in A\) as out­put. This solu­ti­on pro­po­ses not just a rigid plan of posi­ti­ons \(s_A, s_1, …, s_B\), but sol­ves the plan­ning pro­blem for all pos­si­ble start­ing posi­ti­ons simul­ta­neous­ly. If the agent were now pla­ced at any posi­ti­on \(s_0\) and is to find the shor­test way to \(B\) from the­re, it would suc­ceed by fol­lo­wing the poli­cy
$$s_0\stackrel{\pi^*(s_0)=a_0}{\rightarrow} s_1\stackrel{\pi^*(s_1)=a_1}{\rightarrow} …\stackrel{\pi^*(s_n)=a_n}{\rightarrow}s_B .$$
The same appli­es if the posi­ti­on chan­ges are influen­ced by chan­ce and the exact sub­se­quent posi­ti­on does not fol­low 100% from the past posi­ti­ons and the cho­sen action.

Figu­re 2: The poli­cy \(\pi^*\) indi­ca­tes the shor­test path from \(A\) to \(B\) (a). Even for other start­ing posi­ti­ons and in the pre­sence of ran­dom effects, it ful­fills this role (b), ©. Each pair of posi­ti­on \(s_0\) and action \(a\) can be assi­gned the still achie­va­ble shor­test path length, and this func­tion has a com­po­si­tio­nal pro­per­ty: The value of \((s_0,a)\) is the path length from \(s_0\) to \(s(s_0,a)\) plus the shor­test path length from \(s(a_0,a)\) to \(B\) (d).

Formulation of RL tasks

RL pro­blems are for­ma­li­zed simi­lar­ly to Mar­kov decis­i­on pro­ces­ses. As input for RL algo­rith­ms, a com­pu­ter model is nee­ded that:

  • Defi­nes the sta­te space \(S\) and the action space \(A\)
  • Gene­ra­tes a (sto­cha­stic) reward \(r(s,a)\) for each pair \((s,a)\in S\times A\)
  • Gene­ra­tes a (sto­cha­stic) suc­ces­sor sta­te \(s’(s,a)\) for each pair \((s,a)\in S \times A\).

The­se requi­re­ments are some­what wea­k­er and prac­ti­cal­ly easier to mana­ge than for Mar­kov decis­i­on pro­ces­ses. Ins­tead of a tabu­lar, com­pre­hen­si­ve cap­tu­re of the rewards \(r(s,a)\) and tran­si­ti­on pro­ba­bi­li­ties \(p(s’|s,a)\) bet­ween sta­tes, RL mere­ly requi­res examp­les. The goal is to deter­mi­ne an opti­mal poli­cy \(\pi^*\), which maxi­mi­zes the expec­ted value of cumu­la­ti­ve rewards.

$$ \begin{align} \max_{\pi} ~~~&E_{\pi}\left[ \sum_{k=1}^{\text{End}} r(s_k,a_k)\right] \\ \text{s.t.} & a_k=\pi(s_k)\end{align}$$

Here, \(E_{\pi} \left[\sum_{k=1}^{\text{End}}r(s_k,a_k)\right]=\eta(\pi)\) deno­tes the expec­ted value of cumu­la­ti­ve rewards when the poli­cy \(\pi\) is followed.

Solution

The­re are various approa­ches to deter­mi­ning \(\pi^*\). What they have in com­mon is that the agent coll­ects data through inter­ac­tion with the com­pu­ter model, and the expe­ri­en­ces about sta­tes, actions, sta­te chan­ges, and rewards are fed into a feed­back loop to deter­mi­ne \(\pi^*\).

Figu­re 3: Feed­back sche­me for gene­ra­ting \(\pi^*\). A poli­cy \(\pi\) is used to sel­ect the action \(a\). This indu­ces sta­te chan­ges and rewards, which are used to adjust \(\pi\).

The func­tion \(\pi:A\rightarrow S\) is often con­s­truc­ted as a neu­ral net­work \(\pi_{\theta}\) for reasons of fle­xi­bi­li­ty. It can then be adjus­ted based on the data by, for exam­p­le, forming the deri­va­ti­ves of \(\eta(\pi)\) with respect to the para­me­ters \(\theta\) of \(\pi\) [1, p. 367].

$$\begin{align} \eta(\pi) & =\int_{(S\times A)^n} \sum_{k=0}^nr(s_k,a_k)p_{\theta}(\xi)d\xi \\ \nabla_{\theta} \eta &= \int_{(S\times A)^n}\sum_{k=0}^n r(s_k,a_k) \nabla_{\theta}\left[ \log p_{\theta}(\xi)\right] p_{\theta}(\xi)d\xi \\ \hat{\nabla}_{\theta} \eta & = \frac{1}{m}\sum_{j=1}^m \left( \sum_{k=0}^n r(s_k^j,a_k^j)\right)\nabla_{\theta}\left[ \log p_{\theta}(\xi)\right] \\ & =\frac{1}{m} \sum_{j=1}^m \left( \sum_{k=0}^n r(s_k^j,a_k^j)\right)\sum_{k=0}^n\nabla_{\theta}\log \pi_{\theta}(s_k^j,a_k^j)\end{align}$$

Here, \(p_{\theta}(\xi)\) is the pro­ba­bi­li­ty of occur­rence of the sequence \([(s_0,a_0), …, (s_n,a_n)]\), and \(\hat{\nabla}_{\theta} \eta\) is a data-dri­ven esti­ma­tor for the gra­di­ent of cumu­la­ti­ve reward. The abo­ve equa­tions are used, for exam­p­le, in the TD3 algo­rithm [2] and are imple­men­ted in the Python libra­ry Sta­ble Baselines3 [3].

Example: Landing control

The landing pro­cess of an object is to be con­trol­led so that it lands at the desi­gna­ted coor­di­na­te \(0\). The landing pro­cess fol­lows the usu­al dif­fe­ren­ti­al equa­tions that cou­ple posi­ti­on, velo­ci­ty, and acce­le­ra­ti­on, but this is not known to the algo­rithm. Upon tou­ch­ing the ground after time \(T\), a reward of \(-|x(T)|\) is dis­pen­sed, which pena­li­zes the devia­ti­on from the actu­al and desi­gna­ted landing point, thus defi­ning it as some­thing to be minimized.

Figu­re 4: Pro­blem sketch (a), con­trol beha­vi­or of an untrai­ned algo­rithm (b), and the final con­trol beha­vi­or of the opti­mal poli­cy after 10,000 trai­ning steps ©.

Example: Adaptive control of measurement density

A sen­sor is to mea­su­re the tem­po­ral moti­on beha­vi­or of an object in such a way that it can be recon­s­truc­ted as accu­ra­te­ly as pos­si­ble. Howe­ver, each mea­su­re­ment cos­ts money; the­r­e­fo­re, the mea­su­re­ment times must be cho­sen eco­no­mic­al­ly. The com­pu­ter model of this pro­blem includes ran­dom­ly gene­ra­ted func­tions, the opti­on for a moti­on mea­su­re­ment at any point in time, and at the end of the tri­al run, a con­side­ra­ti­on of mea­su­re­ment cos­ts and recon­s­truc­tion error as a per­for­mance measure.

Figu­re 5: Illus­tra­ti­ve pro­blem sketch (a) and opti­mal poli­ci­es for trig­ge­ring mea­su­re­ments in dif­fe­rent cost situa­tions and given various moti­on pat­terns and cost situa­tions (b), ©.

Example: Adaptive search policy

On a pro­per­ty con­ta­mi­na­ted with pol­lut­ants, the loca­ti­on and seve­ri­ty of the maxi­mum con­ta­mi­na­ti­on are to be deter­mi­ned. To do this, 9 pol­lutant mea­su­re­ments can be con­duc­ted at free­ly cho­sen loca­ti­ons, and the algo­rithm should adap­tively deci­de on the loca­ti­on of the next mea­su­re­ment based on pre­vious mea­su­re­ments. The trai­ning data con­sist of simu­la­ti­ons that are pre­su­med to have a cor­re­la­ti­on struc­tu­re simi­lar to the pol­lutant dis­tri­bu­ti­on in the real world.

Figu­re 6: Illus­tra­ti­on of the search pro­blem. Sear­ching for the pol­lutant maxi­mum using a grid-like arran­ge­ment of mea­su­re­ment points (a) is far less suc­cessful than using an adap­ti­ve poli­cy (b).

Practical aspects

Rein­force­ment lear­ning can be used for almost any­thing, inclu­ding pro­blems of high-dimen­sio­nal natu­re with high demands on the algo­rith­m’s lear­ning abili­ty. Howe­ver, this fle­xi­bi­li­ty also ent­ails signi­fi­cant dis­ad­van­ta­ges. RL is enorm­ously data-hun­gry, has sta­bi­li­ty issues, and places high mode­ling requi­re­ments on the user. The user must design sta­te and action spaces in such a way that the dyna­mics have Mar­kov pro­per­ties and must balan­ce fle­xi­bi­li­ty and regu­la­ri­ty through the archi­tec­tu­re of the invol­ved arti­fi­ci­al neu­ral networks.

If a pro­blem can be for­mu­la­ted as an estab­lished con­vex optimi­zation task, RL should be avo­ided. For many other pro­blems, despi­te its pit­falls, RL is the only via­ble approach. Sin­ce it is a method that is still actively and inten­si­ve­ly rese­ar­ched, signi­fi­cant advan­ces can be expec­ted in the future.

Code & Sources

Exam­p­le code: OC_R­L_clas­s_­de­f_2­D_env-py ,  OC_RL_def_2D.pyOE_RL_control_trajectory.py , OC_RL_measurement_decisions.py , OC_RL_support_funs.py , OC_RL_trained_benchmark_def_2D.zip  in our tuto­ri­al­fol­der.

[1] Wier­ing, M., & Ott­erlo, M. v. (2012). Rein­force­ment Lear­ning: Sta­te-of-the-Art. Ber­lin Hei­del­berg: Sprin­ger Sci­ence & Busi­ness Media.

[2] Fuji­mo­to, S., van Hoof, H., & Meger, D. (2018). Addres­sing func­tion appro­xi­ma­ti­on error in actor-cri­tic methods. arXiv pre­print arXiv:1802.09477, 2018.

[3] Raf­fin, A., Hill, A., Glea­ve, A., Kaner­vis­to, A., Ernes­tus, M., & Dor­man, N. (2021). Sta­ble-Base­line­s3: Relia­ble Rein­force­ment Lear­ning Imple­men­ta­ti­ons. Jour­nal of Machi­ne Lear­ning Rese­arch. 22, (268), 1−8.