Path integral control theory

Overview and history

Control theory is a theory from engineering that gives a formal description of how a system, such as a robot or animal, can move from a current state to a future state at minimal cost, where cost can mean time spent, or energy spent or any other quantity. Control theory is used traditionally to control industrial plants, airplanes or missiles, but is also the natural framework to model intelligent behavior in animals or robots. The mathematical formulation of deterministic control theory is very similar to classical mechanics. In fact, classical mechanics can be viewed as a special case of control theory.

Stochastic control theory uses the language of stochastic differential equations. The optimal control is usually computed from the Bellman equation, which is a partial differential equation. Solving the equation for high dimensional systems is difficult in general, except for special cases, most notably the case of linear dynamics and quadratic control cost or the noiseless deterministic case. Therefore, despite its elegance and generality, SOC has not been used much in practice.

In (Fleming et al. 1982) it was observed that posterior inference in a certain class of diffusion processes can be mapped onto a stochastic optimal control problem. These so-called Path integral (PI) control problems (Kappen 2005, Todorov 2006) represent a restricted class of non-linear control problems with arbitrary dynamics and state cost, but with a linear dependence of the control on the dynamics and quadratic control cost. For this class of control problems, the Bellman equation can be transformed into a linear partial differential equation. The solution for both the optimal control and the optimal cost-to-go can be expressed in closed form as a Feyman-Kac path integral. The path integral involves an expectation value with respect to a dynamical system. As a result, the optimal control can be estimated using Monte Carlo sampling or deterministic approximation methods. The path integral control method provides a deep link between control, inference and statistical physics. This statistical physics view of control theory shows that qualitative different control solutions exist for different noise levels separated by phase transitions. See (Todorov 2009, Kappen 2011, Kappen et al. 2012) for earlier reviews and references.

Equivalence between optimal control and optimal sampling

The Monte Carlos sampling method can be made more efficient using importance sampling. For the path integral control problem, importance sampling takes the form of a control signal that steers the particle to regions of low cost. One can use an arbitrary controller, provided that the trajectories are reweighted appropriately. Different importance samplers, using different controllers, all yield unbiased estimates of the optimal control, but differ in their efficiency. One can show an intimate relation between optimal importance sampling and optimal control: the optimal control solution {\em is} the optimal sampler, and better samplers (in terms of effective sample size) are better controllers (in terms of control cost) (Thijssen et al. 2015). The idea is mathematically described by the Girsanov change of measure.

This allow the design of iterative procedure, where in iteration i+1 one generates samples using the controller that was estimated in iteration i. In this way one increases the efficiency of the sampling in each iteration. The controller that is estimated in iteration i is a function that maps states to actions. This is a complex function in general and an infinite dimensional object to estimate, and thus requiring an infinite amount of samples. For practical implementation it is thus an important question are how to represent this control function. In (Kappen, Ruiz 2015) we show how we can learn a control function that is parametrized by a finite number of parameters using the so-called Cross Entropy method. The controller is learned on the basis of self-generated data using a gradient method. The gradient involve 'statistics' that are estimated using self-generated samples using importance sampling. The samples that are generated in the first importance sampling iteration provide a very poor statistics for the learning because of the poor importance sampler. With these samples, only a simple control function can be learned. However, the data that are generated with this simple control function yields much better statistics and thus a more complex control function can be learned. During subsequent iterations, better and more complex controllers are learned that generate more and more efficient statistics that allow to learn even better controllers. In the limit the method may estimate the optimal control solution provided that the this solution can be expressed by the parametrized model.

Control and particle smoothing

The adaptive importance sampling procedure and the equivalence between optimal control and optimal sampling can be used in both directions. For stochastic time-series problems, one can derive an equivalent control formulation and compute approximations to the optimal control for use as importance sampling.

Particle smoothing methods are used for inference of stochastic time series based on noisy observations. We demonstrate how the smoothing problem can be mapped onto a path integral control problem. Subsequently, we use an adaptive importance sampling method to improve drastically the effective sampling size of the posterior and the reliability of the estimations of the marginal smoothing distributions. This method has a linear computational complexity in the number of particles and gives a feedback controller that makes it possible to sample efficiently from the joint smoothing distribution. We show that the proposed method gives more reliable estimations than state-of-the-art forward filtering backward smoothing methods in significantly less time. See (Ruiz et al. 2015).

Quadrotors

We jointly develop path integral control methods in collaboration with UCL for Quadrotors. We illustrate the approach in the video below. The video consists of three parts: the control of the holding pattern and the cat & mouse scenario in simulation, and real flight with 2 and 3 quadrotors.

The first simulation addresses the problem of coordinating agents to hold their position near a point of interest while keeping a safe range of velocities and avoiding crashing into each other. Such a problem arises for instance when multiple aircraft need to land at the same location, and simultaneous landing is not possible. The optimal solution for this problem is a circular flying pattern where units fly equidistantly from each other. There is an initial transient period during which the agents organize to reach the optimal configuration. Remarkably, the coordinated circular pattern emerges regardless of the initial positions. Notably, the same flight pattern has been used frequently in the literature. We show for the first time how this pattern can emerge spontaneously as the optimal solution of a SOC problem.
The second simulation that we consider is the cat and mouse scenario. In this task, there is a team of quadrotors (the cats) that have to catch (get close to) another quadrotor (the mouse). The cats are controlled, but the mouse has autonomous dynamics: it tries to escape the cats by moving at velocity inversely proportional to the distance to the cats. This scenario leads to several interesting dynamical states. For example, with a large value of cats, the mouse always gets caught. The optimal control for the cats consists of surrounding the mouse to prevent collision. Once the mouse is surrounded, the cats keep rotating around it, as in the previous scenario, but with the origin replaced by the mouse position. For too small horizon time, the dynamical state in which the cats rotate around the mouse is not stable, and the mouse escapes. We emphasize that these different behaviors are observed for large uncertainty in the form of sensor noise and wind.
The last part of the video shows the result of the same control methods on 2 and 3 real quadrotors coordinated for the holding pattern.

We are able to control up to 20 quadrotors (with 4 dynamical variables each), and probably more. We found no significant performance loss of the control method by moving from the simulator to the real quadrotors, but there are many other practical limitations which made flying with more than 3 quadrotors difficult. We can thus conclude that the path integral control method is able to compute the state- dependent optimal control for realistic high dimensional stochastic non-linear problem in real time.

We develop decentralized control strategies for UAVs. This is illustrated for the task where a number of UAVs must maintain a holding pattern. Oblivious3 and Oblivious10 are simulations of decentralized control without communication. Each agent assumes that each other agent will move straight ahead based on the measurement of their current position and velocity. Based on that assumption, each agent computes its optimal control. Communication3 and Communication10 are simulations of decentralized control with communication. Each agent send broadcasts its expected future trajectory to all other agents. Based on this information, each agent computes its optimal control.

Related research

The group of Stefan Schaal at the University of Southern California has pioneered the application of the path integral control methods in robotics and presented their results in dozens of research articles. See for instance (Theodorou et al. 2010). This research is continued by Evangelos Theodorou in his lab in Georgia Tech University; by Jonas Buchli at the ETH in Zurich; by Jens Kober at the TU Delft; by Jan Peters at TU Darmstadt.
Theoretical research on the path integral method has been advanced by Emo Todorov at the University of Washington and applied for motor learning and modeling in computational neuroscience.
Other researchers who have applied the path integral theory to robotics are Takamitsu Matsubara (ATR Japan), Kenji Doya (Okinawa).
The path integral control theory is closely related to research on decision making under uncertainty (Daniel Braun, Daniel Polani) and Bayesian theories of sensorimotor control (Karl Friston, Naftali Tishby)

References

Fleming, WH and Mitter, SK. "Optimal control and nonlinear filtering for nondegenerate diffusion processes." Stochastics: An International Journal of Probability and Stochastic Processes 8.1 (1982): 63-77.
Todorov, E "Linearly-solvable Markov decision problems." Advances in neural information processing systems. 2006.
Kappen, HJ "Linear theory for control of nonlinear stochastic systems." Physical review letters 95.20 (2005): 200201.
Todorov, E "Efficient computation of optimal actions." Proceedings of the national academy of sciences 106.28 (2009): 11478-11483.
Kappen, HJ. "Optimal control theory and the linear bellman equation." Inference and Learning in Dynamic Models (2011): 363-387.
Kappen, HJ, Gomez, V and Opper, M "Optimal control as a graphical model inference problem." Machine learning 87.2 (2012): 159-182.
Thijssen, S and Kappen, HJ "Path integral control and state-dependent feedback." Physical Review E 91.3 (2015): 032104.
Kappen, HJ and Ruiz, HC "Adaptive importance sampling for control and inference." arXiv preprint arXiv:1505.01874 (2015).
Theodorou, E, Buchli, J and Schaal, S. "Reinforcement learning of motor skills in high dimensions: A path integral approach." Robotics and Automation (ICRA), 2010 IEEE International Conference on. IEEE, 2010.