MPC
A worked introduction to Model-Predictive Control for Machine Learners.
If you’re into AI, you might have heard Yann LeCun argue that researchers should reject Reinforcement Learning in favour of Model Predictive Control. But what does that actually mean? The key idea of MPC is using a model to simulate possible futures, choosing the action sequence that yields the lowest cost (or highest reward), and executing the first action of that sequence. Then rinse and repeat. That’s exciting; MPC lets us transmute a predictive model into an action-selection policy, derived entirely at inference-time.
In this post I’ll get into the nuts and bolts of MPC, walking through how it works. For the code-inclined, I’ve also put together a companion Colab notebook with a NumPy implementation.
Out of Control
To understand the control problem, imagine you’ve got a drone. You can control the rotor speed with a knob called $u$, and you want your drone to sit at a fixed reference height $r$. We’ll do that by changing the underlying state $x$. Don’t get spooked — this is just control nomenclature. In RL we’d call $u$ the action and $x$ the state. Like in RL, the state is Markovian, which means it contains all the necessary information to infer future states. In our case, the state contains the drone’s vertical position and velocity $x=[p, \hspace{0.2em} \dot{p}]$. Since we can’t actually know $x$ precisely, we estimate it via measurements $y$. Our goal is to minimise the error $|y-r|$. That error tells us how close the drone is to where we want it to be. With that set up, we can define a basic model of how the drone works
\[\begin{aligned} x_{t+1} &= Ax_t + Bu_t\\ y_t &= Cx_t + Du_t.\\ \end{aligned}\]The matrices $A,B,C,D$ define our model of the system, dictating how things change from timestep to timestep:
- $A$ tells you how $x_t$ transitions to $x_{t+1}$. These are just the laws of physics — given a particular position and velocity, what state would the drone be in on the next timestep?
- $B$ tells you how your control signal $u$ affects the state. If the drone’s motors are weak, $B$ will be small — you need a big $u$ to get a little speed.
- $C$ is an emission matrix — it tells you what $y$ you’ll measure based on the underlying state $x$.
- $D$ tells you whether tweaking the control signal will also directly affect the measurements. $D=0$ in our example; since inertia is a thing, you can’t instantly change position by spinning your rotors faster. But if we were measuring the instantaneous acceleration instead of position, then $u$ would have an immediate impact and $D$ would be nonzero.
These are pretty simplified. Common extensions could include adding noise terms, augmenting the state space ($\tilde{x}=[x, d]$) to manage disturbances, or converting our nice linear matrices into nasty nonlinear functions $x_{t+1} = f(x_t, u_t).$
Is This Loss?
The keen-eyed reader will notice that everything in our model is deterministic. In other words, if we know the starting state $x$, we can predict the future for any sequence of control values $u_t$. How convenient! This is the basic concept underpinning MPC — at each timestep we’ll unroll our model to predict the future, and choose the values of $u$ that give rise to the best predicted trajectory. But we can’t choose the “best” trajectory without some kind of scoring. For that we’ll use a cost function
\[\begin{aligned} J &= \sum_{\tau=t}^{t+N-1} \bigl [ (y_\tau - r_\tau)^T Q (y_\tau - r_\tau) + (u_{\tau} - u_{\tau - 1})^T S (u_{\tau} - u_{\tau - 1}) \bigr ] \\ &=\bigl [ \underbrace{(Y_t - R_t)^T \bar{Q} (Y_t - R_t)}_\text{Error term} + \underbrace{(\Delta U_t)^T \bar{S} (\Delta U_t)}_\text{Regularisation} \bigr ]. \end{aligned}\]The first “error term” in our MPC cost function defines how strongly the positional error $|y_t - r_t|$ is penalised, and the second term punishes rapid changes in the control signal (to prevent your controller instantly slamming the rotors to infinity). We sum those two terms over every step in the rollout to score a trajectory, using $N$ as the prediction horizon. The second form is just vectorised, by stacking variables into block matrices like $Y_t = [y_t, y_{t+1}, \dots, y_{t+N-1}]$, and so on. You might notice there’s a direct correspondence between the MPC objective and regularised least squares
\[\min_w \bigl [ (Y - Xw)^T (Y - Xw) + \lambda w^T w \bigr].\]The only real difference is the use of the matrices $\bar{Q}$ and $\bar{S}$ in the inner products, but these are just weighting factors that let you prioritise certain dimensions of the error. In ML, we let the neural net figure that out, but in Control we mathematically solve it ourselves. The Control approach is powerful — when (if) your assumptions hold, you get powerful guarantees about the stability and optimality of your solution. The downside is you need a lot of information to make good assumptions; information about the drone, environment, kinematics, and more — all stuff which an RL agent figures out for itself.
Predictive What?
I said in Control we solve it ourselves. So let’s have a go at that derivation. It’s a bit tedious though, so feel free to skip. The important thing is knowing that, since our system is linear, we can nicely compose everything into matrix multiplications that give us a quadratic cost function that’s very amenable to easy optimisation. In other words, we get a nice cost that’s easy to solve for the optimal action sequence.
Maths warning
The trick is realising that you can unroll the measurements as follows
\[Y_t = \Lambda x_{t} + \Phi U_t, \\\]where $\Lambda, \Phi$ are matrices encoding the repeated application of the update matrices
\[\begin{aligned} \Lambda &= \begin{bmatrix} C \\ CA \\ CA^2 \\ \vdots \\ CA^{N-1} \end{bmatrix}, \quad \Phi = \begin{bmatrix} D & & & & \\ CB & D & & & \\ CAB & CB & D & & \\ \vdots & & & \ddots & \\ CA^{N-2}B & \dots & \dots & CB & D \end{bmatrix}. \end{aligned}\]Note that $\Phi$ has the same block-causal structure as an attention mask in a transformer. This is autoregressive, but we can produce the whole trajectory in one go because our model is linear. These equations are just saying that the trajectory of measurements is a function of our initial state, $x_t$, and our sequence of control values $U_t$. Next we can substitute that stacked trajectory into the cost function. First, define the part of the tracking error that is already fixed at time $t$
\[a_t = \Lambda x_t - R_t.\]From the optimiser’s point of view, $x_t$ and $R_t$ are constants. The only thing it gets to choose is the future control sequence $U_t$, so the tracking error becomes
\[Y_t - R_t = a_t + \Phi U_t.\]The regularisation term is slightly more fiddly, because it penalises changes in control rather than the control values directly. But we can write it with a finite-difference matrix. If
\[U_t = \begin{bmatrix} u_t \\ u_{t+1} \\ \vdots \\ u_{t+N-1} \end{bmatrix},\]then
\[\Delta U_t = \begin{bmatrix} u_t - u_{t-1} \\ u_{t+1} - u_t \\ \vdots \\ u_{t+N-1} - u_{t+N-2} \end{bmatrix} = M U_t - b_t,\]where
\[M = \begin{bmatrix} I & 0 & 0 & \dots & 0 \\ -I & I & 0 & \dots & 0 \\ 0 & -I & I & \dots & 0 \\ \vdots & & \ddots & \ddots & \vdots \\ 0 & \dots & 0 & -I & I \end{bmatrix}, \qquad b_t = \begin{bmatrix} u_{t-1} \\ 0 \\ \vdots \\ 0 \end{bmatrix}.\]For our one-knob drone, $I$ is just the number $1$. For a vector-valued controller, it would be the identity matrix with the same dimension as $u$.
Plugging both substitutions into the original cost gives
\[\begin{aligned} J(U_t) &= (a_t + \Phi U_t)^T \bar{Q} (a_t + \Phi U_t) + (M U_t - b_t)^T \bar{S} (M U_t - b_t) \\ &= U_t^T \Phi^T \bar{Q} \Phi U_t + 2 U_t^T \Phi^T \bar{Q} a_t + a_t^T \bar{Q} a_t \\ &\quad + U_t^T M^T \bar{S} M U_t - 2 U_t^T M^T \bar{S} b_t + b_t^T \bar{S} b_t. \end{aligned}\]This looks busy, but most of it is bookkeeping. The terms $a_t^T \bar{Q} a_t$ and $b_t^T \bar{S} b_t$ don’t depend on $U_t$, so they won’t change which control sequence minimises the cost. We can drop them and collect the remaining quadratic and linear terms
\[J(U_t) = U_t^T \underbrace{(\Phi^T \bar{Q} \Phi + M^T \bar{S} M)}_\text{quadratic bit} U_t + 2U_t^T \underbrace{(\Phi^T \bar{Q} a_t - M^T \bar{S} b_t)}_\text{linear bit} + \text{constant}.\]Most quadratic-programming solvers expect the problem in the form
\[J(U_t) = \frac{1}{2} U_t^T H U_t + U_t^T f + \text{constant}.\]So, up to the irrelevant constant, we can set
\[\begin{aligned} H &= 2(\Phi^T \bar{Q} \Phi + M^T \bar{S} M), \\ f &= 2(\Phi^T \bar{Q} a_t - M^T \bar{S} b_t). \end{aligned}\]And that’s the whole trick: once the model has turned future controls into future measurements, the cost is just a quadratic bowl in $U_t$.
Continuing down the garden path leads to the following cost function, up to a constant term that doesn’t affect the optimum
\[\boxed{J(U_t) = \frac{1}{2} U_t^T H U_t + U_t^T f}\]$H$ and $f$ are just composed of the matrices we’ve already defined. This cost function is appealing because it’s in quadratic form, so we can solve it with the following guarantees: we’ll get the global optimum, we’ll converge quickly, and we can handle constraints like bounding the control signal $0 < u < 10{,}000$. From here, we’re done! At each timestep of the system, we construct the matrices in our solution, pass it into a quadratic solver to output the optimal control sequence, and then snip off the first control value. So if we have an optimal sequence $\mathbf{u}^* = (u_1, u_2, \cdots, u_N)$, we’ll use $u_1$ as our control value (or action) and proceed to the next timestep.
Conclusion
At the end of the day, MPC is interesting because it offers a way to do action selection without the cons of RL, like sample inefficiency. That said, there are clear downsides. The most obvious one is the inefficiency of the whole scheme. Not only are we computing our optimal policy from scratch at every timestep, we throw most of it away and just keep the first action! And if we have a faulty model, then we’re in big trouble (also true in Model-Based RL). Every new step we unroll in our prediction gets further and further away from the truth.
Plus, if we want to use a nonlinear model (e.g. a neural net), then we can’t use a quadratic solver any more — now we need to do that optimisation ourselves via methods like Random Shooting, MPPI or CEM. That optimisation for every timestep bumps up inference costs, where RL amortises it away in training.
For general learning agents, I suspect the right approach is a mixture of RL and MPC-like (or model-free and model-based) strategies. We can amortise easy stuff into a policy or value function, but for hard problems we’d like the option of test-time search to ‘think ahead’. That’s how systems like AlphaGo worked — a policy network plus Monte Carlo Tree Search when the agent thought it needed extra firepower. There’s some evidence that this is going on in the brain too — the basal ganglia does simple ‘System 1’ action selection, but the neocortex can perform simulation to enable more complex ‘System 2’-level thinking. I’m currently working on a little side project which gets into these ideas more deeply which will hopefully get written up over the next month or two!
I’d like to thank Professor Joaquin Carrasco Gomez at UoM for helping me dig into MPC, and providing wealth of helpful resources. Thanks Joaquin!
DISCLAIMER: I’m only learning bits and bobs about Control on the side, and am by no means an expert! Please do let me know if you find any errors.
Posted on May 12, 2026