DYLAN FOSTER: Thanks for inviting me.
TINGLING: Let’s start with a brief overview of this article. Tell us about the problem addressed by this work and why the research community should know about it.
TO FAVOR: So this is a kind of theoretical work on reinforcement learning, or RL. When I talk about reinforcement learning, generally speaking, it refers to the question of how can we design AI agents that can, for example, interact with unknown environments and learn to solve problems through trial and error. So this is part of a broader program that we have developed on the theoretical foundations of RL. And the key questions that we’re looking at here are what’s called exploration and sample efficiency. It just means that we are trying to understand, for example, what are the principles of algorithm design that can allow you to explore an unfamiliar environment and learn as quickly as possible? What we’re doing in this paper is we’re looking at, sort of, how to most effectively solve reinforcement learning problems when you’re dealing with very high-dimensional observations, but the underlying dynamics Is the underlying aspect of the system you are interacting with simple? So this is a context that occurs in many natural reinforcement learning and control problems, particularly in the context of, for example, embodied decision making. So if you think about, let’s say, games like Pong, you know, the game state, like Pong, is extremely simple. It’s just, you know, what’s the position and speed of the ball and, like, where are the paddles? But what we’d like to be able to do is learn, you know, how to control or solve games like this from raw pixels or images, kind of in the same way that a human would, just solve them from vision. So if you look at these kinds of problems, you know, we call them, like, RL with rich observations or RL with latent dynamics. You know, these are interesting because they kind of force you to explore the system, but they also require, you know, learning the representations. For example, you want to be able to use neural networks to learn a mapping, say, of the images you see to the latent state of the system. This is a fairly interesting and non-trivial algorithmic problem. And, in a way, what we’re doing in this work is taking a first step toward something like a unified understanding of how to solve these kinds of RL problems, like rich observations or latent dynamics.
TINGLING: So how did you go about developing this theoretical framework?
TO FAVOR: Yeah, so if you look at these kinds of RL problems with latent dynamics, it’s something that’s actually been researched a lot in theory. And a lot of that goes back to sort of the early work in our lab from around 2016, 2017. There are some really interesting results here, but progress has largely been on a case-by-case basis, meaning there are many different ways to try to model the latent dynamics of your problem, and you know. , each of them leads in one way or another to a different algorithm, right. So, you know, you think very seriously about this modeling assumption. Are you thinking about what an optimal algorithm would look like? And you end up, you know, writing a whole article about it. And there’s nothing wrong with that per se, but if you want to be able to iterate quickly and, sort of, try different modeling assumptions and see what works in practice, you know, it’s not really tenable. It’s just too slow. And so the starting point of this work was, sort of, trying to take a different, more modular approach. So the idea is, you know, that there are many, many types, kinds of systems or modeling hypotheses for dynamics that have already been studied in depth and are the subject of entire articles in their own right. topic in a simpler framework in which you can directly see the system status. And so what we wanted to ask here is: is it possible to use these existing results in a more modular way? For example, if someone has already written an article on how to optimally solve a particular type of MDP, or Markovian decision processcan we just take their algorithm as is and maybe plug it into some sort of meta-algorithm that can directly, sort of, combine that with representation learning and use it to solve the RL problem of rich observation or corresponding latent dynamics?
TINGLING: What were your main discoveries? What have you learned during this process?
TO FAVOR: We started by asking the question exactly the way I just asked it, right. For example, can we take existing algorithms and use them to solve observationally rich RL problems in a modular way? And that turned out to be really tricky. For example, there are many natural algorithms you could try that seem promising at first but don’t really work. And what that sort of led us to and sort of the first main result of this paper is actually a negative result. So what we’ve actually shown are the most well-studied types of systems or, for example, MDPs that have been studied, for example, in previous RL literature, even though they are tractable when you can see directly Depending on the state of the system, they may become statistically intractable once you somehow add high-dimensional observations to the picture. And statistically tractable here, that means that the amount of interaction you need, like the number, sort of, of attempts to explore the system you need, in order to learn good decision-making policy, becomes , like, very, very large, like much, much larger than the corresponding complexity if you could directly see the states of the system. You know, you could look at that and say, I guess we’re out of luck. You know, maybe there’s just no hope of solving these kinds of problems. But that might be a little too pessimistic. You know, the way you should interpret this result is exactly what you need more hypotheses. And this is precisely what is sort of the second result that we obtain in this article. So our second result shows that you can, sort of, get around this impossibility result and, you know, get truly modular algorithms under different kinds of additional assumptions.
TINGLING: Dylan, I’d like to know – and I’m sure our audience would too – what this work means in terms of real-world application. What impact will this have on the research community?
TO FAVOR: Yeah, so maybe I’ll answer that, uh, with two different points. The first is a broader point: why is it important to understand this problem of exploration and sample efficiency in reinforcement learning? If you look at the context that we’re studying in this article – you know, like RL or decision making with high-dimensional observations – on the empirical side, people have made huge progress on this problem. through deep reinforcement learning. This is what has led to these incredible advancements in resolution for games like Atari over the last decade. But if you look at these results, the gains sort of come more from the inductive bias or the generalization capabilities of deep learning and not necessarily the specific algorithms. Thus, current algorithms do not explore very deliberately and therefore their sampling efficiency is very high. It’s hard to make an individual comparison, but you can argue that they need a lot more experience than a human to solve these kinds of problems. So it’s not clear that we are really close to the ceiling of what can be achieved in terms of efficiency, e.g. how efficiently can an agent learn to solve new problems through trial and error. And I think better algorithms here could potentially be transformative in many different areas. To approach this specific work, I think there are a few takeaways for researchers. The first is that by giving this impossibility result which shows that RL with latent dynamics is impossible without other assumptions, we are somehow reducing the search space in which other researchers can search for efficient algorithms. The second takeaway is that we show that this problem becomes solvable when you make additional assumptions. But I view them more as a proof of concept. In a way, we show for the first time that East It’s possible to do something non-trivial, but I think a lot more work and research will be needed to be able to, you know, build on that and bring that to something that can lead to practical algorithms.
TINGLING: Well, Dylan Foster, thank you for joining us today to discuss your article on reinforcement learning in latent dynamics. We certainly appreciate it.
TO FAVOR: Thank you so much. Thanks for inviting me.
(MUSIC)
TINGLING: And to our listeners, thank you all for listening. If you would like to read Dylan’s article, you can find a link at aka.ms/abstracts. You can also find the article on arXiv and on the NeurIPS conference website. I’m Amber Tingle from Microsoft Research and we hope you’ll join us next time on Summaries!
(MUSIC FADE IN)