The Reinforcement Learning Algorithm

There are two general types of Reinforcement Learning algorithms: Model-based and Model-free. Model-based algorithms learn a transition model $P(s'\vert s,a)$ for the environment, where $s$ is the state of the environment at time $t$, $a$ is an action to be executed, and $s'$ is the resulting state of the environment at time $t+1$. Model-based algorithms also learn a reward model $R(s,a,s')$, which gives the expected one-step reward of performing action $a$ in state $s$ and making a transition to state $s'$. Once these models have been learned, dynamic programming algorithms [6] can be applied to compute the optimal value function $V^*$ and the optimal policy $\pi^*$ for choosing actions.

In contrast, model-free methods (such as Q learning and SARSA($\lambda$)) directly learn a value function $V^*$ by repeatedly interacting with the environment without first learning transition or reward models. They rely on the environment to ``model itself''. For robot learning, however, model-free methods are impractical, because they require many more interactions with the environment to obtain good results. They make sense in simulated worlds where the cost of performing an action can be much less than the cost of storing the transition and reward models, particularly if the environment is evolving over time. But the cost of performing an experimental action with a real robot is very high.

Hence, for our experiments, we have chosen the model-based algorithm known as Prioritized Sweeping [49]. Prioritized Sweeping works as follows. At each time step, the learner observes the state $s$ of the environment, chooses an action $a$, performs the action, receives a one-step reward $r$, and observes the resulting state $s'$. The learner then updates its estimate of $P(s'\vert s,a)$ and of $R(s,a,s')$ using the observed result state $s'$ and the observed reward $r$. Finally, the learner performs the $k$ most important Bellman backups to update its estimate of the value function $V$. A Bellman backup in state $s$ is computed as follows:

\begin{displaymath}V(s) := \max_a \sum_{s'} P(s'\vert s,a) [R(s,a,s') + V(s')]\end{displaymath}

This is essentially a one-step lookahead that considers all possible actions $a$ and all possible resulting states $s'$, computes the expected backed-up value of each $a$, and assigns the maximum such value to be the new estimate of $V$ at state $s$.

Prioritized Sweeping maintains a maximizing priority queue of states in which it believes a Bellman backup should be performed. First, it performs a Bellman backup for the most recent state $s$. In each Bellman backup, it computes the change in the value $V(s)$ resulting from the backup:

\begin{displaymath}\Delta(s) = \left\vert V(s) - \max_a \sum_{s'} P(s'\vert s,a) [R(s,a,s') +
V(s')]\right\vert\end{displaymath}

After performing the Bellman backup, Prioritized Sweeping considers all states $s^-$ that are known predecessors of $s$, and computes the potential impact $C$ of the change in $V(s)$ on the change in the value of $s^-$ according to

\begin{displaymath}C(s^-) = \sum_{a} P(s\vert s^-,a) \Delta(s)\end{displaymath}

It then places the state $s^-$ on the priority queue with priority $C(s^-)$. Finally, Prioritized Sweeping performs $k-1$ iterations in which it pops off the state with the maximum potential impact, performs a Bellman backup in that state, and then computes the potential impact of that backup on all predecessor states. In our experiments, $k=5$. (In our implementation, we actually use the state-action, or $Q$, representation of the value function rather than the state value function $V$. We have described the method using $V$ in order to simplify the presentation.)

Prioritized Sweeping is essentially an incremental form of value iteration, in which the most important updates are performed first. Because every interaction with the environment is applied to update the model, Prioritized Sweeping makes maximum use of all of its experience with the environment. Prioritized Sweeping is an ``off-policy'' learning algorithm. During the learning process, any exploration policy can be employed to choose actions to execute. If the exploration policy guarantees to choose every action in every state several times, then Prioritized Sweeping will converge to the optimal action-selection policy. We employ $\epsilon$-greedy exploration. In this form of exploration, when the robot reaches state $s$, it executes a random action with probability $\epsilon$. With probability $1-\epsilon$, it executes the action that is believed to be optimal (according to the current value function $V$). Ties are broken randomly.

We represent both the transition model $P(s'\vert s,a)$ and the reward model $R(s,a,s')$ by three-dimensional matrices with one cell for each combination of $s$, $s'$, and $a$. This technique will only work if the state and action spaces are small. There are two reasons for this. First, the tables must fit into memory. Second, the time required for learning is proportional to the number of cells in these tables, because the Learning Agent must experience multiple visits to each state $s$ so that it can perform each action $a$ several times and gather enough data to estimate $P(s'\vert s,a)$ and $R(s,a,s')$. Hence, the most challenging aspect of applying Reinforcement Learning is the proper design of the state representation.



Subsections
© 2003 Dídac Busquets