There are two general types of Reinforcement Learning algorithms:
Model-based and Model-free. Model-based algorithms learn a transition
model for the environment, where
is the state of the
environment at time
,
is an action to be executed, and
is
the resulting state of the environment at time
. Model-based
algorithms also learn a reward model
, which gives the
expected one-step reward of performing action
in state
and
making a transition to state
. Once these models have been
learned, dynamic programming algorithms [6] can be
applied to compute the optimal value function
and the optimal
policy
for choosing actions.
In contrast, model-free methods (such as Q learning and
SARSA()) directly learn a value function
by repeatedly
interacting with the environment without first learning transition or
reward models. They rely on the environment to ``model itself''. For
robot learning, however, model-free methods are impractical, because
they require many more interactions with the environment to obtain
good results. They make sense in simulated worlds where the cost of
performing an action can be much less than the cost of storing the
transition and reward models, particularly if the environment is
evolving over time. But the cost of performing an experimental action
with a real robot is very high.
Hence, for our experiments, we have chosen the model-based algorithm
known as Prioritized Sweeping [49].
Prioritized Sweeping works as follows. At each time step, the learner
observes the state of the environment, chooses an action
,
performs the action, receives a one-step reward
, and observes
the resulting state
. The learner then updates its estimate of
and of
using the observed result state
and the observed reward
. Finally, the learner performs the
most important Bellman backups to update its estimate of the value
function
. A Bellman backup in state
is computed as follows:
Prioritized Sweeping maintains a maximizing priority queue of states
in which it believes a Bellman backup should be performed. First, it
performs a Bellman backup for the most recent state . In each
Bellman backup, it computes the change in the value
resulting
from the backup:
Prioritized Sweeping is essentially an incremental form of value
iteration, in which the most important updates are performed first.
Because every interaction with the environment is applied to update
the model, Prioritized Sweeping makes maximum use of all of its
experience with the environment. Prioritized Sweeping is an
``off-policy'' learning algorithm. During the learning process, any
exploration policy can be employed to choose actions to execute. If
the exploration policy guarantees to choose every action in every
state several times, then Prioritized Sweeping will converge to the
optimal action-selection policy. We employ -greedy
exploration. In this form of exploration, when the robot reaches
state
, it executes a random action with probability
.
With probability
, it executes the action that is believed
to be optimal (according to the current value function
). Ties are
broken randomly.
We represent both the transition model and the reward
model
by three-dimensional matrices with one cell for each
combination of
,
, and
. This technique will only work if
the state and action spaces are small. There are two reasons for
this. First, the tables must fit into memory. Second, the time
required for learning is proportional to the number of cells in these
tables, because the Learning Agent must experience multiple visits to
each state
so that it can perform each action
several times
and gather enough data to estimate
and
.
Hence, the most challenging aspect of applying Reinforcement Learning
is the proper design of the state representation.