Reinforcement Learning

As mentioned, each of the agents within the Navigation system has a bidding function that is controlled by a set of internal parameters. These parameters need to be tuned in order to achieve the best performance of the Navigation system and of the overall system. Although, as shown in the previous section, we achieved good results with hand-tuned parameters, we wanted to explore if there were other parameter configurations that led to better performance of the system. Adjusting these parameters manually can be very difficult, particularly because of the tradeoffs confronting the top-level agents. An alternative to manual tuning is to employ Machine Learning techniques, specifically Reinforcement Learning methods [64]. In this section, we describe some experiments to test the feasibility of applying Reinforcement Learning within this multiagent system.

Reinforcement Learning is one of the most commonly used learning techniques in Robotics. In Behavior-based architectures learning can be applied at two levels: at the coordination level, where the goal is to apply learning to the coordination system [44,28], or at the behavior level, where the goal is to apply learning to the individual behaviors of the system [45,14]. In our case, we have taken the latter approach [10,11].

Ideally, we would like to apply Reinforcement Learning to tune all of the parameters of all of the agents in the system. However, this is a very difficult problem, and it is not clear that Reinforcement Learning is the best solution at all levels of the system. Instead, we have chosen to focus on a particular learning problem within the Navigation system. Reinforcement Learning is most needed and most appropriate in cases where there is a complex, quantitative tradeoff between behaviors. In such cases, manual tuning is difficult, and the quantitative criterion of maximizing expected reward, which is the goal of Reinforcement Learning, permits us to represent the tradeoff nicely.

Within the Navigation system, such a tradeoff exists between the Target Tracker agent, the Risk Manager, and the Distance Estimator -- recall that we use the initial version of the system, as described in Section 5.1. The Target Tracker wants to know the exact heading and distance to the target at all times. This can be achieved by pointing the camera at the target and moving towards it. The Risk Manager wants to ensure that the robot is surrounded by a rich network of landmarks so that the robot does not get lost. This can be achieved by pointing the camera in various directions around the robot to identify and track landmarks. Finally, the Distance Estimator seeks to know accurate distances to the target landmark. This can be achieved by pointing the camera in the direction of the target while moving the robot orthogonally to the direction of the target. In addition to this conflict, the Navigation system must not monopolize the camera, because the Pilot needs to use it for obstacle avoidance.

**Figure 5.4:** Modified navigation system, with the new agent
$\includegraphics[width=8cm]{figures/RL/MasRL}$

Instead of trying to learn the appropriate values for each of the parameters of these agents, we propose to replace the Target Tracker, the Risk Manager, and the Distance Estimator by a new Learning Agent that learns its behavior through Reinforcement Learning. We formulate the reward function for this agent so that it is rewarded for reaching the current target location while minimizing the use of the camera. The two remaining agents have very different roles. The Map Manager maintains the beta-coefficient map, but does not bid on actions. The only remaining bidding agent is the Rescuer, which is responsible for the higher-level choice of diverting targets whenever the robot becomes blocked. This activity is better-implemented by path planning algorithms than by Reinforcement Learning, so we have not included the Rescuer's responsibilities within the Learning Agent. The modified architecture for the Navigation system is shown in Figure 5.4.

Subsections