In this study we consider how the complexity of evolved neural controllers depends on the environment using foraging behavior of the Cyber Rodent in two different environments. In the first environment, each fruit can be seen from limited directions and different groups of fruits become ripe in different periods. In the second environment, fruits inside a zone are rewarding and those outside are aversive. After evolution, agents with recurrent neural controller outperformed those with feed-forward controllers by effectively using the memory of border passage. Simulation and experimental results with the Cyber Rodent robot confirmed the selection of appropriate complexity of neural controller, both in size and structure, through evolution.
The performance of learning systems depends critically on a number of meta-parameters that controls how the detailed system parameters change with learning. In most of the previous approaches, the meta-parameters are determined based on the experience of the experimenters. However, humans and animals can learn novel behaviors under a wide variety of environments without help of the experimenters/supervisors. In order to build fully autonomous learning agents, it is important to develop methods for adjusting these parameters to match the demands of the task and the environment. In this study, we propose a new method to determine the values of meta-parameters such as learning rate and temperature for exploration based on evolutionary approach.
In our method, every individual of the population encodes the meta-parameters. The fitness value is computed based on the agent performance. Each individual learns its own behavior by using the standard reinforcement learning and it's meta-parameters. The meta-parameters optimized in simulation are also applied to learn the behavior in the real hardware system. Although there are still gaps between computer simulation and the real environment, the implementation shows good results.
Hierarchical structure is often introduced into reinforcement learning to cope with large scaled problems because the expected learning time is exponential in the size of the state space. However, a limitation to the use of hierarchical reinforcement learning algorithms is that the structure has to be given by the designer in advance. We present an evolutionary approach for automatic construction of the structure in hierarchical reinforcement learning.
Our method consists of MAXQ hierarchical reinforcement learning method and genetic programming (GP). The MAXQ method learns the policy based on the hierarchy obtained by the GP, while GP explores the appropriate hierarchies using the result of the MAXQ method. The simulation results show a strong connection between the complexity of the evolved hierarchies and the environmental complexity.
Minsky, in his popular book ``Society of Mind,'' suggested that intelligent behavior is a result of cooperation in a society of primitive agents. These agents can not perform any thought processes, but when combined into ``societies,'' true intelligence arises. Neuroscience research revealed heterogeneous learning modules working in parallel for fast and robust learning of skilled behaviors.
The speed and performance of learning depend on the complexity of the learner. A simple learner with few parameters and no internal states can quickly obtain a reactive policy, but its performance is limited. A learner with many parameters and internal states may finally achieve high performance, but it may take enormous time for learning. Therefore, it is difficult to decide in advance which architecture and algorithm should be used for a new task.
In this study, we propose a new framework for selecting an appropriate policy out of a set of heterogeneous reinforcement learning modules and for correctly improving the policies of all learning modules including those not selected, using the method of importance sampling. In this framework, multiple heterogeneous learning modules sharing the same sensory-motor system can compete to act and cooperate to learn, allowing the overall learning system to obtain a good performance faster. We show in a simulation of partially-observable pole balancing task and robotic experiments of battery-pack foraging and partially observable T-maze tasks that a complex learning module trained with the proposed method can actually learn faster than when it is trained alone, by exploiting task-relevant episodes generated by suboptimal, but fast-learning modules.