MODELING THE MIRROR: GRASP LEARNING AND

ACTION RECOGNITION

 

by

 

Erhan Oztop

 

A Dissertation Presented to the

FACULTY OF THE GRADUATE SCHOOL

UNIVERSITY OF SOUTHERN CALIFORNIA

In Partial Fulfillment of the

Requirements for the Degree

DOCTOR OF PHILOSOPHY

(COMPUTER SCIENCE)

 

 

August 2002

TABLE OF CONTENTS

DEDICATION.. ii

ACKNOWLEDGEMENTS. iii

LIST OF FIGURES. x

1    INTRODUCTION.. 1

2    CHAPTER II: BIOLOGICAL BACKGROUND.. 4

2.1     Abbreviations. 4

2.2     Premotor areas. 4

2.2.1      Area F5. 5

2.2.2      Area F4. 10

2.2.3      Areas F2 and F7 (dorsolateral prefrontal cortex) 11

2.2.4      Area F1 (the primary motor cortex) 12

2.2.5      Areas F3 (SMA proper), F6 (pre-SMA) 12

2.3     The superior temporal sulcus. 12

2.4     Parietal Areas. 13

2.4.1      The anterior intraparietal area (AIP) 13

2.4.2      The caudal intraparietal sulcus (c-IPS) 15

2.4.3      Areas VIP, MIP and LIP.. 17

2.4.4      Areas 7a and 7b (PG and PF) 18

2.5     Connectivity and other brain regions. 19

2.6     Mirror neurons in humans. 23

2.7     Summary.. 25

3    CHAPTER III: MIRROR NEURON SYSTEM MODEL. 26

3.1     The mirror neuron system for grasping and FARS model 26

3.2     The hand-state hypothesis. 29

3.2.1      Virtual fingers. 29

3.2.2      The hand-state hypothesis. 30

3.3     The MNS (mirror neuron system) model 31

3.3.1      Overall function.. 33

3.3.2      Schemas explained.. 33

3.4     Schema implementation.. 36

3.4.1      Grand schema 1: reach and grasp.. 36

3.4.2      Grand schema 2: visual analysis of hand state. 38

3.4.3      Grand Schema 3: core mirror circuit. 42

3.5     Simulation results. 46

3.5.1      Non-explicit affordance coding experiments. 46

3.5.2      Explicit affordance coding experiments. 50

3.5.3      Justifying the visual analysis of hand state schema. 54

3.6     Discussion and predictions. 56

3.6.1      The hand state hypothesis. 56

3.6.2      Neurophysiological predictions. 57

4    CHAPTER IV: MULTILAYER SUPERVISED HEBBIAN LEARNING AND PROBABILITY CODING    60

4.1     Neural coding. 60

4.2     Operation of the proposed network. 61

4.3     Testing the proposed architecture. 64

4.3.1      Deterministic environment. 64

4.3.2      Stochastic environment. 66

4.3.3      Combining multiple layers. 68

4.4     Summary.. 69

5    CHAPTER V: INFANT GRASP LEARNING.. 71

5.1     Motivation.. 71

5.2     Infant reach and grasp.. 71

5.3     Neural maturation versus interactive learning. 74

5.4     Infant Learning to Grasp Model (ILGM) 75

5.4.1      Layers of infant learning to grasp model 76

5.4.2      Functional description of ILGM layers. 77

5.5     Joy of grasping. 78

5.5.1      Mechanical grasp stability.. 78

5.5.2      Implementing the grasp stability.. 79

5.6     Learning approach direction with palm orienting behavior. 80

5.6.1      Simulation results. 80

5.6.2      Conclusions and predictions. 82

5.7     Is infant palm orienting learned or innate? Learning the wrist orientation.......... 82

5.7.1      Simulation results. 83

5.7.2      Conclusions and predictions. 84

5.8     Task constraints shape infant grasping. 84

5.8.1      Simulation results. 85

5.8.2      Conclusions and predictions. 86

5.9     Affordance input matters. 87

5.9.1      Simulation results. 88

5.9.2      Comparison of ILGM with Lockman et al. (1984) 88

5.9.3      ILGM kinematics analysis (five months of age) 88

5.9.4      ILGM kinematics analysis (nine months of age) 89

5.10       Summary and conclusion.. 91

5.11       Discussion.. 92

6    CHAPTER VI: NEUROPHYSIOLOGICAL VIEW OF LEARNING TO GRASP....... 93

6.1     Grasp learning circuit and mirror neurons are complementary networks. 93

6.2     Introduction to primate grasping. 94

6.3     Neural correlates of infant reach and grasp.. 95

6.4     Primate grasp development hypotheses. 97

6.4.1      Hypothesis I: two coexistent grasping circuits. 97

6.4.2      Hypothesis II: single grasping circuit. 98

6.5     Affordance-based learning to grasp model (LGM) 98

6.5.1      Localizing learning to grasp model in primate cortex. 99

6.5.2      What does cerebral cortex know about a grasp?. 101

6.5.3      Simulation level description of LGM layers. 102

6.5.4      Why LGM is relevant: good model versus bad model 102

6.6     Wrist orientation-learning revisited: neural level analysis. 103

6.6.1      Neural level analysis. 103

6.6.2      LGM represents a ‘menu’ of grasps in terms of neural activity.. 104

6.6.3      Predictions and discussion.. 106

6.7     Object axis selectivity: neural level analysis. 106

6.7.1      Neural level analysis. 107

6.7.2      Conclusions and neurophysiological predictions. 108

6.8     Object size selectivity.. 109

6.8.1      Simulation results. 109

6.8.2      Conclusions and neurophysiological predictions. 111

6.9     Generalization: learning to plan based on object location.. 113

6.9.1      Simulation results. 113

6.9.2      Summary and Conclusion.. 115

7    CHAPTER VII: BIOLOGICALLY REALISTIC F5 VISUAL SERVO CIRCUITS FOR GRASPING AND EMERGENCE OF MIRROR NEURONS. 117

7.1     Motivation.. 117

7.2     The link between the mirror neuron system and grasp learning. 118

7.2.1      Two Visual Control Hypotheses. 119

7.2.2      Mirror neurons in feed-forward control (alternative I) 121

7.2.3      Mirror Neurons in feedback control (alternative II) 122

7.2.4      The target of implementation.. 122

7.2.5      The visual servo task. 123

7.2.6      The feed-forward model learning. 123

7.3     Implementation: F5 manual visual control circuit for 2D arm... 123

7.3.1      A leaky integrator model for F5 manual visual feedback circuit. 124

7.3.2      Simulation: visual feedback control with leaky integrators. 127

7.3.3      Simulation: feedback and lower motor centers. 128

7.3.4      F5 Feed-forward visual control and mirror neurons. 130

7.3.5      Simulation: trajectory planning and controller performance. 135

7.4     Feed-forward unit activity and mirror neurons. 137

8    CHAPTER VIII: CONCLUSIONS. 142

8.1     Mirror neurons. 142

8.2     Grasp learning: infant development. 144

8.3     Grasp learning: monkey neurophysiology.. 144

8.4     Grasp learning: neural architecture. 145

9    CHAPTER IX: FUTURE WORK.. 146

9.1     Simultaneous learning in MNS model and LGM... 146

9.2     Going beyond grasping: learning the ‘hand state’ 146

9.3     Tactile feedback for grasping. 146

9.4     Learning to extract the right affordance. 147

9.5     More realistic models of the limb and the brain regions. 147

9.6     Sensitivity analyses for the simulation parameters. 147

10      REFERENCES. 149

11      APPENDIX.. 164

11.1       Mirror neuron system model (MNS) 164

11.1.1    Color segmentation.. 164

11.1.2    Reach and grasp schema precision grasp planning and execution.. 165

11.2       Learning to grasp models (ILGM and LGM) 165

 

 

 

MODELING THE MIRROR: GRASP LEARNING AND

ACTION RECOGNITION

 

by

 

Erhan Oztop

_________________________________________________

 

 

A Dissertation Presented to the

FACULTY OF THE GRADUATE SCHOOL

UNIVERSITY OF SOUTHERN CALIFORNIA

In Partial Fulfillment of the

Requirements for the Degree

DOCTOR OF PHILOSOPHY

(COMPUTER SCIENCE)

 

 

August 2002

 

 

 

Copyright 2002                                                                           Erhan Oztop

 


DEDICATION

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To my Grandmother

 

 


ACKNOWLEDGEMENTS

The time I had at USC during the Ph.D. route was a very enriching period of my life. I had the opportunity to work in an exciting and stimulating research environment, led by Michael Arbib.

I would like to present my deepest gratitude to The Scientific and Technical Research Council of Turkey (TUBITAK) for providing me the scholarship that made it possible for me to attempt and complete the Ph.D. study presented in this thesis. The study would not have been possible if TUBITAK did not provide support for the very first semester and the final semester at USC.

I would like to thank Michael Arbib for guiding and educating me throughout the my years at USC. He has been an extraordinary advisor and mentor, whom I owe all the brain theory I learned. I would also like to present my gratitude to Michael Arbib for providing support via HFSP and USCBP grants. Without HFSP and USCBP support, this study would not be possible.

Stefan Schaal is a great mentor, who has been a constant support and source of inspiration with never-ending energy. Besides introducing me to Robotics, he and his colleague, Sethu Vijayakumar were very influential in maturating the concept of machine learning in my mind.

I also owe a great debt of gratitude to Nina Bradley for being a source positive energy and mentoring me especially in infant motor development. Without her, the thesis would be lacking a major component.

I am also full of gratitude to my Ph.D. qualification exam comittee members Maja Mataric and Christoph von der Malsburg for their guidance and support.

I am greatly indebted to Giacomo Rizzolatti for enabling me to visit his lab and providing the opportunity to communicate with not only himself but also with Vittorio Gallese and Leonardo Fogassi who have provided invaluable insights about mirror neurons. In addition, I am very thankful to Massimo Matelli and Giuseppe Luppino, for the first hand information on the mirror neuron system connectivity, and to Luciano Fadiga for stimulating discussions. I would also like to thank to Christian Keysers and Evelyne (Kohler) for not only actively involving me in their recording sessions but also offering their sincere friendship.

I am very thankful to Hideo Sakata for giving me the opportunity to visit his lab in Tokyo and interact with many researchers including Murata-san with whom I had very stimulating discussions.

I am deeply thankful to Mitsuo Kawato, for giving me the opportunity to interact with various researchers in Japan by having me at ATR during the summer of 2001. My research experience at ATR was very rewarding; I greatly expanded my knowledge on motor control and motor learning. I would like to salute the staff at ATR for all their help. I also would like to present my thanks to the friends at ATR for welcoming me and making me feel at home.

I would like to present my appreciation and thanks to my mentors at Middle East Technical University in Turkey. I present my sincere thanks to my masters advisor Marifi Guler for introducing me to neural computation and to Volkan Atalay for introducing me to computer vision, and supporting my Ph.D. application. Especially, I would like to present my gratitude to Fatos Yarman Vural for her guidance and support during my Masters study and for preparing me for the Ph.D. work presented in this dissertation. Other influential Computer Science professors to whom I am grateful for educating me are Nese Yalabik, Gokturk Ucoluk and Sinan Neftci.

I would like to present my gratitude to Tosun Terzioglu, Albert Ekip, Turgut Onder and Semih Koray who were professors of the Mathematics Department at the Middle East Technical University. They taught me how to think ‘right’ and exposed the beauty of mathematics to me.

Throughout these six years at USC, I had the pleasure to meet several valuable people who contributed to this dissertation. Firstly, I am very thankful to Aude Billard and Auke Jan Ijspeert for all their support and scientific interaction and feedback. They have a huge role in helping me get through the tough times during my Ph.D. work. In addition, I would like to thank Aude Billard for providing me the computing hardware and helping me have a nice working environment, which was very essential for the progress of my Ph.D. study. I am thankful to my great friend Sethu Vijayakumar for his support and stimulating discussions. Jun Nakanishi, Jun Mei Zu, Aaron D’Souza, Jan Peters and Kazunori Okada besides being of great support, were always there to discuss issues and helped me overcome various obstacles. I am deeply thankful to Shrija Kumari for offering not only her smile and friendship but also her energy to proofread my manuscript. I owe a lot to Jun Mei Zu: she has always been there as a great friend and has always offered her help and support when I needed it most. I am indebted to my ex-officemate and a very valuable friend Mihail Bota for his constant support and interactions for improving the thesis and providing me the psychological support to overcome many obstacles throughout my Ph.D. years. Finally, I would like to thank Juan Salvador Yahya Marmol for being a good friend and sharing my workload during various periods of my Ph.D. I count myself very lucky to have these great friends and colleagues whom once again I present my gratitude: Thank you guys!

Not a Hedco Neuroscience inhabitant but a very valuable friend, Lay Khim Ong was always there for offering her help both psychologically and physically (Hey Khim: thank you for your great editing!). I would like to thank other great friends who supported me (in spite of my negligence in keeping in touch with them). Kyle Reynolds, Awe  Vinijkul, Aye Vinijkul Reynolds, Alper Sen, Ebru Dincel: please accept my sincere thanks and appreciation.

I owe deep gratitude to my wife Mika Satomi for her patience in dealing with me in difficult times.  She was always there. Her contribution to this thesis is indispensable. I especially celebrate and thank her for the artistic talents and hard work that she generously offered me throughout the Ph.D. study.

I am greatly indebted to Paulina Tagle and Laura Lopez for their support and help over all these years. I also would like to thank Laura Lopez and Yolanda Solis for their kind friendship and support. My gratitude to Luigi Manna, who helped me with the hardware and software issues during the Ph.D. study.

I would like to thank Laurent Itti for his generous help for improving our lab environment and providing partial support for my research. Also, I would like to salute his lab members for their support and friendship. Florence Miau, in particular, had always offered her warm friendship during her internship at USC.

I would like to present my appreciation to the good things in life particularly, I would like to thank the ocean for comforting and rejuvenating me during difficult times.

Finally, I am deeply indebted to my family, to whom I owe much more than what can be expressed. This work would not be possible without the help of my parents. (Anne ve Baba, Evrim ve Nurdan: Hersey icin cok cok tesekkurler!)

 

 



LIST OF FIGURES

Figure 2.1 Lateral view of macaque brain showing the areas of agranular frontal cortex and posterior parietal cortex (adapted from Geyer et al. 2000). The naming conventions: frontal regions, Matelli et al.(1991); parietal regions, Pandya and Seltzer (1982) 5

Figure 2.2 A canonical neuron response during grasping of various objects in the dark (left to right and top to bottom: plate, ring, cube, cylinder, cone and sphere. The rasters and histograms are aligned with object presentation. Small grey bars in each raster marks onset of key press, go signal, key release, onset of object pulling, release signal, and object release, respectively. The peaks in ring and sphere object cases correspond to the grasping of the object by the monkey (adapted from Murata et al. 1997a) 6

Figure 2.3 The motor responses of the same neuron shown in Figure 2.2. The motor preference of the neuron is also carried over to the visual preference (compare the ring and sphere histograms of both figures) (adapted from Murata et al. 1997a) 7

Figure 2.4 Activity of a cell during action observation (left) and action execution (right). There is no activity in presentation of the object during both initial presentation and bringing the tray towards the monkey. The vertical line over the histogram indicates the hand-object contact onset. (from Gallese et al., 1996). 8

Figure 2.5 Visual response of a mirror neuron. A. Precision grasp B. power grasp C. mimicking of precision grasp. The vertical lines over the histograms indicate the hand-object contact onset. (adapted from Gallese et al., 1996) 9

Figure 2.6 Example of a strictly congruent manipulating mirror neuron: A) The experimenter retrieved the food from a well in a tray. B) Same action, but performed by the monkey. C) The monkey grasped a small piece of food using a precision grip. The vertical lines over the histograms indicate the hand-object contact onset (adapted from Gallese et al., 1996). 9

Figure 2.7 The classification of area F5 neurons derived from published literature (Dipellegrino et al. 1992; Gallese 2002; Gallese et al. 1996; Murata et al. 1997a; Murata et al. 1997b; Rizzolatti et al. 1996a; Rizzolatti and Gallese 2001). All F5 neurons fire in response to some motor action. In addition, canonical neurons fire for object presentation while the mirror neurons fire for action observation. The majority of hand related F5 neurons are purely motor (Gallese 2002)(labelled as Motor Neurons in the figure) 10

Figure 2.8 The macaque parieto-frontal projections from mesial parietal cortex, medial bank of the intraparietal sulcus and the surface of the superior parietal lobule (adapted from Rizzolatti et al. 1998). Note that the Brodmann’s area 7m corresponds to Pandya and Seltzer's (1982) area PGm... 11

Figure 2.9 The intraparietal sulcus opened to show the anatomical location of AIP in the macaque (adapted from Geyer et al. 2000) 13

Figure 2.10 An AIP visual-dominant neuron activity under three task conditions: Object manipulation in the light, object manipulation in the dark and object fixation in the light. The neuron is active during fixation and holding phase when the action is performed in light condition. However, during grasping in dark the neuron shows no activity. The fixation of the object alone without grasping also produces a discharge (adapted from Sakata et al. 1997a) 14

Figure 2.11 Activity the same neuron in Figure 2.10 during fixation of different objects. The neuron show selectivity for horizontal plate (adapted from Sakata et al. 1997a) 14

Figure 2.12 An AIP visual-dominant neuron’s axis orientation tuning and object fixation response is shown. The neuron fires maximally during the fixation of a vertical bar or a cylinder. The tuning is demonstrated in the lower half of the figure (adapted from Sakata et al. 1999) 15

Figure 2.13 Response of an axis-orientation-selective (AOS) neuron in the caudal part of the lateral bank of the intraparietal sulcus (c-IPS) to a luminous bar tilted 45° forward (left) or 45 backward (right) in the sagittal plane. The monkey views the bar with binocular vision. The line segment under the histograms mark the fixation start and the period of 1 second. (adapted from Sakata et al. 1999) 16

Figure 2.14 The response of the same neuron in Figure 2.13, for monocular vision conditions for the left and right eyes. (adapted from Sakata et al. 1999) 16

Figure 2.15 Orientation tuning of a surface-orientation selective (SOS) neuron. First row: Stimuli presented. Middle row: responses of the cell with binocular view. Last row: responses of the cell with monocular view (adapted from Sakata et al. 1997a) 17

Figure 2.16 The reconstructed connectivity of area 7a. The thickness of the arrows represent the strength of the connection. (adapted from Bota 2001) 20

Figure 2.17 The reconstructed connectivity of area 7b. The thickness of the arrows represent the strength of the connection. (adapted from Bota 2001) 21

Figure 2.18 The reconstructed connectivity of area AIP. The thickness of the arrows represent the strength of the connection. (adapted from Bota 2001) 22

Figure 3.1 Lateral view of the monkey cerebral cortex (IPS, STS and lunate sulcus opened). The visuomotor stream for hand action is indicated by arrows (adapted from Sakata et al., 1997) 27

Figure 3.2 AIP extracts the affordances and F5 selects the appropriate grasp from the AIP ‘menu’. Various biases are sent to F5 by Prefrontal Cortex (PFC) which relies on the recognition of the object by Inferotemporal Cortex (IT). The dorsal stream through AIP to F5 is replicated in the MNS model 28

Figure 3.3 Each of the 3 grasp types here is defined by specifying two "virtual fingers", VF1 and VF2, which are groups of fingers or a part of the hand such as the palm which are brought to bear on either side of an object to grasp it. The specification of the virtual fingers includes specification of the region on each virtual finger to be brought in contact with the object. A successful grasp involves the alignment of two "opposition axes": the opposition axis in the hand joining the virtual finger regions to be opposed to each other, and the opposition axis in the object joining the regions where the virtual fingers contact the object. (Iberall and Arbib 1990) 29

Figure 3.4 The components of hand state F(t) = (d(t), v(t), a(t), o1(t), o2(t), o3(t), o4(t)). Note that some of the components are purely hand configuration parameters (namely v,o3,o4,a) whereas others are parameters relating hand to the object. 31

Figure 3.5 The MNS (Mirror Neuron System) model. (i) Top diagonal: a portion of the FARS model. Object features are processed by cIPS and AIP to extract grasp affordances, these are sent on to the canonical neurons of F5 that choose a particular grasp. (ii) Bottom right. Recognizing the location of the object provides parameters to the motor programming area F4 which computes the reach. The information about the reach and the grasp is taken by the motor cortex M1 to control the hand and the arm. (iii) New elements of the MNS model: Bottom left are two schemas, one to recognize the shape of the hand, and the other to recognize how that hand is moving. (iv) Just to the right of these is the schema for hand-object spatial relation analysis. It takes information about object features, the motion of the hand and the location of the object to infer the relation between hand and object. (v) The center two regions marked by the gray rectangle form the core mirror circuit. This complex associates the visually derived input (hand state) with the motor program input from region F5canonical neurons during the learning process for the mirror neurons. The grand schemas introduced in section 3.2 are illustrated as the following. The “Core Mirror Circuit” schema is marked by the center grey box; The “Visual Analysis of Hand State” schema is outlined by solid lines just below it, and the “Reach and Grasp” schema is outlined by dashed lines. (Solid arrows: established connections; dashed arrows: postulated connections) 32

Figure 3.6 (a) For purposes of simulation, we aggregate the schemas of the MNS (Mirror Neuron System) model of Figure 3.5 into three "grand schemas" for Visual Analysis of Hand State, Reach and Grasp, Core Mirror Circuit. (b) For detailed analysis of the Core Mirror Circuit, we dispense with simulation of the other two grand schemas and use other computational means to provide the three key inputs to this grand schema. 34

Figure 3.7 (Left) The final state of arm and hand achieved by the reach/grasp simulator in executing a power grasp on the object shown. (Right) The hand state trajectory read off from the simulated arm and hand during the movement whose end-state is shown at left. The hand state components are: d(t), distance to target at time t; v(t), tangential velocity of the wrist; a(t), Index and thumb finger tip aperture; o1(t), cosine of the angle between the object axis and the (index finger tip – thumb tip) vector; o2(t), cosine of the angle between the object axis and the (index finger knuckle – thumb tip) vector; o3(t), The angle between the thumb and the palm plane; o4(t), The angle between the thumb and the index finger. 37

Figure 3.8 Grasps generated by the simulator. (a) A precision grasp. (b) A power grasp. (c) A side grasp.. 38

Figure 3.9 (a) Training the color expert, based on colored images of a hand whose joints are covered with distinctively colored patches. The trained network will be used in the subsequent phase for segmenting image. (b) A hand image (not from the training sample) is fed to the augmented segmentation program. The color decision during segmentation is done by consulting to the Color Expert. Note that a smoothing step (not shown) is performed before segmentation.. 40

Figure 3.10 Illustration of the model matching system. Left: markers located by feature extraction schema. Middle and Right: initial and final stages of model matching. After matching is performed a number of parameters for the Hand configuration are extracted from the matched 3D model 41

Figure 3.11 The scaling of an incomplete input to form the full spatial representation of the hand state As an example, only one component of the hand state, the aperture is shown. When the 66 percent of the action is completed, the pre-processing we apply effectively causes the network to receive the stretched hand state (the dotted curve) as input as a re-representation of the hand state information accessible to that time (represented by the solid curve; the dashed curve shows the remaining, unobserved part of the hand state) 44

Figure 3.12 The solid curve shows the effective input that the network receives as the action progresses. At each simulation cycle the scaled curves are sampled (30 samples each) to form the spatial input for the network. Towards the end of the action the networks input gets closer to the final hand state. 45

Figure 3.13 (a) A single grasp trajectory viewed from three different angles to clearly show its 3D pattern. The wrist trajectory during the grasp is marked by square traces, with the distance between any two consecutive trace marks traveled in equal time intervals. (b) Left: The input to the network. Each component of the hand state is labelled. (b) Right: How the network classifies the action as a power grasp: squares: power grasp output; triangles: precision grasp; circles: side grasp output. Note that the response for precision and side grasp is almost zero. 47

Figure 3.14 Power and precision grasp resolution. The conventions used are as in the previous figure. (a) The curves for power and precision cross towards the end of the action showing the change of decision of the network. (b) The left shows the initial configuration and the right shows the final configuration of the hand   48

Figure 3.15: (Top) Strong precision grip mirror response for a reaching movement with a precision pinch. (Bottom) Spatial location perturbation experiment. The mirror response is greatly reduced when the grasp is not directed at a target object. (Only the precision grasp related activity is plotted. The other two outputs are negligible.) 48

Figure 3.16 Altered kinematics experiment. Left: The simulator executes the grasp with bell-shaped velocity profile. Right: The simulator executes the same grasp with constant velocity. Top row shows the graphical representation of the grasps and the bottom row shows the corresponding output of the network. (Only the precision grasp related activity is plotted. The other two outputs are negligible.) 49

Figure 3.17 Grasp and object axes mismatch experiment. Rightmost: the change of the object from cylinder to a plate (an object axis change of 90 degrees). Leftmost: the output of the network before the change (the network turns on the precision grip mirror neuron). Middle: the output of the network after the object change. (Only the precision grasp related activity is plotted. The other two outputs are negligible.) 50

Figure 3.18 The plots show the level of mirror responses of the explicit affordance coding object for an observed precision pinch for four cases (tiny, small, medium, big objects). The filled circles indicate the precision activity while the empty squares indicate the power grasp related activity.. 52

Figure 3.19 The solid curve: the precision grasp output, for the non-explicit affordance case, directed to a tiny object. The dashed curve: the precision grasp output of the model to the explicit affordance case, for the same object. 52

Figure 3.20: Empty squares indicate the precision grasp related cell activity, while the filled squares represent the power grasp related cell activity. The grasps show the effect of changing the object affordance, while keeping a constant hand state trajectory. In each case, the hand-state trajectory provided to the network is appropriate to the medium-sized object, but the affordance input to the network encodes the size shown. In the case of the biggest object affordance, the effect is enough to overwhelm the hand state’s precision bias. 53

Figure 3.21 The graph is drawn to show the decision switch time versus object size. The minimum is not at the boundary, that is, the network will detect a precision pinch quickest with a medium object size. Note that the graph does not include a point for "Biggest object" since there is no resolution point in this case (see the final panel of Figure 3.19) 54

Figure 3.22 The precision grasp action used to test our visual system is depicted by superimposed frames (not all the frames are shown) 54

Figure 3.23 The video sequence used to test the visual system is shown together with the 3D hand matching result (over each frame). Again not all the frames are shown.. 55

Figure 3.24 The plot shows the output of the MNS model when driven by the visual recognition system while observing the action depicted in Figure 3.22. It must be emphasized that the training was performed using the synthetic data from the grasp simulator while testing is performed using the hand state extracted by the visual system only. Dashed line: Side grasp related activity; Solid line: Precision grasp related activity. Power grasp activity is not visible as it coincides with the time axis. 56

Figure 4.1 The elevated circular region corresponds to the area defined by the equation (x*x+y*y) <0.25. The environment returns +1 as the reward if the action falls into the circular region, otherwise –1 is returned. 64

Figure 4.2 The adaptation of the firing potential of the stochastic units are shown as a series of evolving 3D maps. (left to right and top to bottom) 65

Figure 4.3. The normalized histogram of the actions generated over 60000 samples. Note that the actions generated captured the environment’s reward distribution (see Figure 4.1). 66

Figure 4.4 The stochastic environment’s double peaked reward distribution (see text for the explanation) 66

Figure 4.5 Some snapshots showing the phases of learning of the layer in the stochastic environment where the reward distribution has two peaks (see Figure 4.4). 67

Figure 4.6. The normalized histogram of 60000 data points (actions) generated by the trained layer in the stochastic environment depicted in Figure 4.4. 68

Figure 5.1 Infant grip configurations can be divided in two categories: power and precision grips. Infants tend to switch from power grips to precision grips as they grow (adapted from Butterworth et al. 1997) 73

Figure 5.2 The structure of the Infant Learning to Grasp Model. The individual layers are trained based on somatosensory feedback. 76

Figure 5.3 Hand Position layer specifies the approach direction of the hand towards the object. The representation is allocentric (centred on the object). Geometrically the space around the object can be uniquely specified with the vector (azimuth, elevation, radius). The Hand Position layer generates the vector by a local population vector computation. The locus of the local neighbourhood is determined by the probability distribution represented in the firing potential of Hand Position layer neurons (see Chapter 4, for details) 77

Figure 5.4:The grasp stability we used in the simulations is illustrated for a hypothetical precision pinch grip (note that this is a simplified, the actual hand used in the simulations has five fingers) 79

Figure 5.5 The trained model’s Hand Position layer is shown as a 3D plot. One dimension is summed to reduce the 4D map to a 3D map. Intuitively the map says: ‘when the object is above the shoulder and in front grasp it from the bottom’ 81

Figure 5.6: The output of the trained model’s target position layer is shown as a 3D plot. One dimension is summed to reduce the 4D map to a 3D map. The object is on the left side of the (right handed) arm. Intuitively, the map says ‘when the object is on the left side grasp it from the right side of the object’ 81

Figure 5.7 The learning evolution of the distribution of the Hand Position layer is shown as a 3D plot. Note that the 1000 neurons shown represent the probability distribution of approach directions. Initially, the layer is not trained and responds in a random fashion to the given input. As the learning progresses, the neurons gain specificity for this object location. 82

Figure 5.8 ILGM planned and performed a power grasp after learning. Note the supination (and to a lesser extent extension) of the wrist required to grasp the object from the bottom side. 83

Figure 5.9 Two learned precision grips (left: three fingered; right four fingered) are shown. Note that the wrist configuration for each case. ILGM learned to combine the wrist location with the correct wrist rotations to secure the object. 83

Figure 5.10 ILGM was able to generate two fingered precision grips. However these were less than the three or four finger grips. 84

Figure 5.11 The cube on the table simulation set up. ILGM interacts with the object with the physical constraint that it has to avoid collision with the table. 85

Figure 5.12 ILGM learned a ‘menu’ of precision grips with the common property that the wrist was placed well above the object. The orientation of the hand and the contact points on the object showed some variability. Two example precision grips are shown in the figure. 85

Figure 5.13. ILGM acquired thumb opposing index finger precision grips. 86

Figure 5.14 The three cylinder orientations and grasp attempts by the poor vision condition. 87

Figure 5.15 The orientation match of the hand and the cylinder is illustrated. Dashed line with diamonds: 5 months old infants; Solid line with diamonds: 9 months old infants; Dashed line with circles: ILGM with no affordance; Solid line with circles: ILGM with affordance (infant data from Lockman et al. (1984)). Right panel illustrates the object orientation used for the simulation and for the infants in this comparison.. 88

Figure 5.16 The hand orientation and cylinder orientation difference curves for individual trials. The columns from left to right correspond to horizontal, diagonal and vertical orientations. The upper row flat class of error curves, lower row non-flat class for error curves (see text for explanation) 89

Figure 5.17 The hand orientation and cylinder orientation difference curves while ILGM was executing four types of grasp in the full-vision condition. Left two figures are two typical error curves for the horizontal cylinder. Note that the two horizontal case error patterns reflect the two possible grasps: from the bottom and from the top. The third and fourth are typical error curves for the diagonal and vertical cylinders respectively.. 90

Figure 5.18 The grasps performed after ILGM learned the association between the wrist rotations and the object affordance (orientation) 91

Figure 6.1: The overall MNS model. The grey background rectangle shows the focus of this chapter. In addition to the areas shown, area F2 will be posited as being involved in grasp planning. 94

Figure 6.2 Left: precision grasp (pad opposition); Middle: Power grasp (palm opposition); Right: Side grasp (side opposition). Each of the 3 grasp types here is defined by specifying two ‘virtual fingers’, VF1 and VF2, which are groups of fingers or a part of the hand such as the palm which are brought to bear on either side of an object to grasp it. The specification of the virtual fingers includes specification of the region on each virtual finger to be brought in contact with the object. A successful grasp involves the alignment of two "opposition axes": the opposition axis in the hand joining the virtual finger regions to be opposed to each other, and the opposition axis in the object joining the regions where the virtual fingers contact the object (adapted from Iberall and Arbib 1990) 95

Figure 6.3 The two possible organization of learning to grasp circuit are shown. According to Hypothesis I, two grasping circuits exist; the phylogenetically older one located in area F1 (hatched background) and the newer one in the premotor cortex (solid background). According to Hypothesis II, F1 is involved in only executing the premotor cortex instructed movements. LGM is based on the latter hypothesis. The details of LGM are shown in Figure 6.4. Note that we introduced area F2 for complementing the MNS structures. The visual input to area F2 originates from MIP (not shown) and V6a. 98

Figure 6.4 The Learning to Grasp Model. F5 is implicated in all grasp related parameters. Dashed connections indicate the direct corticospinal projections of premotor areas. Area F5 works with area F2 and F4 to transform visual affordances signalled by parietal areas into a grasp plan. The grasp plan is then, relayed to primary motor cortex (F1) and spinal cord for execution. The tactile feedback of the action is integrated in the first somatosensory cortex (SI), which mediates the adaptation of the parietal-premotor and inter-premotor connections. 100

Figure 6.5 The top-left shows the Hand Position layer output summed over the radius (approach direction is encoded in spherical coordinates) as a 3D plot.  The top-centre panel shows the sample generated from the Hand Position distribution. Bottom-left shows the Wrist Rotation layer output summed over the heading axis as a 3D plot. The bottom-centre panel shows the parameters picked from the Wrist Rotation layer distribution. Note that Wrist Rotation layer distribution depends on (i.e. represents a conditional distribution) the sample picked from the Hand Position layer. The rightmost panel shows the executed grasp.. 104

Figure 6.6 Using the same LGM used for Figure 6.5, another grasp plan is generated (left four panels). The resulting grasp is shown on the right. By comparing the grasp plan shown on the left four panels with of Figure 6.5’s grasp plan we see how the selection of a different approach direction (see the centre-top panels of both figures) changed the Wrist Orientation distribution.. 105

Figure 6.7:Two very different grasp generation from the same LGM. Upper panel: Grasping with maximum wrist extension with some pronation. Lower panel: Grasping with maximum wrist supination and small wrist extension. Note that the Wrist Layer probability map is the same since the approach direction was chosen the same (the small dots in the right most panels). 105

Figure 6.8 The grasps performed after LGM learned the association of hand rotations with the object orientation input (full vision condition). Note that the left panel shows a bottom side grasp. All of the shown grasp configurations satisfied grasp stability criterion.. 107

Figure 6.9 In the poor-vision case, the hand rotation neurons in LGM show the same response for horizontal (left panel), diagonal (centre panel) and vertical (right panel) object presentations because of the lack of axis orientation input. 107

Figure 6.10 When LGM has access to axis orientation information the Hand Rotation neurons represent different plans in response to horizontal (left panel), diagonal (centre panel) and vertical (right panel) object presentations. 108

Figure 6.11 The small object presentation produced two peaks of activity in the Hand Position layer corresponding to the probability distribution of approach directions. The right panel shows the executed grasp when the data generation was localized in the area pointed by the leftmost arrow. 109

Figure 6.12: A large cube was grasped by securing the object between the thumb and the other fingers (right panel). The Hand Position layer activity is shown on the left panel. The neuron with largest activity is marked with an arrow... 110

Figure 6.13 The largest object presentation and grasping. The Hand Position reflects a single reach direction as indicated with an arrow... 110

Figure 6.14:The Hand Position layer activity is superimposed to demonstrate that the maximum activity loci are separated for each object indicating selectivity for object size. 111

Figure 6.15 The trained Learning to Grasp Model executed grasps to objects located at nine different locations in the workspace. The grasp locations were not used in the training. All of the grasps shown were stable. 114

Figure 6.16 The same model used in generating Figure 6.15 was used to generate a different set of grasps. Again all the grasps were stable. 114

Figure 6.17 The internal mechanisms of representing and generating multiple grasp plans are shown. Solid arrows (except object encoding) denote learned connections while empty arrows indicate data generation. The flow of operation starts with the presentation of the object (the bottom centre) and follows the arrows. At the top-centre, the data generation can yield multiple approach directions. The two possible approach directions are shown creating two streams (left column and right column), each of which yields different grasp execution (bottom pictures of left and right column). 115

Figure 7.1 The MNS model repartitioned to show the focus of this chapter. The grey background marks area of interest. 118

Figure 7.2 One alternative visual control structure for manipulation is shown within the MNS framework. The mirror neurons generate feed-forward commands. 119

Figure 7.3 Another alternative visual control structure for manipulation is shown within the MNS framework (compare with Figure 7.2). The mirror neurons generate feedback commands. 120

Figure 7.4 The feedback and feed-forward control view of the F5 grasping circuit, alternative I: F5mirror neurons learn to generate feed-forward command. The desired state is assumed to be available and is converted to a correction motor command by F5motor-only units using stochastic gradient descent. F5canonical neurons gate the feed-forward and feedback pairs. 121

Figure 7.5. The feedback and feed-forward control view of the F5 grasp circuit, alternative II: F5mirror neurons learn to compute the error. The error is then converted to a correction motor command by F5motor-only units. F5canonical generates the feed-forward command signal. 122

Figure 7.6 The schema level view of the feedback controller. The visual processing encapsulates the process of extracting an error based on the vision of the hand and the object. Lower Motor Centers encapsulates the functionality involved in transforming the motor signal into actual commands sent to muscles. 125

Figure 7.7 The leaky integrator implementation of the feedback circuit that solves the inverse kinematics problem for precision grasping. See text for the explanation.. 126

Figure 7.8 Three grasping tasks executed by the feedback circuit proposed shown on the upper half of the figure. The change of arm/hand configuration during the execution is illustrated by snapshots of the arm/hand. Each hand figure is accompanied (lower half) by the error plot. The grasp execution is stopped (success) when the sum of finger distances to their target was less than 2mms. 128

Figure 7.9.The Visual feedback circuit generating desired trajectories for the ‘lower level motor centre’ (implemented as a PD controller) 129

Figure 7.10 The slow (2 seconds) (lower half) and fast (0.5 seconds) (upper half) performance of the ‘visual feedback servo’ + ‘PD controller’ system is shown. The right hand side graphs show the tracking error (of the wrist) versus time. In the slow case, the object can be grasped but in the fast case, it is missed.. 130

Figure 7.11 The F5 mirror neurons viewed as the memory based feed-forward controller. The arrows below the sheet of neurons indicate outputs while the arrows coming above the sheet indicate inputs. 133

Figure 7.12 The arm configurations that were acquired during 6 object grasping actions are shown. Each of the superimposed configurations is represented by a unit in the feed-forward layer. 134

Figure 7.13 The trajectory generation with feedback and feed-forward control is illustrated for comparison with Figure 7.8 (feedback-only system). In the lower panel the error graphs are plotted as error versus iteration. The error is the sum of squared distances of the fingertips to their targets. The rightmost object was not grasped in the training (a novel object/location). Thus the system could not make use of the feed-forward signal, approximately after iteration 25 and switched to feedback only mode, resulting in slower positioning of the fingers on the target locations. 135

Figure 7.14 The feedback, feed-forward and lower level motor servo and the dynamic arm was simulated all together. Upper half: The grasp lasted 0.5 seconds. Lower half: the grasp lasted 2 seconds. The fast movement error reduced with a factor of 6 while the slow movement reduced with a factor of 10 in terms (compare with Figure 7.10) 136

Figure 7.15 The top row demonstrates two trajectory-planning examples for grasping without obstacle. The bottom row demonstrates how new trajectories are formed by introduction of an obstacle as a local inhibition on the feed-forward controller units. 137

Figure 7.16 The feed-forward unit activations for four grasp observations shown as unit versus time. Each graph consists of 157 neurons acquired during the learning phase (the rows). The columns represent the time. 138

Figure 7.17 A mirror neuron recorded during a grasp observation. On the left the raster plots; on the right the histogram. The recording data shown spans 2 seconds. In addition, the hand start to move approximately at time = 1 second indicated by the vertical bar at the centre of the raster panel (Rizzolatti and Gallese 2001) 139

Figure 7.18 Top row: Real mirror neuron recording during a precision grasp. Bottom row: One of the feed-forward controller unit’s responses to vision of a grasping action. In the left panels, each raster row corresponds to a trial (Poisson spike generation for the model). The right panels show the histograms. The rasters aligned according to the contact of the hand with the object. 140

Figure 7.19. The similarity of a real neuron and model unit is demonstrated. Left two panels real mirror neuron rasters and histogram. Right two panels are the model generated rasters and histogram. A slow increasing activity is observed in both cases. 140

Figure 7.20 Left: a sharp mirror neuron activity, which could only be replicated with our simulator by reducing the receptive field. Right: Similar response profile obtained from one of the feed-forward module units. 141

Figure 7.21 The population activity of feed-forward units with smaller receptive fields. The feed-forward unit we used to match the real mirror firing profile in Figure 7.20 is marked with an ellipse. 141


1          INTRODUCTION

Imagine two friends chatting over a coffee table. At the instant when one of them wants to get a sip of coffee, he effortlessly reaches and grasps his cup, shaping his hand according to the properties of the cup. If it were a mug, he would reach and grab the handle so as to counteract gravity; if it were a paper cup, probably he would grasp the cup with his whole hand covering the surface of the cup unless the coffee is too hot, forcing him to grab the cup from its outer rim. When he starts reaching for the cup, his friend easily understands that he wants to drink coffee even before his hand contacts the mug. Probably the way he reaches and shapes his hand conveys the information that he is not aiming at other objects on the table. He could possibly even infer that his coffee is hot before he grabs it. The situation for the two is reciprocal; they switch roles of being observer and actor as they sip their coffee.

When considered individually each of them is engaged in two tasks -grasping and observing. The former is a goal directed movement while the latter is a perceptual task. The grasping task requires the integration of information from a variety of sources (MacKenzie and Iberall 1994). The context and visual analysis of an object determines the general grasp plan. Proprioceptive (during reach), tactile and kinesthetic information (after contact) are used to guide the hand and arm to complete the grasp. Thus, the task of grasping involves, at the least, the determination of the following information:

1.       How to conform the hand to the object. Based on object properties such as the shape, orientation and size the questions of ‘which side of the object the hand should approach’; ‘which fingers should be engaged’, and ’what should the appropriate wrist rotation be’ must be answered.

2.       How to control the limbs. According to the physical properties of the environment and the biomechanics of the limbs, a control mechanism must engage muscles to transport the hand and shape it to match the specifications given by (1).

The former is the problem of grasp planning; the latter is the problem of grasp execution, which involves dynamics, that is, the adjustment of forces that the muscles must exert to achieve the specified plan. Many other information sources are integrated to refine the reach and grasp plan. For example, obstacle avoidance, speed, and accuracy requirements affect the reach component while the force requirements to secure a heavy slippery object or to handle a delicate object affect the grasp component.

The perceptual task, in essence, does not involve determining movement related parameters, as no movement has to be made. Nevertheless, the observer recognizes the action even before it is complete. Thus, the observer has to analyze the motion of the hand and its relevance to the target to determine whether the hand approach and preshape would yield a grasping behavior. It is interesting to note that there is some similarity in action recognition and action generation in terms of the underlying computations.

In fact, if one could compare the activity of observer’s brain while he is engaged in grasping versus while he is observing his peer grasping, one would see that the motor related regions of the observer’s brain was active in both observation and execution. Thus, one could conclude that the observer’s brain mirrored the action of his peer by establishing equivalence between the observed action and his grasp plan.

The mechanism I caricaturized above is the focus of this thesis. The execution-observation matching system as introduced above does exist in monkey. There is strong evidence that human brain is also endowed with similar mechanism.

To be precise, this thesis investigates the cortical mechanisms involved in:

1.       Translation of a visual description of an object into an appropriate grasp plan that is, learning to make motor plans that yield grasps that are appropriate for the target objects

2.       Mapping of observed grasp actions into internal motor representations

3.       Developmental processes shaping neural circuits to provide the functionality (1)

4.       Developmental processes of (2), that is learning to recognize observed grasp actions based on self-executed grasps

The thesis analyzes (human and monkey) behavior and monkey neurophysiology from a developmental perspective, and constantly asks the questions: what is the underlying factor that give rise to such mechanisms? How does it shape the basic schemas of newborns into a functional form? The aim of the thesis is to give answers to these questions via computational models that learn and adapt starting from minimal bootstrapping behaviors. The models and the hypotheses developed in the thesis are based on:

1.       Human infant motor development studies

2.       Human behavioral and neuroimaging studies

3.       Monkey neurophysiology and neuroanatomy studies

The thesis also presents significant predictions that can be experimentally tested with the hope that experimentalists will be stimulated to conduct the experiments suggested or design new experiments to test the model predictions and further uncover details of the cortical mechanisms of action recognition (mirror neurons), visuomotor learning and motor planning. The results of these experiments then would feedback into the modeling presented here, leading to validation (or rejection) and refinements of the models developed.

The organization of the thesis is as follows:

Chapter II presents the basic biological background with an emphasis on the brain areas that will be the focus of modeling. ‘Mirror Neurons’ (Dipellegrino et al. 1992; Gallese et al. 1996; Rizzolatti et al. 1996a) of the monkey premotor cortex are introduced in this chapter. The research on locating mirror neurons in human is also reviewed in Chapter II.

Chapter III develops the Mirror Neuron System (MNS) model based on the hypothesis that self-observation of grasping movements mediates the adaptation of parietal-premotor and premotor-premotor connections. Using simulation results, the chapter presents predictions on the timing of mirror neuron responses (and others) and suggests neurophysiological experiments for testing the predictions. In addition, Chapter III introduces the grand schemas of Visual Analysis and Reach and Grasp. The simulated 3D arm/hand of the Reach and Grasp schema is used in other chapters to graphically display the learned grasp actions.

Chapter IV develops a reinforcement learning based neural architecture that is capable of representing multiple values of a variable in terms of its probability distribution. The probability distributions are shaped through interaction with the environment so as to reflect the reward distribution in the environment. Chapter IV shows how layers can be connected to build multilayer reinforcement networks that are capable of representing conditional probability distributions. The architecture present in Chapter IV is used by Chapter V and Chapter VI.

Chapter V develops the Infant Learning to Grasp Model (ILGM) based on infant literature. ILGM is a schema level behavioral model that reproduces many infant behaviors and produces testable hypotheses through simulation results. The notable property is that ILGM starts with a very basic repertoire of action mimicking neonates and show how a range of grasping categories can emerge via explorative interaction with the environment.

Chapter VI introduces the neurophysiological Learning to Grasp Model (LGM) as a neural level instantiation of ILGM constrained by monkey neurophysiology and neuroanatomy. LGM replicates existing premotor cortex findings such as the object selectivity of F5 canonical neurons and yields testable predictions about the grasp learning circuit in monkeys. This chapter also, poses serious questions to neurophysiologists on the assessment used to relate neuron firing to behavior. In particular, Chapter VI argues, by simulation results and examples from literature, that behavior-neural activity correlation is not an appropriate measure for investigating brain mechanisms of movement planning. The chapter suggests an experimental methodology for deciphering the neural substrates of movement planning suitable for finding the neuron populations that encode the true movement control variables (parameters).

Chapter VII asks the question, ‘why do the mirror neurons exist?’ The hypothesis of the chapter (introduced in Chapter III) is that mirror neurons evolved initially to provide visual feedback for manual manipulative actions, and later became effective in recognizing actions of others. Chapter VII presents a simplified model of grasping (planar arm/hand) with increasing degrees of complexity in its control. First, a biologically realistic stochastic gradient following visual feedback circuit that can perform precision grips is developed. Then a memory based neural feed-forward circuit is introduced to augment the visual feedback servo circuit. After assessing the behavioral properties of the integrated visual servo circuit, the chapter analyzes the activities of individual memory units in the feed-forward circuit. The activities are converted into spike patterns using Poisson model of neural firing for a direct comparison with mirror neuron data. The results indicate that some of the feed-forward units’ firing patterns are very similar to mirror neurons’, in spite the fact that the feed-forward activity is characterized by a grasp error map. This supports the hypothesis that mirror neuron system initially evolved to serve as a visual servo circuit for manual manipulation.

Chapter VIII summarizes the key results and presents the predictions with high impact potential.

Chapter IX concludes the thesis by pointing out some open questions and recommending future research directions.


2          CHAPTER II: BIOLOGICAL BACKGROUND

In this chapter, we review the literature on the brain areas involved in the mirror neuron system functioning and the grasp related visuomotor transformation. We present major findings from neurophysiological and neuroanatomical (connectivity) literature to form a background for modeling chapters. The main regions of interest are intraparietal sulcus and premotor areas. A supplementary review of the superior temporal sulcus and other parietal areas is included. The posterior parietal cortex is involved in sensory-motor transformations, combining various sensory inputs and computing representations that are used by the motor system to generate movements. In turn, premotor cortex is involved in generating motor plans based on parietal representations of object affordances. The superior temporal sulcus performs visual analysis of motion and form including biological motion providing visual input to the parietal areas.

2.1         Abbreviations


7a, parietal area 7a; area PG

7b, parietal area 7b; area PF

7ab, PFG, parietal area 7ab, area PFG

7ip,parietal area 7ip

7m, mesial part of area 7; area PGm

F1, primary motor cortex; M1

F2, caudal part of dorsal premotor area; PMdc

F3, supplementary motor area; SMA

F4, caudal part of ventral premotor area; PMvc

F5, rostral part of ventral premotor area; PMvr

F6, pre-supplementary motor area; pre-SMA

F7, rostral part of dorsal premotor area; PMdr

SPL, superior parietal lobule

IPL, inferior parietal lobule

STS, superior temporal sulcus

IPS, intraparietal sulcus

IT, inferotemporal cortex

AIP, anterior intraparietal area; part of area 7

cIPS, caudal intraparietal sulcus; part of area 7

LIP, lateral intraparietal area, part of area 7

MIP, medial intraparietal area; part of area 5/7

VIP, ventral intraparietal area, part of area5/7

MST, medial superior temporal area; part of STS

MT, middle temporal area; part of STS

PE, parietal area PE; part of area 5

PEa, parietal area PEa; part of area 5

PEc, parietal area PEc; part of area 5

PO, parieto-occipital cortex

V6A, visual area 6A, area 19

S1, first somatosensory cortex, SI

S2, second somatosensory cortex, SII

 


 

2.2         Premotor areas

The macaque inferior premotor cortex is located ventral from the spur of the arcuate sulcus (see Figure 2.1) and considered to be involved in reaching and grasping movements (Rizzolatti et al. 1988). This region has been further partitioned into two sub-regions: F5, the rostral region, located along the arcuate and F4, the caudal part (see Figure 2.1).

Figure 2.1 Lateral view of macaque brain showing the areas of agranular frontal cortex and posterior parietal cortex (adapted from Geyer et al. 2000). The naming conventions: frontal regions, Matelli et al.(1991); parietal regions, Pandya and Seltzer (1982)

The neurons in F4 appear to be primarily involved in the control of proximal movements (Gentilucci et al. 1988), whereas the neurons of F5 are involved in distal control (Rizzolatti et al. 1988).

2.2.1        Area F5

Area F5 is one of the various agranular frontal areas of particular interest due to its complex function (Matelli 1986). In the monkey, this area lies immediately caudal to the inferior arm of the arcuate sulcus. Stimulation and recording experiments showed that F5 is concerned with both hand and mouth movements. Hand movements are represented mostly in its dorsal part while mouth movements tend to be ventrally (Rizzolatti et al. 1988). Little is known about the functional properties of mouth neurons, however hand neurons were extensively studied.

2.2.1.1              Motor properties

Hand neurons discharge during specific goal-related movements such as grasping, tearing, manipulating and holding (Rizzolatti et al. 1988). Many of them are specific for a particular type of hand movement (Rizzolatti et al. 1988). In addition, some F5 neurons become active at the presentation of three-dimensional objects, in the absence of any overt movement, similar to AIP neurons that become active when the monkey fixates on a presented object. In many cases these visually triggered discharge requires a congruence of the presented object to the grip coded by the neuron (Murata et al. 1997a).

Rizzolatti et al. (1988) found that most F5 neuron firings correlated with specific goal related distal motor acts rather than with single movements made by the animal[1]. Using the motor acts as the classification criterion, they subdivided the neurons into different classes such as grasping-with-the-hand-and-the-mouth, grasping-with-the-hand and holding neurons. The discharge of many F5 neurons depended on the way in which the hand was shaped during the motor act. For example the three main type of neurons found by Rizzolatti et al. (1988) were precision grip, finger prehension and whole hand prehension neurons. Furthermore, almost all of the neurons would discharge when the action was performed with either hand. In addition, Rizzolatti et al. (1988) reported that 20% of the recorded neurons had visual response properties and they required motivationally meaningful visual stimuli to be triggered. Furthermore, they observed that, in the case of distal neurons, there was a relationship between the type of prehension coded by the cells and the size of the stimulus (presented object) effective in triggering the neurons. However, note that the purely motor related neurons constitute the majority of F5 neurons (Gallese 2002).

2.2.1.2              Visual properties: canonical neurons

Murata et al. (1997a) studied the properties of object related activity of F5 neurons. The result of their study indicates that some F5 neurons encode object shapes in motor terms. That is, every time an object is presented, its visual features are automatically translated into an internal motor representation. The translation takes place whether a motor response is required or not. Therefore, these neurons are not intention related.

Figure 2.2 A canonical neuron response during grasping of various objects in the dark (left to right and top to bottom: plate, ring, cube, cylinder, cone and sphere. The rasters and histograms are aligned with object presentation. Small grey bars in each raster marks onset of key press, go signal, key release, onset of object pulling, release signal, and object release, respectively. The peaks in ring and sphere object cases correspond to the grasping of the object by the monkey (adapted from Murata et al. 1997a)

The similarity of the AIP and F5 visual neuron responses suggests that they may be part of a visuomotor transformation circuit. This view is supported by the reciprocal connections between F5 and AIP (Sakata et al. 1997a). Figure 2.2 shows a canonical neuron’s response during motor execution. To test whether the motor related activity was due to the vision of the object, the trial was performed in the dark. The neuron was primarily responsive for ring grasping and a lesser extend the sphere grasping. The same neuron’s response in the object fixation, without any subsequent grasp requirement, is shown in Figure 2.3. It is important to note that the motor preference of the neuron is reflected in the visual fixation condition as well.

Figure 2.3 The motor responses of the same neuron shown in Figure 2.2. The motor preference of the neuron is also carried over to the visual preference (compare the ring and sphere histograms of both figures) (adapted from Murata et al. 1997a)

2.2.1.3              Visual properties: mirror neurons

Recording studies of the rostral part of inferior area 6 (area F5) region showed that some of the visual neurons were responsive to action observation (Gallese et al. 1996; Rizzolatti et al. 1996a; Dipellegrino et al. 1992). The cells with action observation property have been located on the convexity of the bank of arcuate sulcus.. Like other F5 neurons, mirror neurons were active when the monkey performs a particular class of actions. However, in addition the mirror neurons became active when the monkey observes the experimenter or another monkey performing an action (Gallese et al. 1996; Rizzolatti et al. 1996a; Dipellegrino et al. 1992). In most of the mirror neurons, there was a clear relation between the coded observed and executed action. The actions studied so far include grasping, manipulating and placing. The congruence between the observed and executed action varied. For some of the mirror neurons, the congruence was quite loose; for others, the general action (e.g. grasping) and the way the action was executed (e.g. power grasp) had to match in order to activate to neuron (Gallese et al. 1996; Rizzolatti et al. 1996a). An important observation was that mirror neurons required an interaction between the experimenter and the object; the sight of the experimenter or the object alone was not enough to trigger mirror activity. (Gallese et al. 1996; Rizzolatti et al. 1996a) All the neurons were studied by examining their discharge while the experimenter performed a series of motor actions in front of the monkey. These actions were related to food grasping and manipulation and other objects grasping and manipulation. In order to verify whether the recorded neuron coded specifically hand-object interactions a series of actions such as mimicking grasping without any object, prehension actions with tools, mimicking grasp with spatially separated object were performed. All experimenter’s actions were repeated on different positions (e.g. left,-right, far-close). Of the 532 recorded neurons, 92 of them showed mirror property (i.e. they discharged both when the monkey made active movements and when it observed specific meaningful actions performed by the experimenter) (Gallese et al. 1996).

The two important aspects of the mirror neurons are (1) they are robust, they don’t habituate and (2) the distance of the experimenter to the monkey does not affect the response intensity of the cell. Most of the neurons are active during observation of a single action: for example in the study of Gallese et al. (1996). 51/92 of the cells preferred only single action; 38/92 of the cells preferred two or three actions; 3/92 of the cells were active for both hand or mouth grasps. The motor properties of these neurons were indistinguishable from those of other F5 neurons. They had preference for certain actions: 60/92 cells responded when the animal performed only a grasping action. 9/92 cells fired when the animal grasped with his mouth. 11/92 of cells fired for both hand and mouth grasps (Gallese et al. 1996). The remaining 14 neurons had the distribution: tearing (2), bringing to the mouth (2), manipulating (8). The light and dark conditions were employed for 14 cells to test whether the motor property was a result of self-hand vision. All the tested neurons confirmed that, the discharge was not due to self-vision (Gallese et al. 1996).

Figure 2.4 Activity of a cell during action observation (left) and action execution (right). There is no activity in presentation of the object during both initial presentation and bringing the tray towards the monkey. The vertical line over the histogram indicates the hand-object contact onset. (from Gallese et al., 1996).

Figure 2.4 shows the dual response property of mirror neurons. The recorded neuron in the figure was silent during the presentation of the object, but started firing when the experimenter picked up the object. The neuron, interestingly, did not fire during the time the tray was moved towards the monkey and finally it started firing again when the monkey picked up the object. Note that during the period when the tray was moved towards to monkey it could predict that he would grasp the object (Gallese et al. 1996)

Figure 2.5 shows the specificity of a grasp related mirror neuron where the experimenter performed (A) a precision grip, (B) a whole hand prehension, and (C) mimicked a precision grip. The notable property of this neuron is that miming the action was not effective in activating the neuron.

 

Figure 2.5 Visual response of a mirror neuron. A. Precision grasp B. power grasp C. mimicking of precision grasp. The vertical lines over the histograms indicate the hand-object contact onset. (adapted from Gallese et al., 1996)

In most mirror neurons, there is a relationship between the visual action they respond and the motor action they code. The mirror neurons studied by Gallese et al. (1996) were divided into three, according to their visuomotor congruence: strictly congruent, broadly congruent and non-congruent. A neuron was labeled as strictly congruent when the effective observed and executed actions match both in terms of general action type (e.g. grasp) and in terms of how the action was executed (e.g. power grasp). Figure 2.6 shows a strictly congruent neuron.

Figure 2.6 Example of a strictly congruent manipulating mirror neuron: A) The experimenter retrieved the food from a well in a tray. B) Same action, but performed by the monkey. C) The monkey grasped a small piece of food using a precision grip. The vertical lines over the histograms indicate the hand-object contact onset (adapted from Gallese et al., 1996).

The number of strictly congruent neurons found was 29/92. The number of broadly congruent neurons was 56/92 (Gallese et al. 1996). In the case of broadly congruent neurons, there was a link between the executed action and the observed preferred action. These neurons were further sub-classified according to their motor strictness: If a broadly congruent neuron fired only for one motor act (e.g. grasp) with only a single hand configuration (e.g. precision) then it would be of the first type. On the other hand, if the neuron fired for one motor act but the way the action was performed did not affect the firing then it would be of the second type. The third and last type of broadly congruent neurons appear to be activated by the goal of the observed action (Gallese et al. 1996). Finally, the neurons with no apparent congruence were labeled as non-congruent (7/92). Figure 2.7 shows the classification of F5 neurons including the mirror neuron types discussed.

Figure 2.7 The classification of area F5 neurons derived from published literature (Dipellegrino et al. 1992; Gallese 2002; Gallese et al. 1996; Murata et al. 1997a; Murata et al. 1997b; Rizzolatti et al. 1996a; Rizzolatti and Gallese 2001). All F5 neurons fire in response to some motor action. In addition, canonical neurons fire for object presentation while the mirror neurons fire for action observation. The majority of hand related F5 neurons are purely motor (Gallese 2002)(labelled as Motor Neurons in the figure)

Fogassi et al. (1998) found that area F5 was not the only area that had mirror neurons. The rostral part of the inferior parietal lobule of the macaque monkey (area 7b or PF) also has neurons with similar mirror properties. Although some neurons with strict congruence of the executed and observed action have been found, the majority of the neurons studied had limited congruence (similarity) or no congruence at all (Fogassi et al. 1998). 8/43 PF mirror neurons were strictly congruent; 9/43 had low level of congruence (a similarity); and the majority (26/43) were non-congruent (Fogassi et al. 1998). The main cortical input to area F5 comes from the inferior parietal lobe, and in particular areas AIP and PF (Matelli 1986). The similar properties of F5 canonical neurons with AIP neurons, and F5 mirror neurons with PF neurons, suggests that these three areas work together for visuomotor transformation and action recognition.

2.2.2        Area F4

Area F4 (see Figure 2.1) is connected with area F3 and to a lesser extent, to area F6 (Geyer et al. 2000). Area F4 projects to primary motor cortex (F1). The main parietal input to area F4 comes from VIP (Geyer et al. 2000). In area F4 the space is coded in body-parts-centred coordinate frame (e.g. centred on the hand) (Fogassi et al. 1996). When the body-part-moves the coordinate system follows, but when the gaze moves the coordinate frame stays anchored on the body-part (Fogassi et al. 1996). Many of F4 neurons fire during reaching movements of the proximal arm but not the movements of the distal arm. The neurons usually have somatosensory receptive fields that match the movement direction of the limb (Gentilucci et al. 1988). It is suggested that VIP-F4 circuit transforms object locations into motor plans to reach towards them as area F4 sends descending projections to the brain stem and spinal cord (Rizzolatti et al. 1998).

2.2.3        Areas F2 and F7 (dorsolateral prefrontal cortex)

Area F2 (see Figure 2.1) neurons can be grouped into three different classes: (1) signal related neurons, (2) set-related neurons, and (3) movement-related neurons. (see Geyer et al. 2000 for a review). Signal related neurons are activated right after visual instruction stimuli and have phasic response. Set-related neurons show sustained activity after the instruction stimulus during the delay period. Movement related neurons start firing after the trigger signal. Area F2 receives somatosensory input from areas PEip and PEc, and visual input from areas MIP and V6A. Rizzolatti et al. (1998) suggested that F2 can use the MIP and V6A inputs in controlling arm position during the transport of the hand to spatial targets.

Figure 2.8 The macaque parieto-frontal projections from mesial parietal cortex, medial bank of the intraparietal sulcus and the surface of the superior parietal lobule (adapted from Rizzolatti et al. 1998). Note that the Brodmann’s area 7m corresponds to Pandya and Seltzer's (1982) area PGm

Area F7 receives inputs from area 7m (Ferraina et al. 1997a; Ferraina et al. 1997b) (see

Figure 2.8). The neurons in area F7 fire in response to arm movements (Caminiti et al. 1991; Crammond and Kalaska 1996) or visual stimuli (Shen and Alexander 1997b). However, in contrast to area F2, area F7 visual response does not depend on a pending movement (di Pellegrino and Wise 1991). It appears that the 7m-F7 circuit is important for conditional movement selection (Geyer et al. 2000). The other projection to area F7 is from LIP (Lewis and Van Essen 2000), where saccade related target memory activity is represented. The neuronal activity in LIP area can be modulated by attention and eye position (see Colby and Duhamel 1996 for a review of LIP neuron responses). Thus, LIP-F7 circuit may be important for complex saccade control (Geyer et al. 2000).

2.2.4        Area F1 (the primary motor cortex)

Area F1 (see Figure 2.1) is organized somatotopically, where the body parts that require finer movements are represented over a larger cortical surface than the body parts that require less precision. Each neuron may contribute to multiple spinal neuron pools. The motor parameters that are encoded by F1 neurons are usually a combination of the following physical parameters: force, rate of change of force, joint position or the velocity of the movement (Pandya and Seltzer 1982). However, it is possible to get meaningful physical parameters using a population of F1 neurons. Georgopoulos et al. (Georgopoulos et al. 1982) trained monkeys to perform radial outward reaches to a target light. Recording over a population of primary cortex neurons they showed that each neuron fired maximally for a direction (preferred direction), and fired less and less as the direction deviated form the preferred direction. Given a population, the weighted sum of the preferred direction vectors, the population vector, predicted the monkeys reaching direction.

The subcortical input to F1 is relayed by thalamus (see Matelli et al. 1989 for the distinct nuclei projecting to F1). The corticocortical inputs to hand area of F1 comes primarily from supplementary motor area (SMA) and to a lesser extent from the lateral premotor cortex. The other inputs are from area 1, 2 and 5 (Ghosh et al. 1987). Approximately half of the coriticospinal projections are formed by area F1 neurons (Dum and Strick 1991).

2.2.5        Areas F3 (SMA proper), F6 (pre-SMA)

Area F3 is somatotopically organized where arm and leg representations run as two oblique dorsorostral-to-ventrocaudal directions (see Figure 2.1). In addition, area F3 has an orofacial representation, while area F6 has only an arm representation (Luppino et al. 1991).

Areas F3 and F6 have different patterns of thalamic input indicating that they are part of different motor loops with different functions (Luppino et al. 1991). Cortical input to area F3 originate mainly from areas F2, F4, F5, F6 and F7, and the primary and secondary somatosensory cortices and the posterior parietal areas PE and Peci, and the cingulate and the primary motor cortex. On the other hand, area F6 is mainly connected with areas F5 and F7, followed by the prefrontal and cingulate cortex, F2, F3 and F4, and to lesser extend with the posterior areas PG, PFG and superior temporal sulcus (Geyer et al. 2000).

2.3         The superior temporal sulcus

In the macaques’s brain, posterior parietal cortex and the cortex of caudal superior temporal sulcus (STS) have been subdivided into numerous areas mainly involved in spatial analysis of the visual environment and in the control of spatially oriented behaviour (Maioli et al. 1998). The cortex of superior temporal sulcus (STS) contains neurons that are selective for biological motion observation such as limb movements and full body motion. Perrett et al. (1990b) reported STS neurons that were responsive to goal directed hand motion (Perret et al. 1990b; Perret et al. 1990a). PET studies showed that STS in human shows strong activation during biologically meaningful visual stimuli (Bonda et al. 1996) including goal-directed hand actions. In monkeys, some of the STS neurons that are triggered by biologically meaningful stimuli have two notable properties. Firstly, these neurons show responses to goal directed hand motion in a translation/scale/rotation invariant way (Perret et al. 1990b; Perret et al. 1990a). Secondly, these neurons do not require a pictorially realistic stimulus; they respond to point light stimuli (Perret et al. 1990b; Perret et al. 1990a) where the stimulus is just the movement of a small number of points. Bonda et al. (1996) also used this kind of stimulus - 3 lights for the arm and 2 for each finger - when they scanned the subjects during action observation.

2.4         Parietal Areas

Based on cytoarchitectonic and connectional criteria the inferior parietal lobule (Brodmann’s area 7) includes areas 7a, 7b and 7ip (Cavada and Goldman-Rakic 1989). Area 7 reaches its highest development in primates (Cavada and Goldman-Rakic 1989). Damage to this area can cause impairments in spatial perception, neglect of sensory stimuli contralateral to the damage side, defects in visually guided reaching and occulomotor control (Ratcliff 1991; Stein 1991).  Cavada & Goldman-Rakic (1989) divides area 7 in sub-areas of 7m, 7a, 7b, and 7ip. Area 7m is located on the medial surface of the hemisphere. This corresponds to Pandya and Seltzer's (1982) area PGm. Areas 7a, 7b lie on the convexity of the posterior parietal lobule (Cavada and Goldman-Rakic 1989). These regions correspond to Pandya and Seltzer's (1982) PG and PF respectively . Pandya and Seltzer's (1982) also distinguish the subdivisions of PGop and PFop in the lateral opercular part of PG and PF and area Opt in caudal PG. Area 7ip is situated in the posterior bank of intraparietal sulcus and referred as POa by Pandya and Seltzer (1982). In addition, the posterior half of 7ip corresponds to functionally defined areas VIP (Maunsell & Van Essen, 1983) and LIP  (Andersen et al., 1985). Figure 2.9 shows the intraparietal sulcus  (opened) and neighbouring parietal regions using Pandya and Seltzer (1982) nomenclature.

2.4.1        The anterior intraparietal area (AIP)

The anterior part of the lateral bank of the intraparietal sulcus (area AIP) (see Figure 2.9) is involved in extracting visual properties of objects relevant for grasping (Sakata et al. 1997a; Sakata et al. 1998; Sakata et al. 1995; Murata et al. 1996).

Figure 2.9 The intraparietal sulcus opened to show the anatomical location of AIP in the macaque (adapted from Geyer et al. 2000)

Neurons in area AIP are active either in relation to the grasping behavior alolne or in relation to the vision of objects (Sakata et al. 1998; Sakata et al. 1997b; Taira et al. 1990). Some of the latter type are active exclusively for visual fixation. In one study, 21% of cells studied responded to simply fixating an object (visual-related), others (37%) were active only when a movement is being made to manipulate the object (motor-related) (Taira et al. 1990). However, many cells (37%) fell somewhere between these two extremes (visual-dominant) (Taira et al. 1990). Figure 2.10 shows the response of a visual-dominant neuron during different experimental conditions.

Figure 2.10 An AIP visual-dominant neuron activity under three task conditions: Object manipulation in the light, object manipulation in the dark and object fixation in the light. The neuron is active during fixation and holding phase when the action is performed in light condition. However, during grasping in dark the neuron shows no activity. The fixation of the object alone without grasping also produces a discharge (adapted from Sakata et al. 1997a)

The neuron shown in Figure 2.10 is active during fixation and holding phase when the action is performed in light condition. However, in grasping-in-dark condition the neuron shows no activity. The fixation of the object alone without grasping also produces a discharge, however the activity is less than the grasping-in-light condition.

.

Figure 2.11 Activity the same neuron in Figure 2.10 during fixation of different objects. The neuron show selectivity for horizontal plate (adapted from Sakata et al. 1997a)

In addition, some of these neurons show object specificity (object-type visual-dominant neurons) which responds to the sight of complex objects such as a knob-in-groove and a plate-in-groove (Sakata et al. 1997a). Figure 2.11 shows response profile of the same neuron in Figure 2.10 for different objects during fixation. The neuron has a strong preference for the plate shaped object.

 

Figure 2.12 An AIP visual-dominant neuron’s axis orientation tuning and object fixation response is shown. The neuron fires maximally during the fixation of a vertical bar or a cylinder. The tuning is demonstrated in the lower half of the figure (adapted from Sakata et al. 1999)

Furthermore some object-type visual-dominant neurons, show tuning according to the orientation of the longitudinal axis or the surface orientation of flat objects (Sakata et al. 1999; Murata et al. 2000). An example of an object-type visual-dominant neuron that showed tuning for the axis orientation regardless of the shape is presented in Figure 2.12. Top row shows the strong response of the neuron to a vertical cylinder, a square column, and a vertical knob-in-groove in the fixation condition. The bottom row of Figure 2.12 demonstrates the tuning of the neuron for different axis orientations.

The muscimol-induced lesions of area AIP lead to a significant deficit in monkey's ability to grasp objects (Sakata et al. 1997a; Gallese et al. 1994). The grasping movements become clumsy and uncoordinated, and as a result, the monkey is unable to shape his hand and orient his wrist appropriately for objects that are presented. However, the monkey can still execute the basic sequence of the task employed (Sakata et al. 1997a; Gallese et al. 1994).

 

2.4.2        The caudal intraparietal sulcus (c-IPS)

 The lateral bank of the intraparietal sulcus (c-IPS) is involved in three-dimensional analysis of objects (Sakata et al. 1997a; Sakata et al. 1999). Some of these binocular visual neurons are selective for the orientation of the axis of the objects (AOS neurons) and some are selective for the surface orientation of the objects (SOS neurons) (Sakata et al. 1997a; Sakata et al. 1999). AOS neurons prefer long and thin objects as visual stimuli and are tuned to the three-dimensional axis orientation of the objects in space. Figure 2.13 shows the response of an AOS neuron when the object is viewed in binocular viewing condition. Figure 2.14 shows the same neuron’s response when the visual information is limited to the left or right eye indicating that binocular cues are important for driving the AOS neuron shown. SOS neurons prefer broad and flat objects as visual stimuli (Sakata et al. 1999). Complementary to AOS neurons; they are tuned to the surface orientation of objects in three-dimensional space (see Figure 2.15).

Figure 2.13 Response of an axis-orientation-selective (AOS) neuron in the caudal part of the lateral bank of the intraparietal sulcus (c-IPS) to a luminous bar tilted 45° forward (left) or 45 backward (right) in the sagittal plane. The monkey views the bar with binocular vision. The line segment under the histograms mark the fixation start and the period of 1 second. (adapted from Sakata et al. 1999)

It is suggested that c-IPS is a higher center for stereopsis, which integrates various binocular disparity signals received from the V3 complex and other prestriate areas to represent the neural code for geometric features of objects (Sakata et al. 1997a; Sakata et al. 1997b).

Figure 2.14 The response of the same neuron in Figure 2.13, for monocular vision conditions for the left and right eyes. (adapted from Sakata et al. 1999)

Sakata et al. (1997a) suggested that c-IPS could send projections to AIP and thus, contribute to the visual adjustment of the shape of the hand grip and/or hand orientation for manipulation and grasping. Figure 2.15 shows a SOS neuron that is selective for a surface that is 135 degrees tilted around the sagittal axis

Figure 2.15 Orientation tuning of a surface-orientation selective (SOS) neuron. First row: Stimuli presented. Middle row: responses of the cell with binocular view. Last row: responses of the cell with monocular view (adapted from Sakata et al. 1997a)

2.4.3        Areas VIP, MIP and LIP

The intraparietal regions VIP, MIP and LIP (see Figure 2.9) encode the space around the animal with multiple reference frames for different movement purposes (Colby and Goldberg 1999). A crude separation is that VIP is involved in ultra-near space (less than 5cm from the face) (Colby et al. 1993b), MIP with stimuli within reaching distance (Colby and Duhamel 1991) and LIP with far visual stimuli (Colby and Goldberg 1999).

LIP coding has been implicated as attentional (Gottlieb et al. 1998), decision-related (Shadlen and Newsome 2001; Shadlen and Newsome 1996), visual target memory related (Gnadt and Andersen 1988) and motor intention related (Snyder et al. 2000; Snyder et al. 1997). Colby and Goldberg (1999) suggested a unifying functional role for LIP that it encodes the representation of salient spatial locations (with attentional tuning). They noted the distinctive property of neurons in LIP that their firing was not tied to any particular modality and the representation was limited to attended objects and their locations.

Neurons in LIP have retinotopic receptive fields, where they carry visual, memory, and saccade-related signals that describe stimuli in terms of the distance and direction of the stimulus or saccade location relative to the center of gaze (Colby and Goldberg 1999). VIP neurons represent visual locations using a continuum of eye centred to a head centred spatial reference frames (Bremmer et al. 1999; Duhamel et al. 1997). Eskandar and Assad (1999) found neurons with reaching-related activity encoding stimulus features, such as location and direction of stimulus motion. In addition, MIP neurons maintained the memory of a reach target during the delay period of a memory-guided reach task or when the target is obscured (Eskandar and Assad 1999; Snyder et al. 1997). When the hand direction and the visual target direction were disassociated through a well designed set up[2], it was found that MIP neuron activity correlated more with the hand direction than the object location. The opposite was true for LIP neurons (Eskandar and Assad 1999) .

2.4.4        Areas 7a and 7b (PG and PF)

The experimental findings indicate that area 7a, together with other inferior parietal lobule sectors, is involved in spatial coding. Researchers suggested various types of spatial encoding for area 7a. Stein (1991) suggested that area 7a represented extra-personal space. Andersen et al. (1999) suggested that area 7a represents targets in a world-centered coordinate frame. It has been shown that area 7a neurons are involved in the analysis of motion evoked during locomotion or by the manipulation of objects by the hands (Siegel and Read 1997). The different interpretation of area 7a responses can be due to either the non-homogenous functional distributions of neurons or due to the experimental setup differences (see the reviews: Andersen et al. 1997; Wise et al. 1997).

It has been found that reach-related activity in area 7a signaled specific phases of the motor performance (MacKay 1992). Further, it has been suggested that it could be used by the frontal lobe to facilitate upcoming elements of a motor sequence, including terminal corrections (MacKay 1992). Motter et al. (1987) identified visually sensitive and insensitive neurons in area 7a (Motter and Mountcastle 1981; Motter et al. 1987). The Neurons insensitive to visual stimuli comprised the fixation, oculomotor, and projection-manipulation classes, which were suggested to be involved in initiatives toward action (Motter and Mountcastle 1981). Most of the visually sensitive neurons were activated from large and bilateral response areas that excluded the foveal region. The visually sensitive neurons were responsive to stimulus movement and direction over a wide range of velocities. The movement vectors pointed either inward toward the center or outward toward the perimeter of the visual field. For bilaterally activated neurons, the vectors pointed in opposite directions in the two half-fields (opponent vector organization). Motter and Mountcastle (1981) suggested that the neurons could signal motion in the immediate surround.

Constantinidis and Steinmetz (1996) showed that a population of neurons in area 7a was active during the delay period of a spatial memory task that did not require a motor response directed toward the stimulus. Thus, it is suggested that the activity could represent a short-term memory trace for the spatial location of the stimuli (Constantinidis and Steinmetz 1996). In accordance with the spatial memory hypotheses, Maunsell (1995) indicated that the object location coding in area 7a was capable of representing visual stimuli without ever falling into the corresponding receptive field.

Another functional aspect of area 7a, the attentional tuning was studied by Constantinidis and Steinmetz (2001). Their results indicate that area 7a neurons represent the location of the stimulus that attracts the animal's attention and can provide the spatial information required for directing attention to a salient stimulus in a complex scene (Constantinidis and Steinmetz 2001).

According to our view the fundamental and unifying property of area 7a neurons, is that they can potentially be used to monitor the relation of body parts with respect to objects once they are fixated. A population of neurons that detect the motion of visual stimuli inwards to (or outwards from) the fixation point can encode the kinematics aspects (e.g. proximity) of a movement to satisfy a goal such as reaching or grasping. There is evidence that when humans perform reaching movements, they fixate to target objects or obstacles to plan reach actions (Johansson et al. 2001), which can be thought of registering the relevant locations in area 7a as a saliency map. This proposal is supported by the fact that the removal of areas 7a, 7ab and LIP caused marked inaccuracy in reaching in the light to visual targets but had no effect on reaching in the dark (Rushworth et al. 1997). In contrast, the removal of areas 5, 7b and MIP caused misreaching in the dark, but had little effect on reaching in the light. Therefore, Rushworth et al. (1997) suggested that the two divisions of the parietal cortex organize limb movements in distinct spatial coordinate systems: area 7a/7ab/LIP are essential for spatial coordination of visual motor transformations whereas areas 5/7b/MIP is essential for the spatial coordination of arm movements in relation to proprioceptive and efference copy information.

Other parietal areas that can be involved in hand-object relation signals are area 7m (Ferraina et al. 1997a; Ferraina et al. 1997b), and area V6a and area PEc (Caminiti et al. 1999; Battaglia-Mayer et al. 2000; Ferraina et al. 2001; Marconi et al. 2001).

Conventionally, area 7b is considered to be a somatosensory area (Andersen et al. 1990). Robinson and Burton (1980b) studied the somatic response properties of neurons from area SII and area 7b. One-half of the recorded 7b neurons responded only to somatic stimulation. Many neurons in the lateral parts of area 7b were vigorously activated by tactile stimulation. In spite the majority of somatic responses, some visual responses from area 7b were noted. The visual responses of 7b neurons were not studied in detail either because it was not the focus of interest (as in Robinson and Burton 1980a) or due to the complex response properties. In fact, it is possible to find considerable unimodal visual 7b neurons as well as the neurons that respond only to visual stimulation (Dong et al. 1994). The visual responses of 7b neurons can be based on the signals carried by the small projections from the visual cortical areas (Andersen et al. 1990).

Fogassi et al. (1998) studied some of area 7b neurons’ visual properties. They found that the activity of some neurons were triggered by the observation of various hand actions performed by the experimenter. The neurons had motor properties similar to mirror neurons of area F5 (see section 2.2.1.3). The congruence between the action performed by the monkey and the observed action was usually low. The connection of area F5 with area 7b (Fogassi et al. 1998) indicates an intimate relation between 7b and F5 mirror neurons. Currently there are no detailed data on 7b mirror activity. However, unpublished results (Fogassi 1999) indicate that in addition to those neurons that have similar properties as F5 mirror neurons there exist mirror-like neurons that fire for simple arm/hand movement observations (in contrast to complete action observations).

2.5         Connectivity and other brain regions

According to Cavada and Goldman-Rakic (1989) 7m, 7a, 7ip are extensively connected with a number of visual areas located on the medial surface of the hemisphere and in the depths of parieto-occipital and intraparietal sulci. Areas 7m, 7a, 7ip, and to a much lesser extend 7b, are reciprocally connected with the visual temporal cortex, principally with the cortex of the superior temporal sulcus (STS) (Cavada and Goldmanrakic 1989). Although the density of 7b connections with the visual motion cortex of STS is largely surpassed by the extensive connections of 7b with somatosensory areas the interconnections of 7b with the visual regions are established through anterior 7ip, and the transitional cortex 7ab between 7a and 7b (Cavada and Goldmanrakic 1989).

Figure 2.16 The reconstructed connectivity of area 7a. The thickness of the arrows represent the strength of the connection. (adapted from Bota 2001)

Findings from the same study also confirm that AIP is connected with area 7b. Area 7ip is unique among posterior parietal areas in its direct and indirect connections with the IT cortex (Cavada and Goldmanrakic 1989) and may form one of the object information channel to area AIP (Sakata et al. 1997b).Figure 2.16 and Figure 2.17 shows the reconstructed connectivity of areas 7a and 7b; while Figure 2.18 shows the reconstructed connectivity of AIP (Bota 2001).

Figure 2.17 The reconstructed connectivity of area 7b. The thickness of the arrows represent the strength of the connection. (adapted from Bota 2001)

Andersen et al. (1990) suggests two types of processing for area 7a, each one following a different path. First path originates from visual area V4, which is believed to have an important role in pattern and color processing, and reaching to area 7a. Second path is the motion processing input originating from the middle temporal area (MT) and relayed via medial superior temporal area (MST) or LIP (Andersen et al. 1990). MT lies on the posterior bank of the superior temporal sulcus, while MST lies on the anterior bank of the same sulcus (Kandel et al. 2000; Maioli et al. 1998). MT projects to MST and to other areas in the parietal cortex concerned with visuospatial function. The preprocessed visual input from V1 is further elaborated in MT, where the firing pattern of neurons reflect the speed and direction of motion of visual targets (Kandel et al. 2000). Barnes and Pandya (1992) report that area 7a (PG-Opt) is reciprocally connected to STS and suggest that the visuospatial analysis that is associated with posterior intraparietal lobule could be amplified in the multimodal regions of STS. Therefore, the neurons of multimodal areas of the STS could be involved in analyzing the position of the body in relation to the environment (Barnes and Pandya 1992).

Figure 2.18 The reconstructed connectivity of area AIP. The thickness of the arrows represent the strength of the connection. (adapted from Bota 2001)

AIP receives input from other areas of the posterior parietal cortex such as 7b (Neal et al. 1990). In addition, this region has very significant recurrent cortico-cortical projections with area F5 of the inferior premotor cortex (Matelli et al. 1994; Sakata et al. 1997b). Figure 3.1 illustrates the visuomotor stream for hand action as well other related structures. Also see Figure 2.18 for the reconstructed connectivity diagram (Bota 2001) for AIP.

The anterior cingulate cortex is somatotopically organized and has direct connections with the motor and premotor cortices, suggesting that among 7 areas 7b has preferential access to motor centers (Cavada and Goldmanrakic 1989). Area 7b is distinguished from other areas with its prominent connections with somatosensory related areas including S1, S2, the vestibular cortex, area 5 and the granular insular cortex. The only subdivision of area 7 that is connected to primary sensory cortex (S1) is area 7b (Cavada and Goldman-Rakic 1989). The heaviest connection of area 7b is with S2 it is likely that all body representation in S2 is connected to 7b (Cavada and Goldman-Rakic 1989). The connection to granular insular cortex is wide spread which contains high proportion of somatic-sensitive neurons. The connections of area 5 with 7b are topographic: the region of area 5 buried in the anterior bank of IPS, which is involved in forelimb mechanisms, is the source of strongest projection from area 5 to area 7b (Cavada and Goldman-Rakic 1989).

We conclude our discussion of anatomical connections by summarazing the connectivity of functionally defined intraparietal regions. Area 7a receives input from LIP (Andersen et al. 1990; Lewis and Van Essen 2000), MIP (Boussaoud et al. 1990; Lewis and Van Essen 2000; Bota 2001) and VIP (reviewed in Maunsell 1995; Lewis and Van Essen 2000). Interested readers can find more details about these connections at the NeuroHomology Database Website[3] (Bota 2001 and citations therein). The premotor projections of these intraparietal areas include the regions F2 and F4 (Luppino et al. 1999) as reviewed by Geyer et al.(2000).

2.6         Mirror neurons in humans

There is an unsettled debate about mirror neurons’ function. It is suggested that mirror neurons may form the basis of understanding (Fadiga et al. 2000; Umilta et al. 2001), and imitation (Arbib 2001; Rizzolatti and Arbib 1999) and even language in human (Rizzolatti and Arbib 1998). Thus, research for mirror neuron existence in human became necessary to support the idea that mirror neuron involvement in. cognitive tasks.

Grafton et al. (1996) using positron emission tomography (PET), scanned subjects under three conditions, one of them being the control condition (object viewing). The other two were observing grasping actions of common objects and imagining themselves doing the same grasp actions. Grafton et al. (1996) used only precision grasps. The imagined-minus-control and observation-minus-control results were compared. The activation pattern was different. In their analysis, they categorized the activations into lateral activations and medial/dorsal activations. The lateral activation is relevant for our discussion[4]. In the observation condition, the activity locations were left rostral superior temporal sulcus (STS), left inferior frontal cortex (area 45), and the left rostral inferior parietal cortex (area 40). In addition, there was some activation found in the rostral part of the left intraparietal sulcus. However, the imagined grasping activated the left inferior frontal (Broca’s area or area 44) and middle frontal cortex, left caudal inferior parietal cortex (area 40)[5]. Based on these findings, Grafton et al. (1996) suggested that the areas active during grasping observation might form a circuit for recognition of hand-object interactions, whereas the areas active during imagined grasping might be a human homologue of the action observation and execution matching system found in monkeys (mirror neurons). Their conclusion was that humans, as in monkeys, had a similar cortical circuit that was involved in representing observed grasping. Unfortunately, Grafton et al. (1996) did not include the self-execution condition in the experimental setup. Therefore, it cannot be concluded that the areas activated in this study have the dual property of the mirror neurons (the activation during self-action and observation of the same action performed by the demonstrator). In addition, note the discrepancy that the human homologue of the monkey F5, namely Broca’s area (Rizzolatti and Arbib 1998), was not activated during grasp observation but only during imagined grasping.

In another study Grafton et al. (1997) used positron emission tomography (PET) imaging to test whether the observation of tools activates premotor areas without any overt motor demand[6]. Tool observation strongly activated the left dorsal premotor cortex. Silent tool-use naming activated Broca's area, the left dorsal premotor cortex (more than the observation case), the left supplementary motor area and the left ventral premotor cortex. These data indicate that, in human, F5 canonical type of neurons may exist in the left ventral premotor cortex, which can be triggered by object observation.

Iacoboni et al (1999) used functional magnetic resonance imaging (fMRI) to study the brain regions involved in imitation. Their paradigm had three observation conditions and three observation-execution conditions. In the observation-execution conditions, imitative and non-imitative behavior of simple finger movements was compared. In the imitative condition, participants had to execute the observed finger movement. In the two non-imitative conditions, participants had to execute the same movement in response to spatial or symbolic cues The imitation task, when contrasted to non-imitative tasks, activated three areas: the left frontal operculum (Broca’s area or area 44), the right anterior parietal region, and the right parietal operculum. The Broca’s area and right anterior parietal region was also active during observation conditions. Iacoboni et al (1999) argued that Broca's area was activated due the action-observation as Broca’s area is the human homologue of monkey area F5 (Rizzolatti and Arbib 1998).

However, the data is not conclusive since, the Broca’s area was active for all observation cases, not only the action observation. Krams et al (1998) in a similar study found that the Broca’s region was more active during action preparation compared to action preparation-and-execution conditions. In both conditions the visual stimuli presented was the same and consisted of a hand drawing with a mark on a finger indicating the action to be prepared for. Krams et al. (1998) argued that Broca’s are was involved in action suppression (see Krams et al. 1998 for a detailed discussion). However, in both studies, the actions were intransitive; they did involve an object to be manipulated. In contrast, the majority of mirror neurons require an object and the action together; the miming of the action is not effective (Gallese et al. 1996).

In one study the motor cortex was stimulated using transcranial magnetic stimulation technique while the subjects (1) observed an experimenter grasping 3D-objects, (2) looked at the same 3D-objects, (3) observed an experimenter tracing geometrical figures in the air with his arm and (4) detected the dimming of a light (Fadiga et al. 1995). During the conditions of (1)-(4) the motor evoked potentials were recorded from the hand muscles. Fadiga et al. (1995) found that motor evoked potentials increased when the subjects observed movements. The motor evoked potential patterns reflected the pattern of muscle activity recorded when the subjects executed the observed actions. Therefore Fadiga et al. (1995) concluded that in humans there is an action observation and execution matching system, which is similar to monkey action recognition system (mirror neurons). This study showed that the effect of executing and observing the same action performed by others is similar. However, the localization of action observation and execution matching system was not possible with motor evoked potential recordings.

Hari et al. (1998) using a different technique (magnetoencephalogram) showed that the observation of object manipulation activated the primary motor cortex. Hari et al. (1998)  recorded neuromagnetic oscillatory activity of the human precentral cortex while subjects were (1) idle, (2) manipulating a small object, and (3) observing another individual performing the same task. The left and right median nerves were stimulated alternately (inter stimulus interval, 1.5s) at intensities exceeding motor threshold, and the poststimulus rebound of the rolandic 15-to 25-Hz activity was quantified (Hari et al. 1998). The rebound was diminished during action observation as it did in action execution case (the observation suppression magnitude was 31-46% of the suppression during object manipulation). Hari et al. (1998) concluded that the human primary motor cortex was activated during observation as well as execution of the motor tasks since the 15-to 25-Hz activity mainly originates from the precentral motor cortex.

Nishitani and Hari (2000) showed that the inferior frontal area was active during both execution and observation of hand actions which confirmed the existence of a mirror system in human. In contrast to several PET studies (e.g. Grafton et al. 1996; Decety et al. 1997; Rizzolatti et al. 1996b), Broca’s area was active during action observation while area 45 was not active. Therefore, the study of Nishitani and Hari (2000) not only shows that the human brain is endowed with a mirror neuron system but also supports the hypothesis that Broca’s area is the locus of action observation and execution matching system, which is consistent with the homology between Broca’s area and area F5 (Rizzolatti and Arbib 1998).

Buccino et al. (2001) used fMRI to localize action recognition circuitry in humans for actions performed with different effectors. The subjects were presented with transitive and intransitive actions performed with mouth, hand and foot. Observation of both object- and non-object-related actions determined a somatotopically organized activation of premotor cortex. In addition, Buccino et al. (2001) found that during the observation of object-related actions, an activation -also somatotopically organized- was present in the posterior parietal lobe (Buccino et al. 2001). Buccino et al. (2001) argued that when individuals observe object-related actions, an internal replica of the motor act and the result of an object-related analysis are automatically generated in the ventral premotor cortex and the parietal lobe respectively. This result suggests that the observation and execution matching system is not constrained to hand actions but could be a general strategy used in the primate brain for interacting with the environment.

2.7         Summary

The posterior parietal cortex is involved in sensory-motor transformations, combining various sensory inputs and computing representations that are used by the motor system to generate movements. In particular AIP extract object features relevant for grasping. Other parietal areas such as VIP, MIP and LIP are involved in spatial aspects of object representations. These areas project to motor and premotor cortices enabling specific movement planning. Area F5 is involved in grasp planning while F4 is involved in reaching movement planning. The visual areas in the superior temporal sulcus perform visual analysis of form and motion including biological stimuli  and provide parietal networks with motion related and, for some sectors, highly processed visual input. Chapter 3 will factor the connectivity specified in this chapter when developing Mirror Neuron System (MNS) model. The neurophysiology of area F5 will guide the modelling presented in this thesis throughout. We will implicate AIP and c-IPS as coding the object affordances serving as inputs to MNS and Learning to Grasp Models (LGM) of Chapters 5 and 6. We implicate target location schema to be represented in areas MIP/VIP/LIP, without specifying the neural region level assignment. Area 7a will combine the hand and object related visual inputs into an internal representation on which area 7b and F5 can be adapted to form mirror neurons.


3          CHAPTER III: MIRROR NEURON SYSTEM MODEL

Mirror neurons within a monkey's premotor area F5 fire not only when the monkey performs a certain class of actions but also when the monkey observes another monkey (or the experimenter) perform a similar action. It has thus been argued that these neurons are crucial for understanding of actions by others. This chapter offers the ‘hand-state’ hypothesis as a new explanation of the evolution of this capability: the basic functionality of the F5 mirror system is to elaborate the appropriate feedback – what we call the hand state – for opposition-space based control of manual grasping of an object. Given this functionality, the social role of the F5 mirror system in understanding the actions of others may be seen as an exaptation gained by generalizing from self-hand to other's-hand. In other words, mirror neurons first evolved to augment the ‘canonical’ F5 neurons by providing visual feedback on ‘hand state’, relating the shape of the hand to the shape of the object.

First, we introduce the MNS (Mirror Neuron System) model of F5 and related brain regions in terms of basic schemas. Then we aggregate them into three ‘grand schemas’ - Visual Analysis of Hand State, Reach and Grasp, and the Core Mirror Circuit - for each of which we present a useful implementation. The MNS model shows how the mirror system can learn to recognize actions already in the repertoire of the F5 canonical neurons. The chapter, in particular, shows how the connectivity pattern of mirror neuron circuitry can be established through training, and that the resultant network can exhibit a range of novel, physiologically interesting, behaviors during the process of action recognition.

3.1         The mirror neuron system for grasping and FARS model

The macaque inferior premotor cortex has been identified as being involved in reaching and grasping movements (Rizzolatti et al. 1988). This region has been further partitioned into two sub-regions: F5, the rostral region, located along the arcuate and F4, the caudal part (see Figure 3.1). The neurons in F4 appear to be primarily involved in the control of proximal movements (Gentilucci et al. 1988), whereas the neurons of F5 are involved in distal control  (Rizzolatti et al. 1988). Rizzolatti et al. (1996a; Gallese et al. 1996). discovered a subset of F5 hand cells, which they called mirror neurons (Gallese et al. 1996; Rizzolatti et al. 1996a). Like other F5 neurons, mirror neurons are active when the monkey performs a particular class of actions, such as grasping, manipulating and placing. However, in addition, the mirror neurons become active when the monkey observes the experimenter or another monkey performing an action. The term F5 canonical neurons is used to distinguish the F5 hand cells which do not posses the mirror property but are instead responsive to visual input concerning a suitably graspable object. The canonical neurons are indistinguishable from the mirror neurons with respect to their firing during self-action. However they are different in their visual properties – they respond to object presentation not action observation per se (Murata et al. 1997a).

Figure 3.1 Lateral view of the monkey cerebral cortex (IPS, STS and lunate sulcus opened). The visuomotor stream for hand action is indicated by arrows (adapted from Sakata et al., 1997)

 

Most mirror neurons exhibit a clear relation between the observed and executed actions for which they are active. The congruence between the observed and executed action varies. For some of the mirror neurons, the congruence is quite loose; for others, not only must the general action (e.g., grasping) match but also the way the action is executed (e.g., power grasp) must match as well. To be triggered, the mirror neurons require an interaction between the hand motion and the object. The vision of the hand motion or the object alone does not trigger mirror activity (Gallese et al. 1996).

It has thus been argued that the importance of mirror neurons is that they provide a neural representation that is common to execution and observation of grasping actions and thus that these neurons are crucial to the social interactions of monkeys, providing the basis for understanding of actions by others through their linkage of action and perception (Rizzolatti and Fadiga 1998). Below, we offer the Hand-State Hypothesis, suggesting that this important role is an exaptation of a more primitive role, namely that of providing feedback for visually-guided grasping movements. By exaptation we mean the exploitation of an adaptation of a system to serve a different purpose (in this case for social understanding) than it initially developed for (in this case, visual control of grasping). We will then develop the MNS (Mirror Neuron System) model and show that the system can exploit its ability to relate self-hand movements to objects to recognize the manual actions being performed by others, thus yielding the mirror property. We also conduct a number of simulation experiments with the model and show that these yield novel predictions, suggesting new neurophysiological experiments to further probe the monkey mirror system. However, before introducing the Hand-State Hypothesis and the MNS model, we first outline the FARS model of the circuitry that includes the F5 canonical neurons and provides the conceptual basis for the MNS model.

Studies of the anterior intraparietal sulcus (AIP; Figure 3.1) revealed cells that were activated by the sight of objects for manipulation . In addition, this region has very significant recurrent cortico-cortical projections with area F5 (Matelli 1984; Sakata et al. 1997a). In their computational model for primate control of grasping (the FARS – Fagg-Arbib-Rizzolatti-Sakata – model), Fagg and Arbib (1998) analyzed these findings of Sakata and Rizzolatti to show how F5 and AIP may act as part of a visuo-motor transformation circuit, which carries the brain from sight of an object to the execution of a particular grasp. In FARS model, the findings of Sakata (on AIP) and Rizzolatti (on F5) were interpreted as showing that AIP represents the grasps afforded by the object while F5 selects and drives the execution of the grasp (Fagg and Arbib 1998). The term affordance (adapted from Gibson 1966) refers to parameters for motor interaction that are signaled by sensory cues without invocation of high-level object recognition processes. The model also suggests how F5 may use task information and other constraints encoded in prefrontal cortex (PFC) to resolve the action opportunities provided by multiple affordances. Here we emphasize the essential components of the model (Figure 3.2) that will ground the version of the MNS model presented below. We focus on the linkage between viewing an affordance of an object and the generation of a single grasp.

Figure 3.2 AIP extracts the affordances and F5 selects the appropriate grasp from the AIP ‘menu’. Various biases are sent to F5 by Prefrontal Cortex (PFC) which relies on the recognition of the object by Inferotemporal Cortex (IT). The dorsal stream through AIP to F5 is replicated in the MNS model

(1) The dorsal visual stream (parietal cortex) extracts parametric information about the object being attended. It does not "know" what the object is; it can only see the object as a set of possible affordances. The ventral stream (from primary visual cortex to inferotemporal cortex, IT), by contrast, recognize what the object is and passes this information to prefrontal cortex (PFC) which can then, on the basis of the current goals of the organism and the recognition of the nature of the object, bias F5 to choose the affordance appropriate to the task at hand.

(2) AIP is hypothesized as playing a dual role in the seeing/reaching/grasping process, not only computing affordances exhibited by the object but also, as one of these affordances is selected and execution of the grasp begins, serving as an active memory of the one selected affordance and updating this memory to correspond to the grasp that is actually executed.

(3) F5 is hypothesized as first being responsible for integrating task constraints with the set of grasps that are afforded by the attended object in order to select a single grasp. After selection of a single grasp, F5 unfolds this represented grasp in time to govern the role of primary motor cortex (F1) in its execution.

(4) In addition, the FARS model represents the way in which F5 may accept signals from areas F6 (pre-SMA), 46 (dorsolateral prefrontal cortex), and F2 (dorsal premotor cortex) to respond to task constraints, working memory, and instruction stimuli, respectively, and how these in turn may be influenced by object recognition processes in IT (see Fagg and Arbib 1988 for more details), but these aspects of the FARS model are included in MNS model.

3.2         The hand-state hypothesis

The key notion of the MNS model is that the brain augments the mechanisms modeled by the FARS model, for recognizing the grasping-affordances of an object (AIP) and transforming these into a program of action, by mechanisms which can recognize an action in terms of the hand state which makes explicit the relation between the unfolding trajectory of a hand and the affordances of an object. Our radical departure from all prior studies of the mirror system is to hypothesize that this system evolved in the first place to provide feedback for visually-directed grasping, with the social role of the mirror system being an exaptation as the hand state mechanisms become applied to the hands of others as well as to the hand of the animal itself.

3.2.1        Virtual fingers

As background for the Hand-State Hypothesis, we first present a conceptual analysis of grasping. Iberall and Arbib (1990) introduced the theory of virtual fingers and opposition space. The term virtual finger is used to describe the physical entity (one or more fingers, the palm of the hand, etc.) that is used in applying force and thus includes specification of the region to be brought in contact with the object (what we might call the ‘virtual fingertip’). Figure 3.3 shows three types of opposition: those for the precision grip, power grasp, and side opposition. Each of the grasp types is defined by specifying two virtual fingers, VF1 and VF2, and the regions on VF1 and VF2 which are to be brought into contact with the object to grasp it. Note that the "virtual fingertip" for VF1 in palm opposition is the surface of the palm, while that for VF2 in side opposition is the side of the index finger.

Figure 3.3 Each of the 3 grasp types here is defined by specifying two "virtual fingers", VF1 and VF2, which are groups of fingers or a part of the hand such as the palm which are brought to bear on either side of an object to grasp it. The specification of the virtual fingers includes specification of the region on each virtual finger to be brought in contact with the object. A successful grasp involves the alignment of two "opposition axes": the opposition axis in the hand joining the virtual finger regions to be opposed to each other, and the opposition axis in the object joining the regions where the virtual fingers contact the object. (Iberall and Arbib 1990)

The grasp defines two ‘opposition axes’: the opposition axis in the hand joining the virtual finger regions to be opposed to each other, and the opposition axis in the object joining the regions where the virtual fingers contact the object. Visual perception provides affordances (different ways to grasp the object); once an affordance is selected, an appropriate opposition axis in the object can be determined. The task of motor control is to preshape the hand to form an opposition axis appropriate to the chosen affordance, and to so move the arm as to transport the hand to bring the hand and object axes into alignment. During the last stage of transport, the virtual fingers move down the opposition axis (the ‘enclose’ phase) to grasp the object just as the hand reaches the appropriate position.

3.2.2        The hand-state hypothesis

We assert as a general principle of motor control that if a motor plant is used for a task, then a feedback system will evolve to better control its performance in the face of perturbations. We thus ask, as a sequel to the work of Iberall and Arbib (1990), what information would be needed by a feedback controller to control grasping in the manner described in the previous section. Modeling of this feedback control is presented in Chapter 7, using a simplified hand/arm. In this chapter, our aim is to show how the availability of such feedback signals in the primate cortex for self-action for manual grasping can provide the action recognition capabilities which characterize the mirror system. Specifically, we offer the following hypothesis.

The hand-state hypothesis: The basic functionality of the F5 mirror system is to elaborate the appropriate feedback – what we call the hand state – for opposition-space based control of manual grasping of an object. Given this functionality, the social role of the F5 mirror system in understanding the actions of others may be seen as an exaptation gained by generalizing from self-hand to other's-hand.

The key to the MNS model, then, is the notion of hand state as encompassing data required to determine whether the motion and preshape of a moving hand may be extrapolated to culminate in a grasp appropriate to one of the affordances of the observed object. Basically a mirror neuron must fire if the preshaping of the hand conforms to the grasp type with which the neuron is associated; and the extrapolation of hand state yields a time at which the hand is grasping the object along an axis for which that affordance is appropriate.

Our current representation of hand state defines a 7-dimensional trajectory

F(t) = (d(t), v(t), a(t), o1(t), o2(t), o3(t), o4(t))

with the following components (see Figure 3.4):

Three components are hand configuration parameters:

a(t): Index finger-tip and thumb-tip aperture

o3(t), o4(t): The two angles defining how close the thumb is to the hand as measured relative to the side of the hand and to the inner surface of the palm

The remaining four parameters relate the hand to the object. o1 and o2 components represent the orientation of different components of the hand relative to the opposition axis for the chosen affordance in the object whereas d and v represents the kinematics properties of the hand with reference to the target location.

o1(t): The cosine of the angle between the object axis and the (index finger tip – thumb tip) vector

o2(t): The cosine of the angle between the object axis and the (index finger knuckle – thumb tip) vector

d(t): distance to target at time t

v(t): tangential velocity of the wrist

Figure 3.4 The components of hand state F(t) = (d(t), v(t), a(t), o1(t), o2(t), o3(t), o4(t)). Note that some of the components are purely hand configuration parameters (namely v,o3,o4,a) whereas others are parameters relating hand to the object

In considering the last four variables, note that only one or two of them will be relevant in generating a specific type of grasp, but they all must be available to monitor a wide range of possible grasps. We have chosen a set of variables of clear utility in monitoring the successful progress of grasping an object, but do not claim that these and only these variables are represented in the brain. Indeed, the brain's actual representation will be a distributed neural code, which we predict will correlate with such variables, but will not be decomposable into a coordinate-by-coordinate encoding. However, we believe that the explicit definition of hand state offered here will provide a firm foundation for the design of new experiments in kinesiology and neurophysiology.

The crucial point is that the availability of the hand state to provide feedback for visually-directed grasping makes action recognition possible. Notice that we have carefully defined the hand state in terms of relationships between hand and object (though the form of the definition must be subject to future research). This has the benefit that it will work just as well for measuring how the monkey’s own hand is moving to grasp an object as for observing how well another monkey’s hand is moving to grasp the object. This, we claim, is what allows self-observation by the monkey to train a system that can be used for observing the actions of others and recognizing just what those actions are.

3.3         The MNS (mirror neuron system) model

We now present a high level view of the MNS (Mirror Neuron System) model in terms of the set of interacting schemas (functional units: Arbib 1981; Arbib et al. 1998) shown in Figure 3.5, which define the MNS (Mirror Neuron System) model of F5 and related brain regions. The connectivity shown in Figure 3.5 is constrained by the existing neurophysiology and neuroanatomy of the monkey brain (reviewed in Chapter 2). We have already introduced areas AIP and area F5, dividing the F5 grasp-related neurons into (i) F5 mirror neurons which are, when fully developed, active during certain self-movements of grasping by the monkey and during the observation of a similar grasp executed by others, and (ii) F5 canonical neurons, namely those active during self-movement and object vision but not for recognition of the action of others. Other brain regions also play an important role in mirror neuron system functioning in the macaque’s brain for which the readers are referred to Chapter 2.

 

 

Figure 3.5 The MNS (Mirror Neuron System) model. (i) Top diagonal: a portion of the FARS model. Object features are processed by cIPS and AIP to extract grasp affordances, these are sent on to the canonical neurons of F5 that choose a particular grasp. (ii) Bottom right. Recognizing the location of the object provides parameters to the motor programming area F4 which computes the reach. The information about the reach and the grasp is taken by the motor cortex M1 to control the hand and the arm. (iii) New elements of the MNS model: Bottom left are two schemas, one to recognize the shape of the hand, and the other to recognize how that hand is moving. (iv) Just to the right of these is the schema for hand-object spatial relation analysis. It takes information about object features, the motion of the hand and the location of the object to infer the relation between hand and object. (v) The center two regions marked by the gray rectangle form the core mirror circuit. This complex associates the visually derived input (hand state) with the motor program input from region F5canonical neurons during the learning process for the mirror neurons. The grand schemas introduced in section 3.2 are illustrated as the following. The “Core Mirror Circuit” schema is marked by the center grey box; The “Visual Analysis of Hand State” schema is outlined by solid lines just below it, and the “Reach and Grasp” schema is outlined by dashed lines. (Solid arrows: established connections; dashed arrows: postulated connections)

The subsystem of the MNS model responsible for the visuo-motor transformation of objects into affordances and grasp configurations, linking AIP and F5 canonical neurons, corresponds to a key subsystem of the FARS model reviewed above. Our task is to complement the visual pathway via AIP by pathways directed toward F5 mirror neurons which allow the monkey to observe arm-hand trajectories and match them to the affordances and location of a potential target object. We will then show how the mirror system may learn to recognize actions already in the repertoire of the F5 canonical neurons. In short, we will provide a mechanism whereby the actions of others are ‘recognized’ based on the circuitry involved in performing such actions. The Methods section provides the details of the implemented schemas and the Results section confronts the overall model with virtual experiments and produces testable predictions.

3.3.1        Overall function

In general, the visual input to the monkey represents a complex scene. However, we here sidestep much of this complexity (including attentional mechanisms) by assuming that the brain extracts two salient sub-scenes, a stationary object and in some cases a (possibly) moving hand. The overall system operates in two modes:

(i) Prehension: In this mode, the view of the stationary object is analyzed to extract affordances; then under prefrontal influence F5 may choose one of these to act upon, commanding the motor apparatus to perform the appropriate reach and grasp based on parameters supplied by the parietal cortex. The FARS model captures the linkage of F5 and AIP with PFC, prefrontal cortex (Figure 3.2). In the MNS model, we incorporate the F5 and AIP components from FARS (top diagonal of schemas in Figure 3.5), but omit IT and PFC from the present analysis.

(ii) Action recognition: In this mode, the view of the stationary object is again analyzed to extract affordances, but now the initial trajectory and preshape of an observed moving hand must be extrapolated to determine whether the current motion of the hand can be expected to culminate in a grasp of the object appropriate to one of its affordances.

We do not prespecify all the details of the MNS schemas. Instead, we offer a learning model which, given a grasp that is already in the motor repertoire of the F5 canonical neurons, can yield a set of F5 mirror neurons trained to be active during such grasps as a result of self-observation of the monkey's own hand grasping the target object. (How such grasps may be acquired in the first place is a topic of current research.) Consistent with the hand-state hypothesis, the result will be a system whose mirror neurons can respond to similar actions observed being performed by others. The current implementation of the MNS model exploits learning in artificial neural nets.

The heart of the learning model is provided by the Object affordance-hand state association schema and the Action recognition (mirror neurons) schema. These form the core mirror (learning) circuit, marked by the gray slanted rectangle in Figure 3.5, which mediates the development of mirror neurons via learning. The simulation results of this article will focus on this part of the model. Section 3.4.3.1 presents in detail the neural network structure of the core circuit. As we note further in the Discussion section, this leaves open many problems for further research, including the development of a basic action repertoire by F5 canonical neurons through trial-and-error in infancy and the expansion and refinement of this repertoire throughout life.

3.3.2        Schemas explained

As shown in the caption of Figure 3.5, we encapsulate the schemas shown there into the three ‘grand schemas’ of Figure 3.6(a). These guide our implementation of MNS. Our earlier review of the neuroscience literature in Chapter 2 justifies our initial hypotheses, made explicit in Figure 3.5, as to where these finer-grain schemas are realized in the monkey brain. However, after we explain these finer-grain schemas, we will then turn to our present simulation of the three grand schemas which is based on overall functionality. Nonetheless, the neural structure of Core Mirror Circuit yields interesting predictions for further neurophysiological experimentation.

3.3.2.1              Grand schema 1: reach and grasp

Object features schema: The output of this schema provides a coarse coding of geometrical features of the observed object. It thus provides suitable input to AIP and other regions/schemas.

Object affordance extraction schema: This schema transforms its input, the coarse coding of geometrical features of the observed object provided by the Object features schema, into a coarse coding for each affordance of the observed object.

Motor program (grasp) schema: We identify this schema with the canonical F5 neurons, as in the FARS model. Input is provided by AIP's coarse coding of affordances for the observed object. We assume that the output of the schema encodes a generic motor program for the AIP-coded affordances. This output serves as the learning signal to the Action-recognition (Mirror neurons) schema and drives the hand control functions of the Motor execution schema.

Figure 3.6 (a) For purposes of simulation, we aggregate the schemas of the MNS (Mirror Neuron System) model of Figure 3.5 into three "grand schemas" for Visual Analysis of Hand State, Reach and Grasp, Core Mirror Circuit. (b) For detailed analysis of the Core Mirror Circuit, we dispense with simulation of the other two grand schemas and use other computational means to provide the three key inputs to this grand schema

Object location schema: The output of this schema provides, in some body-centered coordinate frame, the location of the center of the opposition axis for the chosen affordance of the observed object.

Motor program (reach) schema: The input is the position coded by the Object location schema, while the output is the motor command required to transport the arm to bring the hand to the indicated location. This drives the arm control functions of the Motor execution schema.

The motor execution schema determines the course of movements via activity in primary motor cortex M1 and "lower" regions.

We next review the schemas which (in addition to the previously presented Object features and Object affordance extraction schemas) implement the visual system of the model:

3.3.2.2              Grand Schema 2: Visual Analysis of Hand State

The hand shape recognition schema takes as input a view of a hand, and its output is a specification of the hand shape, which thus forms some of the components of the hand state. In the current implementation these are a(t), o3(t) and o4(t). Note also that we implicitly assume that the schema includes a validity check to verify that the scene does contain a hand.

The hand motion detection schema takes as input a sequence of views of a hand and returns as output the wrist velocity, supplying the v(t) component of the hand state.

The hand-object spatial relation analysis schema receives object-related signals from the Object features schema, as well as input from the Object Location, Hand shape recognition and Hand motion detection schemas. Its output is a set of vectors relating the current hand preshape to a selected affordance of the object. The schema computes such parameters as the distance of the object to the hand, and the disparity between the opposition axes of the object and the hand. Thus the hand state components o1(t), o2(t), and d(t) are supplied by this schema. The Hand-Object spatial relation analysis schema is needed because, for almost all mirror neurons in the monkey, a hand mimicking a matching grasp would fail to elicit the mirror neuron's activity unless the hand's trajectory were taking it toward an object with a grasp that matches one of the affordances of the object. The output of this visual analysis is relayed to the Object affordance-hand state association schema which drives the F5 mirror neurons whose output is a signal expressing confidence that the observed trajectory will extrapolate to match the observed target object using the grasp encoded by that mirror neuron.

3.3.2.3              Grand Schema 3: Core Mirror Circuit

The action recognition schema which is meant to correspond to the mirror neurons of area F5 – receives two inputs in our model. One is the motor program selected by the Motor program schema; the other comes from the Object affordance-hand state association schema. This schema works in two modes: learning and recognition. When a self-executed grasp is taking place the schema is in learning mode and the association between the observed hand-state (Object affordance-hand state association schema) and the motor program (Motor program schema) is learned. While in recognition mode, the motor program input is not active and the schema acts as a recognition circuit. If satisfactory learning (in terms of generalization and the range of actions learned) has taken place via self-observation then the schema will respond correctly while observing other’s grasp actions.

The object affordance-hand state association schema combines all the hand related information as well as the object information available. Thus the inputs to the schema are from Hand shape recognition (components a(t), o3(t), o4(t)), Hand motion detection (component v(t)), Hand-Object spatial relation analysis (o1(t), o2(t), d(t)) and from Object affordance extraction schemas. As will be explained below, the schema needs a learning signal (mirror feedback). This signal is relayed by the Action recognition schema and, is basically, a copy of the motor program passed to the Action recognition schema itself. The output of this schema is a distributed representation of the object and hand state match (in our implementation the representation is not pre-specified but shaped by the learning process). The idea is to match the object and the hand state as the action progresses during a specific observed reach and grasp. In the current implementation, time is unfolded into a spatial representation of ‘the trajectory until now’ at the input of the Object affordance-hand state association schema, and the Action recognition schema decodes the distributed representation to form the mirror response (again, the decoding is not pre-specified but is the result of the back-propagation learning). In any case, the schema has two operating modes. First is the learning mode where the schema tries to adjust its efferent and afferent weights to ensure the right activity in the Action recognition schema. The second mode is the forward mode where it maps the hand state and the object affordance into a distributed representation to be used by the Action recognition schema.

The key question for this chapter’s modeling will be to account for how learning mechanisms may shape the connections to mirror neuron in such a way that an action in the motor program repertoire of the F5 canonical neurons may become recognized by the mirror neurons when performed by others. In Chapter 5 and Chapter 6 we will present models that can learn a repertoire of grasping actions.

To conclude this section, we note that our modeling is subject to two quite different tests: (i) its overall efficacy in explaining behavior and its development, which can be tested at the level of the schemas (functional units) presented in this article; and (ii) its further efficacy in explaining and predicting neurophysiological data. As we shall see below, certain neurophysiological predictions are possible given the current work, even though the present implementation relies on relatively abstract artificial neural networks.

3.4         Schema implementation

Having indicated the functionality and possible neural basis for each of the schemas that will make up each grand schema, we now turn to the implementation of these three grand schemas. We implement the three grand schemas so that each functions correctly in terms of its input-output relations, and so that the Core Mirror Circuit contains model neurons whose behavior can be tested against neurophysiological data and yield predictions for novel neurophysiological experiments. The Core Mirror Circuit is thus the heart of MNS model that enables us to produce testable predictions (Figure 3.6b), but in order to study it, there must be an appropriate context, necessitating the construction of the kinematically realistic Reach and Grasp Simulator and the Visual Analyzer for Hand State. The latter will first be implemented as an analyzer of views of human hands, and then will have its output replaced by simulated hand state trajectories to reduce computational expense in our detailed analysis of the Core Mirror.

3.4.1        Grand schema 1: reach and grasp

We first discuss the Reach and Grasp Simulator that corresponds to the whole reach and grasp command system shown at the right of the MNS diagram (Figure 3.5). The simulator lets us move from the representation of the shape and position of a (virtual) 3D object and the initial position of the (virtual) arm and hand to a trajectory that successfully results in simulated grasping of the object. In other words the simulator plans a grasp and reach trajectory and executes it in a simulated 3D world (see Chapters 5 and 6 for neural realization of this schema). Trajectory planning (for example Kawato and Gomi 1992; Kawato et al. 1987; Jordan and Rumelhart 1992; Karniel and Inbar 1997; Breteler et al. 2001) and control of prehension(Hoff and Arbib 1993; see Wolpert and Ghahramani 2000 for a review), and their adaptation, have been widely studied. However, our simulator is not adaptive - its sole purpose is to create kinematically realistic actions. A similar reach and grasp system was proposed (Rosenbaum et al. 2001; Rosenbaum et al. 1999) where a movement is planned based on the constraint hierarchy, relying on obstacle avoidance and candidate posture evaluation processes (Meulenbroek et al. 2001). However, the arm and hand model was much simpler than ours as the arm was modeled as a 2D kinematics chain. Our Reach/Grasp Simulator is a non-neural extension of FARS model functionality to include the reach component. It controls a virtual 19 degrees DOF arm/hand (3 at the shoulder, 1 for elbow flexion/extension, 3 for wrist rotation, 2 for each finger joints with additional 2 DOFs for thumb one to allow the thumb to move sideways, and the other for the last joint in the thumb) and provides routines to perform realistic grasps. This kinematics realism is based on the literature of primate reach and grasp experiments (Jeannerod et al. 1995; for human see Hoff and Arbib 1993 and citations therein; for monkey see Roy et al. 2000). During a typical reach to grasp movement, the hand will follow a ‘bell-shaped’ velocity profile (a single peaked velocity curve). The kinematics of the aperture between fingers used for grasping also exhibits typical characteristics. The aperture will first reach a maximum value that is larger than the aperture required for grasping the object and then as the hand approaches to the target the hand encloses to match the actual required aperture for the object. It is also important to note that in grasping tasks the temporal pattern of reaching and grasping is similar in monkey and human (Roy et al. 2000). Of course, there are inter-subject and inter-trial variability in both velocity and aperture profiles (Marteniuk and MacKenzie 1990). Therefore in our simulator we captured the qualitative aspects of the typical reach and grasp actions, namely that the velocity profiles have single peaks and that the hand aperture has a maximum value which is larger than the object size (see Figure 3.7, curves a(t) and v(t) for sample aperture and velocity profiles generated by our simulator) . A grasp is planned by first setting the operational space constraints (e.g., points of contact of fingers on the object) and then finding the arm-hand configuration to fulfill the constraints. The latter is the inverse kinematics problem. The simulator solves the inverse kinematics problem by simulated gradient descent with noise added to the gradient (see Appendix 11.1.2 for a grasp planning example). Once the hand-arm configuration is determined for a grasp action, then the trajectory is generated by warping time using a cubic spline. The parameters of the spline are fixed and determined empirically to satisfy aperture and velocity profile requirements. Within the simulator, it is possible to adjust the target identity, position and size manually using a GUI or automatically by the simulator as, for example, in training set generation.

Figure 3.7 (Left) The final state of arm and hand achieved by the reach/grasp simulator in executing a power grasp on the object shown. (Right) The hand state trajectory read off from the simulated arm and hand during the movement whose end-state is shown at left. The hand state components are: d(t), distance to target at time t; v(t), tangential velocity of the wrist; a(t), Index and thumb finger tip aperture; o1(t), cosine of the angle between the object axis and the (index finger tip – thumb tip) vector; o2(t), cosine of the angle between the object axis and the (index finger knuckle – thumb tip) vector; o3(t), The angle between the thumb and the palm plane; o4(t), The angle between the thumb and the index finger

Figure 3.7 (left) shows the end state of a power grasp, while Figure 3.7 (Right) shows the time series for the hand state associated with this simulated power grasp trajectory. For example, the curve labeled d(t) show the distance from the hand to the object decreasing until the grasp is completed; while the curve labeled a(t) show how the aperture of the hand first increases to yield a safety margin larger than the size of the object and then decreases until the hand contacts the object.

Figure 3.8 Grasps generated by the simulator. (a) A precision grasp. (b) A power grasp. (c) A side grasp

Figure 3.8(a) shows the virtual hand/arm holding a small cube in a precision grip in which the index finger (or a larger "virtual finger") opposes the thumb. The power grasp (Figure 3.8(b)) is usually applied to big objects and characterized by the hand’s covering the object, with the fingers as one virtual finger opposing the palm as the other. In a side grasp (Figure 3.8(c)), the thumb opposes the side of another finger. To clarify the type of heuristics we use to generate the grasp, Appendix 11.1.2 outlines the grasp planning and execution for a precision pinch.

3.4.2        Grand schema 2: visual analysis of hand state

Visual Analysis of Hand State Schema is a non-neurophysiological implementation of a visual analysis system to validate the extraction of hand parameters from a view of a hand, by recovering the configuration of a model of the hand being seen. The hand model is a three dimensional 14 degrees of freedom (DOF) kinematic model, with a 3-DOF joint for the wrist, two 1-DOF joints (metacarpophalangeal and distalinterphalangeal) for each of the four fingers, and finally a 1-DOF joint for the metacarpophalangeal joint, and a 2-DOF joint for the carpometacarpal joint of the thumb. Note the distinction between ‘hand configuration’ which gives the joint angles of the hand considered in isolation, and the ‘hand state’ which comprises 7 parameters relevant to assessing the motion and preshaping of the hand relative to an object. Thus, the hand configuration provides some, but not all, of the data needed to compute the hand state.

To lighten the load of building a visual system to recognize hand features, we marked the wrist and the articulation points of the hand with colors. We then used this color-coding to help recognize key portions of the hand and used this result to initiate a process of model matching. Thus, the first step of the vision problem was color segmentation, after which the three dimensional hand shape was recovered.

3.4.2.1              Color segmentation and feature extraction

One needs color segmentation to locate the colored regions on the image. Gray level segmentation techniques cannot be used in a straightforward way because of the vectorial nature of color images (Lambert and Carron 1999). Split-and-Merge is a well-known image segmentation technique in image processing (Sonka et al. 1993), recursively splitting the image into smaller pieces until some homogeneity criterion is satisfied as a basis for reaggregation into regions. In our case, the criterion is having similar color throughout a region. However, RGB (Red-Green-Blue) space is not well suited for this purpose. HSV (Hue-Saturation-Value) space is better suited since hue in segmentation processes usually corresponds to human perception and ignores shading effects (Russ 1998 Chapters 1 and 6). However, the segmentation system we implemented with HSV space, although better than the RGB version, was not satisfactory for our purposes. Therefore, we designed a system that can learn the best color space.

Figure 3.9(a) shows the training phase of the color expert system, which is a (one hidden-layer) feed-forward network with sigmoidal activation function. The learning algorithm is back-propagation with momentum and adaptive learning rate. The given image is put through a smoothing filter to reduce noise in the image before training. Then the network is given around 100 training samples each of which is a vector of ((R, G, B), perceived color code) values. The output color code is a vector consisting of all zeros except for one component corresponding to the perceived color of the patch. The training builds an internal non-linear color space from which it can unambiguously tell the perceived color. This training is done only at the beginning of a session to learn the colors used on the particular hand. Then the network is fixed as the hand is viewed in a variety of poses.

 

Figure 3.9 (a) Training the color expert, based on colored images of a hand whose joints are covered with distinctively colored patches. The trained network will be used in the subsequent phase for segmenting image. (b) A hand image (not from the training sample) is fed to the augmented segmentation program. The color decision during segmentation is done by consulting to the Color Expert. Note that a smoothing step (not shown) is performed before segmentation

Figure 3.9(b) illustrates the actual segmentation process using the ‘color expert’ to find each region of a single (perceived) color (see Appendix 11.1.1 for details). The output of the algorithm is then converted into a feature vector with a corresponding confidence vector giving a confidence level for each component in the feature vector. Each finger is marked with two patches of the same color. Sometimes it may not be possible to determine which patch corresponds to the fingertip and which to the knuckle. In those cases, the confidence value is set to 0.5. If a color is not found (e.g., the patch may be obscured), a zero value is given for the confidence. If a unique color is found without any ambiguity then the confidence value is set to 1. The segmented centers of regions (color markers) are taken as the approximate articulation point positions. To convert the absolute color centers into a feature vector we simply subtract the wrist position from all the centers found and put the resulting relative (x,y) coordinate into the feature vector (but the wrist is excluded from the feature vector as the positions are specified with respect to the wrist position).

3.4.2.2              3D hand model matching

Our model matching algorithm uses the feature vector generated by the segmentation system to attain a hand configuration and pose that would result in a feature vector as close as possible to the input feature vector (Figure 3.10). The scheme we use is a simplified version of Lowe’s (1991); see Holden (1997) for a review of other hand recognition studies.

Figure 3.10 Illustration of the model matching system. Left: markers located by feature extraction schema. Middle and Right: initial and final stages of model matching. After matching is performed a number of parameters for the Hand configuration are extracted from the matched 3D model

The matching algorithm is based on minimization of the distance between the input feature and model feature vector, where the distance is a function of the two vectors and the confidence vector generated by segmentation system. Distance minimization is realized by hill climbing in feature space. The method can handle occlusions by starting with ‘don't cares’ for any joints whose markers cannot be clearly distinguished in the current view of the hand

The distance between two feature vectors F and G is computed as follows:

where subscripting denotes components and Cf, Cg denotes the confidence vectors associated with F and G. Given this result of the visual processing – our hand shape recognition schema – we can clearly read off the following components of the hand state, F(t):

a(t): aperture of the virtual fingers involved in grasping

o3(t), o4(t): the two angles defining how close the thumb is to the hand as measured relative to the side of the hand and to the inner surface of the palm (see Figure 3.4). The remaining components can easily be computed once the object affordance and location is known. The computation of the components:

d(t): distance to target at time t, and

v(t): tangential velocity of the wrist

o1(t): Angle between the object axis and the (index finger tip – thumb tip) vector

o2(t): Angle between the object axis and the (index finger knuckle – thumb tip) vector

constitute the tasks of the hand-object spatial relation analysis schema and the hand motion detection schema. These require visual inspection of the relation between hand and target, and visual detection of wrist motion, respectively. Section 3.5.3 presents justifies the visual analysis of hand state schema by showing MNS performance when the hand state was extracted by the described visual recognition system based on a real video sequence. However, when we turn to modeling the Core Mirror Circuit  in the next section, we will not use this implementation of visual analysis of hand state but instead, to simplify computation, we will use synthetic output generated by the reach/grasp simulator to emulate the values that could be extracted with this visual system. Specifically, we use the hand/grasp simulator to produce both (i) the visual appearance of such a movement for our inspection (Figure 3.7, left), and (ii) the hand state trajectory associated with the movement (Figure 3.7, right). Especially, for training we need to generate and process too many grasp actions, which makes it impractical to use the visual processing system without special hardware as the computational time requirement is too high. Nevertheless, we need to show the similarity of the data from the visual system and the simulator: We have already shown that the grasp simulator generates aperture and velocity profiles that are similar to those in real grasps. Of course, there is still the question of how well the our visual system can extract these features and more importantly how similar are the other components of the hand state that we did not specifically craft to match the real data. Positive evidence will be presented in Section 3.5.3.

3.4.3        Grand Schema 3: core mirror circuit

As diagrammed in Figure 3.6(b), our detailed analysis of the core mirror circuit does not require simulation of the visual analysis of hand state and of reach and grasp so long as we ensure that it receives the appropriate inputs. Thus, we supply the object affordance and grasp command directly to the network at each trial. (Actually, we conduct experiments to compare performance with and without an explicit input which codes object affordance.) For the hand state input, rather than providing visual input to the visual analysis of hand state schema and have it compute the hand state input to the core mirror circuit, we use our reach and grasp simulator to simulate the performance of the observed primate – and from this simulation we extract (as in Figure 3.7) both a graphical display of the arm and hand movement that would be seen by the observing monkey, as well as the hand state trajectory that would be generated in its brain. We thus use the time-varying hand state trajectory generated in this way to provide the input to the model of the core mirror circuit of the observing monkey without having to simultaneously model its visual analysis of hand State. Thus, we have implemented the core mirror circuit in terms of neural networks using as input the synthetic data on hand state that we gather from our reach and grasp simulator (however see Section 3.5.3 for a simulation with real data extracted by our visual system). Figure 3.13 shows an example of the recognition process together with the type of information supplied by the simulator.

3.4.3.1              Neural network details

In our implementation, we used a feed-forward neural network with one hidden layer. In contrast to the previous sections, we can here identify the parts of the neural network as Figure 3.5 schemas in a one-to-one fashion. The hidden layer of the model neural network corresponds to the object affordance-hand state association schema, while the output layer of the network corresponds to the action recognition schema (i.e., we identify the output neurons with the F5 mirror neurons). In the following formulation MR (mirror response) represents the output of the action recognition schema, MP (motor program) denotes the target of the network (copy of the output of motor program (grasp) schema). X denotes the input vector applied to the network, which is the transformed Hand State (and the object affordance). The transformation applied is described in the next subsection. The learning algorithm used is back propagation (Rumelhart et al. 1986) with momentum term. The formulation is adapted from (Hertz et al. 1991).

Activity propagation (forward pass)

Learning weights from input to hidden layer

Learning weights from hidden to output layer

The squashing function g we used was . and are the learning rate and the momentum coefficient respectively. In our simulations, we adapted during training such that if the output error was consistently decreasing then we increased . Otherwise, we decreased . We kept  as a constant set to 0.9. W is the 3x(6+1) matrix of real numbers representing the hidden-to–output weights. w is the 6x(210+1) (6x(220+1) in the explicit affordance coding case) matrix of real numbers representing the input to hidden weights, and X is the 210+1 (220+1 in explicit affordance coding case) component input vector representing the hand state (trajectory) information. (The extra +1 comes from the fact that the formulation we used hides the bias term required for computing the output of a unit in the incoming signals as a fixed input clamped to 1)

3.4.3.2              Temporal to spatial transformation

The input to the network was formed in a way to allow encoding of temporal information without the use of a dynamic neural network, and solved the scaling problem. The input at any time represented the entire input from the start of the action until the present time t. To form the input vector, each of the seven components of the hand state trajectory to time t is fitted by a cubic spline (see Kincaid and Cheney 1991 for a formulation), and the splines are then sampled at 30 uniformly spaced intervals. The hand state input is then a vector with 210 components: 30 samples from the time-scaled spline fitted to the 7 components of the hand-state time series. Note then that no matter what fraction t is of the total time T of the entire trajectory, the input to the network at time t comprises 30 samples of the hand-state uniformly distributed over the interval [0, t]. Thus the sampling is less densely distributed across the trajectory-to-date as t increases from 0 to T.

An alternative approach would be to use an SRN (simple recurrent neural network) style architecture to recognize hand state trajectories. However, this raises an extra quantization or segmentation step to convert the continuous hand state trajectories to discrete states. With our approach, we avoid this extra step because the quantization is implicitly handled by the learning process.

Figure 3.11 The scaling of an incomplete input to form the full spatial representation of the hand state As an example, only one component of the hand state, the aperture is shown. When the 66 percent of the action is completed, the pre-processing we apply effectively causes the network to receive the stretched hand state (the dotted curve) as input as a re-representation of the hand state information accessible to that time (represented by the solid curve; the dashed curve shows the remaining, unobserved part of the hand state)

Figure 3.11 demonstrates the preprocessing we use to transform time varying hand state components into spatial code. In the figure, only a single component (the aperture) is shown as an example. The curve drawn by the solid line indicates the available information when the 66% of the grasp action is completed. In reality a digital computer (and thus the simulator) runs in discrete time steps, so we construct the continuous curve by fitting a cubic spline to the collected samples for the value represented (aperture value in this case). Then we resample 30 points from the (solid) curve to form a vector of size 30. In effect, this presents the network with the stretched spline shown by the dotted curve. This method has the desirable property of avoiding the time scaling problem to establish the equivalence of actions that last longer than shorter ones, as it is the case for a grasp for an object far from to the hand compared to a grasp to a closer object. By comparing the dotted curve (what the network sees at t = 0.66) with the ‘solid + dashed’ curve (the overall trajectory of the aperture) we can see how much the network’s input is distorted. As the action gets closer to its end the discrepancy between the curves tends to zero. Thus, our preprocessing gives rise to an approximation to the final representation when a certain portion or more of the input is seen. Figure 3.12 samples the temporal evolution of the spatial input the network receives.

Figure 3.12 The solid curve shows the effective input that the network receives as the action progresses. At each simulation cycle the scaled curves are sampled (30 samples each) to form the spatial input for the network. Towards the end of the action the networks input gets closer to the final hand state

3.4.3.3              Neural network training

The training set was constructed by making the simulator perform various grasps in the following way.

(1) The objects used were a cube of changing size (a generic size cube scaled by a random scale factor between 0.5 and 1.5), a disk (approximated as a thin prism), a ball (approximated as a dodecahedron) again scaled randomly by a number between 0.75 and 1.5. In this particular trial, we did not change the disk size. In the training set formation, a certain object always received a certain grasp (unlike the testing case).

(2) The target locations were chosen form the surface patches of a sphere centered on the shoulder joint. The patch is defined by bounding meridian (longitude) and parallel (latitude) lines. The extent of the meridian and parallel lines was from -45° to 45°. The step chosen was 15°. Thus the simulator made 7x7 = 49 grasps per object. The unsuccessful grasp attempts were discarded from the training set. For each successful grasp, two negative examples were added to the training set in the following way. The inputs (group of 30) for each parameter are randomly shuffled. In this way, the network was forced to learn the order of activity within a group rather than learning the averages of the inputs (note that the shuffling does not change mean and variance). The second negative pattern was used to stress that the distance to target was important. The target location was perturbed and the grasp was repeated (to the original target position).

Finally, our last modification in the backpropagation training algorithm was to introduce a random input pattern (totally random; no shuffling) on the fly during training and ask the network to produce zero output for those patterns. This way we not only biased the network to be as silent as possible during ambiguous input presentation but also gave the network a higher chance to reach global minima.

It should be emphasized that the network was trained using the complete trajectory of the hand state (analogous to adjusting synapses after the self-grasp is completed). During testing, in contrast, the prefixes of a trajectory were used (analogous to predictive response of mirror neurons while observing a grasp action). The network thus yielded a time-course of activation for the mirror neurons. As we shall see in the Results section, initial prefixes yields little or no mirror neuron activity, and ambiguous prefixes may yields transient activity of the ‘wrong’ mirror neurons.

We thus need to make two points to highlight the contribution of this study:

(1) It is, of course, trivial to train a network to pair complete trajectories with the final grasp type. What is interesting here is that we can train the system on the basis of final grasp but then observe the whole time course of mirror neuron activity, yielding predictions for neurophysiological experiments by highlighting the importance of the timing of mirror neuron activity.

(2) It is commonly understood that the training method used here, namely back-propagation, is not intended to be a model of the cellular learning mechanisms employed in cerebral cortex. This might be a matter of concern were we intending to model the time course of learning, or analyze the effect of specific patterns of neural activity or neuromodulation on the learning process. However, our aim here is quite different: we want to show that the connectivity of mirror neuron circuitry can be established through training, and that the resultant network can exhibit a range of novel, physiologically interesting, behaviors during the process of action recognition. Thus, the actual choice of training procedure is purely a matter of computational convenience, and the fact that the method chosen is non-physiological does not weaken the importance of our predictions concerning the timing of mirror neuron activity.

3.5         Simulation results

In this study, we experimented with two types of network. The first has only the hand state as the network input. We call this version the non-explicit affordance coding network since the hand state will often imply the object affordance in our simple grasp world. The second network we experimented with – the explicit affordance coding network - has affordance coding as one set of its inputs. The number of hidden layer units in each case was chosen as 6 and there were 3 output units, each one corresponding to a recognized grasp

3.5.1        Non-explicit affordance coding experiments

We first present results with the MNS model implemented without an explicit object affordance input to the core mirror circuit. We then study the effects of supplying an explicit object affordance input.

3.5.1.1              Grasp resolution

In Figure 3.13, we let the (trained) model observe a grasp action. Figure 3.13(a) demonstrates the executed grasp by giving the views from three different angles to show the reader the 3D trajectory traversed. Figure 3.13(b) shows the extracted hand state (left) and the response of the (trained) core mirror network (right). In this example, the network was able to infer the correct grasp without any ambiguity as a single curve corresponding to the observed grasp reaches a peak and the other two units’ output are close to zero during the whole action. The horizontal axis for both figures is such that the onset of the action and the completion of the grasp are scaled to 0 and 1 respectively. The vertical axis in the hand state plot represents a normalized (min=0, max=1) value for the components of the hand state whereas the output plot represents the average firing rate of the neurons (no firing = 0, maximum firing = 1). The plotting scheme that is used in Figure 3.13 will be used in later simulation results as well.

Figure 3.13 (a) A single grasp trajectory viewed from three different angles to clearly show its 3D pattern. The wrist trajectory during the grasp is marked by square traces, with the distance between any two consecutive trace marks traveled in equal time intervals. (b) Left: The input to the network. Each component of the hand state is labelled. (b) Right: How the network classifies the action as a power grasp: squares: power grasp output; triangles: precision grasp; circles: side grasp output. Note that the response for precision and side grasp is almost zero

It is often impossible (even for humans) to classify a grasp at a very early phase of the action. For example, the initial phases of a power grasp and precision grasp can be very similar. Figure 3.14demonstrates this situation where the model changes its decision during the action and finally reaches the correct result towards the end of the action. To create this result we used the "outer limit" of the precision grasp by having the model perform a precision grasp for a wide object (using the wide opposition axis). Moreover, the network had not been trained using this object for precision grasp. In Figure 3.14(b), the curves for power and precision grips cross towards the end of the action, which shows the change of decision of the network.

Figure 3.14 Power and precision grasp resolution. The conventions used are as in the previous figure. (a) The curves for power and precision cross towards the end of the action showing the change of decision of the network. (b) The left shows the initial configuration and the right shows the final configuration of the hand

3.5.1.2              Spatial perturbation

We next analyze how the model performs if the observed grasp action does not meet the object. Since we constructed the training set to stress the importance of distance from hand to object, we expected that network response would decrease with increased perturbation of target location.

Figure 3.15: (Top) Strong precision grip mirror response for a reaching movement with a precision pinch. (Bottom) Spatial location perturbation experiment. The mirror response is greatly reduced when the grasp is not directed at a target object. (Only the precision grasp related activity is plotted. The other two outputs are negligible.)

Figure 3.15 shows an example of such a case. However, the network's performance was not homogeneous over the workspace: for some parts of the space the network would yield a strong mirror response even with comparatively large perturbation. This could be due to the small size of the training set. However, interestingly, the network’s response had some specificity in terms of the direction of the perturbation. If the object’s perturbation direction were similar to the direction of hand motion then the network would be more likely to disregard the perturbation (since the trajectory prefix would then approximate a prefix of a valid trajectory) and signal a good grasp. Note that the network reduces its output rate as the perturbation increases, however the decrease is not linear and after a critical point it sharply drops to zero. The critical perturbation level also depends on the position in space.

3.5.1.3              Altered kinematics

Normally, the simulator produces bell-shaped velocity profiles along the trajectory of the wrist. In our next experiment, we tested action recognition by the network for an aberrant trajectory generated with constant arm joint velocities.

Figure 3.16 Altered kinematics experiment. Left: The simulator executes the grasp with bell-shaped velocity profile. Right: The simulator executes the same grasp with constant velocity. Top row shows the graphical representation of the grasps and the bottom row shows the corresponding output of the network. (Only the precision grasp related activity is plotted. The other two outputs are negligible.)

The change in the kinematics does not change the path generated by the wrist. However, the trajectory (i.e., time course along the path) is changed and the network is capable of detecting this change (Figure 3.16). The notable point is that the network acquired this property without our explicit intervention (i.e. the training set did not include any negative samples for altered velocity profiles). This is because the input to the network at any time comprises 30 evenly spaced samples of the trajectory up to that time. Thus, changes in velocity can change the pattern of change exhibited across those 30 samples. The extent of this property is again dependent on spatial location.

It must be stressed that all the virtual experiments presented in this section used a single trained network. No new training samples were added to the training set for any virtual experiment.

3.5.1.4              Grasp and object axes mismatch

The last virtual experiment we present with non-explicit affordance coding explores the model’s behavior when the object opposition axis does not match the hand opposition axis. This example emphasizes that the response of the network is affected by the opposition axis of the object being grasped. Figure 3.17 shows the axis orientation change for the object and the effect of this perturbation on the output of the network. The arm simulator first performed a precision grasp to a thin cylinder. The mirror neuron model’s response to this action observation is shown in Figure 3.17, leftmost panel. As can be seen from the plot, the network confidently activated the mirror neuron coding precision grip. The middle panel shows the output of the network when the object is changed to a flat plate but the kinematics of the hand is kept the same. The response of the network declined to almost zero in this case. This is an extreme example – the objects in Figure 3.17 (rightmost panel) have opposition axes 90° apart, enabling the network to detect the mismatch between the hand (action) and the object. With less change in the new axis the network would give a higher response and, if the opposition axis of the objects were coincident, the network would respond to both actions (with different levels of confidence depending on other parameters).

Figure 3.17 Grasp and object axes mismatch experiment. Rightmost: the change of the object from cylinder to a plate (an object axis change of 90 degrees). Leftmost: the output of the network before the change (the network turns on the precision grip mirror neuron). Middle: the output of the network after the object change. (Only the precision grasp related activity is plotted. The other two outputs are negligible.)

3.5.2        Explicit affordance coding experiments

Now we switch our attention to the explicit affordance coding network. Here we want to see the effect of object affordance on the model’s behavior. The new model is similar to that given before except that it not only has inputs encoding the current prefix of the hand state trajectory (which includes hand-object relations), but also has a constant input encoding the relevant affordance of the object under current scrutiny. Thus, both the training of the network, and the performance of the trained network will exhibit effects of this additional, affordance, input.

Due to the simple nature of the objects studied here, the affordance coding used in the present study only encodes the object size. In general, one object will have multiple affordances. The ambiguity then would be solved using extra cues such as the contextual state of the network. We chose a coarse coding of object size with 10 units. Each unit has a preferred value; the firing of a unit is determined by the difference of the preferred value and the value being encoded. The difference is passed through a non-linear decay function by which the input is limited to the 0 to 1 range (the larger the difference, the smaller the firing rate). Thus, the explicit affordance coding network has 220 inputs (210 hand state inputs, plus 10 units coarse coding the size). The number of hidden layer units was again chosen as 6 and there were again 3 output units, each one corresponding to a recognized grasp.

We have seen that the MNS model without explicit affordance input displayed a biasing effect of object size in the Grasp Resolution subsection of Section 5.1; the network was biased toward power grasp while observing a wide precision pinch grasp (the network initially responded with a power grasp activity even though the action was a precision grasp). The model with full affordance replicates the grasp resolution behavior seen in Figure 3.12. However, we can now go further and ask how the temporal behavior of the model with explicit affordance coding reflects the fact that object information is available throughout the action. Intuitively, one would expect that the object affordance would speed up the grasp resolution process (which is actually the case, as will be shown in Figure 3.19).

In the following two subsections we look at the effect of affordance information in two cases: (i) where we study the response to precision pinch trajectories appropriate to a range of object sizes; and (ii) where on each trial we use the same time-varying hand state trajectory but modify the object affordance part of the input. In each case, we are studying the response of a network that has been previously trained on a set of normal hand-state trajectories coupled with the corresponding object affordance (size) encoding.

3.5.2.1              Temporal effects of explicit affordance coding

To observe the temporal effects of having explicit coding of affordances to the model, we choose a range of object sizes, and then for each size drive the (previously trained) network with both affordance (object size) information and the hand-state trajectory appropriate for a precision pinch grasp appropriate to that size of object. For each case we looked at the model’s response. Figure 3.18 shows the resultant level of mirror responses for 4 cases (tiny, small, medium, big objects). The filled circles indicate the precision activity while the empty squares indicate the power grasp related activity. When the object to be grasped is small, the model turns on the precision mirror response more quickly and with no ambiguity (Figure 3.18, top two panels). The vertical bar drawn at time 0.6 shows the temporal effect of object size (affordance). The curves representing the precision grasps are shifted towards the end (time = 1), as the object size gets bigger. Our interpretation is that the model gained the property of predicting that a small object is more likely to be grasped with a precision pinch rather than a power pinch. Thus the larger the object, the more of the trajectory had to be seen before a confident estimation could be made that it was indeed leading to a precision pinch. In addition, as we indicated earlier, the explicit affordance coding network displays the grasp resolution behavior during the observation of a precision grip being applied to large objects (Figure 3.18, bottom two panels: the graph labeled big object grasp and to a lesser degree, the graph labeled medium object grasp).

 

Small object grasp

 

Tiny object grasp

 

Big object grasp

 

Medium object grasp

 

Figure 3.18 The plots show the level of mirror responses of the explicit affordance coding object for an observed precision pinch for four cases (tiny, small, medium, big objects). The filled circles indicate the precision activity while the empty squares indicate the power grasp related activity

We also compared the general response time of the non-explicit affordance coding implementation with the explicit coding implementation. The network with affordance input is faster to respond than the previous one.

Figure 3.19 The solid curve: the precision grasp output, for the non-explicit affordance case, directed to a tiny object. The dashed curve: the precision grasp output of the model to the explicit affordance case, for the same object

Moreover, it appears that - when affordance and grasp type are well correlated - having access to the object affordance from the beginning of the action not only lets the system make better predictions but also smoothes out the neuron responses. Figure 3.19 summarizes this: it shows the precision response of both the explicit and non-explicit affordance case for a tiny object (dashed and solid curves respectively).

Figure 3.20: Empty squares indicate the precision grasp related cell activity, while the filled squares represent the power grasp related cell activity. The grasps show the effect of changing the object affordance, while keeping a constant hand state trajectory. In each case, the hand-state trajectory provided to the network is appropriate to the medium-sized object, but the affordance input to the network encodes the size shown. In the case of the biggest object affordance, the effect is enough to overwhelm the hand state’s precision bias.

3.5.2.2              Teasing apart the hand state and object affordance components

We now look at the case where the hand state trajectory is incompatible with the affordance of the observed object. In Figure 3.20, the plot labeled medium object shows the system output for a precision grasp directed to a medium-sized object whose affordance is supplied to the network. We then repeatedly input the hand state trajectory generated for this particular action but in each trial use an object affordance discordant with the observed trajectory affordance (i.e., using a reduced or increased size of the object). The plots in Figure 3.20 show the change of the output of the model due to the change in the affordance. The results shown in these plots tell us two things. First, the recognition process becomes fuzzier as the object gets bigger because the larger object sizes biases the network towards the power grasp. In the extreme case the object affordance can even overwhelm the hand state and switch the network decision to power grasp (Figure 3.20, graph labeled biggest object). Moreover, for large objects, the large discrepancy between the observed hand state trajectory and the size of the objects results in the network converging on a confident assessment for neither grasp.

Secondly, the resolution point (the crossing-point of the precision and power curves) shows an interesting temporal behavior. It may be intuitive to think that as the object gets smaller the network’s precision decision gets quicker and quicker (similar to what we have seen in the previous section). However, although this is the case when the object is changing size from big to small, it is not the case when the object size is getting medium to tiny (i.e., the crossing time has a local minimum between the two extreme object sizes, as opposed to being at the tiny object extreme). Our interpretation is that the network learned an implicit parameter related to the absolute value of the difference of the hand aperture and the object size such that the maximum firing is achieved when the difference is smallest, that is when the hand trajectory matches best with the object. This will explain why the network has quickest resolution for a size between the biggest and the smallest sizes.

Figure 3.21 The graph is drawn to show the decision switch time versus object size. The minimum is not at the boundary, that is, the network will detect a precision pinch quickest with a medium object size. Note that the graph does not include a point for "Biggest object" since there is no resolution point in this case (see the final panel of Figure 3.19)

Figure 3.21 shows the time of resolution versus object size in graphical form. We emphasize that the model easily executes the grasp recognition task when hand-state trajectory matches object affordance. We do not include all the results of these control trials, as they are similar to the cases mentioned in the previous section.

3.5.3        Justifying the visual analysis of hand state schema

Before closing the results of this chapter, we would like to present a simulation run using a real video input to justify our claim that hand state can be extracted from real video and used to drive the core mirror circuit.

Figure 3.22 The precision grasp action used to test our visual system is depicted by superimposed frames (not all the frames are shown)

Figure 3.23 The video sequence used to test the visual system is shown together with the 3D hand matching result (over each frame). Again not all the frames are shown

The object affordances are supplied manually as we did not address object recognition in our visual system. However, the rest of the hand state is extracted by the hand recognition system as described in Section 3.4.3. Figure 3.22 depicts the precision grasp action used as input video for the simulation.The result of the 3D hand matching is illustrated in Figure 3.23. The color extraction is performed as described in the Visual Analysis of Hand State section but not shown in the figure. It would be very rewarding to perform all our MNS simulations using this system. However, the quality of the video equipment available and the computational power requirements did not allow us to collect many grasp examples to train the core mirror circuit. Nevertheless, we did test the hand state extracted by our visual system from this real video sequence on the MNS model that has already been trained with the synthetic grasp examples.

Figure 3.24 The plot shows the output of the MNS model when driven by the visual recognition system while observing the action depicted in Figure 3.22. It must be emphasized that the training was performed using the synthetic data from the grasp simulator while testing is performed using the hand state extracted by the visual system only. Dashed line: Side grasp related activity; Solid line: Precision grasp related activity. Power grasp activity is not visible as it coincides with the time axis

Figure 3.24 shows the recognition result when the actual visual recognition system provided the hand state based on the real video sequence shown in Figure 3.23. Although the output of the network did not reach a high level of confidence for any grasp type, we can clearly see that the network favored the precision grasp over the side and power grasps. It is also interesting to note a similar competition (this time between side and precision grasp outputs) took place as we saw (Figure 3.14) when the grasp action was ambiguous.

3.6         Discussion and predictions

3.6.1        The hand state hypothesis

Because the mirror neurons within monkey premotor area F5 fire not only when the monkey performs a certain class of actions but also when the monkey observes similar actions, it has been argued that these neurons are crucial for understanding of actions by others. Indeed, we agree with the importance of this role and indeed have built upon it elsewhere, as we now briefly discuss. Rizzolatti et al. (1996b) used a PET study to show that both grasping observation and object prehension yield highly significant activation in the rostral part of Broca's area (a significant part of the human language system) as compared to the control condition of object observation. Moreover, Massimo Matelli (in Rizzolatti and Arbib 1998) demonstrated a homology between monkey area F5 and area 45 in the human brain (Broca's area comprises areas 44 and 45). Such observations led Rizzolatti and Arbib (1998) building on Rizzolatti et al. (1996a) to formulate:

The Mirror System Hypothesis: Human Broca’s area contains a mirror system for grasping which is homologous to the F5 mirror system of monkey, and this provides the evolutionary basis for language parity - i.e., for an utterance to mean roughly the same for both speaker and hearer. This adds a neural “missing link” to the tradition that roots speech in a prior system for communication based on manual gesture.

Arbib (2001) then refines this hypothesis by showing how evolution might have bridged from an ancestral mirror system to a ‘language ready’ brain via increasingly sophisticated mechanisms for imitation of manual gestures as the basis for similar skills in vocalization and the emergence of protospeech. In some sense, then, the present paper can be seen as extending these evolutionary concerns back in time. Our central aim was to give a computational account of the monkey mirror system by asking (i) What data must the rest of the brain supply to the mirror system? and (ii) How could the mirror system learn the right associations between classification of its own movements and the movement of others? In seeking to ground the answer to (i) in earlier work on the control of hand movements (Iberall and Arbib 1990) we were led to extend our evolutionary understanding of the mirror system by offering:

The hand state hypothesis: The basic functionality of the F5 mirror system is to elaborate the appropriate feedback – what we call the hand state – for opposition-space based control of manual grasping of an object. Given this functionality, the social role of the F5 mirror system in understanding the actions of others may be seen as an exaptation gained by generalizing from self-hand to other's-hand.

The hand state hypothesis provides a new explanation of the evolution of the ‘social capability’ of mirror neurons, hypothesizing that these neurons first evolved to augment the ‘canonical’ and ‘pure motor’ F5 neurons by providing visual feedback on ‘hand state’, relating the shape of the hand to the shape of the object.

3.6.2        Neurophysiological predictions

We introduced the MNS (Mirror Neuron System) model of F5 and related brain regions as an extension of the FARS model of circuitry for visually-guided grasping of objects that links parietal area AIP with F5 canonical neurons. The MNS model diagrammed in Figure 3.5 includes hypotheses as to how different brain regions may contribute to the functioning of the mirror system. Chapter 6 undertakes the neural implementation of Grasp Learning (area F4, F2 and F5). This chapter focused on the Core Mirror Circuit by aggregating the other functionality into three ‘grand schemas’ - visual analysis of hand state, reach and grasp. Thus we only claim that core mirror circuit is relevant for neurophysiological predictions. We developed the visual analysis of hand state schema to the point of demonstrating algorithms powerful enough to take actual video input of a hand (though we simplified the problem by using colored patches) and produce hand state information. The reach and grasp schema then represented all the functionality for taking the location and affordance of an object and determining the motion of a hand and arm to grasp it (however see Chapter 6 for a detailed neural implementation of this circuit grounded in neurophysiology and infant behavior). As the main aim of this chapter was to analyse the core mirror circuit we showed that if we used the reach and grasp schema to generate an observed arm-hand trajectory (i.e., to represent the reach and grasp generator of the monkey or human being observed), then that simulation could directly supply the corresponding hand-state trajectory, and we thus use these data so that we can analyze the core mirror circuit schema (Figure 3.6(b)) in isolation from the visual analysis of hand state. However note that we have also justified the visual analysis of hand state schema by showing in a simulation that the core mirror circuit can be driven with the proposed vision system without any synthetic data from the reach and grasp schema.

Moreover, the hand state input (regardless of being synthetic or real) was presented to the network in a way to avoid the use of a dynamic neural network. To form the input vector, each of the seven components of the hand state trajectory, up to the present time t, is fitted by a cubic spline. Then this spline is sampled at 30 uniformly spaced intervals; i.e., no matter what fraction t is of the total time T of the entire trajectory, the input to the network at time t comprises 30 samples of the hand-state uniformly distributed over the interval [0, t]. The network is trained using the full trajectory of the hand state in a specific grasp; the training set pairs each such hand state history as input with the final grasp type as output. On the contrary, when testing the model with various grasp observations, the input to the network was the hand state trajectory that was available up to that instant. This exactly parallels the way the biological system (the monkey) receives visual (object and hand) information: When the monkey performs a grasp, the learning can take place after the observation of the complete (self) generated visual stimuli. On the other hand, in the observation case the monkey mirror system predicts the grasp action based on the partial visual stimuli (i.e. before the grasp is completed). The network thus yields a time-course of activation for the mirror neurons, yielding predictions for neurophysiological experiments by highlighting the importance of the timing of mirror neuron activity. We saw that initial prefixes will yield little or no mirror neuron activity, and ambiguous prefixes may yield transient activity of the ‘wrong’ mirror neurons.

Since our aim was to show that the connectivity of mirror neuron circuitry can be established through training, and that the resultant network can exhibit a range of novel, physiologically interesting, behaviors during the process of action recognition, the actual choice of training procedure is purely a matter of computational convenience, and the fact that the method chosen, namely back-propagation, is non-physiological does not weaken the importance of our predictions concerning the timing of mirror neuron activity.

With this we turn to neurophysiological predictions made in our treatment of the Core Mirror Circuit, namely the ‘grounding assumptions’ concerning the nature of the input patterns received by the circuit and the actual predictions on the timing of mirror neuron activity yielded by our simulations.

Grounding assumptions: The key to the MNS model is the notion of hand state as encompassing data required to determine whether the motion and preshape of a moving hand may be extrapolated to culminate in a grasp appropriate to one of the affordances of the observed object. Basically a mirror neuron must fire if the preshaping of the hand conforms to the grasp type with which the neuron is associated; and the extrapolation of hand state yields a time at which the hand is grasping the object along an axis for which that affordance is appropriate. What we emphasize here is not the specific decomposition of the hand state F(t) into the seven specific components (d(t), v(t), a(t), o1(t), o2(t), o3(t), o4(t)) used in our simulation, but rather that the input neural activity will be a distributed neural code which carries information about the movement of the hand toward the object, the separation of the virtual fingertips and the orientation of different components of the hand relative to the opposition axis in the object. The further claim is that this code will work just as well for measuring how well another monkey’s hand is moving to grasp an object as for observing how the monkey’s own hand is moving to grasp the object, allowing self-observation by the monkey to train a system that can be used for observing the actions of others and recognizing just what those actions are.

We provided experiments to compare the performance of the Core Mirror Circuit with and without the availability of explicit affordance information (in this case the size of the object) to strengthen our claim that it is indeed adaptive for the system to have this additional input available, as shown in Figure 3.6(b). Note that the "grasp command" input shown in the figure serves here as a training input, and will, of course, plays no role in the recognition of actions performed by others.

Also we have given a justification of the visual analysis of hand state schema by showing in a simulation that the core mirror circuit can be driven with the visual system we implemented without requiring the Reach and Grasp simulator provide syntetic data.

Novel Predictions: Experimental work to date tends to emphasize the actions to be correlated with the activity of each individual mirror neuron, while paying little attention to the temporal dynamics of mirror neuron response. By contrast, our simulations make explicit predictions on how a given (hand state trajectory, affordance) pair will drive the time course of mirror neuron activity – with non-trivial response possibly involving activity of other mirror neurons in addition to those associated with the actual grasp being observed. For example, a grasp with an ambiguous prefix may drive the mirror neurons in such a way that the system will, in certain circumstances, at first give weight to the wrong classification, with only the late stages of the trajectory sufficing for the incorrect mirror neuron to be vanquished.

To obtain this prediction we created a scene where the observed action consisted of grasping a wide object with precision pinch (thumb and index finger opposing each other). Usually this grasp is applied to small objects (imagine grasping a pen along its long axis versus grasping it along its thin center axis). The mirror response we got from our core mirror circuit was interesting. First, the system recognized (while the action was taking place) the action as power grasp (which is characterized by enclosing the hand over large objects; e.g. grasping an apple) but as the action progressed the model unit representing precision pinch started to get active and the power grasp activity started to decline. Eventually the core mirror circuit settled on the precision pinch. This particular prediction is testable and indeed suggests a whole class of experiments. The monkey has to be presented with unusual or ambiguous grasp actions that require a ‘grasp resolution’. For example, the experimenter can grasp a section of banana using precision pinch from its long axis. Then we would expect to see activity from power grasp related mirror cells followed by a decrease of that activity accompanied by increasing activity from precision pinch related mirror cells.

The other simulations we made leads to different testable predictions such as the mirror response in case of a spatial perturbation (showing the monkey a fake grasp where the hand does not really meet the object) and altered kinematics (perform the grasp with different kinematics than usual). The former is in particular a justification of the model, since in the mirror neuron literature it has been reported that the spatial contact of the hand and the object is usually required for the mirror response (Gallese et al. 1996). On the other hand, the altered kinematics result predicts that an alteration of the kinematics will cause a decrease in the mirror response. We have also noted how a discrepancy between hand state trajectory and object affordance may block or delay the system from classifying the observed movement.

In summary, we have conducted a range of simulation experiments – on grasp resolution, spatial perturbation, altered kinematics, temporal effects of explicit affordance coding, and analysis of compatibility of the hand state to object affordance – which demonstrate that the present model is not only of value in providing an implemented high-level view of the logic of the mirror system, but also serves to provide interesting predictions ripe for neurophysiological testing, as well as suggesting new questions to ask when designing experiments on the mirror system.

 


4          CHAPTER IV: MULTILAYER SUPERVISED HEBBIAN LEARNING AND PROBABILITY CODING

This chapter introduces a learning and data generation model that can be employed in multi-layered circuits. The architecture that we develop in this chapter will be used in the Grasp Learning Models of Chapters 5 and 6. The adaptation of the network weights is performed in a hebbian fashion based on a reinforcement signal. In the general reinforcement learning framework the learning problem is formulated as an agent acting in an environment that returns rewards based on the actions of the agent and state of the environment (Sutton and Barto 1998). By acting, the agent can (and usually does) change the state of the environment. The goal of the agent is to maximize its total reward in the long run, possibly in infinite future