.B COMMENTS ON THE SYMBOL GROUNDING PROBLEM .R Andrzej J. Pindor, University of Toronto, Computers and Communications pindor@utirc.utoronto.ca .QP .B Abstract .R A solution to the symbol grounding problem proposed by Harnad requires giving a system both linguistic and sensorimotor capacity indistnguishable from those of a human. The symbols are then grounded by the fact that analog sensorimotor projections on transducer surfaces, coming from the real world objects, and successively formed sensory invariants of nonarbitrary shape constrain the symbol combinations over and above what is imposed by syntax, and tie the symbols to those real objects. It is argued here that the full sensorimotor capacity may indeed be a crucial factor, since it is capable of providing the symbols (corresponding to language terms) with a deep underlying structure, which creates a network of intricate correlations among them at a level of primitive symbols based in inputs from the transducers. On the other hand the nonarbitrary shapes of sensory invariants as well as the analog nature of sensorimotor projections seem to be of no consequence. Grounding is then seen as coming from this low-level correlation structure and, once known, could in principle be programmed into a system without a need of transducers. .RE In a series of papers Stevan Harnad has suggested a solution to the "symbol grounding" problem (Harnad 1990, Harnad 1993, Harnad 1993a). The essence of the problem is that symbols manipulated by digital computers or even neural nets (in SIM or IMP implementations, see Harnad 1993a) do not seem to be about anything in particular - their only meaning comes from the mind of an interpreter. The symbols themselves are manipulated according to syntactic rules, on the basis of their shapes only, these shapes being unrelated to what the symbols can be interpreted as standing for. This lack of meaning of the symbols is, Harnad claims, evident for instance from the fact that one cannot learn Chinese from a Chinese-Chinese dictionary (Harnad 1993a). Consequently, there is no guarantee that a TT-passing system, say in Chinese, really understands Chinese - it may be simply manipulating symbols (Chinese characters) syntactically, without any regard for what these symbols are _about_. Harnad suggests that symbols of a system can be grounded if (and only if) the system can pass a Total Turing Test, i.e. have both linguistic _and_ sensorimotor capacity totally indistinguishable from our own. Such a system would have to be equipped with a full range of transducers giving it a complete, as he calls it, robotic capacity. Harnad then proposes a more detailed model describing how the robotic capacity leads to the grounding of symbols. He argues that inputs from senses (or sensors for a robot) in the form of analog "sensory projections" connect to symbols of the system (i.e. language terms) through sensory invariants of _nonarbitrary shape_ (these invariants, he suggests, could be extracted from the sensory projections, in the case of a robot, using neural nets). This fact, according to him, puts additional constraints on symbol combinations, over and above syntactic constraints, and results in grounding. Symbols are about real world objects whose sensory projections invariants they correspond to (Harnad 1993a). Before discussing Harnad's suggested solution to the symbol grounding problem it may be appropriate to comment on his use of the term "symbol". In most of his arguments this word is used in the meaning of a computational token capable of being interpreted as being _about something_, i.e. corresponding to a language term. However he also talks about "...the manipulations of physical 'symbol tokens' on the basis of syntactic rules that operate only on the 'shapes' of the symbols (...), as in a digital computer or its idealization, a Turing machine manipulating, say, 0's and 1's " (Harnad 1993a). This indicates that he considers a digital computer 0's and 1's also as (perhaps primitive) symbols, which the higher level symbols, capable of being interpreted as corresponding to language terms, are built of. This is an important point, relevant to my discussion below of Harnad's stress on the analog nature of 'sensory projections'. The main idea of Harnad's model, the need for a system to have the full sensorimotor capability of a human being in order for its symbols to be grounded, expresses the fact that the terms of a language we use are not defined by their relationships to (correlations with) other language terms only - they are defined by a cooperation, so to speak, of all sensory inputs at our disposal. When we say, for instance, "cat" , understanding of this term involves all experiences we have had with cats - through vision, touch, smell etc. A single language dictionary (like Chinese-Chinese in Harnad's example) can only relate language terms among themselves. Relating them to real world objects requires full sensorimotor capacity indistinguishable from our own (Harnad 1993). No surprise that a TT-passing system, which demonstrably does not have such a capacity (for instance Searle's Chinese Room, see Searle 1980) can be suspect with respect to its understanding of the language it seems so expertly to use. After all we do not expect a person blind from birth to understand how colours influence our interaction with the world, regardless of the amount of verbal explanations. How the sensorimotor inputs lead to grounding of top-level symbols (i.e. language terms) is another story and below I criticize two aspects of the Harnad's model. The first aspect which I would like to discuss is his claim of "nonarbitrary shapes" of sensory invariants extracted from sensory projections from real world objects onto a system's transducer surfaces. The word 'shape" above has to be interpreted in a somewhat generic sense - in the case of senses other than vision and touch it must mean some particular feature of the sensory invariants which is somehow fixed by the nature of an object (phenomenon) it corresponds to. This "nonarbitrariness of shape", in Harnad's eyes, imposes a constraint on a system's symbols assigned to represent such an invariant. To what extent are the shapes of sensorimotor projections 'nonarbitrary'? I will consider below several examples indicating that the shapes of the sensorimotor projections seem to be to a large extend dependent on the physical nature of the transducers, which are, in a sense, results of evolutionary 'accidents' (naturally optimized within an accessible range of physical parameters) and are then to a large degree arbitrary. 1. Colours. Colour vision is dependent on six types of cells in the eye's retina sensitive to light in various parts of the spectrum (DeValois, Abramov and Jacobs 1966, DeValois and Jacobs 1968). Two of these have to do with perception of blue and yellow, two with perception of red and green and two are sensitive only to intensity of light within the 'visible range'. Terms 'blue', 'yellow', 'green' and 'red' refer to various ranges of light wavelength covering the visible portion of the spectrum. Now, it is most likely an evolutionary 'accident' how the visible spectrum is divided into these four regions. With a somewhat different chemistry the ranges of sensitivity of the colour cells might have been different, resulting in a different colour perception. One can also conceivably imagine that, had the evolution of the human eye gone somewhat differently, we might have ended up with a colour vision mechanism distinguishing three or five colours. Consequently, sensory projections of real objects, coming from the colour vision system, would have different "colour shapes", which are to a large extent determined by the physical nature of the _transducers_ and not the objects themselves. 2. Visual shapes. Due to the nature of human eye optics, projections of real objects on the eye's retina are already distorted - for instance many straight lines in the outside world project on the retina as curved lines. In addition, as is well known, these projections are upside down. The fact that we see the real world objects "right way up" is a result of the brain learning to _correlate_ shapes of sensory projections from the visual system with other sensory projections. If we perform an experiment in which subjects are made to wear glasses inverting the image falling on the retina upside down (so that it now correspond to the "real" situation), the subjects are at first very confused and have difficulties moving around, grabbing objects etc. However, after a certain time there seems to be a discontinuous transition to a state in which the subjects report that they see everything "normally" and have no more problems performing tasks requiring vision. Obviously, their brains have learned to _correlate_ the new "shapes" of the sensory projections from the vision system with other sensorimotor projections. A similar effect arises if we try to trace a pattern (say with a stylus) looking not at the pattern itself, but at its reflection in a mirror. Initially we are quite confused, but if we persist at the task after a while it becomes as natural as tracing the original pattern - the brain learns to compensate for the reversal of left and right. One could also speculate that if in the distant past evolution chose a slightly different route we might have ended up with eyes more like those of insects - sensory projections of our visual system, coming from real world objects, would be very different and there is no reason to doubt that our brain would learn to deal with such a situation. We see again that the shapes of the sensory projections are in some sense arbitrary, determined by the physical nature of the transducers. 3. Touch. Let us perform a very simple experiment - we cross the index finger with the middle finger of our right hand in such a way that the tip of the middle finger is to the left of the tip of the index finger (and vice versa). Now if we touch a small round object with these two fingers simultaneously (i.e. the object touches the left side of the tip of the index finger and the right side of the tip of the middle finger) we have an impression that we are touching two objects and not one. We see that even such basic information about real objects as whether we deal with a single object or with two separate objects cannot be reliably extracted from a single sensory projection - we need _correlations_ from other sensory projections to form a picture which makes sense. The above examples seem to put in doubt Harnad's claim that "nonarbitrary shapes" of sensorimotor projections from real objects onto transducer surfaces are a crucial element of symbol grounding. The shapes of the sensorimotor projections are shown to be arbitrary to a large extent and it is the _correlations_ among these projections which appear to play a dominant role. Harnad illustrates categorization process leading to the grounding of the category 'names" by an imaginary example of learning to distinguish between edible and poisonous mushrooms (Harnad 1993). It is interesting to note that in his example the grounding of the mushroom names ("mushrooms" for the edible ones and "toadstools" for the poisonous ones) takes place on the basis of _correlations_ between various sensory projections. _Shapes_ of the projection invariants do not to enter in any way. The second aspect of Harnad's model is his claim that the sensorimotor projections coming from the system's transducers, fed subsequently to a neural net for the purpose of categorisation (extraction of invariants), are analog. For instance he writes (Harnad 1993): "...it [Harnad's model] is 3-way (analog-connectionist-symbolic) with the connectionist component just a place-holder for any mechanism able to learn invariants in the analog sensorimotor projections that allow the system to do categorisation" and further down: "...performance requirements of such a T3 [i.e. TTT] -scale robot depend essentially on analog and other nonsymbolic forms of internal structure and function." However nowhere in his arguments does Harnad convincingly show that this analog feature of the input (in the form of sensorimotor projections) to neural nets which do the invariant extraction is, in fact, essential. Any analog signal can be approximated with arbitrary accuracy by a digital signal. Since neural nets can have only finite sensitivity, whether they are fed an analog signal or a correspondingly finely graded digitized signal cannot matter for further processing. Once we accept this, these digitized signals from the transducers (sensorimotor projections) can be viewed as primitive symbols, in the same spirit as 0's and 1's of a Turing machine. All further processing can be considered as symbol manipulations which one way or another lead to construction of high-level symbols representing language terms (category names). This may very well happen with the use of neural nets to extract invariants from sensory projections and perhaps perform categorization. Since any neural net may be emulated using a suitably programmed digital computer, all these steps can be achieved without a need for analog devices. The above analysis suggests that full robotic capacity of a system might provide high-level symbols with a deeper structure based in correlations among the primitive symbols, the sources of which are inputs from sensorimotor transducers. Symbol grounding would then be achieved by the presence of such an underlying structure, which would give the symbols a much richer (and more intricate) set of relationships than can be offered by a (single-language) dictionary. These relationships mirror the experiences of interacting with the real world, making the symbols effective in such an interactions and justifying the claim that the symbols are grounded. It is nevertheless worth pointing out that there does not seem to be a reason why the underlying structure discussed above, once established, could not be built (programmed) into a symbolic system, without a need to give the system the full robotic capacity. Such a system would be capable of passing the TT and should perhaps also be considered to posses understanding of the language it uses. There is one more aspect of the grounding problem, as discussed above, which requires mentioning. There are situations when we deal with concepts defined solely using language, without a reference to sensorimotor projections from real world objects. Such situations arise, for instance, in the case of mathematics. If we consider abstract set theory or abstract group theory, we define objects (sets, group elements) purely syntactically and then proceed to draw all possible conclusions concerning the consequences of these definitions. In spite of the fact that the symbols we manipulate do not require grounding in sensorimotor projections from real world objects, and the manipulations depend only on shapes of these symbols (which are completely arbitrary), we do talk about "understanding" mathematics (abstract set theory, abstract group theory, etc.). It is clear that understanding in this case means a knowledge of (or ability to deduce) _correlations_ among symbols of increasing complexity, arising from definitions of basic symbols from which these higher level symbols are constructed. In conclusion, it is argued above that even though two aspects of Harnad's model for symbol grounding seem unjustified: - shapes of sensorimotor projections from real objects onto transducer surfaces do not appear to be relevant and hence cannot play a role in restricting symbol combinations; - importance of the analog nature of the sensorimotor projections, fed subsequently to neural nets for invariant feature extraction, is not apparent (there are reasons to think that these projections might just as well be digitized leaving us with pure symbol manipulations); the main idea of the model - TTT capacity - may be crucial for symbol grounding. It may be the combinations of various sensorimotor experiences with real objects which lead to the formation of a deep structure underlying the high-level symbols which provides (epistemological) meaning of language terms. This structure underlying the symbols may be somewhat akin to to the semantic structure of language J. Katz is attempting to establish in "The Metaphysics of Meaning" (Katz 1990), although he takes a definitely platonic view, whereas the structure referred to here has a very specific sensorimotor basis. There also appears a possibility that if a symbolic system works on the basis of digitized inputs, corresponding to sensorimotor projections coming from transducers, as basic symbols, it might posses understanding without TTT capability. Possibility of ascribing understanding to a purely symbolic system seems in accordance with using the term "understanding" in the case of abstract mathematics, where (mathematical) terms used are described verbally only, without recourse to the full sensorimotor capacities of a human being. .B References .R DeValois, R.L., I. Abramov, and G.H. Jacobs, (1966) Analysis of Response Patterns of LGN Cells. Journal of the Optical Society of America 56:966-77. DeValois, R.L., and G.H. Jacobs, (1968) Primate Color Vision. Science 162:533-40. Harnad, S. (1990) The Symbol Grounding Problem. Physica D 42: 335-346. Harnad, S. (1993) Symbol Grounding is an Empirical problem: Neural Nets are Just a Candidate Component. Proceedings of the Fifteenth Annual Meeting of the Cognitive Society. NJ: Erlbaum Harnad, S. (1993a) Grounding Symbols in the Analog World with Neural Nets. Think 2:12-78 (Special Issue on "Connectionism versus Symbolism" D.M.W Powers & P.A. Flach, eds.). Katz, J.J. (1990) The Metaphysics of Meaning, MIT Press, Cambridge, Massachusetts. Searle, J. R. (1980) Minds, brains and programs. Behavioral and Brain Sciences 3: 417-424.