From: dstamp@watserv1.waterloo.edu (Dave Stampe-Psy+Eng)
Subject: Re: TECH: of costs and Convolvotrons.
Date: Sun, 15 Mar 1992 23:21:06 GMT
Message-ID: <1992Mar15.232106.10576@watserv1.waterloo.edu>
Organization: University of Waterloo


cs_d476@kingston.ac.uk (Leaback P D) writes:

>I think you are being (unintentionally) slightly misleading. The paper
>you give a reference to concerns itself with seeking out a virtual sound
>source. Being able to seek a sound source is not a good test of the
>ability to simulate localisation cues. For example, varying the volume of
>sound presented over a single earphone is sufficient to seek out a
>virtual sound source (if coupled with a head tracking device).

Are you sure we're talking about the same paper here?  The techniques used
included interaural delay, volume, high-frequency attenuation, and
diffuse sound sources-- I have yet to see the last implemented in a
convolution system!  About all it was missing was pinna comb-filtering.

I included this reference only to show that even simple systems allow
sounds to be located. It does not give any data on the "out-of-head"
quality of the sound, but experiments have shown that even simple cues
linked to head motion cause the sound to be percieved as outside the
head.  Conversely, even good "convolved" sound images fall back into 
the head if they fail to follow head motions.

>The filter system you describe fails to simulate
>
>        * The frequency dependant interaural time delay function.
>        * The frequency dependant head transfer function.
>        * The high frequency Pinna filter function.
>        * Shoulder bounce.
>        * The inverse of the headphone filter function.

The system I discussed does ALL of these except shoulder bounce, which is
difficult to simulate in any case, due to variablility (head tilt, 
clothing, etc).  Headphone transfer function can be a problem, but it
has been shown that a few seconds of head movement quickly null out
residual headphone effects (retraining of the location system's
expectations).

Looking at how the ear processes data, there are two main bands of
sound important for directional hearing:  below 1000 Hz, where phase
(interaural delay) is important, and above 2000 Hz, where amplitude/
peak/dip effects are important (head shadow, pinna comb filtering,
frontal "peaking" etc).  Using a process model of sound outside the ear
and knowledge of what factors are important for directional hearing,
you can isolate important processes to simulate.

>A small FIR has a good stab at ALL the localisation cues. I gave up on
>the approximation techniques because I came to the conclusion that even
>very modest DSP chips can support large enough FIR filters to produce
>effects that are more impressive than the approximation techniques could
>ever achieve.

Well, it depends how you look at the problem.  From a DSP point of
view, you need a large number of impulse functions to cover the 
"angle sphere" (ideally seperated by 1 sample of delay each, to 
reduce frequency notch loss).  About 100-200 would be needed for this
case, and to get smooth performance, you must interpolate 2 or 4
impulse functions to get exact positions.  Sound loacization is better
than 2 degrees straight ahead, so you should be able to set the transfer
functions to this precisions.

Too small an FIR will NOT do better than a process-based design.  You will
lose a lot of the precision clues to position.  For example, at least a
40 point FIR (at a 50 KHz sample rate) would be needed to do interaural
delays.  Other effects depend on how well you wish to reproduce them--
again, you must go back to a process-based model.

I sort of object to the term "approximation" too.  With the convolution
systems you're using someone else's ear transfer functions as an 
"approximation" to your own, aren't you?  And what about the studies
that showed that some people have better location sense using someone
else's pinna recordings?  The point is, there's quite a few localization
cues coming in.  Not all possible ones are used, so simulating the 
major ones IN REAL TIME AND LINKED TO HEAD MOTION should suffice.  If
the sound "pops out" and is well-localized with head movement, it
doesn't matter what the impulse function looks like.  Besides, averaged
impulse functions have lost 90% of their data content already.

>However, one advantage of the approximation techniques is that one does
>not need access to a Head Related Transfer Function which are not easy to
>come by!

This is certainly true.  Every group I've talked to who has measured
them treat them like gold.  Then again, no one will measure all the
sample points required unless they have a business reason, anyway.

One solution would be to get ahold of a dummy binaural recording head
(European, the American ones suck).  The rent an anechoic chamber and
test equipment and...  Naah.


--------------------------------------------------------------------------
| My life is Hardware,                    |                              | 
| my destiny is Software,                 |         Dave Stampe          |
| my CPU is Wetware...                    |                              | 
| Anybody got a SDB I can borrow?         | dstamp@watserv1.uwaterloo.ca |
__________________________________________________________________________