Manual Reference Pages - GENSAI (1)
NAME
gensai - generate stabilised auditory image
CONTENTS
Synopsis/syntax
Description
Strobed Temporal Integration
Options
I. Display Options For The Auditory Image
Ii. Storage Options For The Auditory Image
Iii. Options For The Auditory Image
Examples
References
Files
See Also
Copyright
Acknowledgements
SYNOPSIS/SYNTAX
gensai [ option=value | -option ] filename
DESCRIPTION
Periodic sounds give rise to static, rather than oscillating,
perceptions indicating that temporal integration is applied to the NAP
in the production of our initial perception of a sound -- our auditory
image (Patterson et al., 1992b). Traditionally, auditory temporal
integration is represented by a simple leaky integration process, and
AIM provides a bank of lowpass filters to enable the user to generate
auditory spectra, excitation patterns (see Patterson, 1994a; genasa
and genepn), and auditory spectrograms (see Patterson et al., 1992a,
1993; gensgm and gencgm). However, leaky integrators remove the
phase-locked fine structure observed in the NAP, and this conflicts
with perceptual data indicating that the fine structure plays an
important role in determining sound quality and source identification
(Patterson, 1994b; Yost et al., 1996; Patterson et al., 1996). As a
result, AIM includes two modules which preserve much of the
time-interval information in the NAP during temporal integration, and
which produce a better representation of our auditory images. The
functional version of AIM uses a form of Strobed Temporal Integration
(STI) (Patterson et al., 1992a,b), and this is the primary topic of
this manual entry.In the physiological version of AIM, the auditory image is constructed
with a bank of autocorrelators and the multi-channel result is
referred to as a correlogram (Lyon, 1984; Slaney and Lyon, 1990;
Meddis and Hewitt, 1991). The correlogram module is an aimTool rather
than an integral part of the main program gen. The name of the tool
is acgram and there are man pages for all the tools. An extended
example involving correlograms is presented in the script gtmdcg_pk.
The correlogram module extracts periodicity information and preserves
intra-period fine structure by autocorrelating the function in each
channel of the NAP. It was originally introduced as a model of pitch
perception by Licklider (1951). It is not yet known whether STI or
autocorrelation is more realistic, or more efficient, as a means of
simulating our perceived auditory images. At present, the purpose is
to provide a software package that can be used to compare these
auditory representations in a way not previously possible.There are a large number of Silent Options associated with printing
AIM displays (see docs/aimSilentOptions). They are particularly useful
when printing correlograms, summary correlograms, and summary auditory
images, where some of the default axes and labels are incorrect.
Examples of how to use these silent options are presented in
docs/aimR8demo.
STROBED TEMPORAL INTEGRATION
.
In strobed temporal integration (STI), a bank of delay lines is used
to form a buffer store for the NAP, one delay line per channel. As
the NAP proceeds along the buffer, it decays linearly with time, at
about 2.5 %/ms, so there is no activity beyond about 40 ms in the NAP
buffer. Each channel of the buffer is assigned a strobe unit which
monitors activity in that channel looking for local maxima in the
stream of NAP pulses. When one is found, the unit initiates temporal
integration in that channel; that is, it transfers a copy of the
entire NAP function in that channel at that instant, to the
corresponding channel of an image buffer, where it adds the NAP
function point-for-point with whatever is already in that channel of
the image buffer. The local maximum itself is mapped to the 0-ms
point in the image buffer. The multi-channel version of this STI
process is AIMs representation of our auditory image of a sound.
Periodic and quasi-periodic sounds typically produce a single local
maximum per cycle, per channel of the NAP. In any given channel, this
leads to regular strobing and the transfer into the auditory image of
a sequence of NAP functions which are all virtually identical. As a
result, the auditory images of periodic sounds lead to static auditory
images, and quasi-periodic sounds lead to nearly static images. These
images, however, have the same temporal resolution as the NAP.
Dynamic sounds are represented as a sequence of auditory image
frames. If the rate of change in a sound is not too rapid, as is
diphthongs, features are seen to move smoothly as the sound proceeds,
much as objects move smoothly in animated cartoons.The primary difference between the auditory images produced with STI
and autocorrelation is that STI preserves much more of the temporal
asymmetry that sounds generate in the NAP (Allerhand and Patterson,
1992). Indeed the features that appear in correlograms are virtually
symmetric. The preservation of asymmetry in STI is heavily dependent
on the criterion that the stobe mechanism uses to identify points at
which to initiate temporal integration. If it successfully identifies
local maxima as the integration points, asymmetry is preserved
(Allerhand and Patterson, 1992). If it initiates temporal integration
on every pulse in the NAP, the features lose their asymmetry and the
resulting image is quite similar to the corresponding correlogram. A
detailed description of the STI process is presented in
/docs/aimStrobeCriterion with examples provided by a companion script
(/scripts/aimStrobeCriterion). The discussion of strobe criteria
begins with the simplest criterion for initiating temporal integration
-- strobe on every non-zero NAP point. This converts sharp,
asymmetric NAP features into rough-edged, largely-symmetric features
in the auditory image. From there the discussion proceeds to more
restrictive criteria which gradually reduce the rough edges and
restore the asymmetry of the features in the auditory image. The
degree of restriction in the strobe criterion is an option in AIM
(stcrit_ai) which enables the user to experiment with the relationship
between STI and autocorrellation.It is important to emphasise, that the strobing in a given channel is
independent of that in all other channels so far as the mechanism
itself is concerned. It is this aspect of the strobing process, and
the fact that the local maximum is mapped to 0 ms in the auditory
image, that causes the alignment of channels in the auditory image.
This passive alignment of channels in turn enables AIM to explain
monaural phase perception as set out in Patterson (1987).The auditory image has the same vertical dimension as the neural
activity pattern (filter centre frequency). The continuous time
dimension of the neural activity pattern becomes a local,
time-interval dimension in the auditory image; specifically, it is
"the time interval between a given pulse and the succeeding strobe
pulse". In order to preserve the direction of asymmetry of features
that appear in the NAP, the time-interval origin is plotted towards
the right-hand edge of the image, with increasing, positive time
intervals proceeding to towards the left.
OPTIONS
.
I. DISPLAY OPTIONS FOR THE AUDITORY IMAGE
.
The options that control the positioning of the window in which the
auditory image appears are the same as those used to set up the
earlier windows (i.e. those with the suffix _win), as are the options
that control the level of the image within the display (top, bottom
and magnitude). In addition, there are three new options that are
required to present this new auditory representation. The options are
frstep_aid, pwidth_aid, and nwidth_aid; the suffix "_aid" means "auditory
image display". These options are described here before the options
that control the image construction process itself, as they occur
first in the options list. There are also three extra display options
for presenting the auditory image in its spiral form; these options
have the suffix "_spd" for "spiral display"; they are described in the
manual entry for genspl.
frstep_aid The frame step interval, or the update interval for the auditory image display Default units: ms. Default value: 16 ms.
Conceptually, the auditory image exists continuously in time. The
simulation of the image produced by AIM is not continuous, however;
rather it is like an animated cartoon. Frames of the cartoon are
calculated at discrete points in time, and then the sequence of frames
is replayed to reveal the dynamics of the sound, or the lack of
dynamics in the case of periodic sounds. When the sound is changing
at a rate where we hear smooth glides, the structures in the simulated
auditory image move much like objects in a cartoon. frstep_aid
determines the time interval between frames of the auditory image
cartoon. Frames are calculated at time zero and integer multiples of
frstep_aid.The default value (16 ms) is reasonable for musical sounds and speech
sounds. For a detailed examination of the development of the image of
brief transient sounds frstep_aid should be decreased to 4 or even 2
ms.pwidth_aid The maximum positive time interval presented in the display of the
auditory image (to the left of 0 ms).Default units: ms. Default value: 35 ms.
nwidth_aid The maximum negative time interval presented in the display of the
auditory image (to the right of 0 ms).Default units: ms. Default value: -5 ms.
Note that the minus sign is required when entering nwidth_aid.
animate Present the frames of the simulated auditory image as a cartoon. Switch. Default off.
With reasonable resolution and a reasonable frame rate, the auditory
cartoon for a second of sound will require on the order of 1 Mbyte of
storage. As a result, auditory cartoons are only stored at the
specific request of the user. When the animate flag is set to on,
the bit maps that constitute the frames of the auditory cartoon are
stored in computer memory. They can then be replayed as an auditory
cartoon by pressing carriage return. To exit the instruction, type
"q" for quit or "control c". The bit maps are discarded unless
option bitmap is set to on.There is a silent option review associated with animate. When
review=on, it which causes AIM to pause between the frames of the
cartoon and wait for a return.
II. STORAGE OPTIONS FOR THE AUDITORY IMAGE
.
A record of the auditory image can be stored in two ways depending on
the purpose for which it is stored. The actual numerical values of
the auditory image can be stored as previously, by setting output=on.
In this case, a file with a .sai suffix will be created in accordance
with the conventions of the software. These values can be recalled
for further processing with the aimTools. In this regard the SAI
module is like any previous module.It is also possible to store the bit maps which are displayed on the
screen for the auditory image cartoon. The bit maps require less
storage space and reload more quickly, so this is the preferred mode
of storage when one simply wants to view the auditory image.
bitmap Produce a bit-map storage file Switch. Default value: off.
When the bitmap option is set to on, the bit maps are stored in a
file with the suffix .ctn. The bitmaps are reloaded into memory using
the instructions review, or xreview, followed by the file name without
the suffix .ctn. The auditory image can then be replayed, as with
animate, by typing carriage return. xreview is the newer and
preferred display routine. It enables the user to select subsets of
the cartoon and to change the rate of play via a convenient control
window. It does, however, require an ANSI C compiler like gcc.
III. OPTIONS FOR THE AUDITORY IMAGE
.
There are six options with the suffix "_ai", short for auditory
image. The first option, napdecay_ai, controls the decay rate for
the NAP while it flows down the NAP buffer and before it is
transferred to the auditory image. In point of fact, then, napdecay_ai
is a NAP option rather than and auditory image option. But its effects
are only observed in the auditory image and so it is grouped with the
other _ai options. The next four options control the STI process --
stdecay_ai, stcrit_ai, stlag_ai and decay_ai. The final option,
stinfo_ai, is a switch that causes the software to produce information
about the current STI analysis for demonstration or diagnostic
purposes.The strobe mechanism is conceptually simple. An adaptive strobe
threshold is set up for each channel of the NAP and its value is set
to the height of the first NAP pulse when it occurs. Thereafter, the
strobe threshold decays exponentially with time in the absence of
suprathreshold NAP pulses. The rate of decay is an option, decay_ai,
which is specified as the half life of the image; the default half
life is 30 ms. When a NAP pulse next exceeds the strobe threshold,
the level of the strobe threshold is reset to the height of the peak
of the NAP pulse, and the time of the peak of the NAP pulse is
recorded as a potential temporal integration time. There is then a
short lag before integration, to see if another larger NAP pulse is
about to occur, since the production of a stable images depends on
identifying local maxima in the NAP. The strobe lag is an option,
stlag_ai, whose default value is 5 ms. If a larger NAP pulse occurs
within stlag_ai ms, its peak time becomes the new potential
integration time, the strobe threshold level is reset to the height
of the peak of the new NAP pulse, and the strobe lag is reset to
stlag_ai. In the event of a continuous stream of rising NAP pulses,
the total strobe lag after the occurance of the first suprathreshold
NAP pulse is limited to twice stlag_ai.
napdecay_ai Decay rate for the neural activity pattern (NAP) Default units: %/ms. Default value 2.5 %/ms.
napdecay_ai determines the rate at which the information in the neural
activity pattern decays as it proceeds along the auditory buffer that
stores the NAP prior to temporal integration.stdecay_ai Strobe threshold decay rate Default units: %/ms. Default value: 5 %/ms.
stdecay_sai determines the rate at which the strobe threshold decays.
At 5 %/ms, strobe threshold returns to zero from a NAP peak of any
height in 40 ms, in the absence of further suprathreshold NAP pulses.
Note that, in absolute terms, strobe threshold decays faster after
large NAP peaks than after small NAP peaks.
stcrit_ai |
Strobe criterion
Switch: Default value: 5
The stabilisation of NAP features in the auditory image, and 1 Strobe on every non-zero point in the NAP. 2 Strobe on the peak of every NAP pulse. 3 Avoid strobing on NAP peaks in the temporal shadow of a large peak. 4 Avoid strobing on peaks followed by larger peaks. 5 Do not wait more than twice stlag_ai before initiating integration. |
stlag_ai |
Auditory image strobe lag time (in ms)
Default units: ms. Default value: 5 ms.
For strobe criterion levels 4 and above, following detection of a NAP
General purpose pitch mechanisms based on peak picking are notoriously |
decay_ai |
Auditory image half life
Default units: ms. Default value 30 ms.
When the input sound terminates, the auditory image must decay. In |
stinfo_ai |
Strobe threshold information. (Values: off, on, filename)
Switch: Default value off.
When the switch is on, gensai outputs the strobe threshold function |
EXAMPLES
This Section presents a pair of examples intended to illustrate the
predominant forms of motion that dynamic sounds produce in the
auditory image, and the fact that structures and features can be
tracked across the image provided the rate of change is not excessive.
The first example is a pitch glide for a note with fixed timbre; it
produces predominantly horizontal motion in the auditory image. The
second example is a timbre glide for a note with fixed pitch; it
produces predominantly vertical motion in the auditory image.
A Pitch Glide in the Auditory Image
To this point, the discussion has focussed on how to convert a NAP
from a periodic sound with a repeating pattern into a stabilised
auditory image without smearing the fine structure of the NAP pattern.
The mechanism is not, however, limited to periodic sounds. The sound
file cegc contains a set of click trains that produce four musical
notes referred to as C3, E3, G3, and C4, along with glides from one
note to the next. The notes are relatively long (300 ms) and the
pitch glides are relatively slow (300 ms for 3-5 semitones). As a
result, each note forms a stabilised auditory image and there is
smooth motion in the image over the 300-ms interval as the sound
glides from one note to the next. The pitch of musical notes is
determined by the lower harmonics when they are present and so the
frequency range is limited to 2000 Hz. The demonstration is generated
and stored with the instruction> gensai channels=40 max=2000 input=cegc bitmap=on
It can then be replayed at will with either review cegc or xreview
cegc. (Click on the image with the middle mouse button to pull up
the control window for xreview.) For brevity, the example can be
limited to the transition from C to E near the start of the sound using
the instruction> gensai channels=40 max=2000 start=150 length=600 input=cegc
(In point of fact, the click train associated with the first note has
a period of 8 ms; so this "C" is actually a little below the musical
note C3.)When the note comes on, a stable image of the first note forms over
the first 4-6 cycles of the note. The vertical structure that repeats
four times across the image is the time-interval pattern that
identifies a click-train sound. When the transition begins, in the
lower channels associated with the first and second harmonic, the
individual SAI pulses move from left to right. At the
same time, they move up in frequency as these resolved harmonics
move up into filters with higher centre frequencies. In these low
channels the motion is relatively smooth because the SAI pulses have a
duration which is a significant proportion of the period of the sound.
As the pitch rises and the periods get shorter, each new NAP cycle
contributes a NAP pulse which is shifted a little to the right
relative to the corresponding SAI pulse. This increases the leading
edge of the SAI pulse without contributing to the lagging edge. As a
result, the leading edge builds, the lagging edge decays, and the SAI
pulse moves to the right. The SAI pulses are asymmetric during the
motion, with the trailing edge more shallow than the leading edge, and
the effect is greater towards the left edge of the image because the
discrepancies over four cycles are larger than the discrepancies over
one cycle. The effects are larger for the second harmonic than for
the first harmonic because the width of the pulses of the second
harmonic are a smaller proportion of the period. During the pitch
glide the SAI pulses have a reduced peak height because the activity
is distributed over more channels and over time intervals.The SAI pulses associated with the higher harmonics are relatively
narrow with respect to the changes in period during the pitch glide.
As a result there is more blurring of the image during the glide in
the higher channels. Towards the right-hand edge, for the column that
shows correlations over one cycle, the blurring is minimal. Towards
the left-hand edge the details of the pattern are blurred and we see
mainly activity moving in broad vertical bands from left to right.
When the glide terminates the fine structure reforms from right to
left across the image and the stationary image for the note E appears.The details of the motion are more readily observed when the image is
played in slow motion, using review or xreview and one of the slow
down options.
A Timbre Glide in the Auditory Image
The vowels of speech are quasi-periodic sounds and the period for the
average male speaker is on the order of 8ms. As the articulators
change the shape of the vocal tract during speech, formants appear in
the auditory image and move about. The position and motion of the
formants is an important part of the information conveyed by the
voiced parts of speech. When the speaker uses a monotone voice, the
pitch remains relatively steady and the motion of the formants is
essentially in the vertical dimension. An example of monotone voiced
speech is provided in the file leo which is the acoustic waveform of
the word leo. The auditory image of leo can be produced and viewed
with the instruction> gensai input=leo bitmap=on animate=on
It can be replayed under user control with either
> review leo or
> xreview leo
The dominant impression on first observing the auditory image of
leo is the motion in the formation of the "e" sound, the
transition from "e" to "o", and the formation of the "o" sound.The vocal chords come on at the start of the "l" sound but the tip of
the tongue is pressed against the roof of the mouth just behind the
teeth and so it restricts the air flow and the start of the "l" does
not contain much energy. As a result, in the auditory image, the
presence of the "l" is primarily observed in the transition from the
"l" to the "e". That is, as the three formants in the auditory image
of the "e" come on and grow stronger, the second formant glides into
its "e" position from below, indicating that the second formant was
recently at a lower frequency for the previous sound. The details of
the motion are more readily observed when the image is played in slow
motion, using review or xreview and one of the slow down options.In the "e", the first formant is low, centred on the third
harmonic at the bottom of the auditory image. The second formant
is high, up near the third formant. The lower portion of the
fourth formant shows along the upper edge of the image.
Recognition systems that ignore temporal fine structure often
have difficulty determining whether a high frequency
concentration of energy is a single broad formant or a pair of
narrower formants close together. This makes it more difficult
to distinguish "e". In the auditory image, information about the
pulsing of the vocal chords is maintained and the temporal
fluctuation of the formant shapes makes it easier to distinguish
that there are two overlapping formants rather than a single
large formant.As the "e" changes into the "o", the second formant moves back
down onto the eighth harmonic and the first formant moves up to
a position between the third and fourth harmonics. The third and
fourth formants remain relatively fixed in frequency but they
become softer as the "o" takes over. During the transition, the
second formant becomes fuzzy as it moves down the vertical
ridges of the glottal pulse.
REFERENCES
Assman, P. F. and Q. Summerfield (1990). "Modelling the perception of concurrent vowels: Vowels with different fundamental frequencies," J. Acoust. Soc. Am. 88, 680-697. |
|
Licklider, J. C. R. (1951). "A duplex theory of pitch perception," Experientia, 7, 128-133. Reprinted in E.D. Schubert (ed.), Psychological Acoustics. Stroudsburg, P. A., Dowden, Hutchinson and Ross Inc. (1979). |
|
Lyon, R.F. (1984). "Computational models of neural auditory processing," In: Proc. IEEE Int. Conf. Acoust. Speech Signal Processing. San Diego, CA. March 1984. |
|
Meddis, R. and M. J. Hewitt (1991a). "Virtual pitch and phase sensitivity of a computer model of the auditory periphery: I pitch identification," J. Acoust. Soc. Am. 89, 2866-82. |
|
Patterson, R.D. (1987b). "A pulse ribbon model of monaural phase perception," J. Acoust. Soc. Am. 82, 1560-1586. |
|
Patterson, R.D., Holdsworth, J. and Allerhand M. (1992a). "Auditory Models as preprocessors for speech recognition," In: The Auditory Processing of Speech: From the auditory periphery to words, M.E.H. Schouten (ed), Mouton de Gruyter, Berlin, 67-83. |
|
Patterson, R.D., Robinson, K., Holdsworth, J., |
|
Patterson, R.D. (1994a). "The sound of a sinusoid: Spectral models," J. Acoust. Soc. Am. 96, 1409-1418. |
|
Patterson, R.D. (1994b). "The sound of a sinusoid: Time-interval models," J. Acoust. Soc. Am. 96, 1419-1428. |
|
Patterson, R.D. and Akeroyd, M. A. (1995). "Time-interval patterns and sound quality," in: Advances in Hearing Research: Proceedings of the 10th International Symposium on Hearing, G. Manley, G. Klump, C. Koppl, H. Fastl, & H. Oeckinghaus, (Eds). World Scientific, Singapore, 545-556. |
|
Patterson, R.D., Allerhand, M., and Giguere, C., (1995). "Time-domain modelling of peripheral auditory processing: A modular architecture and a software platform," J. Acoust. Soc. Am. 98-3, (in press). |
|
Slaney, M. and Lyon, R.F. (1990). "A perceptual pitch detector," in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, Albuquerque, New Mexico. |
|
FILES
.gensairc The options file for gensai.
SEE ALSO
genspl, acgram
COPYRIGHT
Copyright (c) Applied Psychology Unit, Medical Research Council, 1995
Permission to use, copy, modify, and distribute this software without
fee is hereby granted for research purposes, provided that this
copyright notice appears in all copies and in all supporting
documentation, and that the software is not redistributed for any fee
(except for a nominal shipping charge). Anyone wanting to incorporate
all or part of this software in a commercial product must obtain a
license from the Medical Research Council.The MRC makes no representations about the suitability of this
software for any purpose. It is provided "as is" without express or
implied warranty.THE MRC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING
ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL
THE A.P.U. BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES
OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,
ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
SOFTWARE.
ACKNOWLEDGEMENTS
The AIM software was developed for Unix workstations by John
Holdsworth and Mike Allerhand of the MRC APU, under the direction of
Roy Patterson. The physiological version of AIM was developed by
Christian Giguere. The options handler is by Paul Manson. The revised
SAI module is by Jay Datta. Michael Akeroyd extended the postscript
facilites and developed the xreview routine for auditory image
cartoons.The project was supported by the MRC and grants from the U.K. Defense
Research Agency, Farnborough (Research Contract 2239); the EEC Esprit
BR Porgramme, Project ACTS (3207); and the U.K. Hearing Research Trust.
SunOS 5.6 | GENSAI (1) | 9 August 1995 |
Generated by manServer 1.07 from /cbu/cnbh/aim/release/man/man1/gensai.1 using man macros.