skip to primary navigationskip to content


Manual Reference Pages  - GENSAI (1)


gensai - generate stabilised auditory image


Strobed Temporal Integration
     I. Display Options For The Auditory Image
     Ii. Storage Options For The Auditory Image
     Iii. Options For The Auditory Image
See Also


gensai [ option=value | -option ] filename


Periodic sounds give rise to static, rather than oscillating, perceptions indicating that temporal integration is applied to the NAP in the production of our initial perception of a sound -- our auditory image (Patterson et al., 1992b). Traditionally, auditory temporal integration is represented by a simple leaky integration process, and AIM provides a bank of lowpass filters to enable the user to generate auditory spectra, excitation patterns (see Patterson, 1994a; genasa and genepn), and auditory spectrograms (see Patterson et al., 1992a, 1993; gensgm and gencgm). However, leaky integrators remove the phase-locked fine structure observed in the NAP, and this conflicts with perceptual data indicating that the fine structure plays an important role in determining sound quality and source identification (Patterson, 1994b; Yost et al., 1996; Patterson et al., 1996). As a result, AIM includes two modules which preserve much of the time-interval information in the NAP during temporal integration, and which produce a better representation of our auditory images. The functional version of AIM uses a form of Strobed Temporal Integration (STI) (Patterson et al., 1992a,b), and this is the primary topic of this manual entry.

In the physiological version of AIM, the auditory image is constructed with a bank of autocorrelators and the multi-channel result is referred to as a ’correlogram’ (Lyon, 1984; Slaney and Lyon, 1990; Meddis and Hewitt, 1991). The correlogram module is an aimTool rather than an integral part of the main program ’gen’. The name of the tool is ’acgram’ and there are man pages for all the tools. An extended example involving correlograms is presented in the script ’gtmdcg_pk’. The correlogram module extracts periodicity information and preserves intra-period fine structure by autocorrelating the function in each channel of the NAP. It was originally introduced as a model of pitch perception by Licklider (1951). It is not yet known whether STI or autocorrelation is more realistic, or more efficient, as a means of simulating our perceived auditory images. At present, the purpose is to provide a software package that can be used to compare these auditory representations in a way not previously possible.

There are a large number of Silent Options associated with printing AIM displays (see docs/aimSilentOptions). They are particularly useful when printing correlograms, summary correlograms, and summary auditory images, where some of the default axes and labels are incorrect. Examples of how to use these silent options are presented in docs/aimR8demo.



In strobed temporal integration (STI), a bank of delay lines is used to form a buffer store for the NAP, one delay line per channel. As the NAP proceeds along the buffer, it decays linearly with time, at about 2.5 %/ms, so there is no activity beyond about 40 ms in the NAP buffer. Each channel of the buffer is assigned a strobe unit which monitors activity in that channel looking for local maxima in the stream of NAP pulses. When one is found, the unit initiates temporal integration in that channel; that is, it transfers a copy of the entire NAP function in that channel at that instant, to the corresponding channel of an image buffer, where it adds the NAP function point-for-point with whatever is already in that channel of the image buffer. The local maximum itself is mapped to the 0-ms point in the image buffer. The multi-channel version of this STI process is AIM’s representation of our auditory image of a sound. Periodic and quasi-periodic sounds typically produce a single local maximum per cycle, per channel of the NAP. In any given channel, this leads to regular strobing and the transfer into the auditory image of a sequence of NAP functions which are all virtually identical. As a result, the auditory images of periodic sounds lead to static auditory images, and quasi-periodic sounds lead to nearly static images. These images, however, have the same temporal resolution as the NAP. Dynamic sounds are represented as a sequence of auditory image frames. If the rate of change in a sound is not too rapid, as is diphthongs, features are seen to move smoothly as the sound proceeds, much as objects move smoothly in animated cartoons.

The primary difference between the auditory images produced with STI and autocorrelation is that STI preserves much more of the temporal asymmetry that sounds generate in the NAP (Allerhand and Patterson, 1992). Indeed the features that appear in correlograms are virtually symmetric. The preservation of asymmetry in STI is heavily dependent on the criterion that the stobe mechanism uses to identify points at which to initiate temporal integration. If it successfully identifies local maxima as the integration points, asymmetry is preserved (Allerhand and Patterson, 1992). If it initiates temporal integration on every pulse in the NAP, the features lose their asymmetry and the resulting image is quite similar to the corresponding correlogram. A detailed description of the STI process is presented in /docs/aimStrobeCriterion with examples provided by a companion script (/scripts/aimStrobeCriterion). The discussion of strobe criteria begins with the simplest criterion for initiating temporal integration -- strobe on every non-zero NAP point. This converts sharp, asymmetric NAP features into rough-edged, largely-symmetric features in the auditory image. From there the discussion proceeds to more restrictive criteria which gradually reduce the rough edges and restore the asymmetry of the features in the auditory image. The degree of restriction in the strobe criterion is an option in AIM (stcrit_ai) which enables the user to experiment with the relationship between STI and autocorrellation.

It is important to emphasise, that the strobing in a given channel is independent of that in all other channels so far as the mechanism itself is concerned. It is this aspect of the strobing process, and the fact that the local maximum is mapped to 0 ms in the auditory image, that causes the alignment of channels in the auditory image. This passive alignment of channels in turn enables AIM to explain monaural phase perception as set out in Patterson (1987).

The auditory image has the same vertical dimension as the neural activity pattern (filter centre frequency). The continuous time dimension of the neural activity pattern becomes a local, time-interval dimension in the auditory image; specifically, it is "the time interval between a given pulse and the succeeding strobe pulse". In order to preserve the direction of asymmetry of features that appear in the NAP, the time-interval origin is plotted towards the right-hand edge of the image, with increasing, positive time intervals proceeding to towards the left.





The options that control the positioning of the window in which the auditory image appears are the same as those used to set up the earlier windows (i.e. those with the suffix _win), as are the options that control the level of the image within the display (top, bottom and magnitude). In addition, there are three new options that are required to present this new auditory representation. The options are frstep_aid, pwidth_aid, and nwidth_aid; the suffix "_aid" means "auditory image display". These options are described here before the options that control the image construction process itself, as they occur first in the options list. There are also three extra display options for presenting the auditory image in its spiral form; these options have the suffix "_spd" for "spiral display"; they are described in the manual entry for ’genspl’.

frstep_aid The frame step interval, or the update interval for the auditory image display

Default units: ms. Default value: 16 ms.

Conceptually, the auditory image exists continuously in time. The simulation of the image produced by AIM is not continuous, however; rather it is like an animated cartoon. Frames of the cartoon are calculated at discrete points in time, and then the sequence of frames is replayed to reveal the dynamics of the sound, or the lack of dynamics in the case of periodic sounds. When the sound is changing at a rate where we hear smooth glides, the structures in the simulated auditory image move much like objects in a cartoon. frstep_aid determines the time interval between frames of the auditory image cartoon. Frames are calculated at time zero and integer multiples of frstep_aid.

The default value (16 ms) is reasonable for musical sounds and speech sounds. For a detailed examination of the development of the image of brief transient sounds frstep_aid should be decreased to 4 or even 2 ms.

pwidth_aid The maximum positive time interval presented in the display of the auditory image (to the left of 0 ms).

Default units: ms. Default value: 35 ms.

nwidth_aid The maximum negative time interval presented in the display of the auditory image (to the right of 0 ms).

Default units: ms. Default value: -5 ms.

Note that the minus sign is required when entering nwidth_aid.

animate Present the frames of the simulated auditory image as a cartoon.

Switch. Default off.

With reasonable resolution and a reasonable frame rate, the auditory cartoon for a second of sound will require on the order of 1 Mbyte of storage. As a result, auditory cartoons are only stored at the specific request of the user. When the animate flag is set to ‘on’, the bit maps that constitute the frames of the auditory cartoon are stored in computer memory. They can then be replayed as an auditory cartoon by pressing ‘carriage return’. To exit the instruction, type "q" for ‘quit’ or "control c". The bit maps are discarded unless option bitmap is set to on.

There is a silent option ’review’ associated with animate. When review=on, it which causes AIM to pause between the frames of the cartoon and wait for a ’return’.



A record of the auditory image can be stored in two ways depending on the purpose for which it is stored. The actual numerical values of the auditory image can be stored as previously, by setting output=on. In this case, a file with a .sai suffix will be created in accordance with the conventions of the software. These values can be recalled for further processing with the aimTools. In this regard the SAI module is like any previous module.

It is also possible to store the bit maps which are displayed on the screen for the auditory image cartoon. The bit maps require less storage space and reload more quickly, so this is the preferred mode of storage when one simply wants to view the auditory image.

bitmap Produce a bit-map storage file

Switch. Default value: off.

When the bitmap option is set to ‘on’, the bit maps are stored in a file with the suffix .ctn. The bitmaps are reloaded into memory using the instructions review, or xreview, followed by the file name without the suffix .ctn. The auditory image can then be replayed, as with animate, by typing ‘carriage return’. xreview is the newer and preferred display routine. It enables the user to select subsets of the cartoon and to change the rate of play via a convenient control window. It does, however, require an ANSI C compiler like gcc.



There are six options with the suffix "_ai", short for ’auditory image’. The first option, napdecay_ai, controls the decay rate for the NAP while it flows down the NAP buffer and before it is transferred to the auditory image. In point of fact, then, napdecay_ai is a NAP option rather than and auditory image option. But its effects are only observed in the auditory image and so it is grouped with the other _ai options. The next four options control the STI process -- stdecay_ai, stcrit_ai, stlag_ai and decay_ai. The final option, stinfo_ai, is a switch that causes the software to produce information about the current STI analysis for demonstration or diagnostic purposes.

The strobe mechanism is conceptually simple. An adaptive strobe threshold is set up for each channel of the NAP and its value is set to the height of the first NAP pulse when it occurs. Thereafter, the strobe threshold decays exponentially with time in the absence of suprathreshold NAP pulses. The rate of decay is an option, decay_ai, which is specified as the half life of the image; the default half life is 30 ms. When a NAP pulse next exceeds the strobe threshold, the level of the strobe threshold is reset to the height of the peak of the NAP pulse, and the time of the peak of the NAP pulse is recorded as a potential temporal integration time. There is then a short lag before integration, to see if another larger NAP pulse is about to occur, since the production of a stable images depends on identifying local maxima in the NAP. The strobe lag is an option, stlag_ai, whose default value is 5 ms. If a larger NAP pulse occurs within stlag_ai ms, its peak time becomes the new potential integration time, the strobe threshold level is reset to the height of the peak of the new NAP pulse, and the strobe lag is reset to stlag_ai. In the event of a continuous stream of rising NAP pulses, the total strobe lag after the occurance of the first suprathreshold NAP pulse is limited to twice stlag_ai.

napdecay_ai Decay rate for the neural activity pattern (NAP)

Default units: %/ms. Default value 2.5 %/ms.

napdecay_ai determines the rate at which the information in the neural activity pattern decays as it proceeds along the auditory buffer that stores the NAP prior to temporal integration.

stdecay_ai Strobe threshold decay rate

Default units: %/ms. Default value: 5 %/ms.

stdecay_sai determines the rate at which the strobe threshold decays. At 5 %/ms, strobe threshold returns to zero from a NAP peak of any height in 40 ms, in the absence of further suprathreshold NAP pulses. Note that, in absolute terms, strobe threshold decays faster after large NAP peaks than after small NAP peaks.

stcrit_ai Strobe criterion

Switch: Default value: 5

The stabilisation of NAP features in the auditory image, and preservation of their asymmetry only occurs if the strobe unit successfully identifies local maxima in the NAPs of periodic and quasi-periodic sounds (Allerhand and Patterson, 1992). A detailed description of the STI process is presented in /docs/aimStrobeCriterion. It begins with the simplest criterion for initiating temporal integration -- strobe on every non-zero NAP point. This converts sharp, asymmetric NAP features into rough-edged, largely-symmetric features in the auditory image. From there the discussion proceeds to more restrictive criteria which gradually reduce the rough edges and restore the asymmetry of the features in the auditory image. The degree of restriction in the strobe criterion is an option in AIM (stcrit_ai) which enables the user to experiment with the relationship between STI and autocorrellation. There are five levls of restriction as noted in the following table:

1 Strobe on every non-zero point in the NAP.

2 Strobe on the peak of every NAP pulse.

3 Avoid strobing on NAP peaks in the temporal shadow of a large peak.

4 Avoid strobing on peaks followed by larger peaks.

5 Do not wait more than twice stlag_ai before initiating integration.

stlag_ai Auditory image strobe lag time (in ms)

Default units: ms. Default value: 5 ms.

For strobe criterion levels 4 and above, following detection of a NAP peak that exceeds strobe threshold, there is a short lag (stlag_ai) before integration, to see if another larger NAP pulse is about to occur. If a larger NAP pulse occurs within stlag_ai ms, its peak time becomes the new potential integration time, the strobe threshold level is reset to the height of the peak of the new NAP pulse, and the strobe lag is reset to stlag_ai (stcrit_ai=4). In the event of a continuous stream of rising NAP pulses, the total strobe lag after the occurance of the first suprathreshold NAP pulse is limited to twice stlag_ai (stcrit_ai=5). The strobe lag default value is 5 ms. See Sections 4 and 5 of /docs/aimStrobeCriterion for an illustration of the effects of these criteria on damped and ramped sinusoids.

General purpose pitch mechanisms based on peak picking are notoriously difficult to design, and the strobe mechanism just described would not work well on an arbitrary acoustic waveform. The reason that this simple strobe mechanism is sufficient for the construction of the auditory image is that NAP functions are highly constrained. The microstructure reveals a function that rises from zero to a local maximum smoothly and returns smoothly back to zero where it stays for more than half of a period of the centre frequency of that channel. On the longer time scale, the amplitude of successive peaks changes only relatively slowly with time. As a result, for periodic sounds there tends to be one clear maximum per period in all but the lowest channels where there is an integer number of maxima per period. The simplicity of the NAP functions follows from the fact that the acoustic waveform has passed through a narrow band filter and so it has a limited number of degrees of freedom. In all but the highest frequency channels, the output of the auditory filter resembles a modulated sine wave whose frequency is near the centre frequency of the filter. Thus the neural activity pattern is largely restricted to a set of peaks which are modified versions of the positive halves of a sine wave, and the remaining degrees of freedom appear as relatively slow changes in peak amplitude and relatively small changes in peak time (phase).

decay_ai Auditory image half life

Default units: ms. Default value 30 ms.

When the input sound terminates, the auditory image must decay. In AIM the form of the decay is exponential and the decay rate is specified as the time taken for the image to reduce in level by half. In addition, decay_ai determines the rate at which the strength of the auditory image increases when a sound comes on, and the level to which it asymptotes if the sound continues indefinitely at a fixed level. In an exponential process, the asymptote is reached when the increment provided by each new cycle of the sound equals the amount that the image decays over the same period.

stinfo_ai Strobe threshold information. (Values: off, on, filename)

Switch: Default value off.

When the switch is on, gensai outputs the strobe threshold function either to the terminal (stinfo_ai=on) or to a designated file (stinfo_ai=<filename>). It also appends the times at which temporal integration would be initiated. This pair of data streams can then be combined with the NAP to produce a display (x11plot) that illustrates the operation of the strobe threshold. See the script StrobeCriterionDisplay for a demonstration. It is this script which is used to produce the strobe threshold figures in aimStrobeCriterion.


This Section presents a pair of examples intended to illustrate the predominant forms of motion that dynamic sounds produce in the auditory image, and the fact that structures and features can be tracked across the image provided the rate of change is not excessive. The first example is a pitch glide for a note with fixed timbre; it produces predominantly horizontal motion in the auditory image. The second example is a timbre glide for a note with fixed pitch; it produces predominantly vertical motion in the auditory image.

    A Pitch Glide in the Auditory Image

To this point, the discussion has focussed on how to convert a NAP from a periodic sound with a repeating pattern into a stabilised auditory image without smearing the fine structure of the NAP pattern. The mechanism is not, however, limited to periodic sounds. The sound file cegc contains a set of click trains that produce four musical notes referred to as C3, E3, G3, and C4, along with glides from one note to the next. The notes are relatively long (300 ms) and the pitch glides are relatively slow (300 ms for 3-5 semitones). As a result, each note forms a stabilised auditory image and there is smooth motion in the image over the 300-ms interval as the sound glides from one note to the next. The pitch of musical notes is determined by the lower harmonics when they are present and so the frequency range is limited to 2000 Hz. The demonstration is generated and stored with the instruction

> gensai channels=40 max=2000 input=cegc bitmap=on

It can then be replayed at will with either ’review cegc’ or ’xreview cegc’. (Click on the image with the middle mouse button to pull up the control window for xreview.) For brevity, the example can be limited to the transition from C to E near the start of the sound using the instruction

> gensai channels=40 max=2000 start=150 length=600 input=cegc

(In point of fact, the click train associated with the first note has a period of 8 ms; so this "C" is actually a little below the musical note C3.)

When the note comes on, a stable image of the first note forms over the first 4-6 cycles of the note. The vertical structure that repeats four times across the image is the time-interval pattern that identifies a click-train sound. When the transition begins, in the lower channels associated with the first and second harmonic, the individual SAI pulses move from left to right. At the same time, they move up in frequency as these resolved harmonics move up into filters with higher centre frequencies. In these low channels the motion is relatively smooth because the SAI pulses have a duration which is a significant proportion of the period of the sound. As the pitch rises and the periods get shorter, each new NAP cycle contributes a NAP pulse which is shifted a little to the right relative to the corresponding SAI pulse. This increases the leading edge of the SAI pulse without contributing to the lagging edge. As a result, the leading edge builds, the lagging edge decays, and the SAI pulse moves to the right. The SAI pulses are asymmetric during the motion, with the trailing edge more shallow than the leading edge, and the effect is greater towards the left edge of the image because the discrepancies over four cycles are larger than the discrepancies over one cycle. The effects are larger for the second harmonic than for the first harmonic because the width of the pulses of the second harmonic are a smaller proportion of the period. During the pitch glide the SAI pulses have a reduced peak height because the activity is distributed over more channels and over time intervals.

The SAI pulses associated with the higher harmonics are relatively narrow with respect to the changes in period during the pitch glide. As a result there is more blurring of the image during the glide in the higher channels. Towards the right-hand edge, for the column that shows correlations over one cycle, the blurring is minimal. Towards the left-hand edge the details of the pattern are blurred and we see mainly activity moving in broad vertical bands from left to right. When the glide terminates the fine structure reforms from right to left across the image and the stationary image for the note E appears.

The details of the motion are more readily observed when the image is played in slow motion, using review or xreview and one of the ’slow down’ options.

    A Timbre Glide in the Auditory Image

The vowels of speech are quasi-periodic sounds and the period for the average male speaker is on the order of 8ms. As the articulators change the shape of the vocal tract during speech, formants appear in the auditory image and move about. The position and motion of the formants is an important part of the information conveyed by the voiced parts of speech. When the speaker uses a monotone voice, the pitch remains relatively steady and the motion of the formants is essentially in the vertical dimension. An example of monotone voiced speech is provided in the file leo which is the acoustic waveform of the word ’leo’. The auditory image of leo can be produced and viewed with the instruction

> gensai input=leo bitmap=on animate=on

It can be replayed under user control with either

> review leo    or

> xreview leo

The dominant impression on first observing the auditory image of leo is the motion in the formation of the "e" sound, the transition from "e" to "o", and the formation of the "o" sound.

The vocal chords come on at the start of the "l" sound but the tip of the tongue is pressed against the roof of the mouth just behind the teeth and so it restricts the air flow and the start of the "l" does not contain much energy. As a result, in the auditory image, the presence of the "l" is primarily observed in the transition from the "l" to the "e". That is, as the three formants in the auditory image of the "e" come on and grow stronger, the second formant glides into its "e" position from below, indicating that the second formant was recently at a lower frequency for the previous sound. The details of the motion are more readily observed when the image is played in slow motion, using review or xreview and one of the ’slow down’ options.

In the "e", the first formant is low, centred on the third harmonic at the bottom of the auditory image. The second formant is high, up near the third formant. The lower portion of the fourth formant shows along the upper edge of the image. Recognition systems that ignore temporal fine structure often have difficulty determining whether a high frequency concentration of energy is a single broad formant or a pair of narrower formants close together. This makes it more difficult to distinguish "e". In the auditory image, information about the pulsing of the vocal chords is maintained and the temporal fluctuation of the formant shapes makes it easier to distinguish that there are two overlapping formants rather than a single large formant.

As the "e" changes into the "o", the second formant moves back down onto the eighth harmonic and the first formant moves up to a position between the third and fourth harmonics. The third and fourth formants remain relatively fixed in frequency but they become softer as the "o" takes over. During the transition, the second formant becomes fuzzy as it moves down the vertical ridges of the glottal pulse.


Assman, P. F. and Q. Summerfield (1990). "Modelling the perception of concurrent vowels: Vowels with different fundamental frequencies," J. Acoust. Soc. Am. 88, 680-697.

Licklider, J. C. R. (1951). "A duplex theory of pitch perception," Experientia, 7, 128-133. Reprinted in E.D. Schubert (ed.), Psychological Acoustics. Stroudsburg, P. A., Dowden, Hutchinson and Ross Inc. (1979).

Lyon, R.F. (1984). "Computational models of neural auditory processing," In: Proc. IEEE Int. Conf. Acoust. Speech Signal Processing. San Diego, CA. March 1984.

Meddis, R. and M. J. Hewitt (1991a). "Virtual pitch and phase sensitivity of a computer model of the auditory periphery: I pitch identification," J. Acoust. Soc. Am. 89, 2866-82.

Patterson, R.D. (1987b). "A pulse ribbon model of monaural phase perception," J. Acoust. Soc. Am. 82, 1560-1586.

Patterson, R.D., Holdsworth, J. and Allerhand M. (1992a). "Auditory Models as preprocessors for speech recognition," In: The Auditory Processing of Speech: From the auditory periphery to words, M.E.H. Schouten (ed), Mouton de Gruyter, Berlin, 67-83.

Patterson, R.D., Robinson, K., Holdsworth, J., McKeown, D., Zhang, C. and Allerhand M. (1992b) "Complex sounds and auditory images," In: Auditory physiology and perception, Y Cazals, L. Demany, K. Horner (eds), Pergamon, Oxford, 429-446.

Patterson, R.D. (1994a). "The sound of a sinusoid: Spectral models," J. Acoust. Soc. Am. 96, 1409-1418.

Patterson, R.D. (1994b). "The sound of a sinusoid: Time-interval models," J. Acoust. Soc. Am. 96, 1419-1428.

Patterson, R.D. and Akeroyd, M. A. (1995). "Time-interval patterns and sound quality," in: Advances in Hearing Research: Proceedings of the 10th International Symposium on Hearing, G. Manley, G. Klump, C. Koppl, H. Fastl, & H. Oeckinghaus, (Eds). World Scientific, Singapore, 545-556.

Patterson, R.D., Allerhand, M., and Giguere, C., (1995). "Time-domain modelling of peripheral auditory processing: A modular architecture and a software platform," J. Acoust. Soc. Am. 98-3, (in press).

Slaney, M. and Lyon, R.F. (1990). "A perceptual pitch detector," in Proc. IEEE Int. Conf. Acoust. Speech Signal Processing, Albuquerque, New Mexico.


.gensairc The options file for gensai.


genspl, acgram


Copyright (c) Applied Psychology Unit, Medical Research Council, 1995

Permission to use, copy, modify, and distribute this software without fee is hereby granted for research purposes, provided that this copyright notice appears in all copies and in all supporting documentation, and that the software is not redistributed for any fee (except for a nominal shipping charge). Anyone wanting to incorporate all or part of this software in a commercial product must obtain a license from the Medical Research Council.

The MRC makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty.



The AIM software was developed for Unix workstations by John Holdsworth and Mike Allerhand of the MRC APU, under the direction of Roy Patterson. The physiological version of AIM was developed by Christian Giguere. The options handler is by Paul Manson. The revised SAI module is by Jay Datta. Michael Akeroyd extended the postscript facilites and developed the xreview routine for auditory image cartoons.

The project was supported by the MRC and grants from the U.K. Defense Research Agency, Farnborough (Research Contract 2239); the EEC Esprit BR Porgramme, Project ACTS (3207); and the U.K. Hearing Research Trust.

SunOS 5.6 GENSAI (1) 9 August 1995
Generated by manServer 1.07 from /cbu/cnbh/aim/release/man/man1/gensai.1 using man macros.