Industry Opinion

What it means to be an engineer

View all industry blogs

01dB White Paper on PsychoAcoustics

AURALLY RELEVANT ANALYSIS BY SYNTHESIS: A NEW SOFTWARE APPROACH TO SOUND DESIGN

Peter Daniel – 01dB GmbH, Regensburg, Germany
Jacky DUMAS – 01dB-METRAVIB, Limonest France

Summary

A new tool for the visual perception of sound quality and sound design is presented.  Requirement for the visualization of sound quality is a signal analysis modeling the information processing of the ear.  The first step of the signal processing implemented in the software calculates an auditory spectrogram by a filter bank adapted to the time and frequency resolution of the human ear. The second step removes redundant information by extracting time- and frequency contoursfrom the auditory spectrogram in analogy to contours of the visual system. In a third step contours and/or auditory spectrograms can be resynthesized confirming that only aurally relevant information was extracted. The visualization by contours allows intuitively grasping the important components of a signal.

Contributions of parts of a signal to the overall quality can be easily auralized by editing and resynthesising them. Resynthesis of time contours alone allows e.g. to auralize impulsive components separately from the tonal components.   Further processing of the contours determines tonal parts in form of tracks.

 

1. Introduction

The first step for an efficient sound design is a visualization of the perceived sound quality. Beside the time function, sounds are normally visualized with spectrograms or waterfall diagrams. A spectrogram displays the power of a sound signal color coded in dependence on frequency and time. Instead of color, a waterfall diagram displays the power of a sound on a third axis in dependence on frequency and time. For most sounds a waterfall diagram becomes quite intricate, thus spectrograms are used in most cases. In order to visualize the perceived spectral and temporal characteristics of a sound the algorithms used for deriving the spectrogram out of the time signal should model in a first step the frequency resolution and time selectivity of the human ear.

Frequency resolution

Conventional spectrograms/waterfall displays use a linear or logarithmic frequency axis. The frequency scaling of the human ear is however modeled by the Bark scale [1], which is not linear and not logarithmic. Also the analysis bandwidth of the ear is not constant, like the one of a Fourier analysis. A numerical table and formulas for the transformation of frequency in Hz into Frequency in Bark is given in the appendix Critical-Band Rate Scale.

 

Temporal Selectivity

The temporal selectivity of the human ear has been investigated in numerous studies. Results depend on the selected signals and the selected scaling methods. In general one can state that the temporal selectivity of the human ear lies in the range of one millisecond [1]. 

Fourier Analysis

The Fourier analysis is a good tool to investigate the physical cause of a signal, but in order to visualize perceived spectral and temporal features, the time window of the Fourier transform has always to be adapted to the time and frequency characteristics of the signal to be analyzed. This poses often a problem, because they are not known, or it is simply inconvenient to change the time window all the time, or it is even not possible.

In the case of an amplitude modulation of a sine e.g., at low modulation frequencies the modulation is perceived as change loudness, at higher modulation frequencies the tone becomes rough and at very high modulation frequencies the side bands are detected by the ear as separate tones. For all these cases it is possible to detect the side bands as separate tones in the Fourier analysis. Which window is the correct one?

The task of selecting the correct time window to model the perception becomes more complex or even impossible to solve in cases of impulsive excitation.
As an example in Figure 1 an impulse train, perceived as a rough harmonic tone, is analyzed with FFT’s with different time windows. Depending on the time window either the impulsive/rough structure or harmonics are resolved.

Auditory Spectrograms

In our new software approach to sound design - the dBSONIC PerceptualXplorer - frequency scaling and analysis bandwidth are adapted to the frequency and time selectivity of the human ear forming an aurally adequate signal representation: the auditory spectrogram. In contrary to the FFT analysis of the pulse train signal displayed in Figure 1, the auditory spectrogram allows to see the harmonic and the impulsive character of the pulse train in one picture.

 

Contours

In a waterfall diagram the aurally adequate signal representation of a pure sine would form a mountain range, resembling the excitation of the basilar membrane of the ear. But a pure sine signal is heard as a pure sine. A visual analogy is a single line, the ridge of the mountain range! Thus the ear performs some kind of contouring of the representation of a sound found at the stage of the basilar membrane. In dBSONIC this is modeled in addition in a second stage by the extraction of maxima in each spectrum of the spectrogram forming frequency contours and the extraction of maxima in each filter channel of the auditory spectrogram forming time contours. Frequency contours include the tonal components of a sound, time contours represent impulsive components (Figure 2).

 

 

 

 

 


Figure 1:
Top left: time signal of pulse train.    Top right: FFT-Analysis of Pulse train with Δf = 21.5 Hz and Δt = 46.4 ms.Bottom left: FFT with Δf = 344.5 Hz and Δt = 2.9 ms
.
Bottom right: Auditory Spectrogram of the same signal with Δf = 5% of a Bark and Δt = 1 ms adapted to the time and frequency resolution of the human ear.

Resynthesis

Auditory spectrograms, frequency contours and time contours can be manipulated and resynthesized, and overlayed and manipulated and resynthesized as shown in Figures 3.

They are excellent tools for the visual exploration and analysis of sounds as demonstrated in many studies, especially in the field of sound quality [9, 13, 14, 15], but also in musical acoustics [8, 16, 17] and speech processing [6, 10, 11] and auditory scene analysis [8]. Resynthesis of the complete contour set results in sounds nearly undistinguishable from the original. Thus the contours contain all relevant acoustical information of the original time signal.

 

 

 

 

 

 


Figure 2: Contourization. Top left: Frequency contours of the impulse train of Figure 1. Top right: Further processing of the frequency contours into frequency tracks, the tonal part of the signal. Bottom left: Time contours representing the impulsive part of the signal. Bottom right: overlayed frequency and time contours representing the whole auditory information of the signal.

 

Figure 3: possible ways of analysis and resynthesis with our Software

 

2.  Psychoacoustic foundation

Auditory analysis and contourization in the software is based on Terhardt's approach of human auditory information processing [4, 6, 7]. The basic question in this approach is which information of the audio signal is extracted by the human ear? A stationary signal can be composed of sinusoidal components or part tones. Each of those part tones is characterized by three parameters: frequency, amplitude and phase. Assuming that frequencies and amplitudes change relatively slowly versus time also time varying signals can be described by this approach.
Amplitude and phase of the Fourier coefficients of the source signal are normally changed drastically on their way to the receiving human ear. The only robust parameters are the frequencies of the time variable part tones. Only the robustness of these part tone frequencies enables listeners to hear truly the acoustic properties of a source. Psychoacoustical experiments [1,7] have shown that the ear in fact works like a frequency analyzer. During its temporal resolution of a few milliseconds, frequencies and amplitudes of most auditory signals don't change.

2.1 Gestalt interpretation

Human information processing consists of hierarchical steps of decision processes [6]. This is necessary as e.g. the highest stage ('consciousness') has a very limited processing power compared to the wealth of information flowing in through the senses. Decision in information processing corresponds to contouring in perception [6, 7]. From these contours auditory objects ('gestalten') are formed. Basic gestalt perception is already implemented in the software in form of tracks. These connect neighboring contour elements as long as their frequency spacing is within a range found in subjective grouping experiments. Tracks have to have a minimum length in order to be recognized as tonal. Components not belonging to tracks are rejected as noise, thus track building can also be used for noise reduction purposes.

3. Calculation

The auditory analysis implemented in the software is based on a customized STFT (Short-Term Fourier-Transformation):

Looking at the STFT-spectrum versus time at an analysis frequency ωA, equation (1) can be seen as a convolution of the window function h(ωA,t) at the constant frequency ωA with the product of signal f(t) and the complex oscillation ejωAt.

The multiplication with the complex oscillation ejωAt shifts the frequency of the signal x(t) by - ωA. Convolution with the window h(ωA,t) corresponds in the frequency domain to a multiplication of the shifted spectrum with its transfer function HωA(ω). If HωA(ω) is an appropriate low-pass filter the output represents the signal contents around the analysis frequency  selected with the analysis bandwidth of the low-pass filter. Suitable analysis low-pass filters of order n are defined by :

with n poles at px/ω3dB = a and with
.
The corresponding window function is defined by

with ω3dB =pB3dB(ωA) and B3dB(ωA) the two-sided 3 dB analysis bandwidth of the low-pass for the analysis frequency ωA. The analysis bandwidth is selected proportional to the critical bandwidthDfG of the human ear (appendix Critical-Band Rate Scale).
Filters of 4th, 3rd, 2nd and 1st order may be selected. Default is the 4th order filter. The computational time increases with higher filter order. 1st order filters result in time windows of Terhardt's Fourier-Time-Transformation [3, 4]. Second order filter  are used by [9, 13-17]. Mummert [10] has shown that some audible components are masked by the use of 2nd order filters and that the use of 4th order filters improves the separation of transient events from stationary parts and the resynthesis quality significantly. Thus time contouring is only suitable for order n>2 [10].

Smoothing: It is possible to smooth the resulting auditory spectrogram (ASP) by filtering it with a first order low-pass filter before contouring. The bandwidth of this low-pass filter can be adjusted by the bandwidth ratio. The bandwidth ratio is defined by the analysis bandwidth divided by the bandwidth of the smoothing low-pass filter. This ratio is applied up to the frequency limit. By default the frequency limit is 3 kHz. Above the frequency limit the bandwidth of the low-pass filter is proportional to the bandwidth of the analysis filter at the frequency limit. Default for the bandwidth ratio is 0 and then no smoothing is applied. Smoothing is not necessary or even worsens resynthesis quality for analysis filter orders higher than 2! Using 1st order analysis filters a smoothing is necessary. A bandwidth ratio of 0.2 is proposed by [3] for 1st order analysis filters combined with an analysis bandwidth of 0.1 Bark. In this case a smoothing analysis bandwidth of 0.5 Bark is applied which is 5 times higher than the analysis bandwidth.

Group Delay Compensation: similar to the basilar membrane the auditory filters applied in the calculation of the auditory spectrogram (ASP) causes a delay. The auditory system compensates the delay, thus a listener perceives events at different frequencies simultaneously although they occur at different times at the stage of the basilar membrane. The Software performs exact delay compensation, too. For the default 4th order analysis filter this delay is 0.75, for the 3rd order analysis filter the default is 0.667, for the 2nd order analysis filter the default is 0.5 and for the 1st order filter the default is 0 times the filter group delay.

Phases: In addition to the level spectrogram by default the phases of the auditory spectrogram (ASP) are stored, too. Thus the resynthesis of the sound from the ASP with phases is possible.

 

4. Analysis Example. Door Slam Sound

A car door slam with poor quality is analyzed. Sound quality is improved by removing a tonal component and by editing and shifting the transients in the auditory spectrogram.

Figure 4: Auditory spectrogram of a door slam noise with an audible whistle marked at the beginning

Figure 4 shows an auditory spectrogram of the sound door slam noise.  The marked part causes a whistling noise. In order to remove the whistling noise in the software a polygon can be drawn around arbitrary shape with a few mouse clicks and the contents of the shape can be deleted or reduced in level. The resulting whole spectrogram or any part can be resynthesized and one can listen to that specific effect of the editing.

In the next step, the audible vibration of the locking process shall be reduced. Therefore the marked impulse in the middle is removed.


Figure 5: After reduction of the initial whistling the second impulse is marked for removal.

After deleting the second impulse marked in Figure 5 and resynthesis of the whole ASP, the vibration is not longer audible, the door slam sounds now more solid but too hard and the last impulse becomes now audible like a completely single event.

In the next step the third impulse is moved towards the first impulse, as shown in Figure 6.  After resynthesis the sound quality of the door slam sound is much improved. The initial vibration and whistling is eliminated and a solid target sound was easily gained.

These audible differences between the sounds can also be verified by a loudness analysis shown in Figure 7. The removal of the second impulse reduces the overall loudness significantly. In the final target sound overall loudness shows a slight dip between the two remaining impulses. The last impulse is not longer detected as single event, but the small dip is deep enough to tell the ear that the locker snapped correctly.



Figure 6: the last impulse is moved towards the first impulse after deleting the second impulse.




Figure 7: Loudness analysis of original and resynthesized sounds. Top left: specific loudness of original sound. Bottom left: specific loudness after removal of 2nd impulse.
Top right: specific loudness after shifting of 3rd impulse. Bottom right: 
overall loudnesses: of original sound (blue curve), after removal of 2nd impulse (green curve) and after shifting of 3rd impulse (red curve).

References

[1] Zwicker, E., Fastl, H.: Psychoacoustics - Facts and Models. Springer Verlag Berlin. 1990.

[2] Zwicker, E. and Terhardt, E.: Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. 68, 1523, 1980.

[3] Terhardt, E.: Fourier transformation of time signals: conceptual revision, Acustica, 57: 242-256, 1985.

[4] Heinbach, W.: Aurally adequate signal representation: The Part-Tone-Time-Pattern. Acustica, 67, S. 113-121, 1988.

[6] Terhardt, E.: From speech to language: On auditory information processing. In: Schouten, M. E. H., Editor, The Auditory Processing of Speech: from Sounds to Words, S. 363-380. Mouton de Gruyter, Berlin, (1992).

[7] Terhardt, E.: Akustische Kommunikation. Grundlagen mit Hörbeispielen. Springer Verlag Berlin – Heidelberg 1998.

[8] Baumann, U.: Ein Verfahren zur Erkennung und Trennung multipler akustischer Objekte. Herbert Utz Verlag Wissenschaft, Dissertation, Munich, 1995.

[9] Heldmann, K.: Wahrnehmung, gehörgerechte Analyse und Merkmalsextraktion technischer Schalle. Dissertation, Munich 1994.

[10] Mummert, M.: Sprachcodierung durch Konturierung eines gehörangepaßten Spektrogramms und ihre Anwendung zur Datenreduktion. Fortschr.-Ber. VDI Reihe 10, VDI-Verlag Düsseldorf. Dissertation, Munich, 1998.

[11] Horn, T.: Image processing of speech with Auditory Magnitude Spectrograms. Acustica Vol. 84, 175-177, 1998.

[12] Terhardt, E., Stoll, G., Seewann, M.: Algorithm for extraction of pitch and pitch salience from complex tonal signals. J. Acoust. Soc. Am., Vol. 71, 679-688, 1982.

[13] Daniel, P., Ellermeier, W., Leclerc, P.: Tonalness and Unpleasantness of tire sounds: methods of assessment and psychoacoustical modeling. Euro-noise 98,
627-632, 1998.

[14] Vormann, M., Weber, R.: Gehörgerechte Darstellung von instationären Umweltgeräuschen mittels Fourier-Time-Transformation (FTT). Fortschritte der Akustik, DAGA 95, 1191-1194, 1995.

[15] Heldmann, K., Keiper, W.: Analyse von instationären technischen Geräuschen. Fortschritte der Akustik, DAGA 91, 761-764, 1991.
[16] Valenzuela, M.N.: Untersuchungen und Berechnungsverfahren zur Klangqualität von Klaviertönen. Dissertation, Herbert Utz Verlag, Munich 1998.

[17] Fleischer, H.: Schwingung und Schall von Glocken. Fortschritte der Akustik, DAGA 2000.

 

Appendix: Critical-Band Rate Scale

 

The auditory analysis uses a critical-band rate scale given in Bark whereas a linear frequency scale in Hz is used in conventional Fourier analysis. The Bark scale reflects the nonlinear frequency transformation of the human ear.


Table 1: Critical-band rate z as a function of frequency

In order to transform frequencies given in Hz into Bark the following approximation given in [1] is commonly used.

z/Bark = 13 arctan(0.76f/kHz) + 3.5 arctan (f/7.5 kHz)2

However as shown in [2] deviations of up to 0.2 Bark may occur between transformed and tabulated values in [1]. For the default frequency interval of 0.05 Bark, differences of 0.2 Bark would amount to a difference of 4 frequency channels.
Therefore dBSONIC PerceptualXplorer uses a more precise approximation as proposed by [8]:
z1 /Bark = 26.81 * f / (1960 + f) - 0.53
for z < 2.0:      z = z1*2./2.53 + 1.06/2.53
for z1 > 20.1:  z = z1*1.22 -4.422

The analysis bandwidth of the hearing system – the critical bandwidth - as a function of frequency in Hz is evaluated by the following formula:

DfG/Hz = 25 + 75 [1 + 1.4 (f/kHz)2]0.69

 

 

 

 

Acoustic Terms and Definitions

Click here for the Acoustic Terms and Definitions

 

 

Interested in taking part in the next Design Challenge?

IVT November: Design a vehicle suitable for extra-terrestrial mining activities. (Deadline: 17 October)

Please email tom.stone@ukipme.com

Latest Video


AGCO now using 'Google' Glass on a vehicle production line

Click here to watch the video

Case IH's CVXDrive transmission explained

Click here to watch the video



Submit Your Industry Opinion

Industry BlogDo you have an opinion you'd like to share with the industrial vehicle community? Good or bad, we'd like to hear your views and opinions on the leading issues shaping the industry. Share your comments by sending up to 500 words to tom.stone@ukimediaevents.com

Submit Your Recruitment Ad

Recruitment AdTo send us your recruitment advertising or to receive information on placing a banner please email kevin.barrett@ukimediaevents.com

Supplier Spotlight

We are building a list of leading suppliers covering all aspects of the industrial vehicle industry. Want to see your company included? Contact kevin.barrett@ukimediaevents.com for more details.

فروشگاه اینترنتی فروشگاه اینترنتی سیستم همکاری در فروش کانال تلگرام چت روم دانلود فیلم فروشگظ;ه اینترنتی