Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Stereo ambience effect for mono/stereo sources (Read 5504 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Stereo ambience effect for mono/stereo sources

(LONG, TECHNICAL, MATHEMATICS of complex numbers)

I have some ideas from an experiment in adding natural-sounding ambience to a mono recording (stereo with both left and right channels identical) or adding new ambience to a stereo recording, with little or no effect on the tonality.

It includes the use of complex numbers (Argand diagrams) to represent stereo waveforms, and how this might allow an alternative, safer method of creating monaural recordings from stereo sound (though this part isn't well developed).

The ambience effect (which probably isn't all new) relies on taking a the left channel waveform to be the real part, the right to be the imaginary part. This represents the sample values of both channels at a given moment in a single complex number, which you can treat as cartesian (x + iy) or polar {r * exp(i * theta)}.

Echoes sound like ambient reflection paths of the direct sound if the delay is less than about 30-40 ms (IIRC, known as fusion zone), so you can delay it about this much before it sounds to a human like it's a separate sound located in a different stereo position, rather than the ambient reflection around the original stereo location, which give cues to depth and room size and distance from the sound source. (If you go to long delays such as 100 ms, it sounds like a public address at an outdoor event with multiple speakers around a stadium or along some race course/track). Another aspect of ambient reflections is the reflectance of the walls or other objects that scatter the delayed sound to your ears. Furthermore, there are different delays to left and right ear and phase shifts too.

For 44.1 kHz, 22.05 and 11.025, MP3 format uses approx. 26.112 ms frames (= 2 granules = 1152 samples at 44.1 kHz), so 26.112 ms seems like a good delay for each successive pass of ambient reflections, since it remains within the fusion zone, so will be associated with the original direct sound by the brain, and it also places each reflected component within a different MP3 frame for standard sample rates.

You could also choose 1152 samples of 48 kHz = 24.000 ms, or 1152 samples of 32 kHz = 36 ms, though the latter might be too close to the edge of the fusion zone for safety, in case anybody has slightly different perception of the fusion zone, so you might choose 18 ms instead or forget about matching MP3 granule boundaries entirely.

I believe MPC and Ogg Vorbis have much more flexibility over frame size or granule size, so there's no possible advantage to choosing any particular delay other than what sounds good, so I decided that MP3 frames at 44.1 kHz rates seemed like a good choice for this example. Perceptually transparent encoding should still sound the same, but there might be a saving in bits by sticking to frames or granules.

Anyhow, for each delayed signal, the idea is to multiply the complex number representing our stereo source signal (or upmixed mono source signal) by a complex scaling factor, k, whose magnitude is a scaling factor (representing the reflectance of the surface) and whose angle, theta, represents a rotation in the polar notation. For the first pass, it's simply k, then on the second pass it's , then and so on (or you could take the first pass signal then delay and multiply it by k for the second pass, and pass that onto the third pass etc.).

You then add all the delayed, transformed signals to the undelayed signal and you might even scale the resulting values to avoid any chance of clipping too (and dither when finally converting back to 16-bit, of course) and you add a certain amount of ambient to the original.

So you have essentially three factors to control to modify the effect:

1. The delay between passes
This is related to the size of the room and difference in audible path length and an MP3 frame or granule size within the fusion zone (< 30-40 ms) seems to work pretty well for this, so in practice it's likely to be fixed.

2. The reflection coefficient (magnitude) of k
This determines how echoey and reverberant the room sounds. I believe a low value is referred to as "dry" and a high value is referred to as "wet".

3. The complex angular shift per pass (theta in polar notation) of k
This determines how the stereo effect moves around in the complex stereo space.

A special case, which simplifies the mathematics is when the angular change per pass, theta is 90° or -90°.

To demonstrate the effect graphically if k = 0.5 exp(i.{pi/2}), which means we have 50% (-6.02 dB reflection per pass) and a 90° (= pi/2) value for theta per pass, see the X marks on the following diagram.



The red curve moving in toward the centre shows the progression of an impulse (value 1) on the left channel as it passes through the angular change and attenuation. The impulse on the right channel would have value (0 + i) and is shown by the green curve.

The special case for theta = 90° makes the mathematics simpler, as follows:

• On the red line, the red X point at (1.0 + 0.0i) is the unchanged direct sound on the left channel. 0 ms (0 sample) delay.
• Point (0 + 0.5i) is after the first pass (switched to the right channel and halved in amplitude). 26.112 ms (1152 sample) delay for 44.1 kHz sampling rate.
• Point (-0.25 + 0.0i) is after the second pass (switched back from right to left, halved in amplitude again and inverted). 52.245 ms (2304 sample) delay
• Point (0 - 0.125i) is after the third pass (switched back from left to right and again halved in amplitude). 78.367 ms (3456 sample) delay.
• Point (0.0625 + 0.0i) is after the fourth pass (back on left channel, halved and inverted again, so it's now a sixteenth of the original signal and added to the original channel). 104.49 ms (4608 sample) delay.

The red and green curves were generated by spreadsheet in 5° steps, with -0.33 dB attenuation per step over numerous steps, to match the curve on which the X marks lie (18 steps is 90° and -6.02 dB, which is one step of the X marks).

With 50% (-6 dB) reaching the ears from the first reflection, this graph would represent a very echoey sound, like a tiled shower cubicle or a tunnel, but it illustrates the effect graphically rather well.

25% reflection per pass (-12 dB) is quite a full ambient sound (possibly a little too wet) and 12.5% (-18 dB) adds presence but is fairly subtle. There's no reason to stick to simple divisions either, and something around 18% (-15 dB) might be a good starting point too.

An added advantage of +/-90° phase shift (k is an imaginary number with no real component) is that the first pass of multiplying by k fully cancels when downmixing to mono, as does the third pass, and other odd numbered passes. The second pass would downmix to an low-depth, ultrafine comb filter (thanks to the long delay) so would have practically no effect on the timbre of the sound (in fact I suspect it exactly matches the periodicity of points on an FFT so has no effect on timbre when used on a sampled signal in the digital domain).

It's also very simple to apply, since the 90° phase shift only involves cross-coupling the channels and inverting one of them on each pass rather than taking sine and cosine components of each (since cos 90° = 0, sin 90° = 1). Whatever was in the left channel from the previous pass goes to the right, delayed and scaled down in amplitude. Whatever was in the right channel from the previous pass, goes to the left, delayed, scaled in amplitude and inverted in polarity. Many audio editors can do this sort of effect very simply. It's even possible to represent as a fairly simple FIR filter (with only a few discrete points that are non-zero) assuming the channel-crossing can be represented in the filter (I don't know of any DSP plugins that can do this sort of cross-channel FIR for Winamp or CoolEdit, but I'd love to know of any).

The 90 degree version is so simple to implement (a bit of cross-mixing and delaying) it may be in a load of stereo effect plugins for Winamp already! I'm not very well read in these things, but I don't know if the complex representation has been used elsewhere before.

It's not worth adding passes forever as they get progressively quieter. With four passes of -18 dB (12.5%), the delayed signal is 72 dB quieter, so practically inaudible, and rapidly approaching the noise floor. In fact, a single pass or just two or three can still add plenty of ambience to the sound.

It's also not essential to keep the delays constant from one pass to the next, nor to keep k constant. One might change the magnitude or angle of k from one pass to another and still achieve a natural stereo effect, or perhaps even a more natural effect than with a fixed value.

I tried out most of this technique using 90° per pass and varying the reflected proportion per pass (magnitude of k) just because it's so simple.

The simplest method I found was to use CoolEdit 96, saving the original sound as a stereo format file (and adding a little silence at the end or fading it out), then using Delay to create a delayed copy and using Channel Mixer to swap left and right, scale to 12.5%, inverting only the New Left Channel. Then I'd save that as pass 1 and use it as the input for the second pass. After perhaps four passes, I'd stop. Then I'd take the original sound and add in the first pass using mix paste (from file). I'd then listen in stereo and mono (using headphones with a mono switch to hear the effect come and go), then I'd add the remaining passes and listen between, before saving the ambient-enhanced file.

Obviously, this does add more dither noise per pass, but with four passes it should remain practically inaudible at 16-bits. It would be better to do all the passes at high resolution then dither back to 16-bits only at the end.

This works for stereo/binaural and monaural sources of sound. The ambience allows the brain to focus binaurally on the initial location of the sound, and may make it easier to discern (it certainly seems like it). If you have binaural recordings already (like Jim Bamford's) then it's not going to make it more real, but it's not likely to destroy to position information about the initial sound either, because the delay is so long.

The effect adds a kind of airy quality, with what seems like an analogue graininess and presence. It seems almost like there's an added rush of air and there might be a high-frequency hiss being introduced.

Yet this airy sound quality that could so easily be assumed to come from a broad frequency response is present even within the 5.5 kHz limit imposed by the 11.025 kHz sample rate on a monaural speech I applied this effect to (while at 11.025 kHz. I wanted to ensure no aliasing on my soundcard could be causing this effect, and indeed the effect was still present after upsampling to 44.1 kHz. I made certain to bandwidth-limit (post-filter) when upsampling to 44.1 kHz in CoolEdit. It may be that phase differences even at levels below the sample period are being reproduced adequately after D-to-A conversion, and an impression of small interaural delays is involved in this airy feeling (but that's just idle speculation on my part).

It does actually seem to lend a feeling of quality / precision /clarity to low sampling rate and monaural material and poor recordings, almost fooling me to believe they are recorded better. Using headphones, at any rate, it seems perceptually slightly easier to pick sounds out, somehow. The speech I refer to was recorded from an audience microphone onto audio cassette after passing through a very boomy, reverberant PA system.

The effect seems to be preserved by high quality (perceptually transparent) psychoacoustic encoding, like MusePack (.MPC) standard, lame -alt-preset standard (.MP3) or Ogg Vorbis -q 5 (.OGG), as you'd expect, being perceptually transparent and with the delay being of the order of the frame size. It's also rather well preserved by much by lower settings, like Ogg Vorbis (-q 0) which can use unsafe stereo coupling when encoding.

If I prepare any samples for you to listen to, I'll post links here, but I think there's enough info for you to try it yourself if you have a suitable audio editor.

You might want to try it to add a little life to dead studio recordings where stereo is placed by the balance knob, or to restore a little spaciousness to audio cassettes that have suffered cross-talk, narrowing the stereo field.

Just don't overdo it and apply it to everything! To much of anything can become bland and unappealing.

Regards,

Dick Darlington

P.S. I'm wondering if the complex notation would allow a way of converting mono to stereo without the risk of comb-filtering or cancellation caused by simple downmixing. (Most music is mastered to ensure it's fairly mono-compatible, however, so it's not always a problem). I suggested in a thread last month that this might be done using the power spectrum or real part of the Fourier Transform, throwing away the phase within each "frame" or "granule" of music. Perhaps the magnitude of the complex notation would offer another way to tackle it. However, in the notation used here, a wave about zero on the left channel oscillates to the positive and negative real axis, so the angle component, theta would change from 0 to 180° during the cycle. Trying to take the magnitude alone would lead to a rectified wave, introducing high frequency harmonic components that weren't in the original signal. Perhaps another transformation of the problem space could enable a shortcut way that's better than downmixing. This is getting into higher mathematics that is beyond me.