(sorry for bumping up an old thread here but I thought this is an interesting topic to post)

To my knowledge, wavelets in audio coding have found no particular advantage over subband coding, hence there is lesser research literature on audio coding when compared with image and video coding where indeed, wavelets have found a place. The EZW (embedded zerotree wavelet) and SPIHT (set partitioning in hierarchical trees) algorithms used in image coding represent the state of the art.
Wavelets are a special type of function which can be expanded (dilation) and shifted. They serve the role of being a good basis function, much like the role that sines and cosines play in the DFT (discrete Fourier transform) and cosines in the DCT. However, sines and cosines are rather constrained as they aren't flexible enough to track transient changes and frequency at the same time. ie. Sharp transients need high frequency to detect but that sacrifices on long term trends and vice versa. Hence the compromise between time and frequency resolution is fixed.
Wavelets on the other hand allow multiple resolutions via different scaling (dilations and expansions). Short wavelets pinpoint the sharp transients while long wavelets focus on the long trends in a signal. Thus multiple resolutions of the signal are shown together. So we can see the forest as well as the trees, so to speak.

It turns out that the discrete version of the wavelet decomposition resembles the filter bank algorithm of subband coding so that is why SBC and wavelets are often used synonymously. The only difference is the background theorty between the two (one is a band splitter, the other is a multiresolution basis) and the filter coefficients.