Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Can audio encoders target quality w/o caring about bit rate/file size? (Read 27944 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Can audio encoders target quality w/o caring about bit rate/file size?

x264 video encoder has encoding mode called Constant Rate Factor. In this mode number (16, 17, etc) is used to define desired quality (lower - better quality and higher bitrate), and encoder does not care about bitrate, only about keeping rate factor constant. It is a question, why nobody has invented something similar for audio encoding (except lossyWAV, which needs too much bitrate for acceptable quality)?

I think every encoder with real vbr (not abr) does that? Lame has V(0-9), QT AAC has --tvbr (0-127), Vorbis has -q((-2)-10). The bitrate may vary a lot with these settings between different songs/genres.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #1
Opus has that too. It just calculates a quality factor from the given target bitrate, based on statistics. Therefore the resulting bitrate may vary depending on the audio source, Opus will not try to approximate the given target bitrate in true VBR mode.

As far as I read from earlier posts.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #2
I think every encoder with real vbr (not abr) does that? Lame has V(0-9), QT AAC has --tvbr (0-127), Vorbis has -q((-2)-10). The bitrate may vary a lot with these settings between different songs/genres.

Opus has that too. It just calculates a quality factor from the given target bitrate, based on statistics. Therefore the resulting bitrate may vary depending on the audio source, Opus will not try to approximate the given target bitrate in true VBR mode.

Well, if you mix audiobook and complex electronic music in one file, then which bitrate will you use for this file? Opus 64 kbps will give good quality for that part, which contains audiobook, but the quality of musical part will be very low. And 176 kbps will give good quality for music, but that bitrate will be too excessive for audiobook. And I would like to have encoder, which takes from me "good quality" as input option, and gives ~64 kbps for audiobook part of the file and ~176 kbps for musical part. None of modern audio encoders can to this.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #3
Quote
Opus 1.1 Alpha has some bugs[…] [pictures of spectrograms]
Right. But how does it sound? Not that I expect transparency at 32 kbps, but visual images are of scant relevance to audio.

It is a question, why nobody has invented something similar for audio encoding (except lossyWAV, which needs too much bitrate for acceptable quality)?
Um, what. Plenty of people have done this, for decades.

I would like to have encoder, which takes from me "good quality" as input option, and gives ~64 kbps for audiobook part of the file and ~176 kbps for musical part. None of modern audio encoders can to this.
I think you must be doing something wrong. Any competent modern encoder should do exactly this when configured to use their VBR mode.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #4
I would like to have encoder, which takes from me "good quality" as input option, and gives ~64 kbps for audiobook part of the file and ~176 kbps for musical part. None of modern audio encoders can to this.
I think you must be doing something wrong. Any competent modern encoder should do exactly this when configured to use their VBR mode.


I guess that rules out Opus babyeater (25 and 64 kb/s), lame (v5), and vorbis aoTuV (q1).  I just encoded a set of 3 tracks: one voice introduction and 2 music.  IN all cases the plain speech encoded at the highest bitrate. 


Can audio encoders target quality w/o caring about bit rate/file size?

Reply #5
Hmmm, maybe I was assuming wrongly… in which case I apologise, but it wasn’t an illogical assumption, I don’t think?

Does that affect only lower-quality settings?

As I don’t want to imply that these encoders aren’t competent, I have to presume there’s a reason for this, and I’d be interested to know what it is. Maybe I’m overlooking something really obvious.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #6
VBR gives you constant quality. 

Having a file with two different quality levels is not what VBR is meant to do.  I think to have that work you'd have to have some kind of filter or processing that attempted to classify the signal as audio or music and then adjusted the encoder's parameters from frame to frame.  Its probably not too hard to do, but its also a very strange thing to want to implement so maybe no one has done so.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #7
Does that affect only lower-quality settings?

As I don’t want to imply that these encoders aren’t competent, I have to presume there’s a reason for this, and I’d be interested to know what it is. Maybe I’m overlooking something really obvious.


I tried again with opus bitrate=170.  This time the speech track was between the 2 music tracks at 191.  I can't say for sure why, except maybe we judge speech on how clear it is rather than whether it is ABX'able.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #8
Right, I see now that DonP didn’t encode a single track with three parts, but three separate tracks. That’s not what softrunner was asking about, then. But DonP’s results and possible explanation still have a good degree of relevance.

My previous posts were written under the assumption that we were talking about large regions of broadly differing complexity/amplitude/whatever in the same file encoded at a single quality, such as speech and music. Barring some other aspect of speech that makes the encoder think it’s similarly complex to music, I would have expected a sizeable difference in bitrate between the two sections.

VBR gives you constant quality. 

Having a file with two different quality levels is not what VBR is meant to do.  I think to have that work you'd have to have some kind of filter or processing that attempted to classify the signal as audio or music and then adjusted the encoder's parameters from frame to frame.  Its probably not too hard to do, but its also a very strange thing to want to implement so maybe no one has done so.
As above, what I think softrunner was asking about, and certainly what I was talking about, was a single file with two different parts and the possibility for an encoder to provide significantly differing bitrates if the two parts differ in complexity. Again, perhaps my presumption that there should be a big difference was incorrect. I have no experience with the specific scenario and vanishingly small experience with encoded speech.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #9
Right, I see now that DonP didn’t encode a single track with three parts, but three separate tracks. That’s not what softrunner was asking about, then. But DonP’s results and possible explanation still have a good degree of relevance.


I went with speech and music in separate tracks (from the same CD) because (1) it would be easier to find the rate for each part, and (2) I figured the bit rate is only based on a small window around the current instant (correct?) so it wouldn't matter that the other type of content is in a different file as long as the quality parameter is unchanged.

Yes, the poser of the question had mixed content in one file.



Can audio encoders target quality w/o caring about bit rate/file size?

Reply #10
Mods, could we get a threadsplit for these quality level posts? I really think this deserves its own thread.

If you have a mixed-content file, then for an encoder to do a good job of targeting a bitrate for the whole file while providing "constant quality" it would have to do a two-pass type deal. It has no other way of knowing, when you ask for 64kbps, if e.g. half is effectively-mono speech (both channels identical) and half is stereo music and thus it can target 32kbps for the speech, 96kbps for the music, and give you basically the bitrate you asked for.

For almost all purposes, it'd be better to let the user specify the quality level. With a single user-specified quality level, a file that was all speech could come out as ~32kbps, a file that was all stereo music could come out as ~96kbps, and the mixed-content file above could come out as ~64kbps, with constant quality and without having to do two passes.

The "VBR bitrate setting=quality level" idea we've heard so many times says an ideal VBR encoder is supposed to encode things at a constant quality level which averages out to the target bitrate across some generic ideal reference collection. But it really makes no sense to try to say how much of an ideal reference collection is mono speech. In the opus-tools suggestion thread, NullC mentioned the "bitrate setting is for fullband stereo equivalent quality" idea i.e. considering the ideal reference collection to consist entirely of FB stereo music. As he said there, the downside is that someone encoding just mono speech ends up with their files encoded at ~1/3 of their target rate. If you shift the balance of the ideal collection you ameliorate that but give those encoding music some of the opposite problem. Multichannel users have to guess at how their bitrate translates to a stereo equivalent bitrate too.

The user specified target bitrate thus becomes sufficiently unhinged from the end result's bitrate and quality that it would no longer make sense to tell people it's a target bitrate; instead you just call it a quality mode and provide some kind of table of what range of result bitrates to expect given channel count, bandwidth, and speech vs. music.

Even if your content is not mixed but is in separate files, having such a quality setting would enable people to encode mixed collections of files- whether just tracks of the same CD (changing quality settings for different tracks when ripping=ugh) or their entire audio collection- with a single setting without worrying that they're either bloating the speech files or starving the music.

VBR gives you constant quality. Having a file with two different quality levels is not what VBR is meant to do.
But if you have a file with 64kbps effectively-mono speech and 64kbps stereo music, those are two vastly different quality levels. Being adaptive and constant-quality rather than constant-bitrate most definitely is, as you admit, what VBR is meant to do.
I think to have that work you'd have to have some kind of filter or processing that attempted to classify the signal as audio or music and then adjusted the encoder's parameters from frame to frame.  Its probably not too hard to do, but its also a very strange thing to want to implement so maybe no one has done so.
The Opus encoder already tries, from frame to frame, to classify the audio as speech or music, determine its bandwidth, and determine channel separation. Wanting this analysis to show up in giving lower bitrates for speech and higher bitrates for music is not very strange or even slightly strange.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #11
If you have a mixed-content file, then for an encoder to do a good job of targeting a bitrate for the whole file while providing "constant quality" it would have to do a two-pass type deal. It has no other way of knowing, when you ask for 64kbps, if e.g. half is effectively-mono speech (both channels identical) and half is stereo music and thus it can target 32kbps for the speech, 96kbps for the music, and give you basically the bitrate you asked for.

The "VBR bitrate setting=quality level" idea we've heard so many times says an ideal VBR encoder is supposed to encode things at a constant quality level which averages out to the target bitrate across some generic ideal reference collection

At least for me, I was specifically not referring to any quality setting that targets a bitrate: I was thinking about settings that target a given quality (level of noise, whatever) without any obligation to meet a particular bitrate and therefore might be predicted to allocate different bitrates based upon material.

If I fed separate speech and loud music files to a single encoder using ‘pure’ VBR and no target bitrate, I would expect the music to come out at a higher bitrate. Extending that reasoning, I would assume to expect the same thing if the two were placed adjacently in one file. I don’t see how two passes would be necessary if no bitrate is being targeted.

Sure, it’s nice to have some vague idea of what sort of bitrate to expect from a giving setting, but IMHO, a proper VBR setting should just work with psychoacoustics and not worry about bitrate.

I think this may be exactly the point you’re making, so I don’t mean to sound like I’m repeating or contradicting you; this is just to clarify what I was trying to convey in my posts.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #12
I think this may be exactly the point you’re making
*nod*

I edited my post, adding a paragraph between the two parts you quoted, to try to make that a little more clear, but your reply came faster than my edit

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #13
I think to have that work you'd have to have some kind of filter or processing that attempted to classify the signal as audio or music and then adjusted the encoder's parameters from frame to frame.  Its probably not too hard to do, but its also a very strange thing to want to implement so maybe no one has done so.

Yes, something like this, and it is the most desired thing to have, because it frees user from checking every separate file (set of files) in affort to define which bitrate to use. Like for x264 video encoder, he says "I want rate factor 21.00" and he gets it - the same quality for every input file, independently on what content is there - just a voice, simple piano music, rock/pop music, electronic music or something else.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #14
Look, I don’t know whether you’ve actually read any of the other posts in this thread. Existing codecs with true VBR modes already target quality without caring about meeting a specific bitrate. You tell the encoder to use a level of quality that is defined by a number on a sliding scale, and then it can vary the bitrate as much as it wants based upon the signal that you supply to it. Have you done any real testing of this?

DonP had some samples of speech that did not encode to a much lower bitrate than samples of music, but this in no way proves that quality-based algorithms are unique to H.264. Please, try some more tests. Read some documentation on encoders such as LAME, oggenc, VBR AAC, etc. Then see whether you can continue to claim that no audio encoder offers a mode that targets only quality without worrying about bitrate.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #15
Just to add up my two pence to an already worked out question by other respectable fellows: my music collection, about 15000 tracks encoded with QT AAC at a target quality of 110, ranges from 74kbps (Edwin Fischer's Bach WTC, mono piano from early thirties) to 314kbps (Henk Van Twiller's transcription for solo saxophone of Bach's cello sonatas: BTW this's quite surprising to me!). The average bitrate is around 256kbps (exactly reached by about 300 tracks). About 5000 tracks are < 240kbps, and about 2500 > 270kbps.
Of course we are speaking of VBR, so those values for a single track are still average bitrate, as shown in iTunes column browser.

This poor man's statistical analysis demonstrates first of how targeting quality the encoder doesn't care the less about bitrate, then that to reach quality 110 the Apple AAC encoder uses on average about 256kpbs, but that if I had chosen to target bitrate instead of quality, say 256kbps CBR instead of quality 110 which is a thing some people consider practically equivalent, it would have been overkill (largely sometime!) for the first tracks and inadequate for the seconds.

On second thought: if I'm not wrong, iTunes store sells everything at 256 CBR. Maybe this is "transparent enough" for everything, but defeats the proven efficiency of their own very good encoder.

On third thought: one of these days I'm going to try to encode those two extremes at 256 CBR and try to ABX them from both the q110 and the lossless ones (though I'm rather sure I will fail all of the times... ).
... I live by long distance.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #16
Thanks for providing data!

if I'm not wrong, iTunes store sells everything at 256 CBR. Maybe this is "transparent enough" for everything, but defeats the proven efficiency of their own very good encoder.
A previous analysis of files from the iTunes Store suggests that iTunes Plus is ‘constrained VBR’, which I interpret to mean ABR:
iTunes' standard setting is identical to Quicktime's ABR setting at medium encoding quality.
iTunes' VBR setting is identical to Quicktime's VBR constrained setting at medium encoding quality.
iTunes Plus is identical to Quicktime's VBR constrained 256kbit/s setting at maximum encoding quality.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #17
A previous analysis of files from the iTunes Store suggests that iTunes Plus is ‘constrained VBR’, which I interpret to mean ABR:

Actually "ABR" and "Constrained VBR" are two separate modes in Apple AAC. I don't know why there are two constrained VBR modes but the ABR mode is more constrained (closer to CBR) than the CVBR mode.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #18
I think to have that work you'd have to have some kind of filter or processing that attempted to classify the signal as audio or music and then adjusted the encoder's parameters from frame to frame.  Its probably not too hard to do, but its also a very strange thing to want to implement so maybe no one has done so.

Yes, something like this, and it is the most desired thing to have, because it frees user from checking every separate file (set of files) in affort to define which bitrate to use. Like for x264 video encoder, he says "I want rate factor 21.00" and he gets it - the same quality for every input file, independently on what content is there - just a voice, simple piano music, rock/pop music, electronic music or something else.


Did you read any of what you just agreed with? I don't think so. what you are asking for is vbr. Every modern codec has it.

That's not what I was describing though.

 

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #19
In repeatedly saying that every codec does what he's talking about and deriding him for asking about it, you seem to be misunderstanding or misconstruing what he's saying and you seem uncivil. English isn't his first language (Russian is, apparently), there have been some communication failures, and he's got some misconceptions that aren't central to what he's saying. Rather than pouncing on him for his misconceptions or his mode of expression and ignoring his main point, we should either have the patience and charity to try to understand and respond to the main thrust of what he's saying, or we should leave it alone.

Yes, VBR encoders generally attempt to do constant quality, they vary their bitrate substantially, especially at higher quality targets, and we have no particular reason to believe that their rate allocation for e.g. different genres of music deviates dramatically from the constant-quality ideal.

But speech is considerably easier to code than music, especially for a codec like Opus which has LP capabilities, and if you can show me a single encoder that dramatically scales back its bitrate when presented with speech, especially in a mixed-content file, I will be quite surprised. Just as an example, I ran some mono samples, both music, speech, and mixed-content, through LAME -V6, oggenc -q 1, and opusenc --bitrate 60. In all cases the speech content was given a bitrate within 1kbps of the average music bitrate. In that respect, these encoders' rate allocation is more like what one would expect of ABR than ideal constant-quality VBR. I'm fairly certain that Vorbis and LAME can both achieve speech quality equivalent to their 60kbps music quality at below 48kbps and that opusenc can do considerably better still.

As NullC said in the opus-tools thread, dramatic bitrate changes between speech and music is an idea worth trying. He warns that the speech / music classification in opus isn't yet accurate enough to avoid some audible problems*. But it certainly wouldn't have to be perfect to improve on what we have now, esp. for content that's mostly speech but has occasional music and sound effects.

*There's been some indication that in the future there may be significant relatively-low-hanging fruit in improving non-realtime use of Opus by using greater lookahead, esp. to improve the accuracy of all the kinds of additional analysis the master branch is doing. But for now the devs are continuing their focus on real-time use.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #20
In repeatedly saying that every codec does what he's talking about and deriding him for asking about it, you seem to be misunderstanding or misconstruing what he's saying and you seem uncivil. English isn't his first language (Russian is, apparently), there have been some communication failures, and he's got some misconceptions that aren't central to what he's saying. Rather than pouncing on him for his misconceptions or his mode of expression and ignoring his main point, we should either have the patience and charity to try to understand and respond to the main thrust of what he's saying, or we should leave it alone.


If he doesn't understand, he should ask rather then just ignore. 

If you understand his misconception, then you should help him to understand, rather then complain that no one else is doing what you also do not do.

But speech is considerably easier to code than music, especially for a codec like Opus which has LP capabilities, and if you can show me a single encoder that dramatically scales back its bitrate when presented with speech, especially in a mixed-content file, I will be quite surprised.


I'm curious what your source is for this statement?  The fact that codecs do not decrease the bitrate as much as you expect them to suggests that transparent speech is harder to encode then you think.


Can audio encoders target quality w/o caring about bit rate/file size?

Reply #21
Speech isn't that easy to code. http://research.nokia.com/files/public/%5B..._Opus_Codec.pdf

Opus uses hybrid mode only at very low bitrates. Speech requires comparable bitrate as for music  for (near) transparent or high quality . There is no such thing as smart encoder that does"64 kbps for speech and 128 kbps for music".
That's enough to say that Opus 1.1 alpha (--bitrate 64) produces bitrates  considerably >64 kbps on speech. It doesn't go anyhow lower.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #22
Speech isn't that easy to code. http://research.nokia.com/files/public/%5B..._Opus_Codec.pdf […] Speech requires comparable bitrate as for music  for (near) transparent or high quality .
Thanks for this! It supports earlier suppositions that bitrates for speech that are similar to music point to speech being more complex than we estimate, not to any failing in VBR modes.

I guess we’re conditioned to think of speech as requiring low bitrates, when in fact it’s often just a case of people forcing low bitrates due to constraints upon bandwidth or capacity, or even just habit. I can appreciate that actually encoding speech at a level that matches music may be more of a challenge than is assumed. That was the case for me, anyway.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #23
The fact that codecs do not decrease the bitrate as much as you expect them to suggests that transparent speech is harder to encode then you think.

Speech isn't that easy to code. Opus uses hybrid mode only at very low bitrates. Speech requires comparable bitrate as for music  for (near) transparent or high quality .

If you think of audio quality only in terms of the binary transparent vs non-transparent distinction, you divorce yourselves from the realities of non-trained everyday listening by the non-golden-eared public, you exclude a whole host of uses for which people would prefer not to pay the extra bitrate costs for very rapidly diminishing quality returns, and you will be forever chasing corner cases and ephemeral differences.

24-32kbps Opus speech is quite good. While trained listeners can frequently distinguish 32kbps hybrid mode mono Opus speech from the original in careful repeated listening in controlled environments, both the Google and Nokia 2011 Opus listening tests showed that in MOS, MUSHRA, or ABC/HR testing, people rate 32kbps Opus practically on a level with the originals. It's true that those tests showed 32kbps had nonoverlapping error bars with the originals, but that's not true for 40kbps, and remember that's with a two-year-old Opus encoder and there's been plenty of improvement since then. If for mono speech current 32kbps Opus doesn't qualify as "high quality" then I don't care in the slightest what high quality is.

On the other hand, while low-bitrate Opus doesn't totally mangle music like most speech codecs, the difference is considerably more clear. I don't see any large-scale mono music tests readily available to back up my personal listening tests and observations, and if there were such their test setup wouldn't be designed for making cross-sample quality comparisons with speech. But if you look at the Google tests you'll see that subjects rated 64kbps stereo music to have a much much greater quality difference from the original than 32kbps mono speech, even though you'd expect channel coupling to have a very major benefit.

Though the difference is less dramatic in other codecs which don't have speech-oriented technologies, it's still there. Part of this is because a lowpass that butchers the sound of many music samples will not have objectionable - or, often, readily-noticed - effects on speech. (The bitrate->lowpass cutoff maps in LAME and Vorbis were designed for music content - in fact, the one in LAME isn't even well tuned for mono, basically just naively scaling the target bitrate by the arbitrary factor of 3/2 before plugging it into a table which is tuned for stereo - and overriding the lowpass can enable them to do considerably better with speech at <56kbps bitrates.) There are many other factors.

On top of that, recorded music is more likely to have important stereo separation, while for speech we're generally listening to a single source at a time and so most recorded material is either mono, "stereo" with both channels practically identical (e.g. identical except for dithering), or easily representable by intensity stereo. Any decent VBR encoder will manage to reduce its bitrate substantially when stereo separation is practically nil but if such content were in a separate file you'd be well-advised to explicitly tell it to downmix, saving a little bitrate and avoiding the possibility of some nonoptimal encoder decisions. Opus has the fairly unique capacity to switch to a true mono mode and back within the same stream, but opusenc doesn't use it, and at low bitrates it doesn't seem to reduce its bitrate as much for such content as one might anticipate.

Thanks for this! It supports earlier suppositions that bitrates for speech that are similar to music point to speech being more complex than we estimate, not to any failing in VBR modes.
It does no such thing. It has no tests relating to music quality, and most definitely no tests where people were asked to directly compare the quality of encoded speech samples to that of encoded music. It tells us that 40kbps hybrid Opus with a two-year-old still-under-heavy-development encoder was statistically tied with the fullband original speech.

Can audio encoders target quality w/o caring about bit rate/file size?

Reply #24
The fact that codecs do not decrease the bitrate as much as you expect them to suggests that transparent speech is harder to encode then you think.

Speech isn't that easy to code. Opus uses hybrid mode only at very low bitrates. Speech requires comparable bitrate as for music  for (near) transparent or high quality .

If you think of audio quality only in terms of the binary transparent vs non-transparent distinction, you divorce yourselves from the realities of non-trained everyday listening by the non-golden-eared public, you exclude a whole host of uses for which people would prefer not to pay the extra bitrate costs for very rapidly diminishing quality returns, and you will be forever chasing corner cases and ephemeral differences.

I do think you are simply mistaking transparency for intelligibility: the fact that you can perfectly understand someone talking on the phone doesn't mean the phone is transparent to speech, not the way this term is used in perceptual codec evaluation, at least.

BTW: have you ever tried to understand someone speaking a foreign language on a slightly noisy line?
... I live by long distance.