Help - Search - Members - Calendar
Full Version: Problems with BOMs and decoding
Hydrogenaudio Forums > Hosted Forums > foobar2000 > Support - (fb2k)
robinbowes
Hi,

I'm working on decoding MP3 tags using perl and I've got some problems with decoding frames written by fb2k.

Let me illustrate this with an example.

consider a TXXX frame:

Description: MusicBrainz Sortname
Text: All About Eve

This is written by fb2k like this (this is the raw frame as returned by MP3::Tag):
CODE
^A<FF><FE>M^@u^@s^@i^@c^@B^@r^@a^@i^@n^@z^@ ^@S^@o^@r^@t^@n^@a^@m^@e^@^@^@<FF><FE>A^@l^@l^@ ^@A^@b^@o^@u^@t^@ ^@E^@v^@e^@


My problem is with the BOM that is pre-pended to the second null-terminated string.
Here's how I decode the string:

First, strip off and inspect the first char: 0x01 => Unicode.
Next, decode the remaining string as utf-16 using Encode::decode

This produces the following string:
CODE
MusicBrainz Sortname^@\x{feff}All About Eve


Can anyone tell me why fb2k writes the BOM at the start of all (both) strings in the frame and not just at the start of the frame?

Thanks,

R.
Peter
From ID3v2 - Informal standard:
QUOTE
Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.

In our understanding, BOM is inserted per-string rather than per-frame.
robinbowes
QUOTE(Peter @ Feb 14 2006, 06:27 PM)
From ID3v2 - Informal standard:
QUOTE
Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.

In our understanding, BOM is inserted per-string rather than per-frame.
*



Ah, but the initial byte that specifies ISO-8859-1 or Unicode occurs only once, i.e. it applies to the whole frame, therefore I would suggest that the BOM should also only occur once per frame, shouldn't it?

R.
robinbowes
One other thing I've noticed is that if the "always write BOM" checkbox is not selected, fb2k writes Unicode tags with no BOM. This makes it difficult to decode - are they utf-16le or utf-16be ? It's a guess.

From the same reference Peter quoted:

QUOTE
Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.


I would say that fb2k shouldn't do this, i.e. it should *always* write a BOM if it writes Unicode tags/strings.

R.
Florian
QUOTE(robinbowes @ Feb 15 2006, 03:41 AM)
QUOTE(Peter @ Feb 14 2006, 06:27 PM)
From ID3v2 - Informal standard:
QUOTE
Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.

In our understanding, BOM is inserted per-string rather than per-frame.
*



Ah, but the initial byte that specifies ISO-8859-1 or Unicode occurs only once, i.e. it applies to the whole frame, therefore I would suggest that the BOM should also only occur once per frame, shouldn't it?

No. Every seperate string in a ID3v2 frame which is written in Unicode must contain the BOM.


QUOTE(robinbowes @ Feb 15 2006, 03:57 AM)
One other thing I've noticed is that if the "always write BOM" checkbox is not selected, fb2k writes Unicode tags with no BOM. This makes it difficult to decode - are they utf-16le or utf-16be ?  It's a guess.

From the same reference Peter quoted:

QUOTE
Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.


I would say that fb2k shouldn't do this, i.e. it should *always* write a BOM if it writes Unicode tags/strings.

You're right. These options were removed from foobar2000 0.9 early betas and foobar2000 0.9 always writes a BOM now.
robinbowes
QUOTE(Ganymed @ Feb 15 2006, 01:41 AM)
QUOTE(robinbowes @ Feb 15 2006, 03:41 AM)
QUOTE(Peter @ Feb 14 2006, 06:27 PM)
From ID3v2 - Informal standard:
QUOTE
Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.

In our understanding, BOM is inserted per-string rather than per-frame.
*



Ah, but the initial byte that specifies ISO-8859-1 or Unicode occurs only once, i.e. it applies to the whole frame, therefore I would suggest that the BOM should also only occur once per frame, shouldn't it?

No. Every seperate string in a ID3v2 frame which is written in Unicode must contain the BOM.


QUOTE(robinbowes @ Feb 15 2006, 03:57 AM)
One other thing I've noticed is that if the "always write BOM" checkbox is not selected, fb2k writes Unicode tags with no BOM. This makes it difficult to decode - are they utf-16le or utf-16be ?  It's a guess.

From the same reference Peter quoted:

QUOTE
Unicode strings must begin with the Unicode BOM ($FF FE or $FE FF) to identify the byte order.


I would say that fb2k shouldn't do this, i.e. it should *always* write a BOM if it writes Unicode tags/strings.

You're right. These options were removed from foobar2000 0.9 early betas and foobar2000 0.9 always writes a BOM now.
*



Thanks for the clarification - that consistency will make it much easier to decode the raw frames, though I'll still need to work round the previous bahviour. Grrr.

I look forward to the general release of 0.9 to fix the problem, or at least stop it spreading!

Thanks again,

R.
robinbowes
One further thing...

The maintainer of MP3::Info has asked me to ask you to write utf-8 frames for id3v2.4. Or do you already do this?

R.
Florian
QUOTE(robinbowes @ Feb 15 2006, 07:40 PM)
The maintainer of MP3::Info has asked me to ask you to write utf-8 frames for id3v2.4. Or do you already do this?

No, but we're planning to do so smile.gif
robinbowes
QUOTE(Ganymed @ Feb 15 2006, 10:41 AM)
QUOTE(robinbowes @ Feb 15 2006, 07:40 PM)
The maintainer of MP3::Info has asked me to ask you to write utf-8 frames for id3v2.4. Or do you already do this?

No, but we're planning to do so smile.gif
*



"Planning to" as in "it will be in release 0.9"?

I'd be happy to help with testing as I'm getting down and dirty with mp3 frame encoding right now and currently have far more knowledge of this rattling around my brain than is healthy!

R.
Peter
We generally agree that UTF-8 is the way to go (allows further simplifications regarding unsync handling). However, making changes like this before 0.9 release will delay the release and require further compatibility testing.
robinbowes
QUOTE(Peter @ Feb 15 2006, 12:55 PM)
We generally agree that UTF-8 is the way to go (allows further simplifications regarding unsync handling). However, making changes like this before 0.9 release will delay the release and require further compatibility testing.
*



OK, so does this mean you're only writing v2.3 tags (as v2.4 tags require ISO-8859-1 or utf-8 encoding)?

Out of interest, is there a schedule for 0.9 release? How long after that will utf-8 support be added?

R.
Peter
We write v2.3 with UTF-16 / little endian, which seemed to be most compatible across different apps until iTunes unsync handling problem popped up.
I don't see anything in the specification that says that v2.4 requires UTF-8 or ISO 8859-1, only setting "character encoding restrictions" flag in the extended header would imply so.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2008 Invision Power Services, Inc.