C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

seeking in a stream with nontrivial codecvt facet

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
Ulrich Eckhardt
Guest





PostPosted: Wed Apr 07, 2004 7:30 pm    Post subject: seeking in a stream with nontrivial codecvt facet Reply with quote



Hi!
I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order marker).
The first primitive approach was to simply use 'in.seekg(3);', which works
with STLport. However, GCC's libstdc++ chokes on that.

Investigating further, I see that it looks at the return-value of
codecvt<>::encoding(), and fails for all values '<= 0'. It doesn't even try
to move.

My questions:
- is GCC's behaviour a bug or a valid interpretation of the standard?
- my interpretation of codecvt::encoding() was that variable-width encodings
like UTF-8 should return 0 here, right?
- Reading [1], I see that the streambuffer has two methods seekoff() and
seekpos(). The former fails when encoding() returns <= 0, the latter has no
guaranteed effects when the position wasn't determined by a call to a
positioning function. Is there a way to achieve what I want portably?

cheers
Uli

[1] C++ IOStreams and Locales, by Langer and Kreft
--
FAQ: http://parashift.com/c++-faq-lite/
/* bittersweet C++ */
default: break;

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Paolo Carlini
Guest





PostPosted: Thu Apr 08, 2004 10:41 am    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote



Ulrich Eckhardt wrote:
Quote:
Hi!
I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order marker).
The first primitive approach was to simply use 'in.seekg(3);', which works
with STLport. However, GCC's libstdc++ chokes on that.

Which version of gcc are you using?

Only the forthcoming 3.4 has complete support for encoding zero.

Paolo.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Thu Apr 08, 2004 6:06 pm    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote



Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote


Quote:
I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8? The
whole idea behind UTF-8 is to provide a byte oriented encoding in which
byte order of larger elements doesn't matter.

Quote:
The first primitive approach was to simply use 'in.seekg(3);', which
works with STLport. However, GCC's libstdc++ chokes on that.

Investigating further, I see that it looks at the return-value of
codecvt<>::encoding(), and fails for all values '<= 0'. It doesn't
even try to move.

My questions:
- is GCC's behaviour a bug or a valid interpretation of the standard?

I think it is valid. Calling seekg with a single parameter resolve to
calling seekpos in filebuf : according to §27.8.1.4/14, "If sp has not
been obtained by a previous successful call to one of the positionning
functions (seekoff or seekpos) on the same file the effects are
undefined." An implementation is allowed to make it work, but is not
required to.

Note that while the standard (Table 88 of §27.4.3.2) requires that an
int convert implicitly to a pos_type, it doesn't specify any semantics
for this, and it is in fact very difficult to imagine what it is
supposed to mean in the case of variable length multi-byte encodings (or
with any file not opened in binary mode, for that matter). In practice,
I would expect that converting 0 and using it will get me to the
beginning of the file, but would not expect anything sensible otherwise
except for files opened in binary mode and imbued with a non-converting
locale. (In fact, the entire definition of fpos internal contradictions and mathematically impossible requirements.)

That said, I don't quite see why g++ would look at
codecvt<>::encoding(). The position is either valid, or it isn't --
since the results are undefined, whether constructing a pos_type from an
arbitrary int results in a valid position or not is pretty much random,
but that's another story. (I can think of two reasonable
implementations here. In one, pos_type will do exactly what you want,
in this one particular case, but could give some pretty strange results
in the general case. In the other, pos_type will be invalid for all
integral values, except maybe 0, which would be treated as a special
case. The first is by far the simplest.)

It might be worth writing a small program which reads a couple of
characters, calls tell, reads further, then calls seekg with the results
of tell. If this fails, then g++ has an error. (And if your
interpretation, that g++ systematically refuses any seekg if
codecvt<>::encoding() return a non-positive value, it will fail.)

Quote:
- my interpretation of codecvt::encoding() was that variable-width
encodings like UTF-8 should return 0 here, right?

That's what I would think. UTF-8 isn't state-dependant, at least not in
the sense that I think is meant here, but the number of external
characters in an internal character is not a constant.

Quote:
- Reading [1], I see that the streambuffer has two methods seekoff()
and seekpos(). The former fails when encoding() returns <= 0, the
latter has no guaranteed effects when the position wasn't determined
by a call to a positioning function. Is there a way to achieve what I
want portably?

Sort of, but I'm not sure you'll like the solution. Basically, you need
to create a filebuf, opened in binary mode and imbued with the "C"
locale, and use that to access the actual file. Next, you need to write
your own, filtering streambuf, which basically uses this filebuf as a
source, and uses the imbued codecvt to do the code translation --
duplicating the code in filebuf, in sum. You can then seek on the
underlying filebuf, then read through the filtering streambuf which
takes care of the rest of the code translation.

Alternatively, for this one specific case, I would guess that just
calling ignore(1) should do the trick. I think, formally, that an
implementation would be allowed to check the Unicode value resulting
from codecvt::encoding(), and reject it as an error if it isn't a valid
Unicode character (which the byte order mark isn't), but I cannot
imagine an implementation actually doing this.

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ben Hutchings
Guest





PostPosted: Fri Apr 09, 2004 1:49 am    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
Quote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote in message
news:<c510gk$2o1ueh$1 (AT) ID-178288 (DOT) news.uni-berlin.de>...

I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8? The
whole idea behind UTF-8 is to provide a byte oriented encoding in which
byte order of larger elements doesn't matter.
snip


This is a Windows convention. It is a means to distinguish UTF-8
text files from text files that use a local encoding. Of course
this is unreliable and not extensible, but what do you expect
from [rest of rant deleted]

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
P.J. Plauger
Guest





PostPosted: Fri Apr 09, 2004 12:31 pm    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

"Ben Hutchings" <do-not-spam-benh (AT) bwsint (DOT) com> wrote


Quote:
kanze (AT) gabi-soft (DOT) fr wrote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote in message
news:<c510gk$2o1ueh$1 (AT) ID-178288 (DOT) news.uni-berlin.de>...

I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8? The
whole idea behind UTF-8 is to provide a byte oriented encoding in which
byte order of larger elements doesn't matter.
snip

This is a Windows convention. It is a means to distinguish UTF-8
text files from text files that use a local encoding. Of course
this is unreliable and not extensible, but what do you expect
from [rest of rant deleted]

You mean Unicode, Inc.? There's a description of this convention at:

http://www.unicode.org/faq/utf_bom.html#2

Doesn't say anything about Windows there, but what do you expect from
[rest of rant deleted]?

Honi soit qui mal y pense.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ulrich Eckhardt
Guest





PostPosted: Fri Apr 09, 2004 2:23 pm    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
Quote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote
I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8? The
whole idea behind UTF-8 is to provide a byte oriented encoding in which
byte order of larger elements doesn't matter.

In UCS2, the character can be a BOM. Unicode only calls it a 'non-breaking
zero-width space'. Writing this at the beginning of a UTF-8 file is just
one way to mark it as a UTF-8 file. Right, my statement was confusing...

Quote:
Alternatively, for this one specific case, I would guess that just
calling ignore(1) should do the trick.
I think, formally, that an implementation would be allowed to check
the Unicode value resulting from codecvt::encoding(), and reject it
as an error if it isn't a valid Unicode character (which the byte
order mark isn't), but I cannot imagine an implementation actually
doing this.

[assuming you meant codecvt::in() and not codecvt::encoding()]

No, the BOM _is_ a valid character and that's the whole point of it. It is
a zero width space. Its meaning for display is pretty void, but if
byte-swapped, it becomes a character that is guaranteed by Unicode not
ever to be valid, thereby allowing automatic detection of the byte order.

Anyhow, even if I could read that single codepoint from a stream, I could
not read it from a stream<char>(assuming an internal ASCII or something
derived therefrom), because I can't convert it. I wonder if ignore() is
supposed to do the job... else, I fear I'll either not allow it or make a
special case and convert it to a normal space.

Thank you for your other comments, too. I'll check these after easter.

Uli

--
Questions ?
see C++-FAQ Lite: http://parashift.com/c++-faq-lite/ first !


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ben Hutchings
Guest





PostPosted: Wed Apr 14, 2004 6:39 am    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

P.J. Plauger wrote:
Quote:
"Ben Hutchings" <do-not-spam-benh (AT) bwsint (DOT) com> wrote in message
news:slrnc7bebg.gkv.do-not-spam-benh (AT) shadbolt (DOT) i.decadentplace.org.uk...

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote in message
news:<c510gk$2o1ueh$1 (AT) ID-178288 (DOT) news.uni-berlin.de>...

I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8? The
whole idea behind UTF-8 is to provide a byte oriented encoding in which
byte order of larger elements doesn't matter.
snip

This is a Windows convention. It is a means to distinguish UTF-8
text files from text files that use a local encoding. Of course
this is unreliable and not extensible, but what do you expect
from [rest of rant deleted]

You mean Unicode, Inc.? There's a description of this convention at:

http://www.unicode.org/faq/utf_bom.html#2

Doesn't say anything about Windows there, but what do you expect from
[rest of rant deleted]?

That would appear to be documentation of existing practice, not a
recommendation. The actual standard says that:

"Use of a BOM is neither required nor recommended for UTF-8, but
may be encountered in contexts where UTF-8 data is converted from
other encoding forms that use a BOM, or where the BOM is used as a
UTF-8 signature."

Quote:
Honi soit qui mal y pense.

I really can't be bothered to come up with a riposte. I can only
suggest that you examine the result of relying too much on such
heuristics for encoding detection:
<http://weblogs.asp.net/cumpsd/archive/2004/02/27/81098.aspx>.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Thu Apr 15, 2004 12:52 am    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote

Quote:
kanze (AT) gabi-soft (DOT) fr wrote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote
I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8? The
whole idea behind UTF-8 is to provide a byte oriented encoding in
which byte order of larger elements doesn't matter.

In UCS2, the character can be a BOM. Unicode only calls it a
'non-breaking zero-width space'. Writing this at the beginning of a
UTF-8 file is just one way to mark it as a UTF-8 file. Right, my
statement was confusing...

OK. I knew it was special (and that it was used to determine byte
order), but I was a little confused as to how. I was under the
impression that it was only used to distinguish between UTF-16BE and
UTF-16LE, but of course, it would also have a distinctive encoding in
UTF-8 (or UTF-32 for that matter), different from any of the UTF-16
encodings.

Quote:
Alternatively, for this one specific case, I would guess that just
calling ignore(1) should do the trick.
I think, formally, that an implementation would be allowed to check
the Unicode value resulting from codecvt::encoding(), and reject it
as an error if it isn't a valid Unicode character (which the byte
order mark isn't), but I cannot imagine an implementation actually
doing this.

[assuming you meant codecvt::in() and not codecvt::encoding()]

Yes.

Quote:
No, the BOM _is_ a valid character and that's the whole point of
it. It is a zero width space. Its meaning for display is pretty void,
but if byte-swapped, it becomes a character that is guaranteed by
Unicode not ever to be valid, thereby allowing automatic detection of
the byte order.

I see. I was under the impression that it was a special illegal
character, which should be thrown out once read, but what you describe
is much more logical.

Quote:
Anyhow, even if I could read that single codepoint from a stream, I
could not read it from a stream<char>(assuming an internal ASCII or
something derived therefrom), because I can't convert it.

OK. I had just assumed that you were reading through a wstream.

I can't find anything in the standard which says what the implementation
is supposed to do if it encounters a character it cannot translate; I
would presume set an error, but I don't see where it says this. The
problem is that the code translation occurs in the filebuf, which simply
doesn't have many possibilities for error reporting. If the separation
between the istream and the streambuf is to be maintained, the filebuf
either has to throw an exception, or the istream will report end of
file, with no indication that it is actually due to an error, and not
running out of input characters.

I have some vague memories of words requiring the return of an
implementation specified character, but I can't find them, and it really
doesn't make sense. While not the case with UTF-8, with some multibyte
encodings, encountering an illegal encoding might result in
desynchronization of your input state, and you have no way of
resynchronizing.

Quote:
I wonder if ignore() is supposed to do the job...

I doubt it. ignore() bases itself on the return values of
streambuf::sgetc(). At that level, it either gets a valid character, or
EOF.

Quote:
else, I fear I'll either not allow it or make a special case and
convert it to a normal space.

If you need special handling, writing your own codecvt is theoretically
the way to go. Either I've missed something, however, or it isn't an
easy job.

One possibility might be a filtering streambuf, reading from a wfilebuf,
and doing whatever special processing you want (like ignoring BOM
characters).

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
P.J. Plauger
Guest





PostPosted: Thu Apr 15, 2004 6:41 am    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

"Ben Hutchings" <do-not-spam-benh (AT) bwsint (DOT) com> wrote


Quote:
P.J. Plauger wrote:
"Ben Hutchings" <do-not-spam-benh (AT) bwsint (DOT) com> wrote in message
news:slrnc7bebg.gkv.do-not-spam-benh (AT) shadbolt (DOT) i.decadentplace.org.uk...

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote in message
news:<c510gk$2o1ueh$1 (AT) ID-178288 (DOT) news.uni-berlin.de>...

I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8?
The
whole idea behind UTF-8 is to provide a byte oriented encoding in
which
byte order of larger elements doesn't matter.
snip

This is a Windows convention. It is a means to distinguish UTF-8
text files from text files that use a local encoding. Of course
this is unreliable and not extensible, but what do you expect
from [rest of rant deleted]

You mean Unicode, Inc.? There's a description of this convention at:

http://www.unicode.org/faq/utf_bom.html#2

Doesn't say anything about Windows there, but what do you expect from
[rest of rant deleted]?

That would appear to be documentation of existing practice, not a
recommendation.

Exactly. My (narrow) point was that it's a practice not directly tied
to Windows, so there's no reason/excuse to rant yet again about the
perceived shortcomings of Windows/Microsoft/Bill Gates, or whomever
you were vaguely alluding to.

Quote:
The actual standard says that:

"Use of a BOM is neither required nor recommended for UTF-8, but
may be encountered in contexts where UTF-8 data is converted from
other encoding forms that use a BOM, or where the BOM is used as a
UTF-8 signature."

Honi soit qui mal y pense.

I really can't be bothered to come up with a riposte. I can only
suggest that you examine the result of relying too much on such
heuristics for encoding detection:
http://weblogs.asp.net/cumpsd/archive/2004/02/27/81098.aspx>.

I personally don't care one way or the other whether people use
such headers for UTF-8 files. Our CoreX library has a codecvt
facet that can be configured to optionally strip off such a header,
for customers who want that feature. Or not. And it works even on
non-Windows systems, amazingly enough.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Thu Apr 15, 2004 7:32 pm    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

Ben Hutchings <do-not-spam-benh (AT) bwsint (DOT) com> wrote

Quote:
P.J. Plauger wrote:
"Ben Hutchings" <do-not-spam-benh (AT) bwsint (DOT) com> wrote in message
news:slrnc7bebg.gkv.do-not-spam-benh (AT) shadbolt (DOT) i.decadentplace.org.uk...
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote in message
news:<c510gk$2o1ueh$1 (AT) ID-178288 (DOT) news.uni-berlin.de>...

I want to open a std::ifstream and skip past the first three
bytes (nonbreaking zero-width space encoded in UTF-8,
a.k.a. byte order marker).

Just curious, but why do you have a byte order marker in UTF-8?
The whole idea behind UTF-8 is to provide a byte oriented
encoding in which byte order of larger elements doesn't matter.
snip

This is a Windows convention. It is a means to distinguish UTF-8
text files from text files that use a local encoding. Of course
this is unreliable and not extensible, but what do you expect
from [rest of rant deleted]

You mean Unicode, Inc.? There's a description of this convention at:

http://www.unicode.org/faq/utf_bom.html#2

Doesn't say anything about Windows there, but what do you expect
from [rest of rant deleted]?

That would appear to be documentation of existing practice, not a
recommendation. The actual standard says that:

"Use of a BOM is neither required nor recommended for UTF-8, but
may be encountered in contexts where UTF-8 data is converted from
other encoding forms that use a BOM, or where the BOM is used as a
UTF-8 signature."

The version I'm looking at says a little more. Concerning a BOM in
UTF-8, "Its usage at the beginning of a UTF-8 data stream is neither
required or recommended by the Unicode Standard, but its presence does
not affect conformance to the UTF-8 encoding scheme. Identification of
the <EF BB BF> bytes sequence at the beginning of a data strem can,
however, be taken as near-certain indication that the data stream is
using the UTF-8 encoding scheme."

This seems very conformant to what Microsoft is trying to do.

Quote:
Honi soit qui mal y pense.

I really can't be bothered to come up with a riposte. I can only
suggest that you examine the result of relying too much on such
heuristics for encoding detection:
http://weblogs.asp.net/cumpsd/archive/2004/02/27/81098.aspx>.

It depends, and Microsoft (like just about everyone else) is trying to
do the best they can in an impossible situation. As the Unicode
standard says, if the first three bytes of a file are <EF BB BF>, there
is a very good chance that the file is UTF-8. It probably isn't EBCDIC,
since none of these bytes are normally assigned in EBCDIC (but I
wouldn't exclude someone using an extended EBCDIC with accented
characters where they were assigned). In ISO 8859-1, it corresponds to
the sequence of letters "" -- not a very common sequence in the
languages I'm familiar with. I can't comment on the likelyhood of its
occurance in the other ISO 8859 codes, since I'm not fluent in the
languages for which they are used. All in all, it would seem like UTF-8
is a good bet, at least in my locale.

As for the problem in the link you give: I don't know the exact
algorithm Microsoft is using to guess the encoding, but most pairs of
ASCII letters is indisinguishable from a single UTF-16 CJK ideograph; a
space won't change that, if it happens to end up on the low order byte
of the UTF-16.

What I *would* do (and Microsoft is apparently not doing) would be to
condition my evaluation on the global locale. If the locale says I'm
in North America or Europe, ASCII (or one of the 8859 code sets) would
seem a better guess that UTF-16 with CJK characters; if I were in
eastern Asia, on the other hand... But this also has its limitations:
there's nothing in locale itself which might help, and not all locales
have a name. And of course, the name "C" doesn't help much either, nor
do many of the traditional names. Modern Unix incorporates the ISO
country code as the 4th and 5th letters of the locale name, but this
isn't systematic either. And sometimes, the locale name will
incorporate the actual code set, which is a good hint too. If you
really wanted to get fancy, you might be able to pick up information
about the time zone as well...

But even with all that, you're still guessing. And you will be until
everyone converts to the same code (UTF-8 seems the best candidate at
present).

Note that the problem isn't new. Ever since I can remember, Unix has had
a program "file" which displayed the type of contents of the files whose
names were given. And which often got confused between sources for AWK,
C or csh. (For that matter... The version on my current system doesn't
seem to know anything about C++; it reports "ascii text" for most of my
C++ and GNU makefile sources.)

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
P.J. Plauger
Guest





PostPosted: Fri Apr 16, 2004 6:05 am    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

<kanze (AT) gabi-soft (DOT) fr> wrote


Quote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote in message
news:<c560rr$2q0nji$1 (AT) ID-178288 (DOT) news.uni-berlin.de>...
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
Ulrich Eckhardt <doomster (AT) knuut (DOT) de> wrote
I want to open a std::ifstream and skip past the first three bytes
(nonbreaking zero-width space encoded in UTF-8, a.k.a. byte order
marker).

Just curious, but why do you have a byte order marker in UTF-8? The
whole idea behind UTF-8 is to provide a byte oriented encoding in
which byte order of larger elements doesn't matter.

In UCS2, the character can be a BOM. Unicode only calls it a
'non-breaking zero-width space'. Writing this at the beginning of a
UTF-8 file is just one way to mark it as a UTF-8 file. Right, my
statement was confusing...

OK. I knew it was special (and that it was used to determine byte
order), but I was a little confused as to how. I was under the
impression that it was only used to distinguish between UTF-16BE and
UTF-16LE, but of course, it would also have a distinctive encoding in
UTF-8 (or UTF-32 for that matter), different from any of the UTF-16
encodings.

Yes, BOM can be used to identify, and specify the byte order of,
any of UTF-8, UTF-16*E, or UTF-32*E. The usual convention is to discard
the header deep inside the codecvt facet, after it adapts (as
needed) to the advertised byte order. Similarly, such a facet should
generate such a header when it generates output.

And, of course, you'd like some way to convince the codecvt facet
*not* to look for, or generate, such headers if you don't want
the clutter.

Quote:
Alternatively, for this one specific case, I would guess that just
calling ignore(1) should do the trick.
I think, formally, that an implementation would be allowed to check
the Unicode value resulting from codecvt::encoding(), and reject it
as an error if it isn't a valid Unicode character (which the byte
order mark isn't), but I cannot imagine an implementation actually
doing this.

[assuming you meant codecvt::in() and not codecvt::encoding()]

Yes.

No, the BOM _is_ a valid character and that's the whole point of
it. It is a zero width space. Its meaning for display is pretty void,
but if byte-swapped, it becomes a character that is guaranteed by
Unicode not ever to be valid, thereby allowing automatic detection of
the byte order.

I see. I was under the impression that it was a special illegal
character, which should be thrown out once read, but what you describe
is much more logical.

And that's why you can't always treat the leading BOM as a header --
it may indeed be a non-breaking space in some guy's universe.
Headers and transparent files don't always go hand in hand.

Quote:
Anyhow, even if I could read that single codepoint from a stream, I
could not read it from a stream<char>(assuming an internal ASCII or
something derived therefrom), because I can't convert it.

OK. I had just assumed that you were reading through a wstream.

I can't find anything in the standard which says what the implementation
is supposed to do if it encounters a character it cannot translate; I
would presume set an error, but I don't see where it says this. The
problem is that the code translation occurs in the filebuf, which simply
doesn't have many possibilities for error reporting. If the separation
between the istream and the streambuf is to be maintained, the filebuf
either has to throw an exception, or the istream will report end of
file, with no indication that it is actually due to an error, and not
running out of input characters.

It's pretty ugly handling the case when a codecvt facet gets out of
sync with the data stream. All you can really do is report the
situation the same as a read error.

Quote:
I have some vague memories of words requiring the return of an
implementation specified character, but I can't find them, and it really
doesn't make sense. While not the case with UTF-8, with some multibyte
encodings, encountering an illegal encoding might result in
desynchronization of your input state, and you have no way of
resynchronizing.

Right. Just keep returning char_traits<T>::eof().

Quote:
I wonder if ignore() is supposed to do the job...

I doubt it. ignore() bases itself on the return values of
streambuf::sgetc(). At that level, it either gets a valid character, or
EOF.

else, I fear I'll either not allow it or make a special case and
convert it to a normal space.

If you need special handling, writing your own codecvt is theoretically
the way to go. Either I've missed something, however, or it isn't an
easy job.

No. It's what you have to do and it ain't easy. That's why we developed
our CoreX library.

Quote:
One possibility might be a filtering streambuf, reading from a wfilebuf,
and doing whatever special processing you want (like ignoring BOM
characters).

CoreX also has such a filter, mostly to avoid deficiencies in the
basic_filebuf offered with other Standard C++ libraries besides ours.
Still another class converts string to string, using a codecvt
facet, if you want to convert purely inside the program without all
the hassle of setting up a basic_stringbuf.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Mon Apr 19, 2004 6:42 pm    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

"P.J. Plauger" <pjp (AT) dinkumware (DOT) com> wrote


[...]
Quote:
Anyhow, even if I could read that single codepoint from a stream,
I could not read it from a stream<char>(assuming an internal
ASCII or something derived therefrom), because I can't convert
it.

OK. I had just assumed that you were reading through a wstream.

I can't find anything in the standard which says what the
implementation is supposed to do if it encounters a character it
cannot translate; I would presume set an error, but I don't see
where it says this. The problem is that the code translation occurs
in the filebuf, which simply doesn't have many possibilities for
error reporting. If the separation between the istream and the
streambuf is to be maintained, the filebuf either has to throw an
exception, or the istream will report end of file, with no
indication that it is actually due to an error, and not running out
of input characters.

It's pretty ugly handling the case when a codecvt facet gets out of
sync with the data stream. All you can really do is report the
situation the same as a read error.

I'm just wondering, but shouldn't this be considered a defect of some
sort? It certainly looks like a design error to me -- traditionally,
actually read errors show up as EOF (which could also be considered a
design error, IMHO), but errors in the format of what was read show up
as fail() && ! eof().

Quote:
I have some vague memories of words requiring the return of an
implementation specified character, but I can't find them, and it
really doesn't make sense. While not the case with UTF-8, with
some multibyte encodings, encountering an illegal encoding might
result in desynchronization of your input state, and you have no
way of resynchronizing.

Right. Just keep returning char_traits<T>::eof().

Except that as a user, I want to display a format error message, and not
continue processing as if I'd reached the normal end of file.

About the best thing I can think of is to implement an extended
interface for streambuf (using multiple inheritance), with extended
error information, then use rdbuf() and dynamic_cast to access this
extended interface. Not a very simple or elegant solution, IMHO, but
better than nothing.

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
P.J. Plauger
Guest





PostPosted: Tue Apr 20, 2004 7:27 pm    Post subject: Re: seeking in a stream with nontrivial codecvt facet Reply with quote

<kanze (AT) gabi-soft (DOT) fr> wrote


Quote:
It's pretty ugly handling the case when a codecvt facet gets out of
sync with the data stream. All you can really do is report the
situation the same as a read error.

I'm just wondering, but shouldn't this be considered a defect of some
sort? It certainly looks like a design error to me -- traditionally,
actually read errors show up as EOF (which could also be considered a
design error, IMHO), but errors in the format of what was read show up
as fail() && ! eof().

I agree that it's a weak specification. I didn't design it -- I'm just an
implementor.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.