C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Char literals inside string literals

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
Carlos Moreno
Guest





PostPosted: Thu Dec 21, 2006 11:54 pm    Post subject: Char literals inside string literals Reply with quote



[[[ Related to comp.os.linux.development.apps recent post ]]]

I know that I'm perhaps being excessive lately with my recurrent theme
of criticizing the current standard.... But:

Am I the only one profoundly disturbed by this paragraph from section
2.13.2 of the standard?

"The escape \ooo consists of the backslash followed by one, two, or
three octal digits [...]. The escape \xhhh consists of the backslash
followed by x followed by one or more hexadecimal digits that are
taken to specify the value of the desired character. There is no
limit to the number of digits in a hexadecimal sequence. A sequence
of octal or hexadecimal digits is terminated by the first character
that is not an octal digit or a hexadecimal digit, respectively."


Recently, I was badly bitten by this, when I needed to embed a special
character (in LATIN1 encoding) in a string containin Spanish text.

The character was \xED, and it was in a word that continued with f ...
So, the literal characetr was \xEDf. If the word had continued with
a letter s, for instance, then it would have been ok.

I was speechless when discovering that it was not a bug in the
compiler (after someone pointed out what was happening, which would
have *never* occured to me!) --- I understand the need to allow for
multi-byte sequences, but the implementation, from my point of view,
is severely broken; even though there's an obvious workaround (insert
two double-quote characters in a row at the point where the sequence
finishes --- as in "...\xED""f..."), this workaround is, at best, a
horrible hack to get around an *ambiguity* in the definition of hex
escaped characters.

I can't think of any conceivable point of view from which the
definition can be considered correct --- and it's not that there is
no unambiguous solution so the "lesser evil" had to be chosen; I mean,
longer sequences could be obtained as several hex-escaped sequences,
if they were defined as exactly two --- if I really needed EDF, I
could write \xED\x0F without the ambiguity of depending on the
character that happens to follow after the special, hex-escaped one).
Or, if the ieda is really to make it easy to put multi-digit hex
sequences, it could have been something like \x{EDF} --- the curly
braces completely disambiguate and fix the problem; and the rule
could be extremely simple; if the character after the x in the \x
is an opening curly brace, then the hex-sequence is finished by the
(required) closing curly-brace; otherwise, it is a two-digit (exactly
two digits) --- I could encode a single-digit sequence as either
\x0F or as \x{F}.

The octal sequences are also, IMHO, ambiguously specified, but at
least we can get around the problem by an adopted habit of *always*
coding three-digit sequences.

Anyway, I know I'm just ranting/rabbling, and possibly just annoying
the people in charge of the language ... But, can you honestly blame
me after reading section 2.13.2 of the standard? (and yes, I know
that most likely this works like this way because it was inherited
directly from C --- but still: rabble-rabble-rabble!!)

Comments, anyone?

Carlos
--

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Greg Herlihy
Guest





PostPosted: Fri Dec 22, 2006 10:10 am    Post subject: Re: Char literals inside string literals Reply with quote



Carlos Moreno wrote:
Quote:
[[[ Related to comp.os.linux.development.apps recent post ]]]

I know that I'm perhaps being excessive lately with my recurrent theme
of criticizing the current standard.... But:

Am I the only one profoundly disturbed by this paragraph from section
2.13.2 of the standard?

"The escape \ooo consists of the backslash followed by one, two, or
three octal digits [...]. The escape \xhhh consists of the backslash
followed by x followed by one or more hexadecimal digits that are
taken to specify the value of the desired character. There is no
limit to the number of digits in a hexadecimal sequence. A sequence
of octal or hexadecimal digits is terminated by the first character
that is not an octal digit or a hexadecimal digit, respectively."


Recently, I was badly bitten by this, when I needed to embed a special
character (in LATIN1 encoding) in a string containin Spanish text.

The character was \xED, and it was in a word that continued with f ...
So, the literal characetr was \xEDf. If the word had continued with
a letter s, for instance, then it would have been ok.

Character literals are meant to be used as character values for
character variables - and were not particularly intended for use within
string literals. To encode an extended character within a string
literal, use a universal character name. For example, instead of the
messy "cient""\xed""fico", prefer "cient\u00edficos" instead.

Greg


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
James Kanze
Guest





PostPosted: Fri Dec 22, 2006 8:38 pm    Post subject: Re: Char literals inside string literals Reply with quote



Carlos Moreno wrote:
Quote:
Am I the only one profoundly disturbed by this paragraph from section
2.13.2 of the standard?

"The escape \ooo consists of the backslash followed by one, two, or
three octal digits [...]. The escape \xhhh consists of the backslash
followed by x followed by one or more hexadecimal digits that are
taken to specify the value of the desired character. There is no
limit to the number of digits in a hexadecimal sequence. A sequence
of octal or hexadecimal digits is terminated by the first character
that is not an octal digit or a hexadecimal digit, respectively."

It's not a nice situation, but history pretty much doesn't leave
a choice. In this case, C++ is just doing what C does, and
standard C did it like that because that was the existing
practice when C was standardized.

Note that it's not really obvious what the correct behavior
should be. You can't say that \x is systematically followed by
two digits, because that wouldn't work on a machine with nine
bit bytes. And you don't want to require more than two (how
many?---there are machines with 32 bit char's) either, because
that would seriously bother people on machines with eight bit
bytes.

Quote:
Recently, I was badly bitten by this, when I needed to embed a special
character (in LATIN1 encoding) in a string containin Spanish text.

The character was \xED, and it was in a word that continued with f ...
So, the literal characetr was \xEDf. If the word had continued with
a letter s, for instance, then it would have been ok.

C++ added a new feature exactly for this. For non-ASCII
characters, you should be using a universal character name.
With exactly four hexadecimal digits, always.

In C90, of course, you didn't have such, and the standard
practice was to always stop a string literal after an octal or a
hex escape.

Quote:
I was speechless when discovering that it was not a bug in the
compiler (after someone pointed out what was happening, which would
have *never* occured to me!) --- I understand the need to allow for
multi-byte sequences, but the implementation, from my point of view,
is severely broken; even though there's an obvious workaround (insert
two double-quote characters in a row at the point where the sequence
finishes --- as in "...\xED""f..."), this workaround is, at best, a
horrible hack to get around an *ambiguity* in the definition of hex
escaped characters.

But you can't limit the language to only two hexadecimal
characters. After all, some machines DO have nine bit bytes,
even today.

Quote:
I can't think of any conceivable point of view from which the
definition can be considered correct --- and it's not that there is
no unambiguous solution so the "lesser evil" had to be chosen; I mean,
longer sequences could be obtained as several hex-escaped sequences,
if they were defined as exactly two --- if I really needed EDF, I
could write \xED\x0F without the ambiguity of depending on the
character that happens to follow after the special, hex-escaped one).

Except that each escape sequence is a separate character. If I
want a single character with the value 0x123 (possible on e.g.
a Unisys 2200), how do I write it.

Quote:
Or, if the ieda is really to make it easy to put multi-digit hex
sequences, it could have been something like \x{EDF} --- the curly
braces completely disambiguate and fix the problem;

That is probably what should have been done to begin with. It
wasn't, however, and by the time the C committee got around to
standardizing, they doubtlessly felt that it would break too
much code to change this.

Quote:
and the rule
could be extremely simple; if the character after the x in the \x
is an opening curly brace, then the hex-sequence is finished by the
(required) closing curly-brace; otherwise, it is a two-digit (exactly
two digits) --- I could encode a single-digit sequence as either
\x0F or as \x{F}.

The "exactly two digits" simply won't fly. You're assuming that
bytes (char's) are exactly 8 bits, and this just isn't the case.

Quote:
The octal sequences are also, IMHO, ambiguously specified, but at
least we can get around the problem by an adopted habit of *always*
coding three-digit sequences.

I rather suspect that the committee would have liked to allow
more, too, but were worried about breaking existing code. (I
suspect that octal is max. three digits because Richie never had
to contend with a machine with bytes of more than 9 bits.
Although I sort of suspect that the CDC mainframes of that
period would have used 10 bit bytes.)

Quote:
Anyway, I know I'm just ranting/rabbling, and possibly just annoying
the people in charge of the language ... But, can you honestly blame
me after reading section 2.13.2 of the standard? (and yes, I know
that most likely this works like this way because it was inherited
directly from C --- but still: rabble-rabble-rabble!!)

Comments, anyone?

It's all history. And the C++ committee's solution (for your
problem, at least) was unified character names.

--
James Kanze (Gabi Software) email: james.kanze (AT) gmail (DOT) com
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Carlos Moreno
Guest





PostPosted: Sun Dec 24, 2006 1:15 am    Post subject: Re: Char literals inside string literals Reply with quote

James Kanze wrote:

Quote:
[...]
Comments, anyone?

It's all history. And the C++ committee's solution (for your
problem, at least) was unified character names.

Thanks for pointing this out --- I didn't know C++ featured this;
I had seen it in other languages, but thought it had been adopted
mostly in web-related languages due to the stronger (apparently,
at least) pressure to produce "international-clean" applications.

Thanks!

Carlos
--

--
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.