C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Defect report: handling of extended source characters in str

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ language, library and standards
View previous topic :: View next topic  
Author Message
Martin Vejnár
Guest





PostPosted: Mon May 08, 2006 5:22 pm    Post subject: Defect report: handling of extended source characters in str Reply with quote



[ Note: Forwarded to C++ Committee. -sdc ]

Consider the following code:

#include <iostream>
int main()
{
std::cout << "\\u00e1" << std::endl;

// Following line contains Unicode character
// "latin small letter a with acute" (U+00E1)
std::cout << "\á" << std::endl;
}

The first statement in main outputs characters "u00e1" preceded by a
backslash.

The Standard says:
[2.1 - Phases of translation, paragraph 1.1]
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. Trigraph sequences (2.3) are replaced by corresponding
single-character internal representations. Any source file character not
in the basic source character set (2.2) is replaced by the
universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same extended
character expressed in the source file as a universal-character-name
(i.e. using the \uXXXX notation), are handled equivalently.)

During this translation phase, the foreign character in the second
statement is replaced by a universal-character-name. Such statement
resembles the first and outputs one of the following:

\u00e1
\u00E1
\U000000e1
\U000000E1

C99 (at least in the draft I have available) avoids this problem by not
introducing any universal character names and not restricting the
(basic) source character set to 96 characters as C++ does.

--
Martin Vejnár


[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
James Kanze
Guest





PostPosted: Mon May 08, 2006 11:21 pm    Post subject: Re: Defect report: handling of extended source characters in Reply with quote



Martin Vejnár wrote:

Quote:
[ Note: Forwarded to C++ Committee. -sdc ]

Consider the following code:

#include <iostream
int main()
{
std::cout << "\\u00e1" << std::endl;

// Following line contains Unicode character
// "latin small letter a with acute" (U+00E1)
std::cout << "\á" << std::endl;
}

The first statement in main outputs characters "u00e1"
preceded by a backslash.

Which is perfectly legal, as the program has undefined
behavior according to §2.1.3.2/3: "If the character following a
backslash is not one of those specified, the behavior is
undefined." As you point out, in this case, the \u00e1 is a
single character at all points beyond translation phase 1.

The correct way to output the sequence u00e1, preceded by a
backslash, is:

std::cout << "\\" "u00e1" << std::endl ;

Note that this is similar to the way a backslash preceding a
newline is handled. In both cases, the backslash is removed in
a very early phase, before any of the usual escape sequences in
a string or a character constant are considered, and thus,
before it can be escaped itself. Consider for example:

std::cout << "\\
a" ;

This is a perfectly legal piece of code -- a somewhat obfuscated
way of outputting an audible signal. It is NOT an illegal
string constant with a newline in it.

Quote:
The Standard says:
[2.1 - Phases of translation, paragraph 1.1]
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. Trigraph sequences (2.3) are replaced by corresponding
single-character internal representations. Any source file character not
in the basic source character set (2.2) is replaced by the
universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same extended
character expressed in the source file as a universal-character-name
(i.e. using the \uXXXX notation), are handled equivalently.)

That's an interesting formulation. Does it mean that all later
phases must see "\u00E1", even if e.g. the implementation uses
UTF-32 internally. In which case, the behavior is well defined,
and must be that which you see. My interpretation of §2.2/2
("The universal-character-name construct provides a way to name
other characters.") is that this is not the intent; that \u00E1
is a single character, and must be treated as such.

Quote:
During this translation phase, the foreign character in the
second statement is replaced by a universal-character-name.
Such statement resembles the first and outputs one of the
following:

\u00e1
\u00E1
\U000000e1
\U000000E1

Or anything else -- you have undefined behavior.

What you don't have is an escape sequence "\\...". The "\u00e1"
doesn't exist beyond the first phase of translation, and the
first \ is followed by a character that "is not one of those
specified". As a quality of implementation issue, I would
expect an error from the compiler -- this is an undefined
behavior which the compiler can easily detect. (Except, of
course, in the unlikely event that the implementation has
defined this as an additional escape sequence.)

Of course, if the intent is for the universal character name to
behave as a sequence of 6 (or 10) characters in the later
translation phases -- and the description of phase one of the
translation can easily be interpreted this way -- then if 'ŕ' is
understood by the implementation as being the same character as
\u00E1 (which would be the case if e.g. the implementation
accepted ISO 8859-1 as its input encoding), then it would have
to output one of the variants you indicate. My own opinion is
that the text in §2.2. makes it clear that this is not the
intent; that \u00E1 should be treated as a single character, a
latin small letter a with acute.

Quote:
C99 (at least in the draft I have available) avoids this
problem by not introducing any universal character names and
not restricting the (basic) source character set to 96
characters as C++ does.

The final C99 does contain universal character names, in almost
exactly the same language as the C++ standard.

--
James Kanze kanze.james (AT) neuf (DOT) fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Martin Vejnár
Guest





PostPosted: Tue May 09, 2006 3:21 pm    Post subject: Re: Defect report: handling of extended source characters in Reply with quote



James Kanze wrote:
Quote:
Martin Vejnár wrote:
Consider the following code:

#include <iostream
int main()
{
std::cout << "\\u00e1" << std::endl;

// Following line contains Unicode character
// "latin small letter a with acute" (U+00E1)
std::cout << "\á" << std::endl;
}

The first statement in main outputs characters "u00e1"
preceded by a backslash.

Which is perfectly legal, as the program has undefined
behavior according to §2.1.3.2/3: "If the character following a
backslash is not one of those specified, the behavior is
undefined." As you point out, in this case, the \u00e1 is a
single character at all points beyond translation phase 1.

On the contrary, I was trying to point out that after phase 1 is
complete, the *second* statement no longer contains letter 'á'. Instead,
that character is replaced by a universal character name. So, after
phase 1 is complete, the code looks like this:

#include <iostream>
int main()
{
std::cout << "\\u00e1" << std::endl;

// Following line contains Unicode character
// "latin small letter a with acute" (U+00E1)
std::cout << "\\u00e1" << std::endl;
}

I understand that what you're saying is probably the original intent of
[2.1/1.1]. However, current wording of the paragraph in question says
something different.

Quote:
The correct way to output the sequence u00e1, preceded by a
backslash, is:

std::cout << "\\" "u00e1" << std::endl ;

The grammar given for string-literal in [2.13.4] is pretty unambigous
about this. Just as "\\n" isn't a backslash followed by a new-line,
"\\u00e1" isn't a backslash followed by a universal-character-name.

Note, that when tokenization begins, there cannot be "\á" in the source,
since after phase 1, all characters are from the basic source character
set as defined in [2.2/1].

Quote:
Note that this is similar to the way a backslash preceding a
newline is handled. In both cases, the backslash is removed in
a very early phase, before any of the usual escape sequences in
a string or a character constant are considered, and thus,
before it can be escaped itself. Consider for example:

std::cout << "\\
a" ;

This is a perfectly legal piece of code -- a somewhat obfuscated
way of outputting an audible signal. It is NOT an illegal
string constant with a newline in it.

The line splicing has nothing to do with phase 1. Phase 2 (line
splicing) removes backslash and newline pairs, while phase 1 replaces
foreign characters (by which I mean characters outside the basic source
character set) with their respective universal character names.

Quote:
The Standard says:
[2.1 - Phases of translation, paragraph 1.1]
Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. Trigraph sequences (2.3) are replaced by corresponding
single-character internal representations. Any source file character not
in the basic source character set (2.2) is replaced by the
universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an actual
extended character encountered in the source file, and the same extended
character expressed in the source file as a universal-character-name
(i.e. using the \uXXXX notation), are handled equivalently.)

That's an interesting formulation. Does it mean that all later
phases must see "\u00E1", even if e.g. the implementation uses
UTF-32 internally. In which case, the behavior is well defined,
and must be that which you see. My interpretation of §2.2/2
("The universal-character-name construct provides a way to name
other characters.") is that this is not the intent; that \u00E1
is a single character, and must be treated as such.

If the second sentence is a question, then yes, I believe that the
Standard says (but shouldn't say) so.

Even if an implementation used UTF-32 as an internal representation, the
"as-if" rule applies. Although I guess, that the intent is to allow
implementations to use whatever internal encoding they want, wording of
[2.1/1.1] effectively *prohibits* converting universal character names
to their respective characters - doing so would change the meaning of
the second statement and introduced an undefined behavior.

Quote:
[snip]

Of course, if the intent is for the universal character name to
behave as a sequence of 6 (or 10) characters in the later
translation phases -- and the description of phase one of the
translation can easily be interpreted this way -- then if 'ŕ' is
understood by the implementation as being the same character as
\u00E1 (which would be the case if e.g. the implementation
accepted ISO 8859-1 as its input encoding), then it would have
to output one of the variants you indicate. My own opinion is
that the text in §2.2. makes it clear that this is not the
intent; that \u00E1 should be treated as a single character, a
latin small letter a with acute.

The intent probably does not correspond to the wording of [2.1/1.1].
That's good enough reason for a defect report, isn't it?

Quote:
C99 (at least in the draft I have available) avoids this
problem by not introducing any universal character names and
not restricting the (basic) source character set to 96
characters as C++ does.

The final C99 does contain universal character names, in almost
exactly the same language as the C++ standard.

I cannot argue about that, since is don't have final C99 available.
However, the draft says

[C99 draft: 5.1.1.2/1.1]
Physical source file multibyte characters are mapped to the source
character set (introducing new-line characters for end-of-line
indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.

Note the difference between "basic source character set" in C++ and
"source character set" in C99. Also note, that conversion of neither
foreign characters nor universal character names occurs. Both foreign
characters and universal character names are then subjects to the
grammer, which makes

printf("\\u00e1");

output "u00e1" preceded by a backslash and

printf("\á");

introduce an undefined behavior. That, I believe, is a very correct and
intuitive approach and a possible solution to the problem.

--
Martin

---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Greg Herlihy
Guest





PostPosted: Tue May 09, 2006 4:21 pm    Post subject: Re: Defect report: handling of extended source characters in Reply with quote

James Kanze wrote:
Quote:
Martin Vejnár wrote:

[ Note: Forwarded to C++ Committee. -sdc ]

Consider the following code:

#include <iostream
int main()
{
std::cout << "\\u00e1" << std::endl;

// Following line contains Unicode character
// "latin small letter a with acute" (U+00E1)
std::cout << "\á" << std::endl;
}

The first statement in main outputs characters "u00e1"
preceded by a backslash.

Which is perfectly legal, as the program has undefined
behavior according to §2.1.3.2/3: "If the character following a
backslash is not one of those specified, the behavior is
undefined." As you point out, in this case, the \u00e1 is a
single character at all points beyond translation phase 1.

The Standard states that in phase 1 of source file translation:

"Any source file character not in the basic source character set is
replaced by the universal-character-name that designates that
character."

Since the characters '\', 'u', '0', 'e', '1' are all in the basic
character set, they are not replaced in phase 1 - a point later
reiterated:

"Note: in translation phase 1, a universal-character-name is introduced
whenever an actual extended character is encountered in the source
text." §2.13.2/5.

Clearly á and not \u00e1 is the "actual extended character" so
processing the string literal in the first line must wait until phase 5
when:

"Each source character set member, escape sequence, or
universal-character-name in character literals and string literals is
converted to a the corresponding member of the execution character set"

At this stage the compiler translates the entire string literal,
\\u00e1. And since the two backslashes, \\, form a valid escape
sequence (for the backslash character itself), the string translates to
\u00e1. So the program output for the first statement is both correct
and well-defined by the Standard.

Now the second line is more interesting. The character á is clearly
not a character in the basic character set, so unlike the the first
line, this character is replaced by \u00e1 in phase 1. And now the
behavior of the program does become undefined since \á is not a valid
escape sequence. And in fact, gcc reports the invalid escape sequence
as an error. Not surprisingly, gcc accepts the first line as legal and
outputs the string "\u00e1" as expected.

Greg


---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.comeaucomputing.com/csc/faq.html ]
Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ language, library and standards All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.