C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Unicode plane detection

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
Roger
Guest





PostPosted: Mon Oct 13, 2003 8:47 pm    Post subject: Unicode plane detection Reply with quote



I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in an
xml file.
So far I've assumed UTF-8 encoding, so I am checking for the first
byte in a utf-8 multi-byte sequence. The code is as follows:
code starts >>
unsigned char inpx=0;
while ( (inp = getc(fd)) != EOF ) {
const int PLANE_ONE_UNICODE = 0x01;
// first byte of unicode sequence
inpx=inp;
if ( inpx >= 0xc0 && inpx <= 0xfd) {
int nb=0;
// find out how long the whole sequence is
if (inp & 0x80) {
nb++;
}
if (inp & 0x40) {
nb++;
}
if (inp & 0x20) {
nb++;
}
if (inp & 0x10) {
nb++;
}
printf("%d byte unicode is found %Xn",nb,inpx);
if (inp & PLANE_ONE_UNICODE ) {
printf("plane 1 char foundn");
}
}
}
Quote:
code Ends

This routine does not take account of surrogate pairs of utf-16
characters/bytes. Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.
This code has to work on unix and W2k so I dare say there are byte
order issues to deal with too. The client has some Python code which
seems to have some unicode routines built in (e.g. ord()) - I've not
been able to find any equivalent routines for C++.
Anyone out there help me please? Virtual pint to those who can.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ben Hutchings
Guest





PostPosted: Thu Oct 16, 2003 2:32 pm    Post subject: Re: Unicode plane detection Reply with quote



In article <dd7a565c.0310130435.67fab35b (AT) posting (DOT) google.com>,
Roger wrote:
Quote:
I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in an
xml file.
So far I've assumed UTF-8 encoding, so I am checking for the first
byte in a utf-8 multi-byte sequence. The code is as follows:
code starts
unsigned char inpx=0;
while ( (inp = getc(fd)) != EOF ) {

Presumably inp is declared with type int?

Quote:
const int PLANE_ONE_UNICODE = 0x01;
// first byte of unicode sequence
inpx=inp;
if ( inpx >= 0xc0 && inpx <= 0xfd) {
int nb=0;
// find out how long the whole sequence is
if (inp & 0x80) {
nb++;
}

This test is pointless, given the condition of the enclosing if
statement.

Quote:
if (inp & 0x40) {
nb++;
}
if (inp & 0x20) {
nb++;
}
if (inp & 0x10) {
nb++;
}

These if statements must be nested, because you must count only
contiguous 1-bits.

Quote:
printf("%d byte unicode is found %Xn",nb,inpx);
if (inp & PLANE_ONE_UNICODE ) {
printf("plane 1 char foundn");
}

This doesn't make sense; you're testing the first byte of the
sequence and not the code that the whole sequence represents.
You need to test that nb == 4 (if it is smaller, the code value
must be in plane 0), and then combine the first two bytes of the
sequence to find whether the code is in plane 1.

Quote:
}
}
code Ends

This routine does not take account of surrogate pairs of utf-16
characters/bytes.

The codes reserved for surrogates are illegal in UTF-8, but you
may want to detect them anyway in case the file is encoded
wrongly.

Quote:
Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.
snip


Try IBM's open source International Components for Unicode

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Rob Lugt
Guest





PostPosted: Fri Oct 17, 2003 9:45 am    Post subject: Re: Unicode plane detection Reply with quote



Hi Roger,

OpenTop [1] contains excellent support for Unicode. Here is a mini program
which uses OpenTop streams which will probably do what you're after. This
will print out the character position (not byte position) of each Unicode
character above the Basic Multilingual Plan (BMP):-

#include <ot/io/FileInputStream.h>
#include <ot/io/InputStreamReader.h>
#include <iostream>
using namespace std;
using namespace ot;
using namespace ot::io;

int main(int argc, char* argv[])
{
try {
RefPtr<InputStream> rpIS(new FileInputStream("input.txt");
RefPtr<Reader> rpRdr(new InputStreamReader(rpIS.get()));
Character ch;
long count=0;
while( (ch = rpRdr->readAtomic()) != Character::EndOfFileCharacter)
{
if(ch.toUnicode() > 0xFFFF)
cout << "Unicode character << ch.toUnicode() << " in
character position " << count << endl;
count++;
}
}
catch(Exception& e) {
cout << e.toString() << endl;
}
return 0;
}

Best regards
Rob Lugt

[1] http://www.elcel.com/products/opentop

"Roger"
Quote:
I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in an
xml file.
So far I've assumed UTF-8 encoding, so I am checking for the first
byte in a utf-8 multi-byte sequence. The code is as follows:
code starts



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Fri Oct 17, 2003 12:05 pm    Post subject: Re: Unicode plane detection Reply with quote

Ben Hutchings <do-not-spam-benh (AT) bwsint (DOT) com> wrote

Quote:
In article <dd7a565c.0310130435.67fab35b (AT) posting (DOT) google.com>,
Roger wrote:

I have a routine that is supposed to detect plane one unicode
characters (for a publisher - they want to detect them) embedded in
an xml file. So far I've assumed UTF-8 encoding, so I am checking
for the first byte in a utf-8 multi-byte sequence. The code is as
follows: code starts

unsigned char inpx=0;
while ( (inp = getc(fd)) != EOF ) {

Presumably inp is declared with type int?

That is the standard idiom.

Quote:
const int PLANE_ONE_UNICODE = 0x01;
// first byte of unicode sequence
inpx=inp;
if ( inpx >= 0xc0 && inpx <= 0xfd) {
int nb=0;
// find out how long the whole sequence is
if (inp & 0x80) {
nb++;
}

This test is pointless, given the condition of the enclosing if
statement.

The whole approch seems wrong to me. When I had to do the same thing, I
just created a table with 256 int's, and indexed into it. I then used
the number of bytes as an index into a second table with the mask for
the first byte. Roughly:

int ch = source.get()
while ( ch != EOF ) {
int byteCount = lengthTable[ ch ] ;
if ( byteCount <= 0 ) {
throw IllegalCharacter() ;
}
-- byteCount ;
uint32_t result = ch & maskTable[ byteCount ] ;
while ( byteCount > 0 ) {
ch = source.get() ;
if ( ch == EOF ) {
throw IncompleteCharacter() ;
} else if ( (ch & 0xC0) != 0x80 ) {
throw IllegalCharacter() ;
}
result = (result << 6) | (ch & 0x3F) ;
}
store( result ) ;
}

(The code was actually more complicated, because it was designed to
implement a codecvt -- and I had to store intermediate states in a state
type, so that partial reads could work.)

[...]

Quote:
Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.
snip

Try IBM's open source International Components for Unicode
http://oss.software.ibm.com/icu/>.

IBM's open source effort is to be applauded, but it apparently got
started before Unicode 3.2 -- it only supports the UCS-2 (or UTF-16, I'm
not sure), and not full Unicode.

I have a somewhat sketchy and experimental version in the Experimental
section at my site; some of the code might be usable as well.

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Andy Heninger
Guest





PostPosted: Sat Oct 18, 2003 9:25 am    Post subject: Re: Unicode plane detection Reply with quote

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:

Quote:
Ben Hutchings <do-not-spam-benh (AT) bwsint (DOT) com> wrote in message

Try IBM's open source International Components for Unicode
http://oss.software.ibm.com/icu/>.

IBM's open source effort is to be applauded, but it apparently got
started before Unicode 3.2 -- it only supports the UCS-2 (or UTF-16, I'm
not sure), and not full Unicode.


The ICU project tries very hard to stay completely up to date with the
the Unicode standard. The latest, Unicode 4.0, was released in April of
this year; support for it in ICU came with version 2.6, released in June.

http://www.unicode.org/versions/Unicode4.0.0/

It's got a lot of characters,

Quote:
Graphic 96,248
Format 134
Control 65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved 878,083


ICU uses UTF-16 internally, meaning that the non-plane-0 code points are
represented by a pair of 16 bit values.

Returning to the original topic of this thread, ICU includes C #define
macros for working with plain UTF-8 and UTF-16 strings (how un-C++ can
you get?). Borrowing code from these would be a good choice for anyone
needing to pick out characters from UTF-8 encoded char * strings, even
if no other part of ICU is wanted. The file of interest is
icu/source/common/unicode/utf8.h.

Historically, the choice of UTF-16 as the primary storage format for ICU
does go way back to the very early days of Unicode, and to Java, with
its 16 bit Unicode chars.

ICU began life as a Java library, and was the original source for much
of the i18n support in the JDK. The Java classes were subsequently
ported to C++ and became ICU4C, and from there, parallel plain C APIs
were developed for use from applications that are allergic to C++ for
one reason or another. (There are still many of these around.)


-- Andy Heninger
[email]heninger (AT) us (DOT) ibm.com[/email]


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Aaron Bentley
Guest





PostPosted: Sat Oct 18, 2003 1:36 pm    Post subject: Re: Unicode plane detection Reply with quote

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:

Quote:
Does anyone know of a C++ library that'll enable me
to process utf-8/16 encoded unicode? or can they offer some advice.

snip


Try IBM's open source International Components for Unicode
http://oss.software.ibm.com/icu/>.


IBM's open source effort is to be applauded, but it apparently got
started before Unicode 3.2 -- it only supports the UCS-2 (or UTF-16, I'm
not sure), and not full Unicode.

ICU 2.6 supports Unicode 4.0, but since its storage is 16 bit, it's too
tricky to modify strings. If read-only access is adequate, it can be
used. (Oh, and the OP apparently just wants utf-16 support.)

Right now I'm using the ICU C library with my vector<UChar32> and it
works, but I can imagine ditching it later for something that supports
direct utf8->utf32 conversions.

Aaron
--
Aaron Bentley
www.aaronbentley.com

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
James Kanze
Guest





PostPosted: Sun Oct 19, 2003 9:45 pm    Post subject: Re: Unicode plane detection Reply with quote

Andy Heninger <heninger (AT) us (DOT) ibm.com> writes:

Quote:
kanze (AT) gabi-soft (DOT) fr wrote:

IBM's open source effort is to be applauded, but it apparently
got started before Unicode 3.2 -- it only supports the UCS-2 (or
UTF-16, I'm not sure), and not full Unicode.

The ICU project tries very hard to stay completely up to date with
the the Unicode standard. The latest, Unicode 4.0, was released in
April of this year; support for it in ICU came with version 2.6,
released in June.

http://www.unicode.org/versions/Unicode4.0.0/

It's got a lot of characters,

Graphic 96,248
Format 134
Control 65
Private Use 137,468
Surrogate 2,048
Noncharacter 66
Reserved 878,083

ICU uses UTF-16 internally, meaning that the non-plane-0 code points
are represented by a pair of 16 bit values.

Thanks for the information. I wasn't aware what the current status was;
all I did know was that their "Unicode" characters were 16 bits, and
that true effective Unicode support requires at least 21 bits (and so in
practice, 32 bits on most machines).

Quote:
Returning to the original topic of this thread, ICU includes C
#define macros for working with plain UTF-8 and UTF-16 strings (how
un-C++ can you get?). Borrowing code from these would be a good
choice for anyone needing to pick out characters from UTF-8 encoded
char * strings, even if no other part of ICU is wanted. The file of
interest is icu/source/common/unicode/utf8.h.

Borrowing working code (when it is legal) is always a good idea:-). On
the other hand, converting between UTF-8 and full 32 bit ISO 10646 is
rather simple. Much of the value of ICU lies in the enormous amount of
other support it offers.

Quote:
Historically, the choice of UTF-16 as the primary storage format for
ICU does go way back to the very early days of Unicode, and to Java,
with its 16 bit Unicode chars.

I know the historical reasons. It's a real problem. Still, if I were
starting a new project in C++ today, and had a choice, I'd go with 32
bit wide characters.

--
James Kanze mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France +33 1 41 89 80 93

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.