C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Portable way to write binary data

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
Holger Sebert
Guest





PostPosted: Mon Nov 21, 2005 9:48 am    Post subject: Portable way to write binary data Reply with quote



Hi all,

I was shocked when I read to thread "For binary files use only read() and
write()??" above, in which was stated that using read()/write() for binary data
is unportable and may lead to undefined behaviour (!!).

I always thought myself to be on the safe side by doing things the following way:

- Use std::ofstream/std::ifstream together with read()/write()

- Only use types of standardized size, i.e. float, double, long, ...
(they _are_ standardized, aren't they?? I'm slowly becoming unsure of almost
everything concerning portable C++ *sigh*)

- Store information of endianess elsewhere and when reading binart data flip the
bytes if neccessary.

Where are the pitfalls following this procedure?

How should I do binary i/o instead to achieve portability?

Note: Unfortunately I cannot use the portable boost libraries ... (because they
don't compile on one of my target architectures, what a funny world)

Many thanks in advance,
Holger

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ulrich Eckhardt
Guest





PostPosted: Mon Nov 21, 2005 12:41 pm    Post subject: Re: Portable way to write binary data Reply with quote



Holger Sebert wrote:
Quote:
I was shocked when I read to thread "For binary files use only read() and
write()??" above, in which was stated that using read()/write() for binary
data is unportable and may lead to undefined behaviour (!!).

I always thought myself to be on the safe side by doing things the
following way:

- Use std::ofstream/std::ifstream together with read()/write()

You need the according codecvt facet (from std::locale::classic()) and the
ios_base::binary flag, too.

Quote:
- Only use types of standardized size, i.e. float, double, long, ...
(they _are_ standardized, aren't they?? I'm slowly becoming unsure of
almost everything concerning portable C++ *sigh*)

No. Neither their size nor their layout is standardized. There are a few
minimum requirements but that's all.

Of course, there is also an invalid assumption that CHAR_BITS==8 but while I
have seen such a beast (a DSP from Texas Instruments), I haven't seen the
need to write portable software for it.

Uli


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
Simon Bone
Guest





PostPosted: Tue Nov 22, 2005 1:42 am    Post subject: Re: Portable way to write binary data Reply with quote



On Mon, 21 Nov 2005 04:48:03 -0500, Holger Sebert wrote:

Quote:
Hi all,

I was shocked when I read to thread "For binary files use only read() and
write()??" above, in which was stated that using read()/write() for binary data
is unportable and may lead to undefined behaviour (!!).

I always thought myself to be on the safe side by doing things the following way:

- Use std::ofstream/std::ifstream together with read()/write()


The stream classes do formatting. You would use the streambuf classes if
you don't need that.

Quote:
- Only use types of standardized size, i.e. float, double, long, ...
(they _are_ standardized, aren't they?? I'm slowly becoming unsure of
almost everything concerning portable C++ *sigh*)


C++ standardizes minimum sizes for fundamental types. Implementations are
always free to use larger types if they think it makes sense for their
customers. For example, there is currently some variation in whether long
is 32 bits (the minimum allowed) or 64 bits (the widest native integral
type on many common processors).

In addition to this, there is some variation allowed in the format of the
types. E.g. integral types can be twos-complement, ones-complement or
signed-magnitude. You certainly do not want bit-for-bit copying of one of
these to another, since that would change the value and might even lead to
a trap representation. On the bright side, two-complement is so common you
can probably rely on it.

Quote:
- Store information of endianess elsewhere and when reading binart data
flip the
bytes if neccessary.


Bear in mind that there are some perverse choices possible. A 4 byte datum
could be written 1234 or 4321 or 2143...
And what about a 8 byte datum?

Quote:
Where are the pitfalls following this procedure?


You might cover enough for all the platforms you develop and test on, and
then find yourself asked to support a platform where all your assumptions
break down. How likely that is depends on your application.

If or when it happens you can possibly write a special program to convert
the data files you have already created to the new platforms expectations.
This is often hard, and with a legacy application where the original
source has become convoluted through long haphazard maintenance (or
just been lost), it is darn-near impossible. Most of us curse applications
that put us through this, so consider whether it is likely for your
applications.

Quote:
How should I do binary i/o instead to achieve portability?


At the least, use typedef names for the types you write/read. Such as
int8_t, int32_t etc from the C99 stddef.h. This encapsulates your
assumption about the sizes of the types.

Your approach to include information about endian-ness in the file is OK,
but usually you can define a fixed format for the file. The time spent
waiting for IO to complete is likely to dwarf any time spend marshalling
the data to or from this format. If you are doing that limited formatting
anyway, you might consider going one extra step and ditching binary IO
altogether too. The advantage of a file format that can be used on any
platform is a big one.

Quote:
Note: Unfortunately I cannot use the portable boost libraries ...
(because they don't compile on one of my target architectures, what a
funny world)


There are many others out there. The boost library is worth looking at to
see how this can be done well. But also look at the serialization section
in the FAQ at http://www.parashift.com/c++-faq-lite/ for more ideas.

HTH

Simon Bone


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
Le Chaud Lapin
Guest





PostPosted: Tue Nov 22, 2005 2:09 am    Post subject: Re: Portable way to write binary data Reply with quote


Holger Sebert wrote:
Quote:
Where are the pitfalls following this procedure?

How should I do binary i/o instead to achieve portability?

Your views seem good to me.

I implemented a serialization package (which turned out to be oddly
similar to one in Boost) that basically defined Source and Target
repositories for serializing the 13 scalar types in C++ and the 13
vector types. Source and Target have virtual functions that can be
overriden by any derived I/O class. I use this model extensively for
my inter-process distributed communication.

With regard to data format, you're right. It's better to follow the
receiver-makes right rule, because in the vast majority of distributed
data sharing, the source and target architectures are identifcal
(PC-to-PC, SPARC-to-SPARC, etc.). For cases where they are not, I
include at beginning of transmission stream an object that completely
characterizes the format of the fundamental C++ types on the source
machine, so that any target machine can do a conversion if necessary.
One would be surprised at how compact this object can be made for the
13 fundamental C++ scalar types.

To do the same for files, I would simply put this descriptor object at
the beginning of the file, but I am not doing that yet.

Finally, since any aggregate can be recursively and ultimately
decomposed into scalar objects, it is trivial to serialize complex
types.

Caveats, which you are certainly aware of:

1. Polymorphic objects are intractable
2. If structure of an object changes, you're in big trouble with all
that old-format data everywhere. Boost gets around this with embedded
versioning. I decided not to take this route, as I felt it would be
pushing the limit on what makes one type distinct from another. And
also, it raises the standard for defining nice clean data types. I
hear a little voice in my head as I write the serialization code..."You
sure you got the structure of this class right? Huh..huh...huh? You'll
suffer if you didn't."

-Le Chaud Lapin-


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
kanze
Guest





PostPosted: Tue Nov 22, 2005 4:34 pm    Post subject: Re: Portable way to write binary data Reply with quote

Simon Bone wrote:

Quote:
On Mon, 21 Nov 2005 04:48:03 -0500, Holger Sebert wrote:


Quote:
I was shocked when I read to thread "For binary files use
only read() and write()??" above, in which was stated that
using read()/write() for binary data is unportable and may
lead to undefined behaviour (!!).


Quote:
I always thought myself to be on the safe side by doing
things the following way:


Quote:
- Use std::ofstream/std::ifstream together with read()/write()


Quote:
The stream classes do formatting. You would use the streambuf
classes if you don't need that.


basic_ios also does error handling. What there is of it,
anyway. Use the streambuf if you don't need that.

The streambuf does character code translation. Don't use
streambuf if you don't want that.

In fact, it's a trade off, which has to be evaluated each time.


Quote:
- Only use types of standardized size, i.e. float, double,
long, ... (they _are_ standardized, aren't they?? I'm
slowly becoming unsure of almost everything concerning
portable C++ *sigh*)



Quote:
C++ standardizes minimum sizes for fundamental types.
Implementations are always free to use larger types if they
think it makes sense for their customers. For example, there
is currently some variation in whether long is 32 bits (the
minimum allowed) or 64 bits (the widest native integral type
on many common processors).


There are also machines with 32 bit char's, and at least one
with 9 bit char's and 36 bit 1's complement int's.

Not everybody has to deal with them, of course.


Quote:
In addition to this, there is some variation allowed in the
format of the types. E.g. integral types can be
twos-complement, ones-complement or signed-magnitude. You
certainly do not want bit-for-bit copying of one of these to
another, since that would change the value and might even lead
to a trap representation. On the bright side, two-complement
is so common you can probably rely on it.


Probably. There's always the Unisys 2200's, but that's a pretty
small market.

Floating point is trickier, since the mainframe IBM's also have
a different format (and I've been told that IEEE isn't always
compatible between vendors, at least where NaN's are concerned).


Quote:
- Store information of endianess elsewhere and when reading
binart data flip the bytes if neccessary.


Quote:
Bear in mind that there are some perverse choices possible. A
4 byte datum could be written 1234 or 4321 or 2143... And what
about a 8 byte datum?


I've actually used systems where long's were 3412. The
processor was Intel, and the compiler Microsoft, so I don't
think we can speak of obscure niche players, either.


Quote:
Where are the pitfalls following this procedure?


Quote:
You might cover enough for all the platforms you develop and
test on, and then find yourself asked to support a platform
where all your assumptions break down. How likely that is
depends on your application.


Quote:
If or when it happens you can possibly write a special program
to convert the data files you have already created to the new
platforms expectations. This is often hard, and with a legacy
application where the original source has become convoluted
through long haphazard maintenance (or just been lost), it is
darn-near impossible. Most of us curse applications that put
us through this, so consider whether it is likely for your
applications.


The problem isn't so much writing the code to read the format,
once you know it. The problem is finding out what the format
was to begin with. Especially if the data written contained
struct's -- who knows where the original compiler inserted
padding?


Quote:
How should I do binary i/o instead to achieve portability?


Quote:
At the least, use typedef names for the types you write/read.
Such as int8_t, int32_t etc from the C99 stddef.h. This
encapsulates your assumption about the sizes of the types.


Quote:
Your approach to include information about endian-ness in the
file is OK, but usually you can define a fixed format for the
file.


I'd say that you have to do it anyway. You have to document the
exact format on disk; otherwise, sooner or later, it will be
unreadable. Given that, you might as well document endian-ness,
and stick to it. (And it is easy to write portably to a given
endianness.)


Quote:
The time spent waiting for IO to complete is likely to dwarf
any time spend marshalling the data to or from this format. If
you are doing that limited formatting anyway, you might
consider going one extra step and ditching binary IO
altogether too. The advantage of a file format that can be
used on any platform is a big one.


In theory at least, any file format can be used on any platform.
I'll admit that I've never tested the extreme cases -- writing a
file on a machine with 9 bit char's, then trying to read it on
one with 8 bit char's, for example. But I regularly read and
write binary files which are shared between Sparc's (in both 32
bit and 64 bit modes) and PC's under Linux and Windows, using
the exact same code on every platform (no conditional byte
swapping).

Note that while globally, I agree with your recommendation for
using text whenever possible (it sure makes debugging easier),
it's worth pointing out that you need to define a few details of
the format there as well -- Unix and Windows typically expect
different line separators, and mainframe IBM's still use EBCDIC.


Quote:
Note: Unfortunately I cannot use the portable boost
libraries ... (because they don't compile on one of my
target architectures, what a funny world)


Join the club:-(.


Quote:
There are many others out there. The boost library is worth
looking at to see how this can be done well.


Sort of. The Boost libraries have different goals than normal
production code, and I would certainly never introduce so much
genericity in something that I knew would only be used for a
short time in one project.


Quote:
But also look at the serialization section in the FAQ at
http://www.parashift.com/c++-faq-lite/ for more ideas.


--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
Holger Sebert
Guest





PostPosted: Thu Nov 24, 2005 8:05 am    Post subject: Re: Portable way to write binary data Reply with quote

Hi,

thank you all for your answers.

The data I have to write are huge blocks of floating point data (both of type
float or double) coming out of numerical applications. So writing them in text
format is out of the question.

These data blocks are generated on some big computer (whose architecture may be
mysterious) and then are post processed on ordinary PCs for, e.g.,
visualisation. Up to now everything worked fine only concering endianess, but I
see the danger that things could get messed up in the future.

A general purpose serialization library might be overkill, or not specialized
enough (furthermore I am obliged to keep the library dependencies as small as
possible).

Does anyone know what in total I have to consider when dealing portably with
binary floating point data and could give a link or something?

Or perhaps it's sufficient just using some typedefs and hope there won't be 80
bit floats that have to be read on a 64 bit float machine ... ?

Regards,
Holger

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze
Guest





PostPosted: Thu Nov 24, 2005 5:19 pm    Post subject: Re: Portable way to write binary data Reply with quote

Holger Sebert wrote:

Quote:
The data I have to write are huge blocks of floating point
data (both of type float or double) coming out of numerical
applications. So writing them in text format is out of the
question.

Be careful here. A good binary representation of floating point
is a lot trickier than the other types. I'd still consider
text, although it's true that conversion of floating point to
text (and vice versa) is also a lot more costly than for other
formats.

Quote:
These data blocks are generated on some big computer (whose
architecture may be mysterious) and then are post processed on
ordinary PCs for, e.g., visualisation. Up to now everything
worked fine only concering endianess, but I see the danger
that things could get messed up in the future.

And how. You don't give any information concerning the "some
big computer", but be aware that some big computers use a
floating point format that is completely incompatible with that
on a PC.

Quote:
A general purpose serialization library might be overkill, or
not specialized enough (furthermore I am obliged to keep the
library dependencies as small as possible).

Does anyone know what in total I have to consider when dealing
portably with binary floating point data and could give a link
or something?

I'm familiar with BER format, but it might be overkill; it can
also be very expensive to decode. About the only other portable
floating point format I know is text.

I would give careful consideration to the set of machines over
which the code must work. Most new machines will support IEEE;
your only real risk is legacy architectures, like IBM 390, and
even these are moving toward IEEE. (Java requires it.) If you
can assume that all of the machines use IEEE, and that NaN's are
never transmitted, then I think you can make do with viewing the
double as if it were an unsigned long long, and transmitting
that.

Quote:
Or perhaps it's sufficient just using some typedefs and hope
there won't be 80 bit floats that have to be read on a 64 bit
float machine ... ?

The only 80 bit floats that I know are IEEE extended precision,
which would be mapped to long double, if they are supported at
all. In the past, one of the most important big machines for
numeric work was the CDC's, which used a 60 bit float (and a 120
bit double), but if you don't currently have to support these,
you almost certainly won't in the future.

But the number of bits isn't the only problem. IBM 390's use
natively a base 16 format, rather than a base 2. Some years
back, IBM introduced IEEE support on this hardware as an option;
at least on the early machines supporting it, IEEE was
significantly slower than the native format, and even if that is
no longer a problem, there is still the issue of files written
in the native format which could force use of it rather than
IEEE.

Trying to be portable to every legal format is probably a waste
of time. The IEEE format (used on PC's, Sparcs, HP's PA and
IBM's Power PC architectures) has become more or less a
standard; unless you have concrete reasons to assume that you
will have to support something else, I'd limit my support to
that until more became a concrete necessity. (Of course, I
would encapsulate the "conversion" routines, so that if more
became necessary, I know where the changes have to be made, and
they won't repercute through the entire program.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
Christopher Yeleighton
Guest





PostPosted: Fri Nov 25, 2005 6:22 pm    Post subject: Re: Portable way to write binary data Reply with quote


"Holger Sebert" <holger.sebert (AT) ruhr-uni-bochum (DOT) de> wrote

Quote:
Hi,

thank you all for your answers.

The data I have to write are huge blocks of floating point data (both of
type
float or double) coming out of numerical applications. So writing them in
text
format is out of the question.

These data blocks are generated on some big computer (whose architecture
may be
mysterious) and then are post processed on ordinary PCs for, e.g.,
visualisation. Up to now everything worked fine only concering endianess,
but I
see the danger that things could get messed up in the future.

A general purpose serialization library might be overkill, or not
specialized
enough (furthermore I am obliged to keep the library dependencies as small
as
possible).

Does anyone know what in total I have to consider when dealing portably
with
binary floating point data and could give a link or something?


http://webstore.ansi.org/ansidocstore/product.asp?sku=INCITS%2FISO%2FIEC+8825%2D1%2D1998

Chris



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
Carl Barron
Guest





PostPosted: Sat Nov 26, 2005 2:32 pm    Post subject: Re: Portable way to write binary data Reply with quote

kanze <kanze (AT) gabi-soft (DOT) fr> wrote:

Quote:
basic_ios also does error handling. What there is of it,
anyway. Use the streambuf if you don't need that.

The streambuf does character code translation. Don't use
streambuf if you don't want that.

Stream buffer classes can do char code translation. But I
don't see any requirement that they always do so. In fact
stringbuf ussually does not:)

this is a legal stream buffer class

struct membuf:public std::streambuf
{
// a simple sequential read of memory block [a,a+n)
membuf(char *a,int n) {setg(a,a,a+n);}
};

definite do not use filebuf for binary data without a specific do
nothing codecvt...

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
James Kanze
Guest





PostPosted: Sun Nov 27, 2005 4:08 am    Post subject: Re: Portable way to write binary data Reply with quote

Carl Barron wrote:
Quote:
kanze <kanze (AT) gabi-soft (DOT) fr> wrote:

basic_ios also does error handling. What there is of it,
anyway. Use the streambuf if you don't need that.

The streambuf does character code translation. Don't use
streambuf if you don't want that.

Stream buffer classes can do char code translation. But I
don't see any requirement that they always do so. In fact
stringbuf ussually does not:)

There is a requirement that filebuf do code translation. The
potential is there.

Quote:
definite do not use filebuf for binary data without a specific
do nothing codecvt...

That's the work-around. It's a fragile solution, but it is the
only one available to us.

It would be nicer if there were a class with an abstraction
which didn't include code translation.

--
James Kanze mailto: [email]james.kanze (AT) free (DOT) fr[/email]
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 pl. Pierre Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
Paul Floyd
Guest





PostPosted: Sun Nov 27, 2005 9:58 pm    Post subject: Re: Portable way to write binary data Reply with quote

On 24 Nov 2005 12:19:53 -0500, kanze <kanze (AT) gabi-soft (DOT) fr> wrote:
Quote:
Holger Sebert wrote:

A general purpose serialization library might be overkill, or
not specialized enough (furthermore I am obliged to keep the
library dependencies as small as possible).

Does anyone know what in total I have to consider when dealing
portably with binary floating point data and could give a link
or something?

I'm familiar with BER format, but it might be overkill; it can
also be very expensive to decode. About the only other portable
floating point format I know is text.

Assuming that you are referring to the Basic Encoding Rules of ASN.1,
there exist other rules that have better performance, especially the
PER, Packed Encoding Rules.

There are commercial libraries that perform encoding and decoding, with
C and C++ bindings.

A bientot
Paul
--
Paul Floyd http://paulf.free.fr (for what it's worth)
Surgery: ennobled Gerald.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
elazro
Guest





PostPosted: Mon Nov 28, 2005 9:03 pm    Post subject: Re: Portable way to write binary data Reply with quote

It sounds like HDF (Hierarchical Data Format) is something you might
wish to look into. HDF was developed for dealing with (large)
scientific datasets, and supports binary IO, compression, and
blocking/tiling for improved IO speeds. It handles endianness and
other issues for you, and it also compiles on a number of
high-performance platforms, so it may work where boost::serialization
fails.

It may be overkill for your situation, it's got a non-trivial learning
curve, and it is mainly C libraries (though there are C++ wrappers).
However, it is quite a good library, widely deployed, and is suitable
for production code.

Otherwise, if you would rather hand-roll it, I'd just put a endianness
flag in, use the types from cstddef for ints and longs (int32_t, etc.),
and encapsulate the conversion routines so that if you run into a
target architecture where the preceding isn't enough, you'll have an
easier time fixing it.

-matt


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.