C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Strategies for indexing very large arrays
Goto page 1, 2  Next
 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
Rune Allnor
Guest





PostPosted: Sun Sep 24, 2006 3:12 pm    Post subject: Strategies for indexing very large arrays Reply with quote



Hi all.

I am implementing this application for bathymetry terrain modeling.
The application basically takes a list of (x,y,z) data points, and
manipulates these. Present-day sonar technologies deliver
data in the order of 5-10 million data points per data set.

Emerging sonar technologies could well increase the spatial sampling
density such that near-future bathymetry data sets extend into some
100 million - 1 billion points/data set.

With 64-bit OSs, it seems that the HW to deal with such data
sets will be available in the near future, which means I have to plan
my software accordingly.

The 'long int' data type can be used to index byte arrays up to 2Gb
length, the 'unsigned long int' can be used to index byte arrays of
4G length. This is more than sufficient today; it may not continue
to be in, say, a 5-10 year perspective.

What would be the alternative ways to prepare for implementing
arrays of more than 2 billion elements under the emerging 64 bit OSs?
Use templates everywhere? Use 'long long int's? Something else?
What are the pros and cons of the different ways?

I do my programming on a Win XP laptop, but make efforts to
be able to use the same source code on other systems and OSs.
The code obviously have to be recompiled when ported, so
it is no need to make the executables portable. I am only
concerned about source code maintenance.

I want my program to run fast right now, at the same time I do
not want to implement everything from scratch in three years,
when the program is ported to a larger computer system.

Rune


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Maxim Yegorushkin
Guest





PostPosted: Sun Sep 24, 2006 4:57 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote



Rune Allnor wrote:

[]

Quote:
The 'long int' data type can be used to index byte arrays up to 2Gb
length, the 'unsigned long int' can be used to index byte arrays of
4G length. This is more than sufficient today; it may not continue
to be in, say, a 5-10 year perspective.

On linux and many unixes long has the size of void*, i.e. long is 32
bit on a 32-bit platform and 64 bit on a 64-bit platform. This may be
not true for windoze.

Quote:
What would be the alternative ways to prepare for implementing
arrays of more than 2 billion elements under the emerging 64 bit OSs?
Use templates everywhere? Use 'long long int's? Something else?
What are the pros and cons of the different ways?

Use size_t.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
andrew_nuss@yahoo.com
Guest





PostPosted: Sun Sep 24, 2006 7:53 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote



Maxim Yegorushkin wrote:
Quote:
Rune Allnor wrote:


Use size_t.


Aside from using size_t for the index, and using a template-based
array, I would consider using a btree of depth 2, with the first index
being the high bits of size_t and the second index being the low bits
of size_t, costing a shift+mask+2 array indexes versus 1 array index.
I am doing this for my very large arrays, though it is just a little
slower to index, it dramatically reduces the problem of heap
fragmentation.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Rune Allnor
Guest





PostPosted: Sun Sep 24, 2006 9:42 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote

andrew_nuss (AT) yahoo (DOT) com skrev:
Quote:
Maxim Yegorushkin wrote:
Rune Allnor wrote:


Use size_t.


Aside from using size_t for the index, and using a template-based
array, I would consider using a btree of depth 2, with the first index
being the high bits of size_t and the second index being the low bits
of size_t, costing a shift+mask+2 array indexes versus 1 array index.
I am doing this for my very large arrays, though it is just a little
slower to index, it dramatically reduces the problem of heap
fragmentation.

Interesting.

Could you elaborate, please? First, I can't really see how this
solves the 2G indexing limit (unless one can hide some
'long long long int' data types inside a template class)?
Second, I get horrible flashbacks to the times of MSDOS
and 64k segment size limits...

Rune


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Jens Kilian
Guest





PostPosted: Sun Sep 24, 2006 10:07 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote

"Maxim Yegorushkin" <maxim.yegorushkin (AT) gmail (DOT) com> writes:
Quote:
Use size_t.

IIRC, the recommended index type for an overloaded operator[] is ptrdiff_t.
--
mailto:jjk (AT) acm (DOT) org As the air to a bird, or the sea to a
fish,
http://www.bawue.de/~jjk/ so is contempt to the contemptible.
[Blake]
http://del.icio.us/jjk

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Maxim Yegorushkin
Guest





PostPosted: Mon Sep 25, 2006 12:26 am    Post subject: Re: Strategies for indexing very large arrays Reply with quote

andrew_nuss (AT) yahoo (DOT) com wrote:
Quote:
Maxim Yegorushkin wrote:
Rune Allnor wrote:


Use size_t.


Aside from using size_t for the index, and using a template-based
array, I would consider using a btree of depth 2, with the first index
being the high bits of size_t and the second index being the low bits
of size_t, costing a shift+mask+2 array indexes versus 1 array index.
I am doing this for my very large arrays, though it is just a little
slower to index, it dramatically reduces the problem of heap
fragmentation.

Agree. This description sounds somewhat similar to a sparse array. Is
it another one of the kind?

http://goog-sparsehash.sourceforge.net/doc/sparsetable.html might be a
good start for the original poster.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Maxim Yegorushkin
Guest





PostPosted: Mon Sep 25, 2006 12:26 am    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Jens Kilian wrote:
Quote:
"Maxim Yegorushkin" <maxim.yegorushkin (AT) gmail (DOT) com> writes:
Use size_t.

IIRC, the recommended index type for an overloaded operator[] is ptrdiff_t.

ptrdiff_t is signed which might be a problem if your array is bigger
than a half of the virtual address space.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
andrew_nuss@yahoo.com
Guest





PostPosted: Mon Sep 25, 2006 2:41 am    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Rune Allnor wrote:
Quote:
Interesting.

Could you elaborate, please? First, I can't really see how this
solves the 2G indexing limit (unless one can hide some
'long long long int' data types inside a template class)?
Second, I get horrible flashbacks to the times of MSDOS
and 64k segment size limits...

Rune


I think size_t will be 64 bits on those machines, but I'm not sure.
Aside from this, here's the rudiments of my template class:

template <typename T>
class Array {

enum { ELEMSIZE = sizeof(T) };
enum { LEAFMASK = BINARY_FLOOR(Heap::CHUNK_SIZE / ELEMSIZE) - 1 };
enum { LEAFSHIFT = BINARY_LOG(LEAFMASK + 1) };
enum { NODESIZE = sizeof(T*) };
enum { NODEMASK = BINARY_FLOOR(Heap::CHUNK_SIZE / NODESIZE) - 1};

T** ar;
size_t len; // logical len
Heap* heap; // free blocks

public:

T& operator [] (size_t index) const
{
return *(ar[index >> LEAFSHIFT] + (index & LEAFMASK));
}
};

Assuming a very big array, your top-level block will be in bytes,
worst-case:

NODESIZE * (NODEMASK+1)

And your bottom level blocks will be in bytes, worst-case:

ELEMSIZE * (LEAFMASK+1)

You can arrange things so that these blocks are around 100000*sizeof(T)
and then have a very large address space. The only hard part about
this array is the constructor/destructor and making decisions about
your chunk sizes. Your maximum chunksize sets the following limit on
the address space:

index < (CHUNKSIZE*CHUNKSIZE) / (sizeof(T) * sizeof(T*))

The result is roughly this, if you get to use blocks that are much much
smaller for the same logical array, roughly the squareroot of the
"flat" approach. Your algorithm will survive a higher degree of heap
fragmentation.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Michael Tiomkin
Guest





PostPosted: Mon Sep 25, 2006 2:41 am    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Maxim Yegorushkin wrote:
Quote:
Jens Kilian wrote:
"Maxim Yegorushkin" <maxim.yegorushkin (AT) gmail (DOT) com> writes:
Use size_t.

IIRC, the recommended index type for an overloaded operator[] is ptrdiff_t.

ptrdiff_t is signed which might be a problem if your array is bigger
than a half of the virtual address space.

This will be a problem only if the object size is 1. It seems that
the size of the (x,y,z) data points is at least 12 (3 floats). For a
32-bit address space the maximal index will be less than 400,000,000.
For a 31-bit address space (32-bit XP) it will be less than
200,000,000.

The types 'ptrdiff_t', 'size_t' and 'unsigned long long' will be
enough in this case. You might have a problem with 'unsigned long
long' for a machine with 128-bit address space, but I think that these
machines would not be available before 2020.

The solution of 'long int' will be wrong for a machine with 64-bit
address space and 32-bit long.

Michael


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Maxim Yegorushkin
Guest





PostPosted: Mon Sep 25, 2006 8:02 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote

kanze wrote:
Quote:
Maxim Yegorushkin wrote:
Jens Kilian wrote:
"Maxim Yegorushkin" <maxim.yegorushkin (AT) gmail (DOT) com> writes:
Use size_t.

IIRC, the recommended index type for an overloaded
operator[] is ptrdiff_t.

ptrdiff_t is signed which might be a problem if your array is
bigger than a half of the virtual address space.

Which is highly unlikely if the size of each element is larger
than one. Generally speaking, a size_t can contain the number
of bytes in any single object, and a ptrdiff_t has an upper
limit of half that. So if the elements in an array are larger
than char's, it should be sufficient. I'd make the choice
dependent on whether I wanted to use an unsigned type or not.
(Normally, given the peculiarities of C++'s unsigned integral
types, it's better to avoid them for numerical values. On the
other hand, if you're already stuck with them in some contexts,
such as the STL, mixing signed and unsigned is a sure guarantee
of some surprises down the road.)

That's interesting. What is so peculiar with C++ unsigned integers?


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Greg Herlihy
Guest





PostPosted: Mon Sep 25, 2006 10:21 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Maxim Yegorushkin wrote:
Quote:
...That's interesting. What is so peculiar with C++ unsigned integers?

Sign promotion for one. Consider this example program from a 64-bit
programming transition guide:

int a = -2;
unsigned int b = 1;
long c = a + b;
long long d = c; // to get a consistent size for printing.

printf("%lld\n", d);

On 32-bit systems, d is -1.

On any 64-bit system with 64-bit longs (such as any LP64 system
including Linux), d is 4294967295.

Greg


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Seungbeom Kim
Guest





PostPosted: Mon Sep 25, 2006 11:48 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Maxim Yegorushkin wrote:
Quote:

That's interesting. What is so peculiar with C++ unsigned integers?

In many cases, unsigned integers used in an expression will convert other
signed integers in the same expression into unsigned (as if they were
"contagious"), yielding surprises. For example:

signed int a = -1;
unsigned int b = 1;
if (a > b) std::cerr << "BANG!\n";

Contrary to the intuition, the condition will be evaluated as true.

Though many people concentrate only on the "unsignedness" or "non-
negativeness" of unsigned integer *values*, unsigned integer types also
differ from signed integer types in the semantics of the *operations*.

--
Seungbeom Kim

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Seungbeom Kim
Guest





PostPosted: Mon Sep 25, 2006 11:49 pm    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Maxim Yegorushkin wrote:
Quote:
Jens Kilian wrote:
"Maxim Yegorushkin" <maxim.yegorushkin (AT) gmail (DOT) com> writes:
Use size_t.
IIRC, the recommended index type for an overloaded operator[] is ptrdiff_t.

ptrdiff_t is signed which might be a problem if your array is bigger
than a half of the virtual address space.

Then how is the built-in operator[], specified in 13.6/13 as taking
ptrdiff_t as one of its operands, supposed to work for such an array?
And what about 'ptrdiff_t operator-(T, T);' specified in 13.6/14,
when the two objects are farther away from each other than what ptrdiff_t
can represent? Does this mean that arrays shouldn't be bigger than std::
numeric_limits<std::ptrdiff_t>::max()?

--
Seungbeom Kim

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Markus Grueneis
Guest





PostPosted: Tue Sep 26, 2006 12:20 am    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Greg Herlihy schrieb:
Quote:
Maxim Yegorushkin wrote:
...That's interesting. What is so peculiar with C++ unsigned integers?

Sign promotion for one. Consider this example program from a 64-bit
programming transition guide:

int a = -2;
unsigned int b = 1;
long c = a + b;
long long d = c; // to get a consistent size for printing.

printf("%lld\n", d);

On 32-bit systems, d is -1.

On any 64-bit system with 64-bit longs (such as any LP64 system
including Linux), d is 4294967295.


This is quite an interesting example, as I don't really see on which
step this should happen. Can you please provide some keyword to search
for such issues? (beyond 64bit compatibility or transition). Or do you
know some good resource which lists some of this specialities?

With a (very) quick search I found some links, but they are more
discussing the easy things like changed constants, -1 issues, MAX_* + 1
and so... Or do I miss something obvious in the above example?


Thank You very much,
-- Markus

Quote:
Greg


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Markus Grueneis
Guest





PostPosted: Tue Sep 26, 2006 1:30 am    Post subject: Re: Strategies for indexing very large arrays Reply with quote

Markus Grueneis schrieb:
Quote:
Greg Herlihy schrieb:
Maxim Yegorushkin wrote:
...That's interesting. What is so peculiar with C++ unsigned integers?
Sign promotion for one. Consider this example program from a 64-bit
programming transition guide:

[example about signed/unsigned problem]

[resolved question]


Don't bother anymore, as Seungbeom Kim wrote:
Quote:
[...]
In many cases, unsigned integers used in an expression will convert other
signed integers in the same expression into unsigned (as if they were
[...]

I've never asked this question myself and therefore wrongly assumed that
unsigned values will be converted to signed ones, not the other way
round. Therefore my previous question is answered, as the unsigned int
value of the expression fits well into a 64bit long, and therefore the
bit-pattern won't get reinterpreted as a signed value.


Thanks,
-- Markus



Quote:

Greg


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.