 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Andy Guest
|
Posted: Tue Aug 30, 2005 1:22 pm Post subject: Very fast delimited record parsing with boost |
|
|
We need to process a very large amount of delimited variable length
ASCII data in files as large as 3-4 gigs. We need a high performance
parser for this and as always, we have no money to buy one. We are ok
with building one as long as that can be done quick enough and I was
wondering if Boost has a panacea for us. Can anyone help with their
ideas / experience.
I am also very open to any suggestions outside Boost. Any outline on
how to build such a parser would be very welcome. If some comparative
performance figures can be mentioned, it would be of tremendous help.
Any fast C++ library would be of help.
We develop a market analytics tool on HP-UX and Linux on 32/64 bits.
Cheers,
Andy
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Tony Delroy Guest
|
Posted: Tue Aug 30, 2005 9:59 pm Post subject: Re: Very fast delimited record parsing with boost |
|
|
Hi Andy,
Could be a classic case of premature optimisation: if you're
processing 3-4 gigs per file, you may find that the disk I/O is the
bottleneck, and if so it's likely to be several orders of magnitude
slower than any parser (that wasn't written either as a joke and/or
designed by a large committee of middle and senior managers). Quite
honestly, parsing delimited records is childs play: do it with
streaming operators or even (s/f)scanf to get you started (depending on
whether you use memory mapping for file I/O or not), and optimise if
needed. If performance turns out to be an issue, and it's not disk
I/O, then my guess is you'll find your data processing takes more time
than your parsing.
Regards,
Tony
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
David Abrahams Guest
|
Posted: Tue Aug 30, 2005 10:04 pm Post subject: Re: Very fast delimited record parsing with boost |
|
|
"Andy" <garth_rockett (AT) yahoo (DOT) com> writes:
| Quote: | We need to process a very large amount of delimited variable length
ASCII data in files as large as 3-4 gigs. We need a high performance
parser for this and as always, we have no money to buy one. We are ok
with building one as long as that can be done quick enough and I was
wondering if Boost has a panacea for us. Can anyone help with their
ideas / experience.
|
Did you cross-post your question on one of the boost mailing lists?
http://www.boost.org/more/mailing_lists.htm
--
Dave Abrahams
Boost Consulting
www.boost-consulting.com
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Calum Grant Guest
|
Posted: Tue Aug 30, 2005 10:27 pm Post subject: Re: Very fast delimited record parsing with boost |
|
|
Andy wrote:
| Quote: | We need to process a very large amount of delimited variable length
ASCII data in files as large as 3-4 gigs. We need a high performance
parser for this and as always, we have no money to buy one. We are ok
with building one as long as that can be done quick enough and I was
wondering if Boost has a panacea for us. Can anyone help with their
ideas / experience.
I am also very open to any suggestions outside Boost. Any outline on
how to build such a parser would be very welcome. If some comparative
performance figures can be mentioned, it would be of tremendous help.
Any fast C++ library would be of help.
We develop a market analytics tool on HP-UX and Linux on 32/64 bits.
|
I'd use Flex, which you will have *free* with your Linux distro.
http://www.gnu.org/software/flex/
It is table-driven, and extremely fast. It also works with Bison -
which is also table-driven and fast, but by the sound of things you
probably need Flex without Bison.
You may run into performance issues with Spirit - I mean it's fast
enough for most things but it uses recursive descent which is slower
than table-driver parsers. Your bottleneck may be the module consuming
the data.
Calum
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Gene Bushuyev Guest
|
Posted: Wed Aug 31, 2005 8:42 am Post subject: Re: Very fast delimited record parsing with boost |
|
|
"Andy" <garth_rockett (AT) yahoo (DOT) com> wrote
| Quote: | We need to process a very large amount of delimited variable length
ASCII data in files as large as 3-4 gigs. We need a high performance
parser for this and as always, we have no money to buy one. We are ok
with building one as long as that can be done quick enough and I was
wondering if Boost has a panacea for us. Can anyone help with their
ideas / experience.
|
You don't need no parser generator. In fact all you need is to get familiar with
istream::ignore() function.
We routinely parse multi-gigabyte data files, and the parser (even for complex
syntax) was never a bottleneck. It may seem counterintuitive, but the bigger the
data size, the smaller is the contribution of the parser, because usually
parsing portion has a linear complexity, while most non-trivial data processing
algorithms tend to have super-linear complexity.
- gene
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
kanze@gabi-soft.fr Guest
|
Posted: Wed Aug 31, 2005 10:40 am Post subject: Re: Very fast delimited record parsing with boost |
|
|
Calum Grant wrote:
This depends on the tables -- if the regular expressions are
complex, either the tables are very big (causing page faults),
or they are compressed somehow (which slows things down).
| Quote: | It also works with Bison - which is also table-driven and
fast, but by the sound of things you probably need Flex
without Bison.
You may run into performance issues with Spirit - I mean it's
fast enough for most things but it uses recursive descent
which is slower than table-driver parsers.
|
First, as you say, it sounds like he needs Flex. Supposing he
needs something that complicated at all -- my impression is that
std::find might even be enough. I'm not familiar with Spirit,
but if it generates a recursive descent parser, it's closer to
Bison than to Flex (and my experience is that hand written
recursive descent parsers are often faster than what Bison
generates). If std::find is not enough, he might want to look
into boost::regex, although I suspect that there, he will pay a
significant performance price for generality that he probably
doesn't need.
FWIW: my regular expression class builds a DFA lazily; it also
has a function to build the complete DFA at the start, and
functions to extract the DFA elements, in order to put them into
a separate pre-compiled table, which can then be loaded by
another program. It pays for this capability by being a lot
less flexible than boost::regex, and I've often found myself
regretting that I cannot (yet) replace it with the Boost version
in my own code. But it could be that Andy has stumbled on one
of the rare cases where my class would be preferable.
| Quote: | Your bottleneck may be the module consuming the data.
|
IO *is* likely to be a bottleneck. Probably the bottleneck.
For starters, he'd probably gain by doing raw, system level IO;
for sequencial reading, this may even be faster than memory
mapping. I had a similar problem once in the distant past,
where a program had to read lines of text (processed as lines)
very, very quickly. The solution I used was to read the file in
large blocks, scanning from the start to find the 'n', and
processing in place. (The "readLine" function return a pointer
to where the line was located in the buffer.) If I didn't find
a 'n', I copied what was left of the buffer down to the
beginning, and read exactly behind it. On the system I was
using at the time, it also turned out that it was faster if 1) I
ensured that the target address for the system reads was
properly aligned, and 2) all system reads had a size which was a
multiple of the segment size on the disk. (For example: if I
had 13 bytes left without a 'n', I would copy them to the
address buffer + 3, then read buffer size - disk segment size of
data at the address buffer + 16. Also, I simply punted on lines
which were longer than the buffer size, and refused to process
them. Given that the actual lines were normally less than 80
characters, and never more than 136, and the buffer size was
something like 16 KBytes -- very big back then, this was a
reasonable restriction.)
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
wpcmame@hotmail.com Guest
|
Posted: Wed Aug 31, 2005 10:42 am Post subject: Re: Very fast delimited record parsing with boost |
|
|
Have a look at the split_iterator in the boost string_algo library.
Unfortunatly it only works with forward iterators.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Frank Chang Guest
|
Posted: Wed Aug 31, 2005 2:55 pm Post subject: Re: Very fast delimited record parsing with boost |
|
|
Gene, Your idea is excellent. I proposed a similar idea in response to
a different OP who had a similar question on Aug 19, 2005 but the OP
never responded.
"slyi, Actually the std::istringstream class can handle more than
white-space character delimiters, as my earlier example attempted to
show:
istringstream iss(line);
while (iss)
{
iss >> word;
iss.ignore(1, ' '); // the delimiter does not have to be white space
}
The ignore member function takes two arguments, the first is the number
of characters to be extracted and ignored and the second is the
delimiter character. In effect, the class istringstream gives you an
elegant tokenizer DFA for free so that you don't have to write your own
tokenizer C++ class."
Gene, I am wondering about what you mean by super-complexity? Is that
like 0(n log n) or 0(n^3)? Thank you.
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Gene Bushuyev Guest
|
Posted: Wed Aug 31, 2005 9:01 pm Post subject: Re: Very fast delimited record parsing with boost |
|
|
"Frank Chang" <FrankChang91 (AT) gmail (DOT) com> wrote
....
| Quote: | Gene, I am wondering about what you mean by super-complexity? Is that
like 0(n log n) or 0(n^3)? Thank you.
|
Superlinear complexity means that it grows faster than linear. When your
application is processing gigabytes of data, you can't really afford anything
worse than O(n log n) or you will be out of business in a matter of a few years.
- gene
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
Calum Grant Guest
|
Posted: Thu Sep 01, 2005 9:16 am Post subject: Re: Very fast delimited record parsing with boost |
|
|
[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
| Quote: | Calum Grant wrote:
Your bottleneck may be the module consuming the data.
IO *is* likely to be a bottleneck. Probably the bottleneck.
|
Yes - this is generally the case. Without knowing more about the
problem, it's not possible to know whether the bottleneck will be IO in
or IO out. If the task is to find the sum of a large table of numbers,
then that will be input-bound. If the results are reformatted and
written to a database, then the bottleneck will be in the output.
I've had very good experience with memory-mapped files (e.g.
http://lightwave2.com/persist) where the performance is identical to
normal RAM. So I would say that this would be the ultimate way to go
the squeeze that extra little bit of performance out of the parser.
Instead of parsing a file, point the parser to the memory-mapped data.
However with 3-4 Gigs it's not going to fit into a 32-bit address space,
so you'd need to parse the file in chunks.
Cheers, Calum
| Quote: | For starters, he'd probably gain by doing raw, system level IO;
for sequencial reading, this may even be faster than memory
mapping. I had a similar problem once in the distant past,
where a program had to read lines of text (processed as lines)
very, very quickly. The solution I used was to read the file in
large blocks, scanning from the start to find the 'n', and
processing in place. (The "readLine" function return a pointer
to where the line was located in the buffer.) If I didn't find
a 'n', I copied what was left of the buffer down to the
beginning, and read exactly behind it. On the system I was
using at the time, it also turned out that it was faster if 1) I
ensured that the target address for the system reads was
properly aligned, and 2) all system reads had a size which was a
multiple of the segment size on the disk. (For example: if I
had 13 bytes left without a 'n', I would copy them to the
address buffer + 3, then read buffer size - disk segment size of
data at the address buffer + 16. Also, I simply punted on lines
which were longer than the buffer size, and refused to process
them. Given that the actual lines were normally less than 80
characters, and never more than 136, and the buffer size was
something like 16 KBytes -- very big back then, this was a
reasonable restriction.)
--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
|
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|