C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Rabin-Karp and STL find

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
glenlow@pixelglow.com
Guest





PostPosted: Tue Dec 14, 2004 8:30 pm    Post subject: Rabin-Karp and STL find Reply with quote



Just rereading Sedgewick's Algorithms and came across a fairly recent
innovation for string searching. Rabin-Karp turns the search string
into a hash value and then computing a hash value for the string and
sliding it along. I was wondering if any of the STL implementations use
this for std::find and string::find and if not, any particular reasons?
Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Sebastian Redl
Guest





PostPosted: Wed Dec 15, 2004 11:29 am    Post subject: Re: Rabin-Karp and STL find Reply with quote



[email]glenlow (AT) pixelglow (DOT) com[/email] wrote:

Quote:
Just rereading Sedgewick's Algorithms and came across a fairly recent
innovation for string searching. Rabin-Karp turns the search string
into a hash value and then computing a hash value for the string and
sliding it along. I was wondering if any of the STL implementations use
this for std::find and string::find and if not, any particular reasons?
Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com

Not for std::find. The main issue lies in the architecture of the algorithm
functions. std::find is declared like so:

template <class InputIterator>
InputIterator find(InputIterator start, InputIterator end,
typename InputIterator::const_reference value);

It is impossible to specialize this declaration for InputIterator<char>,
because such a declaration wouldn't be meaningful. You would need a
specialization for char*, one for vector<char>::iterator, one for
list<char>::iterator, ..., and another for each user-defined type.
It is therefore impossible to specialize on the type of searched object. The
only algorithm that works in such a generic environment is the simple one.

I don't know about string::find. Again, the problem is that, in theory,
basic_string could be instantiated with every type that fulfills the
requirements, and I don't think "can be used in whatever continuous hashing
algorithm the STL implementors come up with" is among these.
Specializations for char and wchar_t would be possible in this case,
though. I think the simplest way of finding out would be to check the
source. STLPort is freely available, SGI's STL probably does the same as
STLPort, Dinkumware's source is viewable if you have Visual Studio (look in
the xstring header).
Ok, after checking GCC's STL source, I just realized that the hashing method
is not implementable given the requirements of std::basic_string. The
reason is that equality between characters is not defined in terms of
numerical value, but in terms of Traits::eq. This means that two identical
strings (given the traits) could produce different hashes and the hash
method would fail. Since Traits needs not have either a hash producer or a
character normalizer (that is, given a character c, return the
representative of the equivalence group c is part of), it is not possible
to reliably create a hash.

There are other searching algorithms though, I wonder if anyone has thought
of using them. They only require the eq test.

--
Sebastian Redl

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
Chris Jefferson
Guest





PostPosted: Wed Dec 15, 2004 1:05 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote



[email]glenlow (AT) pixelglow (DOT) com[/email] wrote:
Quote:
Just rereading Sedgewick's Algorithms and came across a fairly recent
innovation for string searching. Rabin-Karp turns the search string
into a hash value and then computing a hash value for the string and
sliding it along. I was wondering if any of the STL implementations use
this for std::find and string::find and if not, any particular reasons?
Cheers,
Glen Low, Pixelglow Software
www.pixelglow.com

Rabin-Karp would actually not be useful for "std::find", as it is only

really useful when you are looking to match a string of
characters/ints/etc, not a single element.

Even when you are trying to search for a substring of a string, it isn't
the most efficent search algorithm available. It only really comes into
it's own when you are looking through one string for multiple
substrings, which the STL doesn't currently support (as it's quite a
specialised kind of thing to want to do).

There are of course the other problems that you'd have to be careful
about where you invoked it.

It is true that for things like search_n, and substrings searching, many
implementation of the STL use somewhat simplistic algorithms. This is
improving over time! :)

Chris

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Wed Dec 15, 2004 1:07 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

Sebastian Redl wrote:
Quote:
glenlow (AT) pixelglow (DOT) com wrote:

Just rereading Sedgewick's Algorithms and came across a fairly
recent innovation for string searching. Rabin-Karp turns the search
string into a hash value and then computing a hash value for the
string and sliding it along.

Offhand, this sounds like it would be significantly slower that
Boyer-Moore.

Quote:
I was wondering if any of the STL implementations use
this for std::find and string::find and if not, any particular
reasons?
Cheers,

Not for std::find. The main issue lies in the architecture of the
algorithm functions. std::find is declared like so:

template <class InputIterator
InputIterator find(InputIterator start, InputIterator end,
typename InputIterator::const_reference value);

It is impossible to specialize this declaration for
InputIterator meaningful. You would need a specialization for char*, one for
vector<char>::iterator, one for list<char>::iterator, ..., and
another
for each user-defined type. It is therefore impossible to specialize
on the type of searched object. The only algorithm that works in such
a generic environment is the simple one.

I don't know about string::find. Again, the problem is that, in
theory, basic_string could be instantiated with every type that
fulfills the requirements, and I don't think "can be used in whatever
continuous hashing algorithm the STL implementors come up with" is
among these.
Specializations for char and wchar_t would be possible in this case,
though. I think the simplest way of finding out would be to check the
source. STLPort is freely available, SGI's STL probably does the same
as STLPort, Dinkumware's source is viewable if you have Visual Studio
(look in the xstring header).

Ok, after checking GCC's STL source, I just realized that the hashing
method is not implementable given the requirements of
std::basic_string. The reason is that equality between characters is
not defined in terms of numerical value, but in terms of
Traits::eq.

The the behavior of Traits::eq for std::string is defined as == on the
characters. That's not the problem.

I think that the main reason for just using brute force in std::string
is that all of the other techniques involve some significant set up
time. With BM, you get this back in spades as soon as the string is of
any length, but I suspect that there is an implied assumption that
std::string will be fairly short -- typically too short to justify the
setup time.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
glenlow@pixelglow.com
Guest





PostPosted: Wed Dec 15, 2004 6:17 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

[email]kanze (AT) gabi-soft (DOT) fr[/email] wrote:
Quote:
Sebastian Redl wrote:
[email]glenlow (AT) pixelglow (DOT) com[/email] wrote:

Just rereading Sedgewick's Algorithms and came across a fairly
recent innovation for string searching. Rabin-Karp turns the
search
string into a hash value and then computing a hash value for the
string and sliding it along.

Offhand, this sounds like it would be significantly slower that
Boyer-Moore.

Not necessarily although I have no benchmarks to back me up. The
innovation is that sliding the hash value doesn't involve recomputing
it for the entire substring to be tested, just the next character. The
hash function used is: treat the string as a integer of base 256, then
do a modulo prime number. (I happen to think it would work even better
on 64-bit machines because of this -- who says 64-bit is just for
address space!)

It also has less startup cost than Boyer-Moore since you don't actually
keep a hash table, part strings or state machines around. The expense
is in doing a integer modulus per character, as opposed to rescanning
(in memory) with the brute force method -- a tradeoff of CPU power vs.
memory access speed.

Quote:
I think that the main reason for just using brute force in
std::string
is that all of the other techniques involve some significant set up
time. With BM, you get this back in spades as soon as the string is
of
any length, but I suspect that there is an implied assumption that
std::string will be fairly short -- typically too short to justify
the
setup time.

Cheers,
Glen Low, Pixelglow Sofware
www.pixelglow.com


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Allan W
Guest





PostPosted: Thu Dec 16, 2004 1:42 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

Sebastian Redl wrote:
Quote:
I don't know about string::find. Again, the problem is that, in
theory,
basic_string could be instantiated with every type that fulfills the
requirements, and I don't think "can be used in whatever continuous
hashing
algorithm the STL implementors come up with" is among these.

I started to draft a reply to this, pointing out that
std::basic_string<>
is only supposed to be instantiated with char and wchar_t. But I looked
it up in the standard first, and I was surprised!

21.3 says that the argument for std::basic_string is any "char-like
type."
But 21/1 says that "char-like types" are "characters ... of any POD
type"!
It's legal to have a string of unsigned longs, doubles, or even
structures!

Being a curious type, I immediately tried to construct a string of
doubles.
I had no trouble instantiating and populating it, but of course there
was
no way I could write it out to std::cout! (It isn't defined).

Quote:
Specializations for char and wchar_t would be possible in this case,
though.

Seems reasonable. If we can specialize output for these types, we can
also specialize std::find for these types.

Quote:
Ok, after checking GCC's STL source, I just realized that the hashing
method
is not implementable given the requirements of std::basic_string. The
reason is that equality between characters is not defined in terms of
numerical value, but in terms of Traits::eq. This means that two
identical
strings (given the traits) could produce different hashes and the
hash
method would fail. Since Traits needs not have either a hash producer
or a
character normalizer (that is, given a character c, return the
representative of the equivalence group c is part of), it is not
possible
to reliably create a hash.

But perhaps std::find could be specialized for std::string and
std::wstring -- in other words, for normal basic_strings using the
default traits and allocators.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Luís Pedro Coelho
Guest





PostPosted: Thu Dec 16, 2004 1:43 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

[email]glenlow (AT) pixelglow (DOT) com[/email] wrote:
Quote:
I was wondering if any of the STL implementations use
this for std::find and string::find

It would be possible for string::find or std::search, not std::find.

ciau,
luispedro


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
A.G.McDowell
Guest





PostPosted: Thu Dec 16, 2004 1:43 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

In article <1103114432.012649.143480 (AT) z14g2000cwz (DOT) googlegroups.com>,
[email]kanze (AT) gabi-soft (DOT) fr[/email] writes
Quote:

I think that the main reason for just using brute force in std::string
is that all of the other techniques involve some significant set up
time. With BM, you get this back in spades as soon as the string is of
any length, but I suspect that there is an implied assumption that
std::string will be fairly short -- typically too short to justify the
setup time.

--
My choice would be Knuth-Morris-Pratt, which I believe has worst case

time O(size of pattern) + O(size of string to be searched). It requires
only tests for equality. This would be consistent with other STL
requirements for worst case complexity to be considered.

However, it does require additional space proportional to the size of
the pattern to be searched for, which might also show up in the setup
time, not to mention that you will probably end up throwing an exception
if you can't find it.
--
A.G.McDowell


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Thu Dec 16, 2004 2:03 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

[email]glenlow (AT) pixelglow (DOT) com[/email] wrote:
Quote:
kanze (AT) gabi-soft (DOT) fr wrote:
Sebastian Redl wrote:
[email]glenlow (AT) pixelglow (DOT) com[/email] wrote:

Just rereading Sedgewick's Algorithms and came across a
fairly recent innovation for string
searching. Rabin-Karp turns the search string into a
hash value and then computing a hash value for the
string and sliding it along.

Offhand, this sounds like it would be significantly slower that
Boyer-Moore.

Not necessarily although I have no benchmarks to back me
up. The innovation is that sliding the hash value doesn't
involve recomputing it for the entire substring to be tested,
just the next character.

I understand that, but you still have to look at every
character. With BM, you typically only look a n/m characters,
where n is the length of the search space, and m the length of
the search string.

Quote:
The hash function used is: treat the string as a integer of
base 256, then do a modulo prime number. (I happen to think it
would work even better on 64-bit machines because of this --
who says 64-bit is just for address space!)

On my machine, an integral modulo (or division) is about the
most expensive operation there is. Almost as expensive as
reading a value from memory. So the speed should be about the
same as a KMP search (which involves reading only a single value
from memory for each character) -- maybe a little faster when
there is a cache miss.

I've benchmarked BM vs. KMP in the past -- BM was something
between 5 and 10 times faster than KMP. So if your algorithm is
only a little faster than KMP, BM will still beat it.

Quote:
It also has less startup cost than Boyer-Moore since you don't
actually keep a hash table, part strings or state machines
around.

I understand that. The startup cost of BM isn't outrageous, but
it is enough to make it not worthwhile for very short strings
(say less than a 100 characters). Ditto KMP, of course -- my
benchmarks involved lookups in hundreds of thousands of
characters, where the startup cost was amortized.

Quote:
The expense is in doing a integer modulus per character, as
opposed to rescanning (in memory) with the brute force method
-- a tradeoff of CPU power vs. memory access speed.

OK. I think I understand. It's significantly faster than brute
force, with a startup time of almost zero, so it can be used on
very short strings. So it might be an interesting solution for
things like string::find.

Might, because while the brute force method theoretically
involves rescanning the same character multiple times, in
practice, you will generally get a mismatch on the first, or
possibly the second character.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Fri Dec 17, 2004 1:30 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

A.G.McDowell wrote:
Quote:
In article <1103114432.012649.143480 (AT) z14g2000cwz (DOT) googlegroups.com>,
[email]kanze (AT) gabi-soft (DOT) fr[/email] writes

I think that the main reason for just using brute force in
std::string is that all of the other techniques involve some
significant set up time. With BM, you get this back in spades
as soon as the string is of any length, but I suspect that
there is an implied assumption that std::string will be
fairly short -- typically too short to justify the setup
time.

My choice would be Knuth-Morris-Pratt, which I believe has
worst case time O(size of pattern) + O(size of string to be
searched). It requires only tests for equality. This would be
consistent with other STL requirements for worst case
complexity to be considered.

And you also use heap sort instead of quick sort:-).

Seriously, there is an issue here -- typically, BM is
O(searchSpaceSize/patternSize), whereas KMP is
O(searchSpaceSize+patternSize). Given that patternSize is almost
always greater than 1, BM wins by a wide margin. Typically:
you're right in pointing out that KMP has a worst case which is
equal to its typical case, whereas BM has a theoretical worst
case of O(searchSpaceSize*patternSize), just as quick sort has a
worst case of O(n^2). In practice, the worst cases don't
occur -- I don't know if you could even construct one for BM.
But if you do need an absolute upper bound on time, you do
choose KMP and heap sort, rather than BM and quick sort.

KMP has a couple of other interesting advantages, too: you never
back up which was a real advantage when you were searching on
mag tape, but is still somewhat of an advantage today if the
data in on disk -- and a definite advantage as well if you're
scanning data eavesdropping on an Internet connection:-). And it
is trivial to extend KMP to search for several strings in
parallel -- if the patternSize is small enough to be ignored
compared to the searchSpaceSize, there is essentially no
difference in time between searching for one string, and
searching for 100.

Note that while the worst case for brute force is
O(searchSpaceSize*patternSize), the typical complexity is also
much less, since you will typically hit a mismatch in the first
couple of characters in the pattern. In many cases, I suspect
that the complexity will in fact be O(searchSpaceSize), and
there will only be a constant factor of difference between brute
force and KMP.

Quote:
However, it does require additional space proportional to the
size of the pattern to be searched for, which might also show
up in the setup time, not to mention that you will probably
end up throwing an exception if you can't find it.

This is true for both KMP and BM, and means that neither are
probably appropriate for std::string::find. The setup time for
Rabin-Karp is significantly less (the time to calculate the hash
of the pattern), and it uses no additional memory. I'm still
sceptical whether it is worth it for something like
std::string::find, where I would expect that the search space is
typically very, very small.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]


Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Fri Dec 17, 2004 1:34 pm    Post subject: Re: Rabin-Karp and STL find Reply with quote

Allan W wrote:
Quote:
Sebastian Redl wrote:

I don't know about string::find. Again, the problem is that,
in theory, basic_string could be instantiated with every
type that fulfills the requirements, and I don't think "can
be used in whatever continuous hashing algorithm the STL
implementors come up with" is among these.

I started to draft a reply to this, pointing out that
std::basic_string<> is only supposed to be instantiated with
char and wchar_t. But I looked it up in the standard first,
and I was surprised!

21.3 says that the argument for std::basic_string is any
"char-like type." But 21/1 says that "char-like types" are
"characters ... of any POD type"! It's legal to have a string
of unsigned longs, doubles, or even structures!

Sort of. For anything other than char or wchar_t, you have to
provide a traits class as well. The standard does NOT require
the implementation from providing a generic version of
char_traits. And you are only allowed to specialize char_traits
on a user defined type.

(In practice, I know that both g++ and Dinkumware DO provide a
generic version of char_traits. Incompatible, of course, since
the standard doesn't give any hint as to what should go in if
the implementation does provide it.)

Quote:
Being a curious type, I immediately tried to construct a
string of doubles. I had no trouble instantiating and
populating it, but of course there was no way I could write it
out to std::cout! (It isn't defined).

Of course not. You have to instantiate a basic_ostream with the
"character type". Have fun:-).

FWIW: the implementation in g++ will (or would, I haven't tried
it lately) core dump on a Sparc if you instantiated basic_string
with double or long long, because of alignment problems.

Quote:
Specializations for char and wchar_t would be possible in
this case, though.

Seems reasonable. If we can specialize output for these types,
we can also specialize std::find for these types.

Ok, after checking GCC's STL source, I just realized that
the hashing method is not implementable given the
requirements of std::basic_string. The reason is that
equality between characters is not defined in terms of
numerical value, but in terms of Traits::eq. This means that
two identical strings (given the traits) could produce
different hashes and the hash method would fail. Since
Traits needs not have either a hash producer or a character
normalizer (that is, given a character c, return the
representative of the equivalence group c is part of), it is
not possible to reliably create a hash.

But perhaps std::find could be specialized for std::string and
std::wstring -- in other words, for normal basic_strings using
the default traits and allocators.

It could be. More interesting would be some sort of partial
specialization on Iterator::value_type. Well, not really a
partial specialization, but you should be able to do some
template meta-programming tricks to specialize in this case. (I
think that the technique suggested will work with most, if not
all integral types.)

Whether it is worth it is another question.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.