C++Talk.NET Forum Index C++Talk.NET
C++ language newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

raw strings in C++0x

 
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated)
View previous topic :: View next topic  
Author Message
Johan Hahn
Guest





PostPosted: Thu Oct 16, 2003 2:38 pm    Post subject: raw strings in C++0x Reply with quote



Hi

(I posted this two days ago but it didn't show up. FWIW I didn't even
receive a moderator reception confirmation.)

As I understand it, regular expressions will be added to the next
standard.
However, I don't see a proposal for a raw string format, like
r""-strings in
Python. Clearly that would be very nice to have since regular
expressions
do contain an awful lot of backslashes. I would even go as far as saying
it
is essential for regex compatibility with other languages. Have I missed
it
or is it waiting for N1429 to finish, anyone?

(By raw strings I mean simple cstrings that can contain special
characters
which gets translated by the compiler.)

.....johahn


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
Back to top
Ron Natalie
Guest





PostPosted: Thu Oct 16, 2003 8:42 pm    Post subject: Re: raw strings in C++0x Reply with quote




"Johan Hahn" <johahn2003 (AT) home (DOT) se> wrote


Quote:
(By raw strings I mean simple cstrings that can contain special
characters
which gets translated by the compiler.)

You are asking for a new type of STRING LITERAL that
ignores the meaning of backslash? Literals are really the
only place they have special meaning. Once converted
to a char (or arrays of them) the strings are raw.



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Johan Hahn
Guest





PostPosted: Fri Oct 17, 2003 3:39 pm    Post subject: new string literal (was: raw strings in C++0x) Reply with quote



Ron Natalie wrote:
Quote:

"Johan Hahn" <johahn2003 (AT) home (DOT) se> wrote in message
news:cHrjb.29400$mU6.76542 (AT) newsb (DOT) telia.net...

(By raw strings I mean simple cstrings that can contain special
characters which gets translated by the compiler.)

You are asking for a new type of STRING LITERAL that
ignores the meaning of backslash? Literals are really the
only place they have special meaning. Once converted
to a char (or arrays of them) the strings are raw.

Forget the raw string term... I had it all backwards :)

Yes, a new string literal is what I'm looking for, that ignores
the meaning of backslash. (In Python they are prefixed with
an r as in r"c:program fileswhatever".) Primary use would not
be to allow windows style path names as in the example above.
It would make using regular expression across languages and
from resources like www.regelxlib.com a little easier.

The problem is that the characters: ()[]?+-*. all have a special
meaning in regregular expression syntax. To express them you
have to put a backslash in front. In C++ backslash is used as
a escape character in string literals for denoting newline etc. so
you would then have to prefix every special regex character
with two backslashes. This makes the syntax incompatible
with other languages and I believe this is why Python added
raw strings.

For example, to test if a user supplied string is a valid email
address it must not contain any illegal characters from RFC822
§3.3. (This is not sufficient but it makes a good example...)

#include <algorithm>
#include <string>
#include <boost/regex.hpp>

bool is_legal(const std::string& email)
{
typedef std::string::const_iterator ci;
ci& beg = email.begin();
ci& end = email.end();
ci& at = std::find(email.begin(), end, '@');
if (at == end) return false;

// The illegal characters are: ( ) < > @ , ; : " . [ ]
static boost::regex not_illegal("^[^\(\)<>@,;:\\\"\.\[\]]+$");
return boost::regex_match(beg, at, not_illegal) &&
boost::regex_match(at+1, end, not_illegal);
}

The pattern called not_illegal above is very ugly and somewhat
hard to read. Instead, I want to be able to write:

static boost::regex illegal(r"^[^()<>@,;:\".[]]+$");

This is not the most critical addendum to c++ perhaps but a
simple and IMHO a useful one. Or perhaps it could be
accomplished with N1511 "Literals for user-defined types"?
I didn't quite get that one on the first read.

http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1511.pdf

.....johahn


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ron Natalie
Guest





PostPosted: Fri Oct 17, 2003 4:46 pm    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote


"Johan Hahn" <johahn2003 (AT) home (DOT) se> wrote

Quote:
(In Python they are prefixed with
an r as in r"c:program fileswhatever".)

I guess as long as you don't need a quote itself in the string.



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
WW
Guest





PostPosted: Sat Oct 18, 2003 9:02 am    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote

Johan Hahn wrote:
[SNIP]
Quote:
Yes, a new string literal is what I'm looking for, that ignores
the meaning of backslash. (In Python they are prefixed with
an r as in r"c:program fileswhatever".) Primary use would not
be to allow windows style path names as in the example above.

IMO that is something Microosft decided about (to use backslashes instead of
slash) so I honestly think it is not the C++ languages task to fix that.

Quote:
It would make using regular expression across languages and
from resources like www.regelxlib.com a little easier.
[SNIP]


IMO that one is a valid argument. Regular expressions are ancient, and
their syntax (just like anything from UNIX) uses the backslash as an escape
character. However there are two questions here to ask: how often does one
need the backslash in the regular expressions and how frequently will those
regular expressions be (or have to be) in the program text?

Should these new string be C-style? I mean with the 0 at the end? Or
should we just enable some sort of multi-character-literal?

I find it very rare that a path should be embedded into a C++ program as a
literal. I also find it rare in my use of regexps (in NEdit, egrep, perl)
that I have to use to escape something. But of course this might be only
my special case, that is why I ask.

--
WW aka Attila



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Serve La
Guest





PostPosted: Sat Oct 18, 2003 11:02 pm    Post subject: Re: raw strings in C++0x Reply with quote

"Johan Hahn" <johahn2003 (AT) home (DOT) se> wrote

Quote:
As I understand it, regular expressions will be added to the next
standard.

Where did you read that?



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Carl Barron
Guest





PostPosted: Sun Oct 19, 2003 10:34 am    Post subject: Re: raw strings in C++0x Reply with quote

Serve La <i (AT) bleat (DOT) nospam.com> wrote:

Quote:
"Johan Hahn" <johahn2003 (AT) home (DOT) se> wrote in message
news:cHrjb.29400$mU6.76542 (AT) newsb (DOT) telia.net...
As I understand it, regular expressions will be added to the next
standard.

Where did you read that?


There is a proposal on a TR1 page regarding regular expression

recognition, maybe that is what he means. I think this is the url:
http://std.dkuug.dk/jtc1/sc22/wg21/docs/papers/2003/n1429.htm

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Johan Hahn
Guest





PostPosted: Mon Oct 20, 2003 8:59 pm    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote

WW wrote:

Quote:
Johan Hahn wrote:
Yes, a new string literal is what I'm looking for, that ignores
the meaning of backslash. (In Python they are prefixed with
an r as in r"c:program fileswhatever".)
[...]
It would make using regular expression across languages and
from resources like www.regelxlib.com a little easier.

IMO that one is a valid argument. Regular expressions are ancient,
and their syntax (just like anything from UNIX) uses the backslash as
an escape character. However there are two questions here to ask:
how often does one need the backslash in the regular expressions?

I could think of the following places where backslashes are used in
regular expressions:
[1] To match against any character with special meaning in regex
syntax. For example: ( ) [ ] . | ? * + -
[2] To denote special regex escape sequences. (w s < > b etc...)
[3] To match original escape sequences. (a n t etc...)
[4] To denote groups. (as in 1)

Quote:
and how frequently will those regular expressions be (or have to be)
in the program text?

I wrote a script that fetched all 379 patterns from www.regexlib.com
to gather some statistics. It showed that backslashes are used in 79%
of the patterns, to a total degree of 5.8%. This makes the 6:th most
common character in those regular expressions after [ ] ( ) and -.
These is quite high numbers, though I don't know how representative
this subset is of all regular expressions.

Quote:
Should these new string be C-style? I mean with the 0 at the end?
Or should we just enable some sort of multi-character-literal?

I don't understand what you mean by multi-character-literal. My
thought was that the new literals would be translated into plain old
c-style strings by the compiler so it would be almost like a macro
that escaped all single with another (except where is followed
by "). That would mean r"foon" is exactly the same as "foo\n".

Quote:
I find it very rare that a path should be embedded into a C++
program as a literal. I also find it rare in my use of regexps (in
NEdit, egrep, perl) that I have to use to escape something. But
of course this might be only my special case, that is why I ask.

Then you probably search and replace parts of words most of the
time, eh? Whereas a programmer uses regular expressions for
validating input such as urls, filenames, email addresses, social
security numbers and such.

I think if the feature was added everyone would use it, as is done
in Python and that is why I asked if it was planned for or not.

.....johahn


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Johan Hahn
Guest





PostPosted: Mon Oct 20, 2003 9:01 pm    Post subject: Re: raw strings in C++0x Reply with quote

Serve La wrote:
Quote:
"Johan Hahn" <johahn2003 (AT) home (DOT) se> wrote in message
news:cHrjb.29400$mU6.76542 (AT) newsb (DOT) telia.net...
As I understand it, regular expressions will be added to the next
standard.

Where did you read that?

It has not been finally decided upon yet since the proposal is still
being revised. However, based on the amount of work put into
it and a report by Herb Sutter [1] from one of the WG21/J16
meetings I think the intentions are pretty clear:

"There are three broad categories of things we know we'd like to
add to the C++ Standard library:
1. C99 compatibility. [...]
2. Filling in gaps. [...]
3. Useful facilities. Now, just because a facility is useful doesn't
mean it has to be standardized. But some facilities, such as strings,
are so widely used that it would be embarrassing to fail to have
them in a standard. We do in fact have strings in C++98 (unlike
pre-standard C++) for just this reason; what we don't have are
things like standard support for regular expression matching and
tokenization, both of which are common tasks we want to
perform on strings in particular and on iterator ranges and streams
in general. [...] "

And later in that article about what implementation to choose
(boost or GRETA):

"I personally think it's likely that one (or some combination) of
them will be adopted into the C++ Standard library, but at this
point the field is wide open."

[1] http://www.cuj.com/documents/s=7984/cujcexp2004sutter/sutter.htm

.....johahn



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Johan Hahn
Guest





PostPosted: Mon Oct 20, 2003 9:04 pm    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote

Ron Natalie wrote:
Quote:

"Johan Hahn" <johahn2003 (AT) home (DOT) se> wrote

(In Python they are prefixed with
an r as in r"c:program fileswhatever".)

I guess as long as you don't need a quote itself in the string.

Yes, there are subtleties... quotes would still require an
escaping backslash and other special characters such as
newline and tab would have to be written in hexvalues.

But lets focus on the feature as a way to simplify work with
regular expressions. Or else a c++ programmer has to work
with them at a "higher level of distraction" than necessary.
This might be enough to distract a beginner.

For example, as a pattern to find all links within an html file
to .org domains one might write:

const char* link = "<a href=\"(w+?\.org)\">\w*</a>";
or:
const char* link = r"<a href="(w+?.org)">w*</a>";

If the end tag comes on a new line we should be able to
build a c-style string out of several smaller just like normal:

const char* link = r"<a href="(w+?.org)">w*" "n</a>";

Question is how portable regular expressions really are
between languages anyway? I don't think Java and .NET
follows Perl or ECMAScript syntax like boost::regex does.
And how big portability win is it when these languages
doesn't provide this feature and only uses old style strings?

So the real gain should be usability and not portability.

.....johahn


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ben Liddicott
Guest





PostPosted: Sat Oct 25, 2003 8:32 am    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote

"WW" <wolof (AT) freemail (DOT) hu> wrote

Quote:
Johan Hahn wrote:
Yes, a new string literal is what I'm looking for, that ignores
the meaning of backslash. (In Python they are prefixed with
an r as in r"c:program fileswhatever".) Primary use would not
be to allow windows style path names as in the example above.

IMO that is something Microosft decided about (to use backslashes instead of
slash) so I honestly think it is not the C++ languages task to fix that.

To be fair, that was back in 1983 before C was confirmed as the dominant language over Pascal. In 1989 when I was learning to program, Pascal
was still the language of choice where I was. Object Pascal is still a really good language too (though it's not my particular cup of tea).

You need to address your bias. Windows isn't a broken Unix. It is different: It isn't unix at all, it has an entirely different lineage. C, on
the other hand, is tightly bound up with unix.

--
Cheers,
Ben Liddicott


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
WW
Guest





PostPosted: Sun Nov 09, 2003 11:28 am    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote

Ben Liddicott wrote:
Quote:
"WW" <wolof (AT) freemail (DOT) hu> wrote in message
news:bmp6lb$49e$1 (AT) phys-news1 (DOT) kolumbus.fi... > Johan Hahn wrote:
Yes, a new string literal is what I'm looking for, that ignores
the meaning of backslash. (In Python they are prefixed with
an r as in r"c:program fileswhatever".) Primary use would not
be to allow windows style path names as in the example above.

IMO that is something Microosft decided about (to use backslashes
instead of > slash) so I honestly think it is not the C++ languages
task to fix that.

To be fair, that was back in 1983 before C was confirmed as the
dominant language over Pascal. In 1989 when I was learning to
program, Pascal was still the language of choice where I was. Object
Pascal is still a really good language too (though it's not my
particular cup of tea).

Exactly the same here, even the dates agree.

Quote:
You need to address your bias.

I have no bias. I have spent most of my adult life programming for MS OSes.

Quote:
Windows isn't a broken Unix.

;-)

Quote:
It is different:

Being different can be good or bad. Being different for the mere reason of
being different IMO is a bad choice. And I know about no reason why DOS had
to choose over / - other than to be different. I - of course - may be
wrong.

Quote:
It isn't unix at all, it has an entirely different
lineage.

Please point out where did I state that Windows is UNIX! You - at least -
seem to imply that I did.

Quote:
C, on the other hand, is tightly bound up with unix.

I would argue with that. UNIX is tightly bound up with C, that I can
accept. But AFAIK C is used on many more platform than UNIX. Including the
many different versions of Windows (and its sources) as well as embedded
systems without any kind of OS.

C (and C++) has its own ISO standard. UNIX OTOH is POSIX AFAIK. So I still
believe that it is not the job of C or C++ to fix that MS has chosen a
character for path element separator which is traditionally used as an
escape character by programming languages. I also think that because in any
useful SW a path will usually not be embedded in the code, but taken from
configuration - which is system specific. And configuration is usually set
up by some sort of system specific scripting/installation utility - which
can choose not to use as an escape character.

--
WW aka Attila



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
kanze@gabi-soft.fr
Guest





PostPosted: Mon Nov 10, 2003 7:33 pm    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote

"WW" <wolof (AT) freemail (DOT) hu> wrote

Quote:
Ben Liddicott wrote:
"WW" <wolof (AT) freemail (DOT) hu> wrote in message
news:bmp6lb$49e$1 (AT) phys-news1 (DOT) kolumbus.fi...
Johan Hahn wrote:
Yes, a new string literal is what I'm looking for, that
ignores the meaning of backslash. (In Python they are
prefixed with an r as in r"c:program fileswhatever".)
Primary use would not be to allow windows style path names as
in the example above.
IMO that is something Microosft decided about (to use
backslashes instead of slash) so I honestly think it is not the
C++ languages task to fix that.

[...]

Quote:
Windows isn't a broken Unix.

;-)

It is different:

Being different can be good or bad. Being different for the mere
reason of being different IMO is a bad choice. And I know about no
reason why DOS had to choose over / - other than to be different. I
- of course - may be wrong.

MS-DOS introduced hierarchial directories in version 2.0. I did not
choose '/' as the default directory separator because that character was
already taken as the option identifier (the '-' in Unix). MS-DOS chose
'/' as the option identifier (from 1.0 on) because that is what CM/M
used. CP/M used '/' because that is what some of the PDP-11 OS's used.

The decision goes back at least to the middle 70's, at which time, most
people had never even heard of Unix or C, and hierarchial directory
structures weren't an issue.

MS-DOS operating system requests (e.g. open), and those of Windows, have
always, from the very first, accepted either '' or '/' as a directory
separator, and still do today. There is really not the slightest need
to use '' for this in a string constant.

For many years, too, MS-DOS had a system request to read and to change
the defaults -- it was trivial to write a program which changed the
defaults, and once done, all other programs which used the request to
respect the option would use the Unix option. Once this was done, all
of the MS-DOS built-in utilities used '-' for the option id, and '/' for
the directory separator, both with regards to input and display. As far
as I know, no programs written outside of Microsoft ever used this
request -- the fact that it was undocumented probably had something to
do with this:-) -- and Microsoft quietly dropped it somewhere around 4.0
or 5.0.

Finally (and to get a little bit back on topic), in practice, about the
only place I find that I have file names in strings is in #include's.
And the "..." in an include isn't a string literal, and '' aren't
expanded in it. In theory, the standard allows an implementation to do
just about any mapping it wants. So a compiler which did expand ''s
here would be conformant, just as would be a compiler which rot-13
encoded them, or truncated parts of them. In practice, none of the
compilers I have access to (Sun CC, g++) do -- #include "n" looks for a
file with the two character name '', 'n', and not for one with a single
character name.

Practically speaking, all other file names will be obtained from a
configuration file, command line options or environement variables (or
the registry, for Windows programs). Where the presence of a ''
doesn't make the slightest problem.

Quote:
It isn't unix at all, it has an entirely different lineage.

Please point out where did I state that Windows is UNIX! You - at
least - seem to imply that I did.

C, on the other hand, is tightly bound up with unix.

I would argue with that. UNIX is tightly bound up with C, that I can
accept. But AFAIK C is used on many more platform than UNIX.
Including the many different versions of Windows (and its sources) as
well as embedded systems without any kind of OS.

Unix existed before C. C was initially designed for Unix. It was later
posted to other platforms. The original versions of C, and especially
of the C library, were very much based on Unix. (This is most obviously
reflected in the requirement that a line be terminated internally be a
single character.)

Quote:
C (and C++) has its own ISO standard. UNIX OTOH is POSIX AFAIK. So I
still believe that it is not the job of C or C++ to fix that MS has
chosen a character for path element separator which is traditionally
used as an escape character by programming languages.

MS didn't chose a character that was traditionally used as an escape
character by programming languages. There is, as far as I can tell, no
tradition in this respect, except for C -- the usual way of embedding a
" or a ' in a string was to double it, and the usual way of generating a
new line or a form feed was to use a separate function to output (and
new lines weren't physically present on the disk). And traditionally,
filenames could only be made up of a very small set of characters: the
alphanum's, and perhaps one or two others. And there was only one
directory per disk, and you never named it: a "filename" might look like
":dsk1:somethin.txt", or "c:whatever.txt"

Quote:
I also think that because in any useful SW a path will usually not be
embedded in the code, but taken from configuration - which is system
specific. And configuration is usually set up by some sort of system
specific scripting/installation utility - which can choose not to use
as an escape character.

Exactly. I find that having a possible C: at the start of an absolute
file name is more of a problem than the '/' or '' question.

The original poster, if I recall correctly, mentionned the problem with
regards to regular expressions, and not file names. This is a more
valid issue: while most regular expressions will be read from a
configuration file, there are exceptions -- those used to parse the
configuration file, for example, or simple things like recognizing
numerical fields (and determining their base). Still, it's not the end
of the world, and the number of times it is a problem hardly seems
worthy of a language extension.

--
James Kanze GABI Software mailto:kanze (AT) gabi-soft (DOT) fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Ben Hutchings
Guest





PostPosted: Mon Nov 10, 2003 7:57 pm    Post subject: Re: new string literal (was: raw strings in C++0x) Reply with quote

WW wrote:
Quote:
Ben Liddicott wrote:
"WW" <wolof (AT) freemail (DOT) hu> wrote in message
news:bmp6lb$49e$1 (AT) phys-news1 (DOT) kolumbus.fi...
Johan Hahn wrote:
Yes, a new string literal is what I'm looking for, that ignores
the meaning of backslash. (In Python they are prefixed with
an r as in r"c:program fileswhatever".) Primary use would not
be to allow windows style path names as in the example above.

IMO that is something Microosft decided about (to use backslashes
instead of slash) so I honestly think it is not the C++ languages
task to fix that.
snip
Being different can be good or bad. Being different for the mere reason of
being different IMO is a bad choice. And I know about no reason why DOS had
to choose over / - other than to be different. I - of course - may be
wrong.
snip


DOS commands and many NT commands use "/" to introduce options (though
some early versions of DOS permitted the use of "-" as an alternative).
This is why it could not be used as a directory separator on the command
line. However, at the OS level both "" and "/" are accepted as
directory separators, so there is no need to use "" when writing file
names in source code. (The same does not go for registry paths,
unfortunately.)

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Back to top
Display posts from previous:   
Post new topic   Reply to topic    C++Talk.NET Forum Index -> C++ Language (Moderated) All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.