 |
C++Talk.NET C++ language newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
David Given Guest
|
Posted: Sun Feb 25, 2007 10:11 am Post subject: C parser wanted |
|
|
I'm trying to write a C source transformer, that will read in ANSI C source
code, mutate it in various bizarre ways, and then write it out again.
Unfortunately, C is notoriously hard to parse due to context-sensitive tokens.
Does anyone know where I can find a simple-as-possible toolkit that will
construct a parse tree and emit it again? Ideally I need something actually
*in* C, which lets out cil and the antlr-based grammar. I have found ctree,
but it's not really open source and it's a bit cryptic.
Does anyone know of anything else I should look at?
--
┌── dg@cowlark.com ─── http://www.cowlark.com ───────────────────
│
│ "The first 90% of the code takes the first 90% of the time. The other 10%
│ takes the other 90% of the time." --- Anonymous
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
Douglas A. Gwyn Guest
|
Posted: Tue Feb 27, 2007 3:02 am Post subject: Re: C parser wanted |
|
|
"David Given" <dg (AT) cowlark (DOT) com> wrote in message
news:clcm-20070225-0008 (AT) plethora (DOT) net...
| Quote: | I'm trying to write a C source transformer, that will read in ANSI C
source
code, mutate it in various bizarre ways, and then write it out again.
Unfortunately, C is notoriously hard to parse due to context-sensitive
tokens.
|
Actually the tokens aren't "context-sensitive" in any significant sense.
The main trick is in parsing declarations (already tokenized), where
typedef names have to be considered, to do the job right.
| Quote: | Does anyone know where I can find a simple-as-possible toolkit that will
construct a parse tree and emit it again? Ideally I need something
actually
*in* C, which lets out cil and the antlr-based grammar. I have found
ctree,
but it's not really open source and it's a bit cryptic.
|
Printing a tree isn't too hard. GCC 4.x has options to dump the parse
tree in various formats, including at least one that is very close to C
code.
Indeed, you might consider basing your project on GCC.
| Quote: | Does anyone know of anything else I should look at?
|
http://www.quut.com/c/ANSI-C-grammar-l.html
http://www.quut.com/c/ANSI-C-grammar-y.html
Obviously, you have to add code to the yacc spec to build the parse
tree, but at least the hard part (parsing) is taken care of.
If you don't have lex and yacc, you can get work-alikes flex and bison
from the GNU project.
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
Hans-Bernhard Bröker Guest
|
Posted: Tue Feb 27, 2007 3:02 am Post subject: Re: C parser wanted |
|
|
David Given wrote:
| Quote: | I'm trying to write a C source transformer, that will read in ANSI C source
code, mutate it in various bizarre ways, and then write it out again.
|
So you want to write yet another auto-obfuscator?
| Quote: | Unfortunately, C is notoriously hard to parse due to context-sensitive tokens.
|
Pardon the french, but that's nonsense. Given adequate tools (the
obvious choice being lex and yacc), C is about as easy to parse as it
gets. The yacc grammar's right there in the standard text, for crying
out loud.
If you want notoriously hard to parse, go have a look at C++ or Fortran.
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
Keith Thompson Guest
|
Posted: Thu Mar 01, 2007 8:33 am Post subject: Re: C parser wanted |
|
|
Hans-Bernhard Brker <HBBroeker@t-online.de> writes:
| Quote: | David Given wrote:
I'm trying to write a C source transformer, that will read in ANSI C source
code, mutate it in various bizarre ways, and then write it out again.
So you want to write yet another auto-obfuscator?
Unfortunately, C is notoriously hard to parse due to
context-sensitive tokens.
Pardon the french, but that's nonsense. Given adequate tools (the
obvious choice being lex and yacc), C is about as easy to parse as it
gets. The yacc grammar's right there in the standard text, for crying
out loud.
|
("Nonsense" is French? })
In my experience, parsing C is made rather tricky by the existence of
typedefs. A typedef name in effect becomes a reserved word, but only
within the scope of the declaration.
C-without-typedefs can be parsed straightforwardly (generating, say, a
simple parse tree) using just a yacc grammar, with no symbol table
required. With typedefs, the parser needs to communicate with the
symbol table to determine whether a given identifier currently is a
typedef name or not.
I can't think of a specific ambiguity off the top of my head, but I'm
fairly sure they exist.
--
Keith Thompson (The_Other_Keith) kst-u (AT) mib (DOT) org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
Derek M. Jones Guest
|
Posted: Thu Mar 01, 2007 8:34 am Post subject: Re: C parser wanted |
|
|
Hans-Bernhard Bröker,
| Quote: | Unfortunately, C is notoriously hard to parse due to context-sensitive
tokens.
Pardon the french, but that's nonsense. Given adequate tools (the
obvious choice being lex and yacc), C is about as easy to parse as it
gets. The yacc grammar's right there in the standard text, for crying
out loud.
|
I think the point Doug was trying to make is that it is difficult to
create a unique parse for C.
For instance:
x (y) ;
can be parsed be two ways, as can:
(x) - y ;
and:
x * y ;
A symbol table is needed to decide which of the two possible
parses to use.
| Quote: | If you want notoriously hard to parse, go have a look at C++ or Fortran.
|
Actually Fortran is very easy to parse, but it is very difficult to
lexically analyse.
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
David Given Guest
|
Posted: Thu Mar 01, 2007 8:35 am Post subject: Re: C parser wanted |
|
|
Hans-Bernhard Bröker wrote:
[...]
| Quote: | So you want to write yet another auto-obfuscator?
|
No, an Objective C compiler.
[...]
| Quote: | Unfortunately, C is notoriously hard to parse due to context-sensitive tokens.
Pardon the french, but that's nonsense. Given adequate tools (the
obvious choice being lex and yacc), C is about as easy to parse as it
gets. The yacc grammar's right there in the standard text, for crying
out loud.
|
Right, except that it expects the lexer to magically know whether an
identifier's a typedef name or not. And since the only way of figuring out
whether an identifier is a typedef or not is to actually *understand* the
input text, a C parser is going to need full symbol table and scoping support.
Which is not particularly hard, but is complex and extremely fiddly to get right.
I'm hoping to find a minimal implementation of a fully working parser and
serializer so that I don't need to reproduce other people's mistakes.
--
┌── dg@cowlark.com ─── http://www.cowlark.com ───────────────────
│
│ "The first 90% of the code takes the first 90% of the time. The other 10%
│ takes the other 90% of the time." --- Anonymous
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
David Spencer Guest
|
Posted: Thu Mar 01, 2007 8:35 am Post subject: Re: C parser wanted |
|
|
David Given <dg (AT) cowlark (DOT) com> writes:
| Quote: | I'm trying to write a C source transformer, that will read in ANSI C source
code, mutate it in various bizarre ways, and then write it out again.
Unfortunately, C is notoriously hard to parse due to context-sensitive tokens.
|
Fortunately, C is notoriously easy to parse. There are yacc and lex
sources freely available on the net; you just need to write the actions,
typedefs require some very simple bookkeeping on the side. It may be
offensive to a yacc purist, but fortunately don't detain anyone in
practice. There is also plenty of guidance for that on the net and in
the comp.compilers archive.
Also, very fortunately, you don't need any fancy parse or scan error
reporting or recovery: If the code passes a real compiler or lint,
yacc's built-in "parse error" should be sufficient (as it shouldn't
occur).
I have several utilities that parse C code and put out other C
code. It's very straightforward.
--
dhs spencer (AT) panix (DOT) com
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
Ira Baxter Guest
|
Posted: Tue Mar 06, 2007 2:41 am Post subject: Re: C parser wanted |
|
|
"David Given" <dg (AT) cowlark (DOT) com> wrote in message
news:clcm-20070225-0008 (AT) plethora (DOT) net...
| Quote: | I'm trying to write a C source transformer, that will read in ANSI C
source
code, mutate it in various bizarre ways, and then write it out again.
Unfortunately, C is notoriously hard to parse due to context-sensitive
tokens.
Does anyone know where I can find a simple-as-possible toolkit that will
construct a parse tree and emit it again? Ideally I need something
actually
*in* C, which lets out cil and the antlr-based grammar. I have found
ctree,
but it's not really open source and it's a bit cryptic.
Does anyone know of anything else I should look at?
|
However, our DMS Software Reengineering Toolkit is capable of
parsing C source in a variety of dialects (ANSI, GCC2/3/4, MSVC6, ...),
has a full expanding preprocessor as well as the ability to not expand
most directives and yet still parse, building complete abstract syntax
trees,
performing full name and (expression)
type resolution, carrying out control and data flow analysis,
accepting and applying either procedural or source to source transformations
to the ASTs, and then regenerating compilable text from the ASTs.
It has been used on systems of code with 7500 compilation units
to carry out some pretty weird transformations.
See http://www.semanticdesigns.com/Products/FrontEnds/CFrontEnd.html
I don't know about "as simple as possible".
The complexity of the task of parsing/analyzing/transforming
real languages forces a certain degree of complexity on the
underlying infrastructure.
DMS is as simple as we could make it, and still do these tasks.
--
Ira Baxter, CTO
www.semanticdesigns.com
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
Douglas A. Gwyn Guest
|
Posted: Tue Mar 06, 2007 2:41 am Post subject: Re: C parser wanted |
|
|
"David Given" <dg (AT) cowlark (DOT) com> wrote in message
news:clcm-20070228-0010 (AT) plethora (DOT) net...
| Quote: | Right, except that it expects the lexer to magically know whether an
identifier's a typedef name or not.
|
When creating tokens, it doesn't need to know anything about typedefs.
They're lexically just identifiers. Typedefs have to be taken into account
when parsing certain declarations.
| Quote: | I'm hoping to find a minimal implementation of a fully working parser and
serializer so that I don't need to reproduce other people's mistakes.
|
The GCC front end does all that. It also supports Objective-C.
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
Keith Thompson Guest
|
Posted: Sat Mar 10, 2007 6:39 pm Post subject: Re: C parser wanted |
|
|
"Douglas A. Gwyn" <DAGwyn (AT) null (DOT) net> writes:
| Quote: | "David Given" <dg (AT) cowlark (DOT) com> wrote in message
news:clcm-20070228-0010 (AT) plethora (DOT) net...
Right, except that it expects the lexer to magically know whether an
identifier's a typedef name or not.
When creating tokens, it doesn't need to know anything about typedefs.
They're lexically just identifiers. Typedefs have to be taken into account
when parsing certain declarations.
[...] |
I think you can choose to treat a typedef name either as an
identifier, or as a special kind of token (effectively a keyword). As
long as the behaviors of the lexer and parser are consistent with each
other, either will work. Treating typedef names as keywords *might*
make the whole thing easier by simplifying the parser's job (it's been
a while since I've worked on this stuff).
--
Keith Thompson (The_Other_Keith) kst-u (AT) mib (DOT) org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
--
comp.lang.c.moderated - moderation address: clcm (AT) plethora (DOT) net -- you must
have an appropriate newsgroups line in your header for your mail to be seen,
or the newsgroup name in square brackets in the subject line. Sorry. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|