Difference between revisions of "Development/KDevelop-PG-Qt Introduction"

Jump to: navigation, search
(Store it in the Parsetree)
 
(56 intermediate revisions by 6 users not shown)
Line 1: Line 1:
{{Template:I18n/Language Navigation Bar|Development/KDevelop-PG-Qt Introduction}}
 
  
KDevelop-PG-Qt is the parser-generator from KDevplatform. It is used for some KDevelop-languagesupport-plugins (Ruby, PHP, Java...).
+
<languages />
 +
<translate>
  
It uses Qt classes internally. There's also the original KDevelop-PG parser, which used types from the STL, but has since been superseeded by KDevelop-PG-Qt. Most of the features are the same, though it could be that the ...-Qt parser generator is more up to date and feature rich than the plain STL style generator. The ...-Qt version should be used to write parsers for KDevelop language plugins.
+
== Preface == <!--T:1-->
  
= In-Depth information  =
+
<!--T:2-->
 +
'''KDevelop-PG-Qt''' is the parser-generator from ''KDevplatform''. It is used for some ''KDevelop-languagesupport-plugins'' (Ruby, PHP, CSS...).
  
This document is not supposed to be a full-fledged and in-depth resource for all parts of KDevelop-PG. Instead it is intended to be a short introduction and, more importantly, a reference for developers.  
+
<!--T:3-->
 +
It uses Qt classes internally. There's also the original '''KDevelop-PG parser''', which used types from the STL, but has since been superceeded by '''KDevelop-PG-Qt'''. Most of the features are the same, though it could be that the ...-Qt parser generator is more up to date and feature rich than the plain STL style generator. The ...-Qt version should be used to write parsers for KDevelop language plugins.  
  
To get and in-depth introduction, read Jakob Petsovits' excellent Bachelor thesis. You find it in the Weblinks section at the bottom of this page.
+
== In-Depth information  == <!--T:4-->
  
= The Application  =
+
<!--T:5-->
 +
This document is not supposed to be a full-fledged and in-depth resource for all parts of '''KDevelop-PG'''. Instead it is intended to be a short introduction and, more importantly, a reference for developers.
  
== Usage  ==
+
<!--T:6-->
 +
To get an in-depth introduction, read Jakob Petsovits' excellent Bachelor thesis. You find it in the Weblinks section at the bottom of this page. However, it does not cover recently added features.
  
You can find KDevelop-PG-Qt in [http://websvn.kde.org/trunk/playgroun/devtools/kdevelop-pg-qt SVN] . Also included in the source are three example packages.<br>  
+
== The Application  == <!--T:7-->
<pre>svn co svn://anonsvn.kde.org/home/kde/trunk/playground/devtools/
+
 
</pre>  
+
=== Usage  === <!--T:8-->
 +
 
 +
<!--T:9-->
 +
You can find '''KDevelop-PG-Qt''' in [https://projects.kde.org/projects/extragear/kdevelop/utilities/kdevelop-pg-qt git]. Four example packages are also included in the sources.<br />  
 +
To download it try:
 +
{{Input|1=git clone git://anongit.kde.org/kdevelop-pg-qt.git}}
 +
or:
 +
{{Input|1=git clone kde:kdevelop-pg-qt}} (when having setup '''git''' with kde:-prefix)
 +
 
 +
<!--T:10-->
 
The program itself requests a .g file, a so called grammar, as input:  
 
The program itself requests a .g file, a so called grammar, as input:  
<pre>./kdev-pg-qt --output=''prefix'' syntax.g
+
{{Input|1=./kdev-pg-qt --output=''prefix'' syntax.g}}
</pre>  
+
 
The value of the ''--ouput'' switch decides the prefix of the output files and additionally the namespace for the generated code.  
+
<!--T:11-->
 +
The value of the ''--output'' switch decides the prefix of the output files and additionally the namespace for the generated code.
 +
'''Kate''' provides elementary highlighting for '''KDevelop-PG-Qt's''' grammar-files.
  
== Output Format  ==
+
=== Output Format  === <!--T:12-->
  
 +
<!--T:13-->
 
While evaluating the grammar and generating its parser files, the application will output information about so called ''conflicts'' to STDOUT. As said above, the following files will actually be prefixed.  
 
While evaluating the grammar and generating its parser files, the application will output information about so called ''conflicts'' to STDOUT. As said above, the following files will actually be prefixed.  
  
=== ast.h  ===
+
==== ast.h  ==== <!--T:14-->
  
AST stands for [http://en.wikipedia.org/wiki/Abstract_syntax_tree ''Abstract Syntax Tree'']. It defines the data structure in which the parse tree is saved. Each node is a struct with the postfix ''Ast'', which contains members that point to any possible sub elements.  
+
<!--T:15-->
 +
AST stands for [http://en.wikipedia.org/wiki/Abstract_syntax_tree Abstract Syntax Tree]. It defines the data structure in which the parse tree is saved. Each node is a struct with the postfix ''Ast'', which contains members that point to any possible sub elements.  
  
=== parser.h and parser.cpp  ===
+
==== parser.h and parser.cpp  ==== <!--T:16-->
  
One important part of ''parser.h'' is the definition of the parser tokens, the ''TokenType'' enum. The TokenStream of your lexer should to use this. You have to write your own lexer or let one generate by Flex. See also the part about Tokenizers/Lexers below.
+
<!--T:17-->
 +
One important part of ''parser.h'' is the definition of the parser tokens, the ''TokenType'' enum. The TokenStream of your lexer should to use this. You have to write your own lexer or let one generate by '''Flex'''. See also the part about Tokenizers/Lexers below.
  
Having the token stream available, you create your root item and call the parser on the parse method for the top-level AST item, e.g. DocumentAst* =&gt; parseDocument(&amp;root). On success, root will contain the AST.<br>
+
<!--T:18-->
 +
Having the token stream available, you create your root item and call the parser on the parse method for the top-level AST item, e.g. DocumentAst* =&gt; parseDocument(&amp;root). On success, root will contain the AST.<br />
  
 +
<!--T:19-->
 
The parser will have one parse method for each possible node of the AST. This is nice for e.g. an expression parser or parsers that should only parse a sub-element of a full document.
 
The parser will have one parse method for each possible node of the AST. This is nice for e.g. an expression parser or parsers that should only parse a sub-element of a full document.
  
=== visitor.h and visitor.cpp  ===
+
==== visitor.h and visitor.cpp  ==== <!--T:20-->
  
 +
<!--T:21-->
 
The Visitor class provides an abstract interface to walk the AST. Most of the time you don't need to use this directly, the DefaultVisitor takes some work off your shoulders.
 
The Visitor class provides an abstract interface to walk the AST. Most of the time you don't need to use this directly, the DefaultVisitor takes some work off your shoulders.
  
=== defaultvisitor.h and defaultvisitor.cpp  ===
+
==== defaultvisitor.h and defaultvisitor.cpp  ==== <!--T:22-->
  
The DefaultVisitor is an implementation of the abstract Visitor interface and automatically visits each node in the AST. Hence, this is probably the best candidate for a base class for your personal visitors. Most language plugins use these in their Builder classes to create the DUChain.<br>
+
<!--T:23-->
 +
The DefaultVisitor is an implementation of the abstract Visitor interface and automatically visits each node in the AST. Hence, this is probably the best candidate for a base class for your personal visitors. Most language plugins use these in their Builder classes to create the DUChain.<br />
  
== Command-Line-Options  ==
+
=== Command-Line-Options  === <!--T:24-->
  
*--namespace=''namespace'' - sets the C++ namespace for the generated sources independently from the file prefix. When this option is set, you can also use / in the --ouput option<br>
+
<!--T:25-->
*--no-ast - don't create the ast.h file, more to that below<br>
+
* --namespace=''namespace'' - sets the C++ namespace for the generated sources independently from the file prefix. When this option is set, you can also use / in the --ouput option
*--debug-visitor - generates a debug visitor that prints the AST<br>
+
* --no-ast - don't create the ast.h file, more to that below
*--serialize-visitor - generates code for serialization via a QIODevice  
+
* --debug-visitor - generates a debug visitor that prints the AST
*--symbols - all possible nodes from the AST (not the leafs) will be written into the file ''kdev-pg-symbol.''  
+
* --serialize-visitor - generates code for serialization via a QIODevice
*--rules - all grammar rules with informationen about their syntactic correlations will be written into a file called ''kdev-pg-rules''. useful for debugging and solving conflicts
+
* --terminals - all tokens will be written into the file ''kdev-pg-terminals''
*--help - a so far not really helpful help text ;-)<br>
+
* --symbols - all possible nodes from the AST (not the leafs) will be written into the file ''kdev-pg-symbols''  
 +
* --rules - all grammar rules with informationen about their syntactic correlations will be written into a file called ''kdev-pg-rules''. useful for debugging and solving conflicts
 +
* --token-text - generates a function to map token-numbers onto token-names
 +
* --help - print usage information
  
= Tokenizers/Lexers<br> =
+
== Tokenizers/Lexers == <!--T:26-->
  
As mentioned, KDevelop-PG-Qt requires an existing Tokenizer. You can either write one per hand, as was done for C++ and PHP, or you can use tools like Flex. With the existing examples, it shouldn't be too hard to write such a lexer. Between most languages, especially those ''"inheriting"'' C, there are many common syntactic elements. Especially comments and literals can be handled just the same way over and over again. Adding a simple token is trivial:
+
<!--T:27-->
<pre>"special-command"    return Parser::Token_SPECIAL_COMMAND;
+
As mentioned, '''KDevelop-PG-Qt''' requires a Tokenizer. You can either let '''KDevelop-PG-Qt''' generate one for you, write one per hand, as it has been done for C++ and PHP, or you can use external tools like '''Flex'''.
</pre>
+
That's pretty much it, take a look at eg. ''java.ll'' for an excellent example.
+
  
The tokenizer's job, in princeple, boils down to:<br>
+
<!--T:28-->
 +
The tokenizer's job, in principle, boils down to:
  
*converting keywords and chars with special meanings to tokens<br>
+
<!--T:29-->
*converting literals and identifier to tokens<br>
+
* converting keywords and chars with special meanings to tokens
*clean out anything that doesn't change the semantics, e.g. comments or whitespace (the latter of course not in Python)<br>
+
* converting literals and identifier to tokens
*while doing the above, handling character encoding (we recommend using UTF8 as much as possible)<br>
+
* clean out anything that doesn't change the semantics, e.g. comments or whitespace (the latter of course not in Python)
 +
* while doing the above, handling character encoding (we recommend using UTF8 as much as possible)
  
 +
<!--T:30-->
 
The rest, e.g. actually building the tree and evaluating the semantics, is part of the parser and the AST visitors.<br>
 
The rest, e.g. actually building the tree and evaluating the semantics, is part of the parser and the AST visitors.<br>
  
= How to write Grammar-Files  =
+
=== Using KDevelop-PG-Qt === <!--T:31-->
  
== Chomsky Type-2 Grammars  ==
+
<!--T:32-->
 +
'''KDevelop-PG-Qt''' can generate lexers being well integrated into its architecture (you do not have to create a token-stream-class invoking lex or something like that). See examples/foolisp in the code for a simplistic example, there is also an incomplete PHP-Lexer for demonstration purposes.
  
KDevelop-PG-Qt uses so called [http://en.wikipedia.org/wiki/Chomsky_hierarchy Type-2-grammars] use a concept of non-terminals (nodes) and terminals(tokens). While writing the grammar for the basic structure of your language, you should try to mimic the semantics of the language. Lets take a look at an example:
+
==== Regular Expressions ==== <!--T:33-->
 +
Regular expressions are used to write rules using the KDevelop-PG-Qt, we use the following syntax (α and β are arbitrary regular expressions, a and b characters):
 +
*α|β accepts any word accepted by α or accepted by β
 +
*α&β accepts any word accepted by both α and β
 +
*α^β accepts any word accepted by a but not by β
 +
*~α accepts any word not accepted by α
 +
*?α like α, but also accepts the empty word
 +
*α* accepts any (maybe empty) sequence of words accepted by α
 +
*α+ accepts any nonempty sequence of words accepted by α (equivalent to αα*)
 +
*α@β accepts any nonempty sequence of words accepted by α separated by words accepted by β (equivalent to α(βα)*)
 +
*αβ accepts words consisting of a word accepted by α followed by a word accepted by β
 +
*[…] switches to “disjunctive” environment, αβ will get interpreted as α|β, you can use (…) inside the brackets to go back to normal mode
 +
*. accepts any single character
 +
*a-b accepts a single character between a and b (including a and b) in the Unicode (of course only characters that can be represented in the used encoding)
 +
*"…" will accept the word enclosed by the quotation marks, escape sequences will still get interpreted
 +
*a accepts the word consisting of the single character a
 +
*Any escape sequence (see below), accepts the word consisting of the character represented by the escape sequence
 +
*{⟨name⟩} accepts any word accepted by the regex named ⟨name⟩
  
 +
<!--T:178-->
 +
All regular expressions are case sensitive. Sorry, there is currently no way for insensitivity.
 +
 +
==== Known Escape Sequences ==== <!--T:179-->
 +
There are several escape sequences which can be used to encode special characters:
 +
*\n, \t, \f, \v, \r, \0, \b, \a like in C
 +
*\x, \X, \u or \U followed by hex digits: character represented by this Unicode value (in hex)
 +
*\d, \D followed by decimal digits, same, but in decimal representation
 +
*\o, \O followed by octal digits, same, but in octal representation
 +
*\y, \Y followed by binary digits, same, but in binary representation
 +
 +
==== Predefined named regex ==== <!--T:180-->
 +
Some regexes are predefined and can be used using braces {⟨name⟩}. They get imported from the official Unicode data, some important regexes:
 +
*{alphabetic} any alphabetic character
 +
*{num} any numeric character
 +
*{ascii-range} any character representable in ASCII
 +
*{latin1-range} any character representable in Latin 1 (8 Bit)
 +
*{uppercase}
 +
*{lowercase}
 +
*{math}
 +
 +
==== Rules ==== <!--T:181-->
 +
Rules can be written as:
 +
{{Input|1=
 +
⟨regular expression⟩ TOKEN;
 +
}}
 +
 +
<!--T:182-->
 +
Then the Lexer will generate the token TOKEN for lexemes matching the given regular expression. Which token will be chosen if there are multiple options? We use the ''first longest match'' rule: It will take the longest possible match (eating as many characters as possible), if there are multiple of those matches, it will take the first one.
 +
 +
<!--T:183-->
 +
Rules can perform code actions and you can also omit tokens (then no token will be generated):
 +
{{Input|1=
 +
⟨regular expression⟩ [: ⟨code⟩ :] TOKEN;
 +
⟨regular expression⟩ [: ⟨code⟩ :];
 +
⟨regular expression⟩ ;
 +
}}
 +
 +
<!--T:184-->
 +
There is rudimentary support for ''lookahead'' and so called (our invention) ''barriers'':
 +
{{Input|1=
 +
⟨regular expression⟩ %la(⟨regular expression⟩);
 +
⟨regular expression⟩ %ba(⟨regular expression⟩);
 +
}}
 +
The first rule will only accept words if they match the first regular expression and are followed by anything matching the expression specified using %la. The second rule will accept words matched by the first regular expression but will never run into a character sequence matching the regex specified by %ba. However, currently only rules with fixed length are allowed in %la and %ba (for example foo|bar, but not qux|garply). When applying the “first longest match” rule the %la/%ba expressions count, too.
 +
 +
<!--T:185-->
 +
You can create your own named regexes using an arrow:
 +
{{Input|1=
 +
⟨regular expression⟩ -> ⟨identifier⟩;
 +
}}
 +
The first character of the identifier should not be upper case.
 +
 +
<!--T:186-->
 +
Additionally there are two special actions:
 +
{{Input|1=
 +
⟨regular expression⟩ %fail;
 +
⟨regular expression⟩ %continue;
 +
}}
 +
 +
<!--T:187-->
 +
%fail will stop tokenization. %continue will make the matched characters part of the next token.
 +
 +
==== Rulesets ==== <!--T:188-->
 +
 +
<!--T:189-->
 +
A grammar file can contain multiple ''rulesets''. A ruleset is a set of rules, as described in the previous section. It gets declared using:
 +
{{Input|1=
 +
%lexer "name" ->
 +
  ⟨rules⟩
 +
  ;
 +
}}
 +
 +
<!--T:190-->
 +
For your main-ruleset you omit the name (the name will be “start”).
 +
 +
<!--T:191-->
 +
Usually the start-ruleset will be used. But you can change the ruleset in code actions using the macro lxSET_RULE_SET(⟨name⟩). You can specify code to be executed when entering or leaving a ruleset by using %enter [: ⟨code⟩ :]; or %leave [: ⟨code⟩ :]; respectively inside the definition of the ruleset.
 +
 +
==== Further Configuration and Output ==== <!--T:192-->
 +
The standard statements %lexer_declaration_header and %lexer_bits_header are available to include files in the generated lexer.h/lexer.cpp.
 +
 +
<!--T:193-->
 +
By using %lexer_base you can specify the baseclass for the lexer-class, by default it is the TokenStream class defined by KDevelop-PG-Qt.
 +
 +
<!--T:194-->
 +
After %lexerclass(bits) you can specify code to be inserted in lexer.cpp.
 +
 +
<!--T:195-->
 +
You have to specify an encoding the lexer should work with internally using %input_encoding "⟨encoding⟩", possible values:
 +
*ASCII (7-Bit)
 +
*Latin 1 (8-Bit)
 +
*UTF-8 (8-Bit, full Unicode)
 +
*UCS-2 (16-Bit, UCS-2 part of Unicode)
 +
*UTF-16 (16-Bit, full Unicode)
 +
*UTF-32 (32-Bit, full Unicode)
 +
 +
<!--T:196-->
 +
With %input_stream you can specify which class the lexer should use to get the characters to process, there are some predefined classes:
 +
*QStringIterator, reads from QString, required (internal) encoding: UTF-16 or UCS-2
 +
*QByteArrayIterator, reads from QByteArray, required encoding: ASCII, Latin-1 or UTF-8
 +
*QUtf16ToUcs4Iterator, reads from UTF-16 QString, required encoding: UTF-32 (UCS-4)
 +
*QUtf8ToUcs4Iterator, reads from UTF-8 QByteArray, required encoding: UTF-32 (UCS-4)
 +
*QUtf8ToUcs2Iterator, reads from UTF-8 QByteArray, required encoding: UCS-2
 +
*QUtf8ToUtf16Iterator, reads from UTF-8 QByteArray, required encoding: UTF-16
 +
*QUtf8ToAsciiIterator, reads from UTF-8 QByteArray, will ignore all non-ASCII characters, reqired encoding: ASCII
 +
 +
<!--T:197-->
 +
Whether you choose UTF-8, UTF-16 or UTF-32 is irrelevant for functionality, but it may significantly affect compile-time and run-time performance (you may want to test your Lexer with ASCII if compilation takes too long). For example you want to work with a QByteArray containing UTF-8 data, and you want to get full Unicode support: You could either use the QByteArrayIterator and UTF-8 as internal encoding, or the QUtf8ToUtf16Iterator and UTF-16, or the QUtf8ToUcs4Iterator and UTF-32.
 +
 +
<!--T:198-->
 +
You can also choose between %table_lexer; and %sequence_lexer; In the first case transitions between states of the lexer will get represented by big tables while generating the lexer (cases in the generated code). In the second case sequences will get stored in a compressed data structure and transitions will get represented by nested if-statements. For UTF-32 %table_lexer is infeasible, thus there %sequence_lexer is the onliest option.
 +
 +
<!--T:199-->
 +
Inside your actions of the lexer you can use some predefined macros:
 +
{{Input|1=
 +
lxCURR_POS  // position in the input (some kind of iterator or pointer)
 +
lxCURR_IDX  // index of the position in the input
 +
            // (it is the index as presented in the input, for example: input is a QByteArray, index incrementation per byte, but the lexer may operate on 32-bit codepoints)
 +
lxCONTINUE  // like %continue, add the current lexeme to the next token
 +
lxLENGTH    // length of the current lexeme (as presented in the input)
 +
lxBEGIN_POS // position of the first character of the current lexeme
 +
lxBEGIN_IDX  // corresponding index
 +
lxNAMED_TOKEN(⟨name⟩, ⟨type⟩) // create a variable representing named ⟨name⟩ with the token type ⟨type⟩
 +
lxTOKEN(⟨type⟩)              // create such a variable named “token”
 +
lxDONE      // return the token generated before
 +
lxRETURN(X) // create a token of type X and return it
 +
lxEOF      // create the EOF-token
 +
lxFINISH    // create the EOF-token and return it  (will stop tokenization)
 +
lxFAIL      // raise the tokenization error
 +
lxSKIP      // continue with the next lexeme (do not return a token, you should not have created one before)
 +
lxNEXT_CHR(⟨chr⟩)            // set the variable ⟨chr⟩ to the next char in the input
 +
yytoken    // current token
 +
}}
 +
 +
=== Using Flex === <!--T:34-->
 +
 +
<!--T:35-->
 +
With the existing examples, it shouldn't be too hard to write such a lexer. Between most languages, especially those ''"inheriting"'' C, there are many common syntactic elements. Especially comments and literals can be handled just the same way over and over again. Adding a simple token is trivial:
 +
 +
<!--T:36-->
 +
{{Input|1="special-command"    return Parser::Token_SPECIAL_COMMAND; }}
 +
 +
<!--T:37-->
 +
That's pretty much it, take a look at eg. ''java.ll'' for an excellent example. However, it is quite tricky and ugly to handle UTF-8 with Flex.
 +
 +
== How to write Grammar-Files  == <!--T:38-->
 +
 +
=== Context-Free Grammars  === <!--T:39-->
 +
 +
<!--T:40-->
 +
'''KDevelop-PG-Qt''' uses so called [http://en.wikipedia.org/wiki/Context-free_grammars context-free grammars] using a concept of non-terminals (nodes) and terminals(tokens). While writing the grammar for the basic structure of your language, you should try to mimic the semantics of the language. Lets take a look at an example:
 +
 +
<!--T:41-->
 
C++-document consists of lots of declarations and definitions, a class definition could be handled e.g. in the following way:
 
C++-document consists of lots of declarations and definitions, a class definition could be handled e.g. in the following way:
  
 +
<!--T:42-->
 
#''CLASS-token''
 
#''CLASS-token''
 
#a ''identifier''
 
#a ''identifier''
Line 88: Line 287:
 
#and finally the '';-token''
 
#and finally the '';-token''
  
 +
<!--T:43-->
 
The ''member-declarations-list'' is of course not a part of any C++ description, it is just a ''helper'' to explain the structure of a given semantic part of your language. The grammar could then define how exactly such helper might look like.
 
The ''member-declarations-list'' is of course not a part of any C++ description, it is just a ''helper'' to explain the structure of a given semantic part of your language. The grammar could then define how exactly such helper might look like.
  
== Basic Syntax  ==
+
=== Basic Syntax  === <!--T:44-->
  
 +
<!--T:45-->
 
Now let us have a look at a basic example, a declaration in C++, as described in grammar syntax:
 
Now let us have a look at a basic example, a declaration in C++, as described in grammar syntax:
  
<pre>
+
<!--T:46-->
 +
{{Input|1=
 
   class_declaration
 
   class_declaration
 
  | struct_declaration
 
  | struct_declaration
Line 102: Line 304:
 
  | typedef_declaration
 
  | typedef_declaration
 
  | extern_declaration
 
  | extern_declaration
-> declaration ;;
+
-> declaration ;
</pre>
+
}}
  
 +
<!--T:47-->
 
This is called a ''rule'' definition. Every lower-case string in the grammar file references such a rule. Our case above defines what a ''declaration'' looks like. The ''|''-char stands for a logical ''or'', all rules have to end on two semicolons.  
 
This is called a ''rule'' definition. Every lower-case string in the grammar file references such a rule. Our case above defines what a ''declaration'' looks like. The ''|''-char stands for a logical ''or'', all rules have to end on two semicolons.  
  
 +
<!--T:48-->
 
In the example we reference other rules which also have to be defined. Here's for example the ''class_declaration'', note the tokens in all-upper-case:
 
In the example we reference other rules which also have to be defined. Here's for example the ''class_declaration'', note the tokens in all-upper-case:
  
<pre>
+
<!--T:49-->
 +
{{Input|1=
 
   CLASS IDENTIFIER SEMICOLON
 
   CLASS IDENTIFIER SEMICOLON
 
  | CLASS IDENTIFIER LBRACE class_declaration* RBRACE SEMICOLON
 
  | CLASS IDENTIFIER LBRACE class_declaration* RBRACE SEMICOLON
-> class_declaration ;;
+
-> class_declaration ;
</pre>
+
}}
  
 +
<!--T:50-->
 
There is a new char in there: The asterisk has the same meaning as in regular expressions, i.e. that the previous rule can occur arbitrarily often or not at all.
 
There is a new char in there: The asterisk has the same meaning as in regular expressions, i.e. that the previous rule can occur arbitrarily often or not at all.
  
In a grammar ''0'' stands for an empty token. Using it in addition with parenthesizing and the logical ''or'' from above, you can express optional elements:
+
<!--T:51-->
 +
In a grammar <code> 0 </code> stands for an empty token. Using it in addition with parenthesizing and the logical ''or'' from above, you can express optional elements:
  
<pre>
+
<!--T:52-->
 +
{{Input|1=
 
   some_required_rule SOME_TOKEN
 
   some_required_rule SOME_TOKEN
     ( some_optional_stuff | some_other_stuff | 0 )
+
     <nowiki>( some_optional_stuff | some_other_stuff | 0 )</nowiki>
-> my_rule ;;
+
-> my_rule ;
</pre>
+
}}
  
== Store it in the Parsetree  ==
+
<!--T:53-->
 +
All symbols never occuring on the left side of a rule are start-symbols. You can use one of them to start parsing.
  
 +
=== Making matched rules available to Visitors  === <!--T:54-->
 +
 +
<!--T:55-->
 
The simple rule above could be used to parse the token stream, yet no elements would be saved in the parsetree. This can be easily done though:
 
The simple rule above could be used to parse the token stream, yet no elements would be saved in the parsetree. This can be easily done though:
  
<pre>
+
<!--T:56-->
 +
{{Input|1=
 
   class_declaration=class_declaration
 
   class_declaration=class_declaration
 
  | struct_declaration=struct_declaration
 
  | struct_declaration=struct_declaration
Line 137: Line 350:
 
  | typedef_declaration=typedef_declaration
 
  | typedef_declaration=typedef_declaration
 
  | extern_declaration=extern_declaration
 
  | extern_declaration=extern_declaration
-> declaration ;;
+
-> declaration ;
</pre>
+
}}
  
The DeclarationAst struct now contains pointers to each of these elements. During the parse process the pointer for each found element gets set, all others become NULL. To store lists of elements, prepend the identifier with a hash (''#''):
+
<!--T:57-->
 +
The DeclarationAst struct now contains pointers to each of these elements. During the parse process the pointer for each found element gets set, all others become NULL. To store lists of elements, prepend the identifier with a hash <keycap> # </keycap>:
  
<pre>
+
<!--T:58-->
 +
{{Input|1=
 
   CLASS IDENTIFIER SEMICOLON
 
   CLASS IDENTIFIER SEMICOLON
 
  | CLASS IDENTIFIER LBRACE (#class_declaration=class_declaration)* RBRACE SEMICOLON
 
  | CLASS IDENTIFIER LBRACE (#class_declaration=class_declaration)* RBRACE SEMICOLON
-> class_declaration ;;
+
-> class_declaration ;
</pre>
+
}}
  
 +
<!--T:59-->
 
'''TODO: internal structure of the list, important for Visitors'''
 
'''TODO: internal structure of the list, important for Visitors'''
  
 +
<!--T:60-->
 
Identifier and targets can be used in more than one place:
 
Identifier and targets can be used in more than one place:
  
<pre>
+
<!--T:61-->
 +
{{Input|1=
 
   #one=one (#one=one)*
 
   #one=one (#one=one)*
-> one_or_more ;;
+
-> one_or_more ;
</pre>
+
}}
  
 +
<!--T:62-->
 
In the example above, all matches to the rule ''one'' will be stored in one and the same list ''one''.
 
In the example above, all matches to the rule ''one'' will be stored in one and the same list ''one''.
  
== Angabe der Tokens  ==
+
=== Defining available Tokens  === <!--T:63-->
  
An einer Stelle müssen alle Tokens aufgezählt werden. Aus ihnen wird ein enum in parser.h erstellt. Der zusätzlich anzugebende Name wird zur Anzeige von Fehlermeldungen u.ä. verwendet: <code>
+
<!--T:64-->
%token T1 ("T1-Name"), T2 ("T2-Name"), COMMA (";"), SEMICOLON (";") ;;
+
Somewhere in the grammar (you should probably put it near the head) you'll have to define a list of available Tokens. From this list, the ''TokenType'' enum in ''parser.h'' will be created. Additionally to the enum value names you should define an explanation name which will e.g. be used in error messages. Note that the representation of a Token inside the source code is not required for the grammar/parser as it operates on a TokenStream, see Lexer/Tokenizer section above.
</code>
+
  
== Special Characters...  ==
+
<!--T:65-->
 +
{{Input|1=
 +
%token T1 ("T1-Name"), T2 ("T2-Name"), COMMA (";"), SEMICOLON (";") ;
 +
}}
  
=== ...to use inside of Rules  ===
+
<!--T:66-->
 +
It is possible to use ''%token'' multiple times to group tokens in the grammar. Though all tokens will still be put into the same ''TokenType'' enum.
  
Einige weitere Zeichen erleichtern die Arbeit: <code>
+
<!--T:67-->
 +
'''TODO: explain process of using the parser Tokens'''
 +
 
 +
=== Special Syntax...  === <!--T:68-->
 +
 
 +
==== ...to use inside Rules  ==== <!--T:69-->
 +
 
 +
===== List of one or more elements ===== <!--T:70-->
 +
 
 +
<!--T:71-->
 +
Alternatively to the asterisk (''*'') you can use a plus sign (''+'') to mark lists of one-or-more elements:
 +
 
 +
<!--T:72-->
 +
{{Input|1=
 
   (#one=one)+
 
   (#one=one)+
-> one_or_more ;;
+
-> one_or_more ;
</code> ...entspricht genau obigem Beispiel.<br> Ebenfalls häufig verwendet ist das '@': <code>
+
}}
 +
 
 +
==== Separated lists ==== <!--T:73-->
 +
 
 +
<!--T:74-->
 +
Using the ''#rule @ TOKEN'' syntax you can mark a list of ''rule'', separated by ''TOKEN'':
 +
 
 +
<!--T:75-->
 +
{{Input|1=
 
   #item=item @ COMMA
 
   #item=item @ COMMA
-> comma_separated_list ;;
+
-> comma_separated_list ;
</code> ...sorgt genau dafür, dass zwischen den Items ein Komma stehen muss.<br> Dieses Syntax-Element ist zur Zeit leider auskommentiert: <code>
+
}}
  ?(item=item)
+
-> optional_item ;;
+
</code> ...entspricht genau: <code>
+
  (0 | item=item)
+
-> optional_item ;;
+
</code><br> Verwendet man an Stelle des Gleichheitszeichens den Doppelpunkt ('':''), so wird lediglich während des Parsens eine lokale Variable für den Unterbaum eingerichtet wird.
+
  
=== ...to add Hand-Written Code  ===
+
==== Optional items ==== <!--T:76-->
  
Manchmal muss im Parser manuell geschriebener Code eingebaut werden. Beispiele sind:  
+
<!--T:77-->
 +
Alternatively to the above mentioned ''(item=item | 0)'' syntax you can use the following to mark optional items:
  
*Eine eigene Fehler-Behandlung
+
<!--T:78-->
*Erzeugen der Tokens (wenn man es nicht von außerhalb machen möchte)
+
{{Input|1=
*Setzen bestimmter Variablen in den Nodes und dem Parser-Objekt (beispielsweise ließe sich bei einem C++-Parser zwischenspeichern, ob sich der Parser gerade in einem ''private'', ''protected'' oder ''public'' Abschnitt befindet. In den Nodes der Elemente ließe sich eine solche Information speichern.
+
  ?item=item
*Zusätzliche Prüfungen, die sich am schnellsten in C++ durchführen lassen
+
-> optional_item ;
 +
}}
  
<code>
+
<!--T:79-->
 +
{{Attention|1= ''<nowiki>(0|x)</nowiki>'' is equivalent to ''0''}}
 +
 
 +
==== Local variables for the parse-process ==== <!--T:80-->
 +
 
 +
<!--T:81-->
 +
Using a colon '''<keycap> : </keycap>''' instead of the equal sign '''<keycap>  =  </keycap> you can store the sub-AST in a local variable that will only be available during parsing, and only in the current rule.
 +
 
 +
<!--T:82-->
 +
'''TODO: need example'''
 +
 
 +
==== Inlining ==== <!--T:83-->
 +
 
 +
<!--T:84-->
 +
Instead of a name you can also place a dot <keycap> . </keycap> before the equal sign <keycap> = </keycap>. Then the AST will inherit all class-members from the sub-AST and the parsing-method for the sub-AST will be merged into the parent-AST. An example:
 +
 
 +
<!--T:85-->
 +
{{Input|1=
 +
  op=PLUS | op=MUL | op=MOD
 +
-> operator ;;
 +
  val1=number .=op val2=number
 +
-> simpleArithmetics ;
 +
}}
 +
 
 +
<!--T:86-->
 +
In SimpleArithmeticsAst there will be the fields ''val1'', ''val2'' (storing the operands) and ''op'' (storing the operator-token). ''parseSimpleArithmetics'' will not have to call ''parseOperator''. Obviously recursive inlining is not allowed.
 +
 
 +
=== ...to add Hand-Written Code  === <!--T:87-->
 +
 
 +
<!--T:88-->
 +
Sometimes it is required to integrate hand-written code into the generated parser. Instead of editing the end-result (**never** do that!) you should put this code into the grammar at the correct places. Here are a few examples when you'd need this:
 +
 
 +
<!--T:89-->
 +
* custom error handling / error recovery, i.e. to prevent the parser to stop at the first error
 +
* creating tokens, if you don't want to do that externally
 +
* setting custom variables, especially for state tracking. e.g. in C++ you could save whether you are inside a ''private'', ''protected'' oder ''public'' section. then you could save this information inside each node of the class elements.
 +
* additional verifications, lookaheads etc.
 +
 
 +
==== General Syntax ==== <!--T:90-->
 +
 
 +
<!--T:91-->
 +
{{Input|1=
 
[:
 
[:
// Code
+
// here be dragons^W code ;-)
 
:]
 
:]
</code> ...fügt der Datei parser.cpp den angegebenen Code hinzu. Wird diese Konstrukt innerhalb einer Regel verwendet, so wird der Code an der entsprechenden Stelle des Parsens eingefügt. <code>
+
}}
?[:
+
 
// Boolescher Ausdruck
+
<!--T:92-->
 +
The code will be put into the generated ''parser.cpp'' file. If you use it inside a grammar rule, it will be put into the correct position during the parse process. You can access the current node via the variable ''yynode'', it will have the type 'XYZAst**'.
 +
 
 +
==== Global Code ==== <!--T:93-->
 +
 
 +
<!--T:94-->
 +
In '''KDevelop''' language plugins, you'll see that most grammars start with something like:
 +
 
 +
<!--T:95-->
 +
{{Input|1=
 +
[:
 +
 
 +
<!--T:96-->
 +
#include <QtCore/QString>
 +
#include <kdebug.h>
 +
#include <tokenstream.h>
 +
#include <language/interfaces/iproblem.h>
 +
#include "phplexer.h"
 +
 
 +
<!--T:97-->
 +
namespace KDevelop
 +
{
 +
    class DUContext;
 +
}
 +
 
 +
<!--T:98-->
 
:]
 
:]
</code> ...sorgt dafür, dass der Boolesche Ausdruck Voraussetzung für die aktuelle Regel wird. Innerhalb dieser Code-Segmente kann man auf die aktuell bearbeitete Node über die Variable ''yynode'' des Typs 'XYZAst**' zugreifen. <code>
+
}}
 +
 
 +
<!--T:99-->
 +
This is a code section, that will be put at the beginning of ''ast.h'', i.e. into the global context.
 +
 
 +
<!--T:100-->
 +
But there are also some newer statements available to include header-files:
 +
 
 +
<!--T:101-->
 +
{{Input|1=
 +
%ast_header "header.h"
 +
%parser_declaration_header "header.h"
 +
%parser_bits_header "header.h"
 +
}}
 +
 
 +
<!--T:102-->
 +
The include-statement will be inserted into ast.h, parser.h respectively parser.cpp.
 +
 
 +
==== Namespace Code ==== <!--T:103-->
 +
 
 +
<!--T:104-->
 +
Also it's very common to define a set of enumerations e.g. for operators, modifiers, etc. pp. Here's an stripped example from PHP, note that the code will again be put into the generated ''parser.h'' file:
 +
 
 +
<!--T:105-->
 +
{{Input|1=
 +
%namespace
 +
[:
 +
    enum ModifierFlags {
 +
        ModifierPrivate      = 1,
 +
        ModifierPublic      = 1 << 1,
 +
        ModifierProtected    = 1 << 2,
 +
        ModifierStatic      = 1 << 3,
 +
        ModifierFinal        = 1 << 4,
 +
        ModifierAbstract    = 1 << 5
 +
    };
 +
...
 +
    enum OperationType {
 +
        OperationPlus = 1,
 +
        OperationMinus,
 +
        OperationConcat,
 +
        OperationMul,
 +
        OperationDiv,
 +
        OperationMod,
 +
        OperationAnd,
 +
        OperationOr,
 +
        OperationXor,
 +
        OperationSl,
 +
        OperationSr
 +
    };
 +
:]
 +
}}
 +
 
 +
<!--T:106-->
 +
When you write code at the end of the grammar-file simply between ''[:'' and '':]'', it will be added to ''parser.cpp''.
 +
 
 +
==== Additional AST member ==== <!--T:107-->
 +
 
 +
<!--T:108-->
 +
To add additional members to _every_ AST variable, use the following syntax:
 +
 
 +
<!--T:109-->
 +
{{Input|1=
 +
%ast_extra_members
 +
[:
 +
  KDevelop::DUContext* ducontext;
 +
:]
 +
}}
 +
 
 +
<!--T:110-->
 +
You can also specify the base-class for the nodes:
 +
 
 +
<!--T:111-->
 +
{{Input|1=
 +
%ast_base symbol_name "BaseClass"
 +
}}
 +
 
 +
<!--T:112-->
 +
''BaseClass'' has to inherit from ''AstNode''.
 +
 
 +
==== Additional parser class members ==== <!--T:113-->
 +
 
 +
<!--T:114-->
 +
Instead of polluting the global context with state tracker variables, and hence destroying the whole advantages of OOP, you can add additional members to the parser class. It's also very convenient to define functions for error reporting etc. pp. Again a stripped example from PHP:
 +
 
 +
<!--T:115-->
 +
{{Input|1=
 +
%parserclass (public declaration)
 +
[:
 +
  enum ProblemType {
 +
      Error,
 +
      Warning,
 +
      Info
 +
  };
 +
  void reportProblem( Parser::ProblemType type, const QString& message );
 +
  QList<KDevelop::ProblemPointer> problems() {
 +
      return m_problems;
 +
  }
 +
  ...
 +
  enum InitialLexerState {
 +
      HtmlState = 0,
 +
      DefaultState = 1
 +
  };
 +
:]
 +
}}
 +
 
 +
<!--T:116-->
 +
Note, that we used ''%parserclass (public declaration)'', we could instead have used ''private'' or ''protected'' declaration.
 +
 
 +
<!--T:117-->
 +
{{Input|1=
 
%parserclass ( [private|protected|public] declaration)
 
%parserclass ( [private|protected|public] declaration)
 
[:
 
[:
 
// Code
 
// Code
 
:]
 
:]
</code> ...fügt der Parser-Klasse in der Datei parser.h im entsprechenden Abschnitt Code hinzu. Wer Elemente hinzufügt, braucht auch Zugriff auf Konstruktor und Destruktor: <code>
+
}}
 +
 
 +
<!--T:118-->
 +
There is also a statement to specify a base-class:
 +
 
 +
<!--T:119-->
 +
{{Input|1=
 +
%parser_base "ClassName"
 +
}}
 +
 
 +
==== Initializing additional parser class members ==== <!--T:120-->
 +
 
 +
<!--T:121-->
 +
When you add member variables to the class, you'll have to initialize and or destroy them as well. Here's how (either use ctor or dtor, of course):
 +
 
 +
<!--T:122-->
 +
{{Input|1=
 
%parserclass ( [constructor|desctructor] )
 
%parserclass ( [constructor|desctructor] )
 
[:
 
[:
 
// Code
 
// Code
 
:]
 
:]
</code> ...fügt den entsprechenden Funktionen den Code hinzu. Um in der Auswertung einer Regel eine lokale Variable zu definieren, gibt es folgede Syntax: <code>
+
}}
 +
 
 +
==== Boolean Checks ==== <!--T:123-->
 +
 
 +
<!--T:124-->
 +
{{Input|1=
 +
?[:
 +
// some bool expression
 +
:]
 +
}}
 +
 
 +
<!--T:125-->
 +
The following rule will only apply if the boolean expression evaluates to ''true''. Here's an advanced example, which also shows that you can use the pipe symbol ('|') as logical or, i.e. essentially this is a ''if... else...' conditional:
 +
 
 +
<!--T:126-->
 +
{{Input|1=
 +
    ?[: someCondition :] SOMETOKEN ifrule=myVar
 +
  | elserule=myVar
 +
}}
 +
 
 +
<!--T:127-->
 +
This is especially convenient together with lookaheads (see below).
 +
 
 +
==== Defining local variables inside rules ==== <!--T:128-->
 +
 
 +
<!--T:129-->
 +
You can set up the grammar to define local variables whenever a rule gets applied:
 +
 
 +
<!--T:130-->
 +
{{Input|1=
 
   ...
 
   ...
 
-> class_member [:
 
-> class_member [:
 
   enum { A_PUBLIC, A_PROTECTED, A_PRIVATE } access;
 
   enum { A_PUBLIC, A_PROTECTED, A_PRIVATE } access;
:];;
+
:];
</code> Hierdurch wird jedoch lediglich eine lokale Variable eingeführt, die innerhalb der Regel verwendet wird. Möchte man dagegen im Parsetree zusätzliche Informationen speichern, so benötigt man diese Syntax: <code>
+
}}
[
+
 
   member|temporary variable name: type
+
<!--T:131-->
]
+
This variable is local to the rule ''class_member''.
</code> Korrekt wäre folglich: <code>
+
 
 +
==== Defining additional variables for the parse tree ==== <!--T:132-->
 +
 
 +
<!--T:133-->
 +
Similar to the syntax above, you can define members whenever a rule gets applied:
 +
 
 +
<!--T:134-->
 +
{{Input|1=
 +
  ...
 +
-> class_member [
 +
   [member|temporary] variable yourName: yourType
 +
];
 +
}}
 +
 
 +
<!--T:135-->
 +
For example:
 +
 
 +
<!--T:136-->
 +
{{input|1=
 
   ...
 
   ...
 
-> class_member [
 
-> class_member [
 
       member variable access : AccessType;
 
       member variable access : AccessType;
];;
+
];
</code> Der ''AccessType'' muss dann an anderer Stelle definiert werden. Verwendet man statt ''member'' das Wort ''temporary'', so erhält man eine äquivalente Formulierung für obigen Code. <code>
+
}}
%ast_extra_members
+
[:
+
// Code
+
:]
+
</code> ...fügt jeder Knoten-Klasse Code hinzu. (ast.h)
+
  
= Conflicts  =
+
<!--T:137-->
 +
Of course ''AccessType'' has to be defined somewhere else, see e.g. the ''Additional parser class members'' section above.
  
Erste Versuche, so einen Parser zu erzeugen, werden wahrscheinlich fehlschlagen. Es wird ein Konflikt angezeigt und das Parsen funktioniert nicht richtig. Ein Beispiel für einen sogenannten FIRST/FIRST-Konflikt: <code>
+
<!--T:138-->
 +
Using ''temporary'' or ''member'' is equivalent.
 +
 
 +
== Conflicts  == <!--T:139-->
 +
 
 +
<!--T:140-->
 +
The first time you write a grammar, you'll potentially fail: Since '''KDevelop-PG-Qt''' is a ''LL(1)'' parser generator, conflicts can occur and have to be solved by hand. Here's an example for a FIRST/FIRST conflict:
 +
 
 +
<!--T:141-->
 +
{{Input|1=
 
   CLASS IDENTIFIER SEMICOLON
 
   CLASS IDENTIFIER SEMICOLON
-> class_declaration ;;
+
-> class_declaration ;
  
   CLASS IDENTIFIER LBRACE class_content RBRACE SEMICOLON
+
   <!--T:142-->
-> class_definition ;;
+
CLASS IDENTIFIER LBRACE class_content RBRACE SEMICOLON
 +
-> class_definition ;
  
   class_declaration
+
   <!--T:143-->
 +
class_declaration
 
  | class_definition
 
  | class_definition
-> class_expression ;;
+
-> class_expression ;
</code> Ausgabe:
+
}}
<pre>** WARNING found FIRST/FIRST conflict in  "class_exp"
+
</pre>
+
Manchmal kann man mit Warnungen leben, deshalb wird Code erzeugt, eine ''class_expression'' wird jedoch nicht korrekt ausgewertet werden. Enthält der Code z.B. eine ''class_definition'', so wird der Parser zuerst in die Funktion für die ''class_declaration'' hineinspringen, da er das führende Token ''CLASS'' identifiziert hat. Das Semikolon wird jedoch nicht gefunden und es kommt zu einer Fehlermeldung.
+
  
== Backtracking  ==
+
<!--T:144-->
 +
KDevelop-PG-Qt will output:
  
Wer solche Grammatiken nur von der Theorie her kennt, den wird diese Vorgehensweise vielleicht verwundern. Denn in der BNF wäre so eine Angabe beispielsweise vollkommen in Ordnung. Der Parser könnte zum Beispiel zurück springen und die nächste Alternative ausprobieren. Dies ist jedoch nicht immer erforderlich und für eine gesteigerte Effizienz wird nach Möglichkeit darauf verzichtet. Es gibt jedoch eine Möglichkeit, die ganzen Konflikte zu umgehen, denn KDevelop-PG-Qt unterstützt besagtes Backtracking: <code>
+
<!--T:145-->
 +
{{Output|1=
 +
  ** WARNING found FIRST/FIRST conflict in  "class_expression"
 +
}}
 +
 
 +
<!--T:146-->
 +
Sometime's it's these warnings can be ignored, but most of the time it will lead to problems in parsing. In our example the ''class_expression'' rule won't be evaluated properly: When you try to parse a ''class_definition'', the parser will see that the first part (''CLASS'') of ''class_declaration'' matches, and jumps think's that this rule is to be applied. Since the next part of the rule does ''not'' match, an error will be reported. It does _not_ try to evaluate ''class_definition'' automatically, but there are different ways to solve this problem:
 +
 
 +
=== Backtracking  === <!--T:147-->
 +
 
 +
<!--T:148-->
 +
In theory such behavior might be unexpected: In BNF e.g. the above syntax would be enough and the parser would automatically jump back and retry with the next rule, here ''class_definition''. But in practice such behaviour is ''- most of the time -'' not necessary and would slow down the parser. If however such behavior is explicitly what you want, you can use an explicit backtracking syntax in your grammar:
 +
 
 +
<!--T:149-->
 +
{{Input|1=
 
   try/rollback(class_declaration)
 
   try/rollback(class_declaration)
 
       catch(class_definition)
 
       catch(class_definition)
-> class_expression ;;
+
-> class_expression ;
</code> So lassen sich alle Konflikte auflösen, dieses Vorgehensweise geht allerdings auf Kosten der Effizienz. Außerdem sollte die Reihenfolge beachtet werden (zur Effizienzsteigerung und zum korrekten Aufbau des Parsetrees).
+
}}
  
== Look ahead  ==
+
<!--T:150-->
 +
In theory, every FIRST/FIRST could be solved this way, but keep in mind that the more you use this feature, the slower your parser gets. If you use it, sort the rules in order of likeliness. Also, there could be cases where the sort order could make the difference between correct and wrong parsing, especially with rules that "extend" others.
  
KDevelop-PG-Qt bietet eine Möglichkeit an, andere Stellen des Token-Streams zu berücksichtigen, ohne gleich in eine tiefe Backtracking-Struktur einzutauchen. Hierfür gibt es eine Funktion LA(qint64). LA(1) gibt das aktuelle Token zurück, LA(2) das nächste, LA(0) das vorherige usw. (die seltsame Indizierung wurde von der Bezeichnung verschiedener Parser-Typen übernommen) <code>
+
== Look ahead  == <!--T:151-->
 +
 
 +
<!--T:152-->
 +
'''KDevelop-PG-Qt''' has an alternative to the - ''potentially'' slow - backtracking mechanism: Look ahead. You can use the ''LA(qint64)'' function in embedded C++ code (see sections above). ''LA(1)'' will return the current token, you most probably never need to use that. Accordingly, ''LA(2)'' returns the next and ''LA(0)'' the previous token. (If you wonder where these somewhat uncommon indexes come from: Read the thesis, or make yourself acquainted with ''LA(k)'' parser theory.)
 +
 
 +
<!--T:153-->
 +
{{Input|1=
 
   (?[: LA(2) == Token_LBRACE :] class_definition)
 
   (?[: LA(2) == Token_LBRACE :] class_definition)
 
  | class_declaration
 
  | class_declaration
 
-> class_expression
 
-> class_expression
</code> Es wird weiterhin ein Konflikt angezeigt, dieser wurde so jedoch manuell gelöst. Man sollte diese manuelle Auflösung in einem Kommentar erwähnen, bevor man ihn später als Fehlerquelle ansieht.  
+
}}
 +
 
 +
<!--T:154-->
 +
Note: The conflict will still be output, but it is ''manually'' resolved. You should always document this somewhere with a comment, for future contributors to know that this is a false-positive.
 +
 
 +
== Elegant Solutions  == <!--T:155-->
  
== Elegant Solutions  ==
+
<!--T:156-->
 +
Often you can solve the conflict all-together with an elegant solution. Like in our example:
  
In sehr vielen Fällen finden sich elegantere Lösungen. So auch in unserem Beispiel: <code>
+
<!--T:157-->
 +
{{Input|1=
 
   LBRACE class_content RBRACE
 
   LBRACE class_content RBRACE
-> class_definition ;;
+
-> class_definition ;
  
   CLASS IDENTIFIER (0 | class_definition) SEMICOLON
+
   <!--T:158-->
-> class_expression ;;
+
CLASS IDENTIFIER ?class_definition SEMICOLON
</code> Nun gibt es keine Konflikte mehr.
+
-> class_expression ;
 +
}}
  
== FIRST/FOLLOW-Conflicts  ==
+
<!--T:159-->
 +
No conflicts, fast: everybody is happy!
  
Von FIRST/FOLLOW-Konflikten spricht man dann, wenn es uneindeutig ist, wo ein Symbol endet und wo das Eltern-Symbol forgesetzt wird. Ein stupides Beispiel: <code>
+
== FIRST/FOLLOW-Conflicts  == <!--T:160-->
 +
 
 +
<!--T:161-->
 +
A FIRST/FOLLOW conflict says, that it is undefined where a symbol ends and the parent symbol continues. A pretty stupid example is:
 +
 
 +
<!--T:162-->
 +
{{Input|1=
 
   item*
 
   item*
-> item_list ;;
+
-> item_list ;
  
   item_list item*
+
   <!--T:163-->
-> items ;;
+
item_list item*
</code> Die Uneindeutigkeit ist offensichtlich. ''try/rollback'' hilft bei ernstzunehmenden Problemen (der Parser funktioniert nicht), oft lassen sich diese Konflikte allerdings auch ignorieren, jedoch sollte man sich darüber im klaren sein, dass der Parser ''greedy'' ist, also die ''item_list'' den größtmöglichen Platz einnehmen wird. Führt dies zu einer Unterteilung, die später zu Widersprüchen führt, helfen nur ''try/rollback'', Überprüfungen mit manuellem Code oder eine Umstrukturierung.
+
-> items ;
 +
}}
  
=== Changing the Greedy Behaviour  ===
+
<!--T:164-->
 +
You probably see the glaring error. ''try/rollback'' helps with serious problems (e.g. parser doesn't work), though often you can ignore these conflicts. But if you do so, be aware that the parser is ''greedy'': In our example ''item_list'' will always contain ''all'' ''item'' elements, and ''items'' will ''never'' have an ''item'' member. If this is an issue that leads to later conflicts, only ''try/rollback'', custom manual code to check which rule should be used, or a refactoring of your grammar structure.
  
Manchmal ist die Greedy-Verhaltensweise unerwünscht. Bei Widerholungen ist die Unterbrechung mit manuell geschriebenem Code möglich. Dieses Beispiel einer Deklaration eines Array mit fester Größe zeigt, wie eine Wiederholung beschränkt werden kann: <code>
+
=== Changing the Greedy Behaviour  === <!--T:165-->
 +
 
 +
<!--T:166-->
 +
Alternatively it is sometimes helpful/required to change the greedy behaviour. In lists you can use manual code to force a break at a given position. Here's an example that shows this on a declaration of an array with fixed size:
 +
 
 +
<!--T:167-->
 +
{{Input|1=
 
   typed_identifier=typed_identifier LBRACKET UNSIGNED_INTEGER
 
   typed_identifier=typed_identifier LBRACKET UNSIGNED_INTEGER
 
   [: count = static_cast<MyTokenStream*>(token)->strForCurrentToken().toUInt(); :]
 
   [: count = static_cast<MyTokenStream*>(token)->strForCurrentToken().toUInt(); :]
Line 292: Line 818:
 
   (#expression=expression [: if(--count == 0) break; :] @ COMMA) ?[: count == 0 :]
 
   (#expression=expression [: if(--count == 0) break; :] @ COMMA) ?[: count == 0 :]
 
   RBRACE SEMICOLON
 
   RBRACE SEMICOLON
-> initialized_fixed_array [ temporary variable count: uint; ];;
+
-> initialized_fixed_array [ temporary variable count: uint; ];
</code> Über den Code ''return false'' ist ein vorzeitiger Abbruch der momentanen Regelauswertung möglich. Über ''return true'' eine vorzeitige Rückkehr.
+
}}
  
== try/recover  ==
+
<!--T:168-->
 +
'''TODO: return true/false or break??'''
  
<code>
+
<!--T:169-->
 +
Via ''return false'' you can enforce a premature stop (== error) of the evaluation of the current rule. Using ''return true'' you stop the evaluation prematurely, but signalize that the parse process was successful.
 +
 
 +
== try/recover  == <!--T:170-->
 +
 
 +
<!--T:171-->
 +
<syntaxhighlight lang="text">
 
   try/recover(expression)
 
   try/recover(expression)
 
-> symbol ;;
 
-> symbol ;;
</code>
+
</syntaxhighlight>
 
This is approximately the same as:
 
This is approximately the same as:
<code>
+
<syntaxhighlight lang="text">
 
   [: ParserState *state = copyCurrentState(); :]
 
   [: ParserState *state = copyCurrentState(); :]
 
   try/rollback(expression)
 
   try/rollback(expression)
 
   catch( [: restoreState(state); :] )
 
   catch( [: restoreState(state); :] )
-> symbol ;;
+
-> symbol ;
</code>
+
</syntaxhighlight>
Hence you have to implement the member-functions copyCurrentState and restoreState and yaou have to define a type called ParseState. You do not have to write the declaration of those functions in the header-file, it is generated automatically if you use ''try/recover''. This concept seems to be useful if there are additional states used while parsing. The Java-parser takes usage from it very often. But I do not know a lot about this feature and it seems unimportant for me. (I guess, it is not) I would be happy when somebody could explain it to me.
+
 
 +
<!--T:172-->
 +
Hence you have to implement the member-functions copyCurrentState and restoreState and you have to define a type called ParseState. You do not have to write the declaration of those functions in the header-file, it is generated automatically if you use ''try/recover''. This concept seems to be useful if there are additional states used while parsing. The Java-parser takes usage from it very often. But I do not know a lot about this feature and it seems unimportant for me. (I guess, it is not) I would be happy when somebody could explain it to me.
 +
 
 +
== Operator Expression Parsing == <!--T:173-->
 +
 
 +
<!--T:174-->
 +
'''TODO'''
 +
 
 +
== Weblinks  == <!--T:175-->
  
= Weblinks  =
+
<!--T:176-->
 +
* [http://quickgit.kde.org/?p=kdevelop-pg-qt.git&a=blob&f=kdev-pg/kdev-pg-parser.yy  The KDevelop-PG-Qt-Grammar (it is a Bison/Yacc-grammar-file)]
 +
* [https://projects.kde.org/projects/extragear/kdevelop/utilities/kdevelop-pg-qt Project at projects.kde.org]
 +
* [http://the-user.org/category/computing/kde/kdev-pg-qt Some blog-posts by “The User” about KDevelop-PG-Qt (informative for recently added features)]
 +
* [http://stud4.tuwien.ac.at/~e0127236/thesis/jpetso-javaparser.pdf Jakob Petsovits' bachelor thesis about using KDevelop-PG for Java Java-Parsers]- It is a good in-depth introduction to everything you might want to know for writing your own grammar. Keep in mind that it is partly outdated. In doubt, refer to this page for updated syntax. Also some of the shortcomings of KDevelop-PG layed out in the thesis have been fixed in the meantime.
  
*[http://websvn.kde.org/trunk/playground/devtools/kdevelop-pg-qt/kdev-pg/kdev-pg-parser.yy?view=markup] - The KDevelop-PG-Qt-Grammar (it is a Bison/Yacc-grammar-file)
+
<!--T:177-->
*[http://websvn.kde.org/trunk/playground/devtools/kdevelop-pg-qt/] - WebSVN
+
[[Category:Getting_Started]]
*[http://stud4.tuwien.ac.at/~e0127236/thesis/jpetso-javaparser.pdf] - Jakob Petsovits' bachelor thesis about using KDevelop-PG for Java Java-Parsers - It is a good in-depth introduction to everything you might want to know for writing your own grammar. Keep in mind that it is partly outdated. In doubt, refer to this page for updated syntax. Also some of the shortcomings of KDevelop-PG layed out in the thesis have been fixed in the meantime. <br>
+
[[Category:Programming]]
 +
</translate>

Latest revision as of 14:47, 18 July 2012

Other languages:English 100% • ‎Russian 6%

Contents

[edit] Preface

KDevelop-PG-Qt is the parser-generator from KDevplatform. It is used for some KDevelop-languagesupport-plugins (Ruby, PHP, CSS...).

It uses Qt classes internally. There's also the original KDevelop-PG parser, which used types from the STL, but has since been superceeded by KDevelop-PG-Qt. Most of the features are the same, though it could be that the ...-Qt parser generator is more up to date and feature rich than the plain STL style generator. The ...-Qt version should be used to write parsers for KDevelop language plugins.

[edit] In-Depth information

This document is not supposed to be a full-fledged and in-depth resource for all parts of KDevelop-PG. Instead it is intended to be a short introduction and, more importantly, a reference for developers.

To get an in-depth introduction, read Jakob Petsovits' excellent Bachelor thesis. You find it in the Weblinks section at the bottom of this page. However, it does not cover recently added features.

[edit] The Application

[edit] Usage

You can find KDevelop-PG-Qt in git. Four example packages are also included in the sources.
To download it try:

git clone git://anongit.kde.org/kdevelop-pg-qt.git

or:

git clone kde:kdevelop-pg-qt
(when having setup git with kde:-prefix)

The program itself requests a .g file, a so called grammar, as input:

./kdev-pg-qt --output=prefix syntax.g

The value of the --output switch decides the prefix of the output files and additionally the namespace for the generated code. Kate provides elementary highlighting for KDevelop-PG-Qt's grammar-files.

[edit] Output Format

While evaluating the grammar and generating its parser files, the application will output information about so called conflicts to STDOUT. As said above, the following files will actually be prefixed.

[edit] ast.h

AST stands for Abstract Syntax Tree. It defines the data structure in which the parse tree is saved. Each node is a struct with the postfix Ast, which contains members that point to any possible sub elements.

[edit] parser.h and parser.cpp

One important part of parser.h is the definition of the parser tokens, the TokenType enum. The TokenStream of your lexer should to use this. You have to write your own lexer or let one generate by Flex. See also the part about Tokenizers/Lexers below.

Having the token stream available, you create your root item and call the parser on the parse method for the top-level AST item, e.g. DocumentAst* => parseDocument(&root). On success, root will contain the AST.

The parser will have one parse method for each possible node of the AST. This is nice for e.g. an expression parser or parsers that should only parse a sub-element of a full document.

[edit] visitor.h and visitor.cpp

The Visitor class provides an abstract interface to walk the AST. Most of the time you don't need to use this directly, the DefaultVisitor takes some work off your shoulders.

[edit] defaultvisitor.h and defaultvisitor.cpp

The DefaultVisitor is an implementation of the abstract Visitor interface and automatically visits each node in the AST. Hence, this is probably the best candidate for a base class for your personal visitors. Most language plugins use these in their Builder classes to create the DUChain.

[edit] Command-Line-Options

  • --namespace=namespace - sets the C++ namespace for the generated sources independently from the file prefix. When this option is set, you can also use / in the --ouput option
  • --no-ast - don't create the ast.h file, more to that below
  • --debug-visitor - generates a debug visitor that prints the AST
  • --serialize-visitor - generates code for serialization via a QIODevice
  • --terminals - all tokens will be written into the file kdev-pg-terminals
  • --symbols - all possible nodes from the AST (not the leafs) will be written into the file kdev-pg-symbols
  • --rules - all grammar rules with informationen about their syntactic correlations will be written into a file called kdev-pg-rules. useful for debugging and solving conflicts
  • --token-text - generates a function to map token-numbers onto token-names
  • --help - print usage information

[edit] Tokenizers/Lexers

As mentioned, KDevelop-PG-Qt requires a Tokenizer. You can either let KDevelop-PG-Qt generate one for you, write one per hand, as it has been done for C++ and PHP, or you can use external tools like Flex.

The tokenizer's job, in principle, boils down to:

  • converting keywords and chars with special meanings to tokens
  • converting literals and identifier to tokens
  • clean out anything that doesn't change the semantics, e.g. comments or whitespace (the latter of course not in Python)
  • while doing the above, handling character encoding (we recommend using UTF8 as much as possible)

The rest, e.g. actually building the tree and evaluating the semantics, is part of the parser and the AST visitors.

[edit] Using KDevelop-PG-Qt

KDevelop-PG-Qt can generate lexers being well integrated into its architecture (you do not have to create a token-stream-class invoking lex or something like that). See examples/foolisp in the code for a simplistic example, there is also an incomplete PHP-Lexer for demonstration purposes.

[edit] Regular Expressions

Regular expressions are used to write rules using the KDevelop-PG-Qt, we use the following syntax (α and β are arbitrary regular expressions, a and b characters):

  • α|β accepts any word accepted by α or accepted by β
  • α&β accepts any word accepted by both α and β
  • α^β accepts any word accepted by a but not by β
  • ~α accepts any word not accepted by α
  • ?α like α, but also accepts the empty word
  • α* accepts any (maybe empty) sequence of words accepted by α
  • α+ accepts any nonempty sequence of words accepted by α (equivalent to αα*)
  • α@β accepts any nonempty sequence of words accepted by α separated by words accepted by β (equivalent to α(βα)*)
  • αβ accepts words consisting of a word accepted by α followed by a word accepted by β
  • […] switches to “disjunctive” environment, αβ will get interpreted as α|β, you can use (…) inside the brackets to go back to normal mode
  • . accepts any single character
  • a-b accepts a single character between a and b (including a and b) in the Unicode (of course only characters that can be represented in the used encoding)
  • "…" will accept the word enclosed by the quotation marks, escape sequences will still get interpreted
  • a accepts the word consisting of the single character a
  • Any escape sequence (see below), accepts the word consisting of the character represented by the escape sequence
  • {⟨name⟩} accepts any word accepted by the regex named ⟨name⟩

All regular expressions are case sensitive. Sorry, there is currently no way for insensitivity.

[edit] Known Escape Sequences

There are several escape sequences which can be used to encode special characters:

  • \n, \t, \f, \v, \r, \0, \b, \a like in C
  • \x, \X, \u or \U followed by hex digits: character represented by this Unicode value (in hex)
  • \d, \D followed by decimal digits, same, but in decimal representation
  • \o, \O followed by octal digits, same, but in octal representation
  • \y, \Y followed by binary digits, same, but in binary representation

[edit] Predefined named regex

Some regexes are predefined and can be used using braces {⟨name⟩}. They get imported from the official Unicode data, some important regexes:

  • {alphabetic} any alphabetic character
  • {num} any numeric character
  • {ascii-range} any character representable in ASCII
  • {latin1-range} any character representable in Latin 1 (8 Bit)
  • {uppercase}
  • {lowercase}
  • {math}

[edit] Rules

Rules can be written as:

⟨regular expression⟩ TOKEN;

Then the Lexer will generate the token TOKEN for lexemes matching the given regular expression. Which token will be chosen if there are multiple options? We use the first longest match rule: It will take the longest possible match (eating as many characters as possible), if there are multiple of those matches, it will take the first one.

Rules can perform code actions and you can also omit tokens (then no token will be generated):

⟨regular expression⟩ [: ⟨code⟩ :] TOKEN;
⟨regular expression⟩ [: ⟨code⟩ :];
⟨regular expression⟩ ;

There is rudimentary support for lookahead and so called (our invention) barriers:

⟨regular expression⟩ %la(⟨regular expression⟩);
⟨regular expression⟩ %ba(⟨regular expression⟩);

The first rule will only accept words if they match the first regular expression and are followed by anything matching the expression specified using %la. The second rule will accept words matched by the first regular expression but will never run into a character sequence matching the regex specified by %ba. However, currently only rules with fixed length are allowed in %la and %ba (for example foo|bar, but not qux|garply). When applying the “first longest match” rule the %la/%ba expressions count, too.

You can create your own named regexes using an arrow:

⟨regular expression⟩ -> ⟨identifier⟩;

The first character of the identifier should not be upper case.

Additionally there are two special actions:

⟨regular expression⟩ %fail;
⟨regular expression⟩ %continue;

%fail will stop tokenization. %continue will make the matched characters part of the next token.

[edit] Rulesets

A grammar file can contain multiple rulesets. A ruleset is a set of rules, as described in the previous section. It gets declared using:

%lexer "name" ->
  ⟨rules⟩
  ;

For your main-ruleset you omit the name (the name will be “start”).

Usually the start-ruleset will be used. But you can change the ruleset in code actions using the macro lxSET_RULE_SET(⟨name⟩). You can specify code to be executed when entering or leaving a ruleset by using %enter [: ⟨code⟩ :]; or %leave [: ⟨code⟩ :]; respectively inside the definition of the ruleset.

[edit] Further Configuration and Output

The standard statements %lexer_declaration_header and %lexer_bits_header are available to include files in the generated lexer.h/lexer.cpp.

By using %lexer_base you can specify the baseclass for the lexer-class, by default it is the TokenStream class defined by KDevelop-PG-Qt.

After %lexerclass(bits) you can specify code to be inserted in lexer.cpp.

You have to specify an encoding the lexer should work with internally using %input_encoding "⟨encoding⟩", possible values:

  • ASCII (7-Bit)
  • Latin 1 (8-Bit)
  • UTF-8 (8-Bit, full Unicode)
  • UCS-2 (16-Bit, UCS-2 part of Unicode)
  • UTF-16 (16-Bit, full Unicode)
  • UTF-32 (32-Bit, full Unicode)

With %input_stream you can specify which class the lexer should use to get the characters to process, there are some predefined classes:

  • QStringIterator, reads from QString, required (internal) encoding: UTF-16 or UCS-2
  • QByteArrayIterator, reads from QByteArray, required encoding: ASCII, Latin-1 or UTF-8
  • QUtf16ToUcs4Iterator, reads from UTF-16 QString, required encoding: UTF-32 (UCS-4)
  • QUtf8ToUcs4Iterator, reads from UTF-8 QByteArray, required encoding: UTF-32 (UCS-4)
  • QUtf8ToUcs2Iterator, reads from UTF-8 QByteArray, required encoding: UCS-2
  • QUtf8ToUtf16Iterator, reads from UTF-8 QByteArray, required encoding: UTF-16
  • QUtf8ToAsciiIterator, reads from UTF-8 QByteArray, will ignore all non-ASCII characters, reqired encoding: ASCII

Whether you choose UTF-8, UTF-16 or UTF-32 is irrelevant for functionality, but it may significantly affect compile-time and run-time performance (you may want to test your Lexer with ASCII if compilation takes too long). For example you want to work with a QByteArray containing UTF-8 data, and you want to get full Unicode support: You could either use the QByteArrayIterator and UTF-8 as internal encoding, or the QUtf8ToUtf16Iterator and UTF-16, or the QUtf8ToUcs4Iterator and UTF-32.

You can also choose between %table_lexer; and %sequence_lexer; In the first case transitions between states of the lexer will get represented by big tables while generating the lexer (cases in the generated code). In the second case sequences will get stored in a compressed data structure and transitions will get represented by nested if-statements. For UTF-32 %table_lexer is infeasible, thus there %sequence_lexer is the onliest option.

Inside your actions of the lexer you can use some predefined macros:

lxCURR_POS  // position in the input (some kind of iterator or pointer)
lxCURR_IDX  // index of the position in the input
            // (it is the index as presented in the input, for example: input is a QByteArray, index incrementation per byte, but the lexer may operate on 32-bit codepoints)
lxCONTINUE  // like %continue, add the current lexeme to the next token
lxLENGTH    // length of the current lexeme (as presented in the input)
lxBEGIN_POS // position of the first character of the current lexeme
lxBEGIN_IDX  // corresponding index
lxNAMED_TOKEN(⟨name⟩, ⟨type⟩) // create a variable representing named ⟨name⟩ with the token type ⟨type⟩
lxTOKEN(⟨type⟩)               // create such a variable named “token”
lxDONE      // return the token generated before
lxRETURN(X) // create a token of type X and return it
lxEOF       // create the EOF-token
lxFINISH    // create the EOF-token and return it  (will stop tokenization)
lxFAIL      // raise the tokenization error
lxSKIP      // continue with the next lexeme (do not return a token, you should not have created one before)
lxNEXT_CHR(⟨chr⟩)             // set the variable ⟨chr⟩ to the next char in the input
yytoken     // current token

[edit] Using Flex

With the existing examples, it shouldn't be too hard to write such a lexer. Between most languages, especially those "inheriting" C, there are many common syntactic elements. Especially comments and literals can be handled just the same way over and over again. Adding a simple token is trivial:

"special-command"    return Parser::Token_SPECIAL_COMMAND;

That's pretty much it, take a look at eg. java.ll for an excellent example. However, it is quite tricky and ugly to handle UTF-8 with Flex.

[edit] How to write Grammar-Files

[edit] Context-Free Grammars

KDevelop-PG-Qt uses so called context-free grammars using a concept of non-terminals (nodes) and terminals(tokens). While writing the grammar for the basic structure of your language, you should try to mimic the semantics of the language. Lets take a look at an example:

C++-document consists of lots of declarations and definitions, a class definition could be handled e.g. in the following way:

  1. CLASS-token
  2. a identifier
  3. the {-token
  4. a member-declarations-list
  5. the }-token
  6. and finally the ;-token

The member-declarations-list is of course not a part of any C++ description, it is just a helper to explain the structure of a given semantic part of your language. The grammar could then define how exactly such helper might look like.

[edit] Basic Syntax

Now let us have a look at a basic example, a declaration in C++, as described in grammar syntax:

 struct_declaration
 

This is called a rule definition. Every lower-case string in the grammar file references such a rule. Our case above defines what a declaration looks like. The |-char stands for a logical or, all rules have to end on two semicolons.

In the example we reference other rules which also have to be defined. Here's for example the class_declaration, note the tokens in all-upper-case:

 CLASS IDENTIFIER LBRACE class_declaration* RBRACE SEMICOLON
-> class_declaration ;

There is a new char in there: The asterisk has the same meaning as in regular expressions, i.e. that the previous rule can occur arbitrarily often or not at all.

In a grammar 0 stands for an empty token. Using it in addition with parenthesizing and the logical or from above, you can express optional elements:

some_required_rule SOME_TOKEN
    ( some_optional_stuff | some_other_stuff | 0 )
-> my_rule ;

All symbols never occuring on the left side of a rule are start-symbols. You can use one of them to start parsing.

[edit] Making matched rules available to Visitors

The simple rule above could be used to parse the token stream, yet no elements would be saved in the parsetree. This can be easily done though:

class_declaration=class_declaration

The DeclarationAst struct now contains pointers to each of these elements. During the parse process the pointer for each found element gets set, all others become NULL. To store lists of elements, prepend the identifier with a hash # :

CLASS IDENTIFIER SEMICOLON

TODO: internal structure of the list, important for Visitors

Identifier and targets can be used in more than one place:

#one=one (#one=one)*
-> one_or_more ;

In the example above, all matches to the rule one will be stored in one and the same list one.

[edit] Defining available Tokens

Somewhere in the grammar (you should probably put it near the head) you'll have to define a list of available Tokens. From this list, the TokenType enum in parser.h will be created. Additionally to the enum value names you should define an explanation name which will e.g. be used in error messages. Note that the representation of a Token inside the source code is not required for the grammar/parser as it operates on a TokenStream, see Lexer/Tokenizer section above.

%token T1 ("T1-Name"), T2 ("T2-Name"), COMMA (";"), SEMICOLON (";") ;

It is possible to use %token multiple times to group tokens in the grammar. Though all tokens will still be put into the same TokenType enum.

TODO: explain process of using the parser Tokens

[edit] Special Syntax...

[edit] ...to use inside Rules

[edit] List of one or more elements

Alternatively to the asterisk (*) you can use a plus sign (+) to mark lists of one-or-more elements:

(#one=one)+
-> one_or_more ;

[edit] Separated lists

Using the #rule @ TOKEN syntax you can mark a list of rule, separated by TOKEN:

#item=item @ COMMA
-> comma_separated_list ;

[edit] Optional items

Alternatively to the above mentioned (item=item | 0) syntax you can use the following to mark optional items:

?item=item
-> optional_item ;
Warning.png
 
Attention
(0|x) is equivalent to 0


[edit] Local variables for the parse-process

Using a colon  : instead of the equal sign = you can store the sub-AST in a local variable that will only be available during parsing, and only in the current rule.

TODO: need example

[edit] Inlining

Instead of a name you can also place a dot . before the equal sign = . Then the AST will inherit all class-members from the sub-AST and the parsing-method for the sub-AST will be merged into the parent-AST. An example:

op=PLUS

In SimpleArithmeticsAst there will be the fields val1, val2 (storing the operands) and op (storing the operator-token). parseSimpleArithmetics will not have to call parseOperator. Obviously recursive inlining is not allowed.

[edit] ...to add Hand-Written Code

Sometimes it is required to integrate hand-written code into the generated parser. Instead of editing the end-result (**never** do that!) you should put this code into the grammar at the correct places. Here are a few examples when you'd need this:

  • custom error handling / error recovery, i.e. to prevent the parser to stop at the first error
  • creating tokens, if you don't want to do that externally
  • setting custom variables, especially for state tracking. e.g. in C++ you could save whether you are inside a private, protected oder public section. then you could save this information inside each node of the class elements.
  • additional verifications, lookaheads etc.

[edit] General Syntax

[:
// here be dragons^W code ;-)
:]

The code will be put into the generated parser.cpp file. If you use it inside a grammar rule, it will be put into the correct position during the parse process. You can access the current node via the variable yynode, it will have the type 'XYZAst**'.

[edit] Global Code

In KDevelop language plugins, you'll see that most grammars start with something like:

[:

#include <QtCore/QString>
#include <kdebug.h>
#include <tokenstream.h>
#include <language/interfaces/iproblem.h>
#include "phplexer.h"

namespace KDevelop
{
    class DUContext;
}

:]

This is a code section, that will be put at the beginning of ast.h, i.e. into the global context.

But there are also some newer statements available to include header-files:

%ast_header "header.h"
%parser_declaration_header "header.h"
%parser_bits_header "header.h"

The include-statement will be inserted into ast.h, parser.h respectively parser.cpp.

[edit] Namespace Code

Also it's very common to define a set of enumerations e.g. for operators, modifiers, etc. pp. Here's an stripped example from PHP, note that the code will again be put into the generated parser.h file:

%namespace
[:
    enum ModifierFlags {
        ModifierPrivate      = 1,
        ModifierPublic       = 1 << 1,
        ModifierProtected    = 1 << 2,
        ModifierStatic       = 1 << 3,
        ModifierFinal        = 1 << 4,
        ModifierAbstract     = 1 << 5
    };
...
    enum OperationType {
        OperationPlus = 1,
        OperationMinus,
        OperationConcat,
        OperationMul,
        OperationDiv,
        OperationMod,
        OperationAnd,
        OperationOr,
        OperationXor,
        OperationSl,
        OperationSr
    };
:]

When you write code at the end of the grammar-file simply between [: and :], it will be added to parser.cpp.

[edit] Additional AST member

To add additional members to _every_ AST variable, use the following syntax:

%ast_extra_members
[:
  KDevelop::DUContext* ducontext;
:]

You can also specify the base-class for the nodes:

%ast_base symbol_name "BaseClass"

BaseClass has to inherit from AstNode.

[edit] Additional parser class members

Instead of polluting the global context with state tracker variables, and hence destroying the whole advantages of OOP, you can add additional members to the parser class. It's also very convenient to define functions for error reporting etc. pp. Again a stripped example from PHP:

%parserclass (public declaration)
[:
  enum ProblemType {
      Error,
      Warning,
      Info
  };
  void reportProblem( Parser::ProblemType type, const QString& message );
  QList<KDevelop::ProblemPointer> problems() {
      return m_problems;
  }
  ...
  enum InitialLexerState {
      HtmlState = 0,
      DefaultState = 1
  };
:]

Note, that we used %parserclass (public declaration), we could instead have used private or protected declaration.

protected

There is also a statement to specify a base-class:

%parser_base "ClassName"

[edit] Initializing additional parser class members

When you add member variables to the class, you'll have to initialize and or destroy them as well. Here's how (either use ctor or dtor, of course):

desctructor] )
[:
// Code
:]

[edit] Boolean Checks

?[:
// some bool expression
:]

The following rule will only apply if the boolean expression evaluates to true. Here's an advanced example, which also shows that you can use the pipe symbol ('|') as logical or, i.e. essentially this is a if... else...' conditional:

?[: someCondition :] SOMETOKEN ifrule=myVar

This is especially convenient together with lookaheads (see below).

[edit] Defining local variables inside rules

You can set up the grammar to define local variables whenever a rule gets applied:

...
-> class_member [:
   enum { A_PUBLIC, A_PROTECTED, A_PRIVATE } access;
:];

This variable is local to the rule class_member.

[edit] Defining additional variables for the parse tree

Similar to the syntax above, you can define members whenever a rule gets applied:

temporary] variable yourName: yourType
];

For example:

...
-> class_member [
      member variable access : AccessType;
];

Of course AccessType has to be defined somewhere else, see e.g. the Additional parser class members section above.

Using temporary or member is equivalent.

[edit] Conflicts

The first time you write a grammar, you'll potentially fail: Since KDevelop-PG-Qt is a LL(1) parser generator, conflicts can occur and have to be solved by hand. Here's an example for a FIRST/FIRST conflict:

 class_definition
-> class_expression ;

KDevelop-PG-Qt will output:

** WARNING found FIRST/FIRST conflict in  "class_expression"

Sometime's it's these warnings can be ignored, but most of the time it will lead to problems in parsing. In our example the class_expression rule won't be evaluated properly: When you try to parse a class_definition, the parser will see that the first part (CLASS) of class_declaration matches, and jumps think's that this rule is to be applied. Since the next part of the rule does not match, an error will be reported. It does _not_ try to evaluate class_definition automatically, but there are different ways to solve this problem:

[edit] Backtracking

In theory such behavior might be unexpected: In BNF e.g. the above syntax would be enough and the parser would automatically jump back and retry with the next rule, here class_definition. But in practice such behaviour is - most of the time - not necessary and would slow down the parser. If however such behavior is explicitly what you want, you can use an explicit backtracking syntax in your grammar:

try/rollback(class_declaration)
      catch(class_definition)
-> class_expression ;

In theory, every FIRST/FIRST could be solved this way, but keep in mind that the more you use this feature, the slower your parser gets. If you use it, sort the rules in order of likeliness. Also, there could be cases where the sort order could make the difference between correct and wrong parsing, especially with rules that "extend" others.

[edit] Look ahead

KDevelop-PG-Qt has an alternative to the - potentially slow - backtracking mechanism: Look ahead. You can use the LA(qint64) function in embedded C++ code (see sections above). LA(1) will return the current token, you most probably never need to use that. Accordingly, LA(2) returns the next and LA(0) the previous token. (If you wonder where these somewhat uncommon indexes come from: Read the thesis, or make yourself acquainted with LA(k) parser theory.)

 class_declaration
-> class_expression

Note: The conflict will still be output, but it is manually resolved. You should always document this somewhere with a comment, for future contributors to know that this is a false-positive.

[edit] Elegant Solutions

Often you can solve the conflict all-together with an elegant solution. Like in our example:

LBRACE class_content RBRACE
-> class_definition ;

   CLASS IDENTIFIER ?class_definition SEMICOLON
-> class_expression ;

No conflicts, fast: everybody is happy!

[edit] FIRST/FOLLOW-Conflicts

A FIRST/FOLLOW conflict says, that it is undefined where a symbol ends and the parent symbol continues. A pretty stupid example is:

item*
-> item_list ;

   item_list item*
-> items ;

You probably see the glaring error. try/rollback helps with serious problems (e.g. parser doesn't work), though often you can ignore these conflicts. But if you do so, be aware that the parser is greedy: In our example item_list will always contain all item elements, and items will never have an item member. If this is an issue that leads to later conflicts, only try/rollback, custom manual code to check which rule should be used, or a refactoring of your grammar structure.

[edit] Changing the Greedy Behaviour

Alternatively it is sometimes helpful/required to change the greedy behaviour. In lists you can use manual code to force a break at a given position. Here's an example that shows this on a declaration of an array with fixed size:

typed_identifier=typed_identifier LBRACKET UNSIGNED_INTEGER
   [: count = static_cast<MyTokenStream*>(token)->strForCurrentToken().toUInt(); :]
   RBRACKET EQUAL LBRACE
   (#expression=expression [: if(--count == 0) break; :] @ COMMA) ?[: count == 0 :]
   RBRACE SEMICOLON
-> initialized_fixed_array [ temporary variable count: uint; ];

TODO: return true/false or break??

Via return false you can enforce a premature stop (== error) of the evaluation of the current rule. Using return true you stop the evaluation prematurely, but signalize that the parse process was successful.

[edit] try/recover

   try/recover(expression)
-> symbol ;;

This is approximately the same as:

   [: ParserState *state = copyCurrentState(); :]
   try/rollback(expression)
   catch( [: restoreState(state); :] )
-> symbol ;

Hence you have to implement the member-functions copyCurrentState and restoreState and you have to define a type called ParseState. You do not have to write the declaration of those functions in the header-file, it is generated automatically if you use try/recover. This concept seems to be useful if there are additional states used while parsing. The Java-parser takes usage from it very often. But I do not know a lot about this feature and it seems unimportant for me. (I guess, it is not) I would be happy when somebody could explain it to me.

[edit] Operator Expression Parsing

TODO

[edit] Weblinks


This page was last modified on 18 July 2012, at 14:47. This page has been accessed 27,422 times. Content is available under Creative Commons License SA 3.0 as well as the GNU Free Documentation License 1.2.
KDE® and the K Desktop Environment® logo are registered trademarks of KDE e.V.Legal