Revision as of 14:04, 22 October 2009

Development/KDevelop-PG-Qt Introduction

KDevelop-PG-Qt is the parser-generator from KDevplatform. It is used for some KDevelop-languagesupport-plugins (Ruby, PHP, Java...).

It uses Qt classes internally. There's also the original KDevelop-PG parser, which used types from the STL, but has since been superseeded by KDevelop-PG-Qt. Most of the features are the same, though it could be that the ...-Qt parser generator is more up to date and feature rich than the plain STL style generator. The ...-Qt version should be used to write parsers for KDevelop language plugins.

In-Depth information

This document is not supposed to be a full-fledged and in-depth resource for all parts of KDevelop-PG. Instead it is intended to be a short introduction and, more importantly, a reference for developers.

To get and in-depth introduction, read Jakob Petsovits' excellent Bachelor thesis. You find it in the Weblinks section at the bottom of this page.

The Application

Usage

You can find KDevelop-PG-Qt in SVN . Also included in the source are three example packages.

svn co svn://anonsvn.kde.org/home/kde/trunk/playground/devtools/

The program itself requests a .g file, a so called grammar, as input:

./kdev-pg-qt --output=''prefix'' syntax.g

The value of the --ouput switch decides the prefix of the output files and additionally the namespace for the generated code.

Output Format

While evaluating the grammar and generating its parser files, the application will output information about so called conflicts to STDOUT. As said above, the following files will actually be prefixed.

ast.h

AST stands for Abstract Syntax Tree. It defines the data structure in which the parse tree is saved. Each node is a struct with the postfix Ast, which contains members that point to any possible sub elements.

parser.h and parser.cpp

One important part of parser.h is the definition of the parser tokens, the TokenType enum. The TokenStream of your lexer should to use this. You have to write your own lexer or let one generate by Flex. See also the part about Tokenizers/Lexers below.

Having the token stream available, you create your root item and call the parser on the parse method for the top-level AST item, e.g. DocumentAst* => parseDocument(&root). On success, root will contain the AST.

The parser will have one parse method for each possible node of the AST. This is nice for e.g. an expression parser or parsers that should only parse a sub-element of a full document.

visitor.h and visitor.cpp

The Visitor class provides an abstract interface to walk the AST. Most of the time you don't need to use this directly, the DefaultVisitor takes some work off your shoulders.

defaultvisitor.h and defaultvisitor.cpp

The DefaultVisitor is an implementation of the abstract Visitor interface and automatically visits each node in the AST. Hence, this is probably the best candidate for a base class for your personal visitors. Most language plugins use these in their Builder classes to create the DUChain.

Command-Line-Options

--namespace=namespace - sets the C++ namespace for the generated sources independently from the file prefix. When this option is set, you can also use / in the --ouput option
--no-ast - don't create the ast.h file, more to that below
--debug-visitor - generates a debug visitor that prints the AST
--serialize-visitor - generates code for serialization via a QIODevice
--symbols - all possible nodes from the AST (not the leafs) will be written into the file kdev-pg-symbol.
--rules - all grammar rules with informationen about their syntactic correlations will be written into a file called kdev-pg-rules. useful for debugging and solving conflicts
--help - a so far not really helpful help text ;-)

Tokenizers/Lexers

As mentioned, KDevelop-PG-Qt requires an existing Tokenizer. You can either write one per hand, as was done for C++ and PHP, or you can use tools like Flex. With the existing examples, it shouldn't be too hard to write such a lexer. Between most languages, especially those "inheriting" C, there are many common syntactic elements. Especially comments and literals can be handled just the same way over and over again. Adding a simple token is trivial:

"special-command"    return Parser::Token_SPECIAL_COMMAND;

That's pretty much it, take a look at eg. java.ll for an excellent example.

The tokenizer's job, in princeple, boils down to:

converting keywords and chars with special meanings to tokens
converting literals and identifier to tokens
clean out anything that doesn't change the semantics, e.g. comments or whitespace (the latter of course not in Python)
while doing the above, handling character encoding (we recommend using UTF8 as much as possible)

The rest, e.g. actually building the tree and evaluating the semantics, is part of the parser and the AST visitors.

How to write Grammar-Files

Chomsky Type-2 Grammars

KDevelop-PG-Qt uses so called Type-2-grammars use a concept of non-terminals (nodes) and terminals(tokens). While writing the grammar for the basic structure of your language, you should try to mimic the semantics of the language. Lets take a look at an example:

C++-document consists of lots of declarations and definitions, a class definition could be handled e.g. in the following way:

CLASS-token
a identifier
the {-token
a member-declarations-list
the }-token
and finally the ;-token

The member-declarations-list is of course not a part of any C++ description, it is just a helper to explain the structure of a given semantic part of your language. The grammar could then define how exactly such helper might look like.

Basic Syntax

Now let us have a look at a basic example, a declaration in C++, as described in grammar syntax:

   class_declaration
 | struct_declaration
 | function_declaration
 | union_declaration
 | namespace_declaration
 | typedef_declaration
 | extern_declaration
-> declaration ;;

This is called a rule definition. Every lower-case string in the grammar file references such a rule. Our case above defines what a declaration looks like. The |-char stands for a logical or, all rules have to end on two semicolons.

In the example we reference other rules which also have to be defined. Here's for example the class_declaration, note the tokens in all-upper-case:

   CLASS IDENTIFIER SEMICOLON
 | CLASS IDENTIFIER LBRACE class_declaration* RBRACE SEMICOLON
-> class_declaration ;;

There is a new char in there: The asterisk has the same meaning as in regular expressions, i.e. that the previous rule can occur arbitrarily often or not at all.

In a grammar 0 stands for an empty token. Using it in addition with parenthesizing and the logical or from above, you can express optional elements:

  some_required_rule SOME_TOKEN
    ( some_optional_stuff | some_other_stuff | 0 )
-> my_rule ;;

Making matched rules available to Visitors

The simple rule above could be used to parse the token stream, yet no elements would be saved in the parsetree. This can be easily done though:

   class_declaration=class_declaration
 | struct_declaration=struct_declaration
 | function_declaration=function_declaration
 | union_declaration=union_declaration
 | namespace_declaration=namespace_declaration
 | typedef_declaration=typedef_declaration
 | extern_declaration=extern_declaration
-> declaration ;;

The DeclarationAst struct now contains pointers to each of these elements. During the parse process the pointer for each found element gets set, all others become NULL. To store lists of elements, prepend the identifier with a hash (#):

   CLASS IDENTIFIER SEMICOLON
 | CLASS IDENTIFIER LBRACE (#class_declaration=class_declaration)* RBRACE SEMICOLON
-> class_declaration ;;

TODO: internal structure of the list, important for Visitors

Identifier and targets can be used in more than one place:

   #one=one (#one=one)*
-> one_or_more ;;

In the example above, all matches to the rule one will be stored in one and the same list one.

Defining available Tokens

Somewhere in the grammar, you should probably put it near the head, you'll have to define a list of available Tokens. From this list, the TokenType enum in parser.h will be created. Additionally to the enum value names you should define an explanation name which will e.g. be used in error messages. Note that the representation of a Token inside the source code is not required for the grammar/parser as it operates on a TokenStream, see Lexer/Tokenizer section above.

%token T1 ("T1-Name"), T2 ("T2-Name"), COMMA (";"), SEMICOLON (";") ;;

It is possible to use %token multiple times to group tokens in the grammar. Though all tokens will still be put into the same TokenType enum.

TODO: explain process of writing Lexer/Tokenizer and using the parser Tokens

Special Syntax...

...to use inside Rules

list of one or more elements

Alternativly to the astersik (*) you can use a plus-sign (+) to mark lists of one-or-more elements:

   (#one=one)+
-> one_or_more ;;

separated lists

Using the #rule @ TOKEN syntax you can mark a list of rule, separated by TOKEN:

   #item=item @ COMMA
-> comma_separated_list ;;

optional items

TODO: is this available or commented out?

Alternatively to the above mentioned (item=item | 0) syntax you can use the following to mark optional items:

   ?(item=item)
-> optional_item ;;

local variables for the parse-process

Using a colon (:) instead of the equal sign (=) you can store local variables that will only be available during parsing, and only for the sub-tree.

TODO: need example

...to add Hand-Written Code

Sometimes it is required to integrate hand-written code into the generated parser. Instead of editing the end-result (**never** do that!) you should put this code into the grammar at the correct places. Here are a few examples when you'd need this:

custom error handling / error recovery, i.e. to prevent the parser to stop at the first error
creating tokens, if you don't want to do that externally
setting custom variables, especially for state tracking. e.g. in C++ you could save whether you are inside a private, protected oder public section. then you could save this information inside each node of the class elements.
additional verifications, lookaheads etc.

General Syntax

[:
// here be dragons^W code ;-)
:]

The code will be put into the generated parser.cpp file. If you use it inside a grammar rule, it will be put into the correct position during the parse process. You can access the current node via the variable yynode, it will have the type 'XYZAst**'.

Global Code

In KDevelop language plugins, you'll see that most grammars start with something like:

[:

#include <QtCore/QString>
#include <kdebug.h>
#include <tokenstream.h>
#include <language/interfaces/iproblem.h>
#include "phplexer.h"

namespace KDevelop
{
    class DUContext;
}

:]

This is a code section, that will be put at the beginning of parser.h, i.e. into the global context.

Namespace Code

Also it's very common to define a set of enumerations e.g. for operators, modifiers, etc. pp. Here's an stripped example from PHP, note that the code will again be put into the generated parser.h file:

%namespace
[:
    enum ModifierFlags {
        ModifierPrivate      = 1,
        ModifierPublic       = 1 << 1,
        ModifierProtected    = 1 << 2,
        ModifierStatic       = 1 << 3,
        ModifierFinal        = 1 << 4,
        ModifierAbstract     = 1 << 5
    };
...
    enum OperationType {
        OperationPlus = 1,
        OperationMinus,
        OperationConcat,
        OperationMul,
        OperationDiv,
        OperationMod,
        OperationAnd,
        OperationOr,
        OperationXor,
        OperationSl,
        OperationSr
    };
:]

Additional AST member

To add additional members to _every_ AST variable, use the following syntax:

%ast_extra_members
[:
  KDevelop::DUContext* ducontext;
:]

Additional parser class members

Instead of polluting the global context with state tracker variables, and hence destroying the whole advantages of OOP, you can add additional members to the parser class. It's also very convenient to define functions for error reporting etc. pp. Again a stripped example from PHP:

%parserclass (public declaration)
[:
  enum ProblemType {
      Error,
      Warning,
      Info
  };
  void reportProblem( Parser::ProblemType type, const QString& message );
  QList<KDevelop::ProblemPointer> problems() {
      return m_problems;
  }
  ...
  enum InitialLexerState {
      HtmlState = 0,
      DefaultState = 1
  };
:]

Note, that we used %parserclass (public declaration), we could instead have used private or protected declaration.

%parserclass ( [private|protected|public] declaration)
[:
// Code
:]

Initializing additional parser class members

When you add member variables to the class, you'll have to initialize and or destroy them as well. Here's how (either use ctor or dtor, of course):

%parserclass ( [constructor|desctructor] )
[:
// Code
:]

Boolean Checks

?[:
// some bool expression
:]

The following rule will only apply if the boolean expression evaluates to true. Here's an advanced example, which also shows that you can use the pipe symbol ('|') as logical or, i.e. essentially this is a if... else...' conditional:

    ?[: someCondition :] SOMETOKEN ifrule=myVar
  | elserule=myVar

This is especially convenient together with lookaheads (see below).

defining local variables inside rules

You can setup the grammar to define local variables whenever a rule gets applied:

   ...
-> class_member [:
   enum { A_PUBLIC, A_PROTECTED, A_PRIVATE } access;
:];;

This variable is local to the rule class_member.

defining additional variables for the parse tree

Similar to the syntax above, you can define members whenever a rule gets applied:

   ...
-> class_member[
   [member|temporary] variable yourName: yourType
]

For example:

   ...
-> class_member [
      member variable access : AccessType;
];;

Of course AccessType has to be defined somewhere else, see e.g. the Additional parser class members section above.

Using temporary or member is equivalent.

Conflicts

Erste Versuche, so einen Parser zu erzeugen, werden wahrscheinlich fehlschlagen. Es wird ein Konflikt angezeigt und das Parsen funktioniert nicht richtig. Ein Beispiel für einen sogenannten FIRST/FIRST-Konflikt:

  CLASS IDENTIFIER SEMICOLON

-> class_declaration ;;

  CLASS IDENTIFIER LBRACE class_content RBRACE SEMICOLON

-> class_definition ;;

  class_declaration
| class_definition

-> class_expression ;; Ausgabe:

** WARNING found FIRST/FIRST conflict in  "class_exp"

Manchmal kann man mit Warnungen leben, deshalb wird Code erzeugt, eine class_expression wird jedoch nicht korrekt ausgewertet werden. Enthält der Code z.B. eine class_definition, so wird der Parser zuerst in die Funktion für die class_declaration hineinspringen, da er das führende Token CLASS identifiziert hat. Das Semikolon wird jedoch nicht gefunden und es kommt zu einer Fehlermeldung.

Backtracking

Wer solche Grammatiken nur von der Theorie her kennt, den wird diese Vorgehensweise vielleicht verwundern. Denn in der BNF wäre so eine Angabe beispielsweise vollkommen in Ordnung. Der Parser könnte zum Beispiel zurück springen und die nächste Alternative ausprobieren. Dies ist jedoch nicht immer erforderlich und für eine gesteigerte Effizienz wird nach Möglichkeit darauf verzichtet. Es gibt jedoch eine Möglichkeit, die ganzen Konflikte zu umgehen, denn KDevelop-PG-Qt unterstützt besagtes Backtracking:

  try/rollback(class_declaration)
     catch(class_definition)

-> class_expression ;; So lassen sich alle Konflikte auflösen, dieses Vorgehensweise geht allerdings auf Kosten der Effizienz. Außerdem sollte die Reihenfolge beachtet werden (zur Effizienzsteigerung und zum korrekten Aufbau des Parsetrees).

Look ahead

KDevelop-PG-Qt bietet eine Möglichkeit an, andere Stellen des Token-Streams zu berücksichtigen, ohne gleich in eine tiefe Backtracking-Struktur einzutauchen. Hierfür gibt es eine Funktion LA(qint64). LA(1) gibt das aktuelle Token zurück, LA(2) das nächste, LA(0) das vorherige usw. (die seltsame Indizierung wurde von der Bezeichnung verschiedener Parser-Typen übernommen)

  (?[: LA(2) == Token_LBRACE :] class_definition)
| class_declaration

-> class_expression Es wird weiterhin ein Konflikt angezeigt, dieser wurde so jedoch manuell gelöst. Man sollte diese manuelle Auflösung in einem Kommentar erwähnen, bevor man ihn später als Fehlerquelle ansieht.

Elegant Solutions

In sehr vielen Fällen finden sich elegantere Lösungen. So auch in unserem Beispiel:

  LBRACE class_content RBRACE

-> class_definition ;;

  CLASS IDENTIFIER (0 | class_definition) SEMICOLON

-> class_expression ;; Nun gibt es keine Konflikte mehr.

FIRST/FOLLOW-Conflicts

Von FIRST/FOLLOW-Konflikten spricht man dann, wenn es uneindeutig ist, wo ein Symbol endet und wo das Eltern-Symbol forgesetzt wird. Ein stupides Beispiel:

  item*

-> item_list ;;

  item_list item*

-> items ;; Die Uneindeutigkeit ist offensichtlich. try/rollback hilft bei ernstzunehmenden Problemen (der Parser funktioniert nicht), oft lassen sich diese Konflikte allerdings auch ignorieren, jedoch sollte man sich darüber im klaren sein, dass der Parser greedy ist, also die item_list den größtmöglichen Platz einnehmen wird. Führt dies zu einer Unterteilung, die später zu Widersprüchen führt, helfen nur try/rollback, Überprüfungen mit manuellem Code oder eine Umstrukturierung.

Changing the Greedy Behaviour

Manchmal ist die Greedy-Verhaltensweise unerwünscht. Bei Widerholungen ist die Unterbrechung mit manuell geschriebenem Code möglich. Dieses Beispiel einer Deklaration eines Array mit fester Größe zeigt, wie eine Wiederholung beschränkt werden kann:

  typed_identifier=typed_identifier LBRACKET UNSIGNED_INTEGER
  [: count = static_cast<MyTokenStream*>(token)->strForCurrentToken().toUInt(); :]
  RBRACKET EQUAL LBRACE
  (#expression=expression [: if(--count == 0) break; :] @ COMMA) ?[: count == 0 :]
  RBRACE SEMICOLON

-> initialized_fixed_array [ temporary variable count: uint; ];; Über den Code return false ist ein vorzeitiger Abbruch der momentanen Regelauswertung möglich. Über return true eine vorzeitige Rückkehr.

try/recover

  try/recover(expression)

-> symbol ;; This is approximately the same as:

  [: ParserState *state = copyCurrentState(); :]
  try/rollback(expression)
  catch( [: restoreState(state); :] )

-> symbol ;; Hence you have to implement the member-functions copyCurrentState and restoreState and yaou have to define a type called ParseState. You do not have to write the declaration of those functions in the header-file, it is generated automatically if you use try/recover. This concept seems to be useful if there are additional states used while parsing. The Java-parser takes usage from it very often. But I do not know a lot about this feature and it seems unimportant for me. (I guess, it is not) I would be happy when somebody could explain it to me.

Weblinks

[1] - The KDevelop-PG-Qt-Grammar (it is a Bison/Yacc-grammar-file)
[2] - WebSVN
[3] - Jakob Petsovits' bachelor thesis about using KDevelop-PG for Java Java-Parsers - It is a good in-depth introduction to everything you might want to know for writing your own grammar. Keep in mind that it is partly outdated. In doubt, refer to this page for updated syntax. Also some of the shortcomings of KDevelop-PG layed out in the thesis have been fixed in the meantime.