This article is about working on KDevelop-PG-Qt itself, for an introduction to KDevelop-PG-Qt see Development/KDevelop-PG-Qt Introduction, but it might also be useful for advanced KDevelop-PG-Qt users.
Ideas behind KDevelop-PG-Qt
A Recursive Descent Parser
The idea behind the generated parser is call "Recursive Descent". There is exactly one function for every symbol. The parser will try to replace the start-symbol with other symbols, the process ends with a terminal used in a rule. It is some kind of depth-first-search, for every sub-symbol the parse will invoke the corresponding function. More precisely, KDevelop-PG-Qt generates a LL(1) parser, that means it looks at exactly 1 token to decide, which sub-symbols to choose.
Inject your own hand-written code
A strength of KDevelop-PG-Qt is the possibility to "inject" hand-written code nearly everywhere. That is why developers of KDevelop-PG-Qt should take care about the following aspects:
- When implementing new constructs in the grammar try to make it possible to use custom code where it could make sense.
- Allow the user to understand the generated code and to forecast the output: When you ignore this advise it will not be possible for the user to inject his code quickly.
- Do not manipulate the rules, this would be against the last point. E.g. it would not be good to shift symbols and tokens to resolve conflicts automatically without asking the user. For the user it would not be obvious any longer where his code will be inserted.
- Do not change the names of functions and variables in the generated classes.
- Even AST's etc. sometimes need custom code.
There is also another aspect: Grammar-constructs are better than C++-constructs, feel free to replace common C++-code with KDevelop-PG-Qt-constructs!
KDevelop-PG-Qt's generated sources follow the visitor pattern. That is the focus of it: It should be possible to visit the parse-tree more than once and it should be simple to implement visitors.
Recursive Descent Parsers have a serious problem: Imagine operators in C++: They have as many as 18 different priorities (from "," to "::"). A RDP would theoretically (there are certainly some tricks to reduce it for C++) need 18 different symbols (from "CommaExpression" to "Name") and every expression would be parsed by invoking 18 functions down to the terminal, even if the expression would be a plain numeric-literal. But this is maybe the smaller problem: The simple number will also need 18 AST-Nodes. Each list node would need a type-identifier, the start-position, the end-position, a pointer to a list of the one and only child and the ListNode, on a 64-bit system you would need 48 bytes for each node. The PHP plugin also introduces a DUContext-pointer, this means 56 bytes per node, that is a lot ;).
But there is an alternative: Walk through the tokens and construct the tree beginning on the bottom by adding each token to the tree in the right way. The tree will grow but some leaps will stay, that is why it is called "Bottom-Up Parsing". It works well, but only very few people have tested it. Months after having written it I noticed that the algorithm used is also called “Shunting-yard-algorithm“.
That does not mean that parsers should use signals and slots and templates are forbidden, but KDevelop-PG-Qt was designed to be easily integratable into Qt. So it should use Qt's classes like QString and QList.
Structure of the code
- Item: The base-class
- Zero: 0/ε, only one central object globalSystem.zero()
- Plus: Repetition (at least once)
- Star: Kleene-Star
- Symbol: Central non-terminals (one symbol for one name)
- Action: Hand-written code
- Cons: Concatenation
- Evolve: A rule (expression -> symbol)
- TryCatch: try/rollback or try/recover statement
- Alias: Not implemented
- Terminal: A terminal?token
- NonTerminal: Reference to a symbol, used in expressions (may contain extra-parameters)
- Annotation: Assignment
- VariableDeclaration (in rules)
- Operator: Definition of operator-expressions
- InlinedNonTerminal: Like non-terminal, but inlined using ‘.=‘
Use create<Model::XYZItem>() to allocate an AST. The BNFVisitor uses some custom stuff for better performance. The same memory-pools like in the generated parsers are used.
There is a special function-template nodeCast, it behaves like dynamic_cast and qobject_cast, but it should only be used for pointers to ASTs.
In kdevelop-pg-qt/tests you can find tests and benchmarks for the classes located in kdevelop-pg-qt/include, these are the onliest test-cases being properly integrated with CMake/CTest, you can simply use `make test`. In kdevelop-pg-qt/examples there are three basic examples for parsing (more or less) real programming languages. kdevelop-pg-qt/kdev-pg/tests contains test-cases for errors and warnings, when changing something being somehow related to the different kinds of problems you should use test.sh to ensure that everything is correct. Additionally there are two examples, Op and Op2, for parsing operator-expressions. In Op a parse-tree for more or less classical expressions will be built. Op2 tries to be very complex and cover a lot of features of KDevelop-PG-Qt, in the inputs for Op2 you can configure the syntax for operator-expressions immediately before testing an expression.
We have a 0.9 and a 0.9.5 release (0.9.5 was just to fix a crash). There may be 1.0 soon. Ask apaku about git- or release-stuff, ask The_User, if you want to contribute.
Ideas for the future
- Complete the alias-support (Difficulty: ?, probably medium, priority: low, don't know what it should be used for)
- Make it more OOPish, remove global variables (Difficulty: high, priority: ask milian)
- Abstraction: Input of rules -> Algorithms for Parser-Generation -> Code-Generation in a target-language (Difficulty: high, priority: low, but it would be awesome, such refactoring could also improve performance, because data would have to be stored properly)
- Use QList etc. instead of ListNodes, but take care about compatibility, it should be optional (Difficulty: low, priority: nice to have, it is for optimization)
- Use object-oriented Bison (Difficulty: low, priority: low)
- Bootstrapping (Difficulty: high, priority: low/ask milian)
- tree2tree translators, e.g. make it easy to create DUChain-stuff (Difficulty: high, priority: low)
- Support other languages than C++, see AntLR (Difficulty: high, priority: low, would need more OOP)
- Resolve conflicts automatically (but ask the user!), maybe give hints how to resolve them (Difficulty: medium/high, priority: low)
This would probably need macros mentioned below.
- Use unions, be careful, should be optional (Difficulty: medium, priority: medium, it is for optimization and you could save a lot of memory, there is already an algorithm to compute the perfect layout in git, see https://barney.cs.uni-potsdam.de/mailman/private/kdevelop-devel/2010-April/037096.html for some details for implementation)
- A KDevelop-plugin (Difficulty: medium, priority: nice to have, maybe it would be much easier with more OOP)
- Implement --symbol-text (like --token-text) (Difficulty: low, priority: would be good for debugging)
- Rule-inheritance/abstract rules (Difficulty: low-medium, priority: low) (kinda implemented by inlining)
- Default-implementation for expectedToken and expectedSymbol (would require --symbol-text, difficulty: low, priority: medium)
- LL(k)-parsing to fix conflicts (Difficulty: medium-high, it is easier because of the new BnfVisitor, priority: low)
How to do it:
For each conflict try to resolve it by increasing k and computing the new FIRST- and FOLLOW-sets until a limit is reached. Maybe there could be also a detection for non-LL(k)-grammars (a conflict can be resolved if there is a common possibly infinite beginning)
- PEG: “PEG” is in fact just a buzzword, you just have to implement infinite look-ahead conditions (& and ! operator) (Difficulty: low, priority: low)
- LR(0)-, SLR-, LALR(1), LR(1), GLR, Earley, CYK-parsing (Difficulty: ultra-high, priority: low)
- AST forward-declarations (Difficulty: low, priority: nice to have for optimizations)
- Check if files really changed (Difficulty: low, priority: nice to have for optimizations)
- Macros like BREAK (to break a @, * or + loop), FAIL(message) (to indicate a failure in the rule) and DONE (to leave the rule) (Difficulty: low, priority: low, but good for code-stability)
Maybe it should not be done with macros because a lot of #define, #undef etc. would look messy. Do not forget the macro for the current node! Maybe $-signs could be used, if replacement should be done at generation-time.
- Port KDev-Cmg (Difficulty: high, priority: low)
- Good serialization-visitor (Difficulty: low-medium, priority: low, could be useful for optimizations in KDev-plugins)
- Mark conflicts as fixed (Difficulty: medium/low, priority: high)
- Line-End etc. (lexer)
- On-context-switch etc. (lexer)
- Deactivate/activate lexer-rules (have to keep the bitsets of possible final states)
You like one of these ideas? Contact us at #email@example.com or firstname.lastname@example.org! You have another idea? Add it to the list! You think a feature is more important than "low"? Contact us!
- Write unit-tests (Difficulty: low, priority: medium, in progress, some fields are covered well)
- Lookahead operator (lexer) (it is partially implemented)
- 1.0 release (Difficulty: :D, priority: medium, in progress, now we have had 0.9 and 0.9.5)
- Correct line numbers :D
- Complete Shunting-yard-Bottom-Up Parsing (Difficulty: high, priority: high)
- Add a short-form for (expression|0) (Difficulty: low, priority: nice to have)
- Line-numbers in generated source-code to make compiler-messages usefull
- Use a switch-statement in Visitor::visitNode (Difficulty: low)
- Inlining of non-terminals (Difficulty: low, priority: low), there is a suggestion to implement it with rule-inheritance and a proxy keyword, I think there should be simply an option for the = operator to inline the non-terminal
- Implement a tokenizer (Difficulty: medium/high, priority: medium)
It should make tokenizing easier, e.g. UTF-8 support with Unicode-character-classes etc. Could be a bit tricky to keep conditions small when constructing the DFA, e.g. expressions like "unicode-letter, without chinese, without b, x and z" would be needed as conditions in the transition-graph (we do not want to care about all chinese letters in the transition-graph!), but short stuff should of course be expanded, e.g. "unicode-letter without (unicode-letters without b, x and z)" to "b, x, z". The current prototype contains different implementations for handling sets of characters: traditional, table based, and range based.
- Use forward-declarations in parser.h (Difficulty: low, priority: low)
- Integrate existing lexers, quex seems to provide everything you need, flex is simply standard. (Difficulty: definitely lower than implementing it yourself, priority: medium) Unfortunately quex has a strange license (LGPL but no military usage) and I do not like the syntax. Most other lexer-generators are just shit compared to quex.