Translate

Translate to 조선말

Translation of the wiki page Development/KDevelop-PG-Qt Introduction from English (en) to 조선말 (ko-kp)

This tool does not work without JavaScript. JavaScript is disabled, failed to work, or this browser is unsupported.

Translations:Development/KDevelop-PG-Qt Introduction/Page display title/ko-kp

Development/KDevelop-PG-Qt Introduction

You need translation rights to translate messages.Get permission

== Preface ==‎

'''KDevelop-PG-Qt''' is the parser-generator from ''KDevplatform''. It is used for some ''KDevelop-languagesupport-plugins'' (Ruby, PHP, CSS...).‎

Edit

It uses Qt classes internally. There's also the original '''KDevelop-PG parser''', which used types from the STL, but has since been superceeded by '''KDevelop-PG-Qt'''. Most of the features are the same, though it could be that the ...-Qt parser generator is more up to date and feature rich than the plain STL style generator. The ...-Qt version should be used to write parsers for KDevelop language plugins.‎

Edit

== In-Depth information ==‎

Edit

This document is not supposed to be a full-fledged and in-depth resource for all parts of '''KDevelop-PG'''. Instead it is intended to be a short introduction and, more importantly, a reference for developers.‎

Edit

To get an in-depth introduction, read Jakob Petsovits' excellent Bachelor thesis. You find it in the Weblinks section at the bottom of this page. However, it does not cover recently added features.‎

Edit

== The Application ==‎

Edit

=== Usage ===‎

Edit

You can find '''KDevelop-PG-Qt''' in [https://projects.kde.org/projects/extragear/kdevelop/utilities/kdevelop-pg-qt git]. Four example packages are also included in the sources.<br /> To download it try: {{Input|1=git clone git://anongit.kde.org/kdevelop-pg-qt.git}} or: {{Input|1=git clone kde:kdevelop-pg-qt}} (when having setup '''git''' with kde:-prefix)‎

Edit

The program itself requests a .g file, a so called grammar, as input: {{Input|1=./kdev-pg-qt --output=''prefix'' syntax.g}}‎

Edit

The value of the ''--output'' switch decides the prefix of the output files and additionally the namespace for the generated code. '''Kate''' provides elementary highlighting for '''KDevelop-PG-Qt's''' grammar-files.‎

Edit

=== Output Format ===‎

Edit

While evaluating the grammar and generating its parser files, the application will output information about so called ''conflicts'' to STDOUT. As said above, the following files will actually be prefixed.‎

Edit

==== ast.h ====‎

Edit

AST stands for [http://en.wikipedia.org/wiki/Abstract_syntax_tree Abstract Syntax Tree]. It defines the data structure in which the parse tree is saved. Each node is a struct with the postfix ''Ast'', which contains members that point to any possible sub elements.‎

Edit

==== parser.h and parser.cpp ====‎

Edit

One important part of ''parser.h'' is the definition of the parser tokens, the ''TokenType'' enum. The TokenStream of your lexer should to use this. You have to write your own lexer or let one generate by '''Flex'''. See also the part about Tokenizers/Lexers below.‎

Edit

Having the token stream available, you create your root item and call the parser on the parse method for the top-level AST item, e.g. DocumentAst* => parseDocument(&root). On success, root will contain the AST.<br />‎

Edit

The parser will have one parse method for each possible node of the AST. This is nice for e.g. an expression parser or parsers that should only parse a sub-element of a full document.‎

Edit

==== visitor.h and visitor.cpp ====‎

Edit

The Visitor class provides an abstract interface to walk the AST. Most of the time you don't need to use this directly, the DefaultVisitor takes some work off your shoulders.‎

Edit

==== defaultvisitor.h and defaultvisitor.cpp ====‎

Edit

The DefaultVisitor is an implementation of the abstract Visitor interface and automatically visits each node in the AST. Hence, this is probably the best candidate for a base class for your personal visitors. Most language plugins use these in their Builder classes to create the DUChain.<br />‎

Edit

=== Command-Line-Options ===‎

Edit

* --namespace=''namespace'' - sets the C++ namespace for the generated sources independently from the file prefix. When this option is set, you can also use / in the --ouput option * --no-ast - don't create the ast.h file, more to that below * --debug-visitor - generates a debug visitor that prints the AST * --serialize-visitor - generates code for serialization via a QIODevice * --terminals - all tokens will be written into the file ''kdev-pg-terminals'' * --symbols - all possible nodes from the AST (not the leafs) will be written into the file ''kdev-pg-symbols'' * --rules - all grammar rules with informationen about their syntactic correlations will be written into a file called ''kdev-pg-rules''. useful for debugging and solving conflicts * --token-text - generates a function to map token-numbers onto token-names * --help - print usage information‎

Edit

== Tokenizers/Lexers ==‎

Edit

As mentioned, '''KDevelop-PG-Qt''' requires a Tokenizer. You can either let '''KDevelop-PG-Qt''' generate one for you, write one per hand, as it has been done for C++ and PHP, or you can use external tools like '''Flex'''.‎

Edit

The tokenizer's job, in principle, boils down to:‎

Edit

* converting keywords and chars with special meanings to tokens * converting literals and identifier to tokens * clean out anything that doesn't change the semantics, e.g. comments or whitespace (the latter of course not in Python) * while doing the above, handling character encoding (we recommend using UTF8 as much as possible)‎

Edit

The rest, e.g. actually building the tree and evaluating the semantics, is part of the parser and the AST visitors.<br>‎

Edit

=== Using KDevelop-PG-Qt ===‎

Edit

'''KDevelop-PG-Qt''' can generate lexers being well integrated into its architecture (you do not have to create a token-stream-class invoking lex or something like that). See examples/foolisp in the code for a simplistic example, there is also an incomplete PHP-Lexer for demonstration purposes.‎

Edit

==== Regular Expressions ==== Regular expressions are used to write rules using the KDevelop-PG-Qt, we use the following syntax (α and β are arbitrary regular expressions, a and b characters): *α|β accepts any word accepted by α or accepted by β *α&β accepts any word accepted by both α and β *α^β accepts any word accepted by a but not by β *~α accepts any word not accepted by α *?α like α, but also accepts the empty word *α* accepts any (maybe empty) sequence of words accepted by α *α+ accepts any nonempty sequence of words accepted by α (equivalent to αα*) *α@β accepts any nonempty sequence of words accepted by α separated by words accepted by β (equivalent to α(βα)*) *αβ accepts words consisting of a word accepted by α followed by a word accepted by β *[…] switches to “disjunctive” environment, αβ will get interpreted as α|β, you can use (…) inside the brackets to go back to normal mode *. accepts any single character *a-b accepts a single character between a and b (including a and b) in the Unicode (of course only characters that can be represented in the used encoding) *"…" will accept the word enclosed by the quotation marks, escape sequences will still get interpreted *a accepts the word consisting of the single character a *Any escape sequence (see below), accepts the word consisting of the character represented by the escape sequence *{⟨name⟩} accepts any word accepted by the regex named ⟨name⟩‎

Edit

All regular expressions are case sensitive. Sorry, there is currently no way for insensitivity.‎

Edit

==== Known Escape Sequences ==== There are several escape sequences which can be used to encode special characters: *\n, \t, \f, \v, \r, \0, \b, \a like in C *\x, \X, \u or \U followed by hex digits: character represented by this Unicode value (in hex) *\d, \D followed by decimal digits, same, but in decimal representation *\o, \O followed by octal digits, same, but in octal representation *\y, \Y followed by binary digits, same, but in binary representation‎

Edit

==== Predefined named regex ==== Some regexes are predefined and can be used using braces {⟨name⟩}. They get imported from the official Unicode data, some important regexes: *{alphabetic} any alphabetic character *{num} any numeric character *{ascii-range} any character representable in ASCII *{latin1-range} any character representable in Latin 1 (8 Bit) *{uppercase} *{lowercase} *{math}‎

Edit

==== Rules ==== Rules can be written as: {{Input|1= ⟨regular expression⟩ TOKEN; }}‎

Edit

Then the Lexer will generate the token TOKEN for lexemes matching the given regular expression. Which token will be chosen if there are multiple options? We use the ''first longest match'' rule: It will take the longest possible match (eating as many characters as possible), if there are multiple of those matches, it will take the first one.‎

Edit

Rules can perform code actions and you can also omit tokens (then no token will be generated): {{Input|1= ⟨regular expression⟩ [: ⟨code⟩ :] TOKEN; ⟨regular expression⟩ [: ⟨code⟩ :]; ⟨regular expression⟩ ; }}‎

Edit

There is rudimentary support for ''lookahead'' and so called (our invention) ''barriers'': {{Input|1= ⟨regular expression⟩ %la(⟨regular expression⟩); ⟨regular expression⟩ %ba(⟨regular expression⟩); }} The first rule will only accept words if they match the first regular expression and are followed by anything matching the expression specified using %la. The second rule will accept words matched by the first regular expression but will never run into a character sequence matching the regex specified by %ba. However, currently only rules with fixed length are allowed in %la and %ba (for example foo|bar, but not qux|garply). When applying the “first longest match” rule the %la/%ba expressions count, too.‎

Edit

You can create your own named regexes using an arrow: {{Input|1= ⟨regular expression⟩ -> ⟨identifier⟩; }} The first character of the identifier should not be upper case.‎

Edit

Additionally there are two special actions: {{Input|1= ⟨regular expression⟩ %fail; ⟨regular expression⟩ %continue; }}‎

Edit

%fail will stop tokenization. %continue will make the matched characters part of the next token.‎

Edit

==== Rulesets ====‎

Edit

A grammar file can contain multiple ''rulesets''. A ruleset is a set of rules, as described in the previous section. It gets declared using: {{Input|1= %lexer "name" -> ⟨rules⟩ ; }}‎

Edit

For your main-ruleset you omit the name (the name will be “start”).‎

Edit

Usually the start-ruleset will be used. But you can change the ruleset in code actions using the macro lxSET_RULE_SET(⟨name⟩). You can specify code to be executed when entering or leaving a ruleset by using %enter [: ⟨code⟩ :]; or %leave [: ⟨code⟩ :]; respectively inside the definition of the ruleset.‎

Edit

==== Further Configuration and Output ==== The standard statements %lexer_declaration_header and %lexer_bits_header are available to include files in the generated lexer.h/lexer.cpp.‎

Edit

By using %lexer_base you can specify the baseclass for the lexer-class, by default it is the TokenStream class defined by KDevelop-PG-Qt.‎

Edit

After %lexerclass(bits) you can specify code to be inserted in lexer.cpp.‎

Edit

You have to specify an encoding the lexer should work with internally using %input_encoding "⟨encoding⟩", possible values: *ASCII (7-Bit) *Latin 1 (8-Bit) *UTF-8 (8-Bit, full Unicode) *UCS-2 (16-Bit, UCS-2 part of Unicode) *UTF-16 (16-Bit, full Unicode) *UTF-32 (32-Bit, full Unicode)‎

Edit

With %input_stream you can specify which class the lexer should use to get the characters to process, there are some predefined classes: *QStringIterator, reads from QString, required (internal) encoding: UTF-16 or UCS-2 *QByteArrayIterator, reads from QByteArray, required encoding: ASCII, Latin-1 or UTF-8 *QUtf16ToUcs4Iterator, reads from UTF-16 QString, required encoding: UTF-32 (UCS-4) *QUtf8ToUcs4Iterator, reads from UTF-8 QByteArray, required encoding: UTF-32 (UCS-4) *QUtf8ToUcs2Iterator, reads from UTF-8 QByteArray, required encoding: UCS-2 *QUtf8ToUtf16Iterator, reads from UTF-8 QByteArray, required encoding: UTF-16 *QUtf8ToAsciiIterator, reads from UTF-8 QByteArray, will ignore all non-ASCII characters, reqired encoding: ASCII‎

Edit

Whether you choose UTF-8, UTF-16 or UTF-32 is irrelevant for functionality, but it may significantly affect compile-time and run-time performance (you may want to test your Lexer with ASCII if compilation takes too long). For example you want to work with a QByteArray containing UTF-8 data, and you want to get full Unicode support: You could either use the QByteArrayIterator and UTF-8 as internal encoding, or the QUtf8ToUtf16Iterator and UTF-16, or the QUtf8ToUcs4Iterator and UTF-32.‎

Edit

You can also choose between %table_lexer; and %sequence_lexer; In the first case transitions between states of the lexer will get represented by big tables while generating the lexer (cases in the generated code). In the second case sequences will get stored in a compressed data structure and transitions will get represented by nested if-statements. For UTF-32 %table_lexer is infeasible, thus there %sequence_lexer is the onliest option.‎

Edit

Inside your actions of the lexer you can use some predefined macros: {{Input|1= lxCURR_POS // position in the input (some kind of iterator or pointer) lxCURR_IDX // index of the position in the input // (it is the index as presented in the input, for example: input is a QByteArray, index incrementation per byte, but the lexer may operate on 32-bit codepoints) lxCONTINUE // like %continue, add the current lexeme to the next token lxLENGTH // length of the current lexeme (as presented in the input) lxBEGIN_POS // position of the first character of the current lexeme lxBEGIN_IDX // corresponding index lxNAMED_TOKEN(⟨name⟩, ⟨type⟩) // create a variable representing named ⟨name⟩ with the token type ⟨type⟩ lxTOKEN(⟨type⟩) // create such a variable named “token” lxDONE // return the token generated before lxRETURN(X) // create a token of type X and return it lxEOF // create the EOF-token lxFINISH // create the EOF-token and return it (will stop tokenization) lxFAIL // raise the tokenization error lxSKIP // continue with the next lexeme (do not return a token, you should not have created one before) lxNEXT_CHR(⟨chr⟩) // set the variable ⟨chr⟩ to the next char in the input yytoken // current token }}‎

Edit

=== Using Flex ===‎

Edit

With the existing examples, it shouldn't be too hard to write such a lexer. Between most languages, especially those ''"inheriting"'' C, there are many common syntactic elements. Especially comments and literals can be handled just the same way over and over again. Adding a simple token is trivial:‎

Edit

{{Input|1="special-command" return Parser::Token_SPECIAL_COMMAND; }}‎

Edit

That's pretty much it, take a look at eg. ''java.ll'' for an excellent example. However, it is quite tricky and ugly to handle UTF-8 with Flex.‎

Edit

== How to write Grammar-Files ==‎

Edit

=== Context-Free Grammars ===‎

Edit

'''KDevelop-PG-Qt''' uses so called [http://en.wikipedia.org/wiki/Context-free_grammars context-free grammars] using a concept of non-terminals (nodes) and terminals(tokens). While writing the grammar for the basic structure of your language, you should try to mimic the semantics of the language. Lets take a look at an example:‎

Edit

C++-document consists of lots of declarations and definitions, a class definition could be handled e.g. in the following way:‎

Edit

#''CLASS-token'' #a ''identifier'' #the ''{-token'' #a ''member-declarations-list'' #the ''}-token'' #and finally the '';-token''‎

Edit

The ''member-declarations-list'' is of course not a part of any C++ description, it is just a ''helper'' to explain the structure of a given semantic part of your language. The grammar could then define how exactly such helper might look like.‎

Edit

=== Basic Syntax ===‎

Edit

Now let us have a look at a basic example, a declaration in C++, as described in grammar syntax:‎

Edit

{{Input|1= class_declaration | struct_declaration | function_declaration | union_declaration | namespace_declaration | typedef_declaration | extern_declaration -> declaration ; }}‎

Edit

This is called a ''rule'' definition. Every lower-case string in the grammar file references such a rule. Our case above defines what a ''declaration'' looks like. The ''|''-char stands for a logical ''or'', all rules have to end on two semicolons.‎

Edit

In the example we reference other rules which also have to be defined. Here's for example the ''class_declaration'', note the tokens in all-upper-case:‎

Edit

{{Input|1= CLASS IDENTIFIER SEMICOLON | CLASS IDENTIFIER LBRACE class_declaration* RBRACE SEMICOLON -> class_declaration ; }}‎

Edit

There is a new char in there: The asterisk has the same meaning as in regular expressions, i.e. that the previous rule can occur arbitrarily often or not at all.‎

Edit

In a grammar <code> 0 </code> stands for an empty token. Using it in addition with parenthesizing and the logical ''or'' from above, you can express optional elements:‎

Edit

{{Input|1= some_required_rule SOME_TOKEN <nowiki>( some_optional_stuff | some_other_stuff | 0 )</nowiki> -> my_rule ; }}‎

Edit

All symbols never occuring on the left side of a rule are start-symbols. You can use one of them to start parsing.‎

Edit

=== Making matched rules available to Visitors ===‎

Edit

The simple rule above could be used to parse the token stream, yet no elements would be saved in the parsetree. This can be easily done though:‎

Edit

{{Input|1= class_declaration=class_declaration | struct_declaration=struct_declaration | function_declaration=function_declaration | union_declaration=union_declaration | namespace_declaration=namespace_declaration | typedef_declaration=typedef_declaration | extern_declaration=extern_declaration -> declaration ; }}‎

Edit

The DeclarationAst struct now contains pointers to each of these elements. During the parse process the pointer for each found element gets set, all others become NULL. To store lists of elements, prepend the identifier with a hash <keycap> # </keycap>:‎

Edit

{{Input|1= CLASS IDENTIFIER SEMICOLON | CLASS IDENTIFIER LBRACE (#class_declaration=class_declaration)* RBRACE SEMICOLON -> class_declaration ; }}‎

Edit

'''TODO: internal structure of the list, important for Visitors'''‎

Edit

Identifier and targets can be used in more than one place:‎

Edit

{{Input|1= #one=one (#one=one)* -> one_or_more ; }}‎

Edit

In the example above, all matches to the rule ''one'' will be stored in one and the same list ''one''.‎

Edit

=== Defining available Tokens ===‎

Edit

Somewhere in the grammar (you should probably put it near the head) you'll have to define a list of available Tokens. From this list, the ''TokenType'' enum in ''parser.h'' will be created. Additionally to the enum value names you should define an explanation name which will e.g. be used in error messages. Note that the representation of a Token inside the source code is not required for the grammar/parser as it operates on a TokenStream, see Lexer/Tokenizer section above.‎

Edit

{{Input|1= %token T1 ("T1-Name"), T2 ("T2-Name"), COMMA (";"), SEMICOLON (";") ; }}‎

Edit

It is possible to use ''%token'' multiple times to group tokens in the grammar. Though all tokens will still be put into the same ''TokenType'' enum.‎

Edit

'''TODO: explain process of using the parser Tokens'''‎

Edit

=== Special Syntax... ===‎

Edit

==== ...to use inside Rules ====‎

Edit

===== List of one or more elements =====‎

Edit

Alternatively to the asterisk (''*'') you can use a plus sign (''+'') to mark lists of one-or-more elements:‎

Edit

{{Input|1= (#one=one)+ -> one_or_more ; }}‎