[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The BNF converter takes a file in "Bovine Normal Form" which is similar to "Backus-Naur Form". If you have ever used yacc or bison, you will find it similar. The BNF form used by semantic, however, does not include token precedence rules, and several other features needed to make real parser generators.
It is important to have an Emacs Lisp file with a variable ready to take the output of your table (see See section 5. Preparing a bovine table for your language.) Also, make sure that the file `semantic-bnf.el' is loaded. Give your language file the extension `.bnf' and you are ready.
The comment character is #.
When you want to test your file, use the keyboard shortcut C-c C-c to parse the file, generate the variable, and load the new definition in. It will then use the settings specified above to determine what to do. Use the shortcut C-c c to do the same thing, but spend extra time indenting the table nicely.
Make sure that you create the variable specified in the
%parsetable
token before trying to convert the BNF file. A
simple definition like this is sufficient.
(defvar semantic-toplevel-lang-bovine-table nil "Table for use with semantic for parsing LANG.") |
If you use tokens (created with the %token
specifier), also
make sure you have a keyword table available, like this:
(defvar semantic-lang-keyword-table nil "Table for use with semantic for keywords.") |
Specify the name of the keyword table with the %keywordtable
specifier.
The BNF file has two sections. The first is the settings section, and the second is the language definition, or list of semantic rules.
6.1 Settings | Setup for a language | |
6.2 Rules | Create rules to parse a language | |
6.3 Optional Lambda Expressions | Actions to take when a rule is matched | |
6.4 Examples | Simple Samples | |
6.5 Semantic Token Style Guide | What the tokens mean, and how to use them. | |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A setting is a keyword starting with a %. (This syntax is taken from yacc and bison. @xref{(bison)}.)
There are several settings that can be made in the settings section. They are:
bovine-toplevel
. (See below)
bovine-inner-scope
.
%token
keywords.
Specifies a lisp variable into which the output of a keyword table is
stored. This obarray is used to turn symbols into keywords when applicable.
%keywordtable
specified
variable is available in the file specified by %outputfile
.
%put
to apply properties to that keyword.
(see 4. Preparing your language for Lexing).
%setupfunction
.
It will be inserted between two specifier strings, or added to
the end of the function.
When working inside %( ... )%
tokens, any lisp expression can be
entered which will be placed inside the setup function. In general, you
probably want to set variables that tell Semantic and related tools how
the language works.
Here are some variables that control how different programs will work with your language.
nil
, no number extraction is done during lex.
Symbols which match this expression are returned as number
tokens instead of symbol
tokens.
The default value for this variable should work in most languages.
( TYPE START . END) |
nil
is also a valid return.
TYPE can be any type of symbol, as long as it doesn't occur as a
nonterminal in the language definition.
(CHAR CLASS) |
'newlines
as syntactic elements.
Useful for languages where the newline is a special case terminator.
Only set this on a per mode basis, not globally.
t
means to strip comments when flexing. Nil
means to keep comments
as part of the token stream.
Imports
.
case-fold-search
when parsing.
nil
if it does not need to be expanded.
Languages with compound definitions should use this function to expand
from one compound symbol into several. For example, in C the
definition
int a, b; |
Within the language definition (the `.bnf' sources), it is often useful to set the NAME slot of a token with a list of items that distinguish each element in the compound definition.
This list can then be detected by the function set in
semantic-expand-nonterminal
to create multiple tokens.
This function has one additional duty of managing the overlays created
by semantic. It is possible to use the single overlay in the compound
token for all your tokens, but this can pose problems identifying
all tokens covering a given definition.
Please see `semantic-java.el' for an example of managing overlays when expanding a token into multiple definitions.
Available override symbols:
SYMBOL | PARAMETERS | DESCRIPTION |
find-dependency | (token) | Find the dependency file |
find-nonterminal | (token & parent) | Find token in buffer. |
find-documentation | (token & nosnarf) | Find doc comments. |
abbreviate-nonterminal | (token & parent) | Return summary string. |
summarize-nonterminal | (token & parent) | Return summary string. |
prototype-nonterminal | (token) | Return a prototype string. |
concise-prototype-nonterminal' | (tok & parent color) | Return a concise prototype string. |
uml-abbreviate-nonterminal' | (tok & parent color) | Return a UML standard abbreviation string. |
uml-prototype-nonterminal' | (tok & parent color) | Return a UML like prototype string. |
uml-concise-prototype-nonterminal' | (tok & parent color) | Return a UML like concise prototype string. |
prototype-file | (buffer) | Return a file in which prototypes are placed |
nonterminal-children | (token) | Return first rate children. These are children which may contain overlays. |
nonterminal-external-member-parent | (token) | Parent of TOKEN |
nonterminal-external-member-p | (parent token) | Non nil if TOKEN has PARENT, but is not in PARENT. |
nonterminal-external-member-children | (token & usedb) | Get all external children of TOKEN. |
nonterminal-protection | (token & parent) | Return protection as a symbol. |
nonterminal-abstract | (token & parent) | Return if TOKEN is abstract. |
nonterminal-leaf | (token & parent) | Return if TOKEN is leaf. |
nonterminal-static | (token & parent) | Return if TOKEN is static. |
beginning-of-context | (& point) | Move to the beginning of the | current context.
end-of-context | (& point) | Move to the end of the | current context.
up-context | (& point) | Move up one context level. |
get-local-variables | (& point) | Get local variables. |
get-all-local-variables | (& point) | Get all local variables. |
get-local-arguments | (& point) | Get arguments to this function. |
end-of-command | Move to the end of the current | command|
beginning-of-command | Move to the beginning of the | current command|
ctxt-current-symbol | (& point) | List of related symbols. |
ctxt-current-assignment | (& point) | Variable being assigned to. |
ctxt-current-function | (& point) | Function being called at point. |
ctxt-current-argument | (& point) | The index to the argument of | the current function the cursor is in.
Parameters mean:
&
buffer
token
parent
This configures Imenu to use semantic parsing.
It should be a function that takes no arguments and returns an index of the current buffer as an alist.
Simple elements in the alist look like `(INDEX-NAME . INDEX-POSITION)'.
Special elements look like `(INDEX-NAME INDEX-POSITION FUNCTION ARGUMENTS...)'.
A nested sub-alist element looks like (INDEX-NAME SUB-ALIST).
The function imenu--subalist-p
tests an element and returns t
if it is a sub-alist.
This function is called within a save-excursion
.
The variable is buffer-local.
These are specific to the document tool.
document-comment-start
document-comment-line-prefix
document-comment-end
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Writing the rules should be very similar to bison for basic syntax. Each rule is of the form
RESULT : MATCH1 (optional-lambda-expression) | MATCH2 (optional-lambda-expression) ; |
RESULT is a non-terminal, or a token synthesized in your grammar. MATCH is a list of elements that are to be matched if RESULT is to be made. The optional lambda expression is a list containing simplified rules for concocting the parse tree.
In bison, each time an element of a MATCH is found, it is "shifted" onto the parser stack. (The stack of matched elements.) When all of MATCH1's elements have been matched, it is "reduced" to RESULT. @xref{(bison)Algorithm}.
The first RESULT written into your language specification should
be bovine-toplevel
, or the symbol specified with %start
.
When starting a parse for a file, this is the default token iterated
over. You can use any token you want in place of bovine-toplevel
if you specify what that nonterminal will be with a %start
token
in the settings section.
MATCH is made up of symbols and strings. A symbol such as
foo
means that a syntactic token of type foo
must be
matched. A string in the mix means that the previous symbol must have
the additional constraint of exactly matching it. Thus, the
combination:
symbol "moose" |
means that a symbol must first be encountered, and then it must
string-match "moose"
. Be especially careful to remember that the
string is a regular expression. The code:
punctuation "." |
will match any punctuation.
For the above example in bison, a LEX rule would be used to create a new token MOOSE. In this case, the MOOSE token would appear. For the bovinator, this task was mixed into the language definition to simplify implementation, though Bison's technique is more efficient.
To make a symbol match explicitly for keywords, for example, you can use
the %token
command in the settings section to create new symbols.
%token MOOSE "moose" find_a_moose: MOOSE ; |
will match "moose" explicitly, unlike the previous example where moose need only appear in the symbol. This is because "moose" will be converted to MOOSE in the lexical analysis stage. Thus the symbol MOOSE won't be available any other way.
If we specify our token in this way:
%token MOOSE symbol "moose" find_a_moose: MOOSE ; |
then MOOSE
will match the string "moose" explicitly, but it won't
do so at the lexical level, allowing use of the text "moose" in other
forms of regular expressions.
Non symbol tokens are also allowed. For example:
%token PERIOD punctuation "." filename : symbol PERIOD symbol ; |
will explicitly match one period when used in the above rule.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The OLE (Optional Lambda Expression) is converted into a bovine lambda (see See section 5. Preparing a bovine table for your language.) This lambda has special short-cuts to simplify reading the Emacs BNF definition. An OLE like this:
( $1 ) |
results in a lambda return which consists entirely of the string or object found by matching the first (zeroth) element of match. An OLE like this:
( ,(foo $1) ) |
executes `foo' on the first argument, and then splices its return into the return list whereas:
( (foo $1) ) |
executes foo, and that is placed in the return list.
Here are other things that can appear inline:
$1
,$1
'$1
foo
(foo)
,(foo)
'(foo)
(EXPAND $1 nonterminal depth)
(EXPANDFULL $1 nonterminal depth)
bovine-toplevel
. This lets you have
much simpler rules in this specific case, and also lets you have
positional information in the returned tokens, and error skipping.
(ASSOC symbol1 value1 symbol2 value2 ... )
( ( symbol1 . value1) (symbol2 . value2) ... ) |
If the symbol %quotemode backquote
is specified, then use
,@
to splice a list in, and ,
to evaluate the expression.
This lets you send $1
as a symbol into a list instead of having
it expanded inline.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The rule:
SYMBOL : symbol |
is equivalent to
SYMBOL : symbol ( $1 ) |
which, if it matched the string "A", would return
( "A" ) |
If this rule were used like this:
ASSIGN: SYMBOL punctuation "=" SYMBOL ( $1 $3 ) |
it would match "A=B", and return
( ("A") ("B") ) |
The letters A and B come back in lists because SYMBOL is a nonterminal, not an actual lexical element.
to get a better result with nonterminals, use , to splice lists in like this;
ASSIGN: SYMBOL punctuation "=" SYMBOL ( ,$1 ,$3 ) |
which would return
( "A" "B" ) |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In order for a generalized program using Semantic to work with
multiple languages, it is important to have a consistent meaning for
the contents of the tokens returned. The variable
semantic-toplevel-bovine-table
is documented with the complete
list of a tokens that a functional or OO language may use. While any
given language is free to create their own tokens, such a language
definition would not produce a stream of tokens usable by a
generalized tool.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In general, all tokens returned from a parser should be generated with the following form:
("NAME" type-symbol ... "DOCSTRING" PROPERTIES OVERLAY) |
NAME and type-symbol are the only syntactic elements of a
nonterminal which are guaranteed to exist. This means that a parser
which uses nil
for either of these two slots, or some value
which is not type consistent is wrong.
NAME is also guaranteed to be a string. This string represents the name of the nonterminal, usually a named definition which the language will use elsewhere as a reference to the syntactic element found.
type-symbol is a symbol representing the type of the nonterminal. Valid type-symbols can be anything, as long is it is an Emacs Lisp symbol.
DOCSTRING is a required slot in the nonterminal, but can be nil. Some languages have the documentation saved as a comment nearby. In these cases, DOCSTRING is nil, and the function `semantic-find-documentation'.
PROPERTIES is a slot generated by the semantic parser harness,
and need not be provided by a language author. Programmatically access
nonterminal properties with semantic-token-put
and
semantic-token-get
to access properties.
OVERLAY represents positional information for this token. It is
automatically generated by the semantic parser harness, and need not
be provided by the language author, unless they provide a nonterminal
expansion function via semantic-expand-nonterminal
.
The OVERLAY property is accessed via several functions returning the beginning, end, and buffer of a token. Use these functions unless the overlay is really needed (see 9.1 Token Queries). Depending on the overlay in a program can be dangerous because sometimes the overlay is replaced with an integer pair
[ START END ] |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
If a parser produces tokens for a functional language, then the following token formats are available.
("NAME" variable "TYPE" DEFAULT-VALUE EXTRA-SPEC
"DOCSTRING" PROPERTIES OVERLAY)
nil
for untyped languages. Languages which
support variable declarations without a type (Such as C) should supply
a string representing the default type for that language.
DEFAULT-VALUE can be a string, or something pre-parsed and language specific. Hopefully this slot will be better defined in future versions of Semantic.
EXTRA-SPEC are extra specifiers. See below.
("NAME" function "TYPE" ( ARG-LIST ) EXTRA-SPEC
"DOCSTRING" PROPERTIES OVERLAY)
nil
for untyped languages, or for
procedures in languages which support functions with no return data.
See above for more.
ARG-LIST is a list of arguments passed to this function. Each element in the arg list can be one of the following:
("NAME" type "TYPE" ( PART-LIST ) ( PARENTS ) EXTRA-SPEC
"DOCSTRING" PROPERTIES OVERLAY)
PART-LIST is the list of individual entries inside compound types. Structures, for example, can contain several fields which can be represented as variables. Valid entries in a PART-LIST are:
PARENTS represents a list of parents of this type. Parents are used in two situations.
The structure of the PARENTS list is of this form:
( EXPLICIT-PARENTS . INTERFACE-PARENTS) |
INTERFACE-PARENTS is a list of strings representing the names of all INTERFACES, or abstract classes inherited from. It can also be nil.
This slot can be interesting because the form:
( nil "string") |
("FILE" include SYSTEM "DOCSTRING" PROPERTIES OVERLAY)
#include
statement in C.
In this case, instead of NAME, a FILE is specified.
FILE can be a subset of the actual file to be loaded.
SYSTEM is true if this include is part of a set of system includes. This field isn't currently being used and may be eliminated.
("NAME" package DETAIL "DOCSTRING" PROPERTIES OVERLAY)
package
statement, or a provide
in Emacs Lisp.
DETAIL might be an associated file name, or some other language specific bit of information.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Some default token types have a slot EXTRA-SPEC, for extra specifiers. These specifiers provide additional details not commonly used, or not available in all languages. This list is an alist, and if a given key is nil, it is not in the list, saving space. Some valid extra specifiers are:
(parent . "text")
(dereference . INT)
(pointer . INT)
*
characters.
(typemodifiers . ( "text" ... ))
register'
and volatile'
(suffix . "text")
(const . t)
(throws . ( "text" ... ))
(destructor . t)
(constructor . t)
(user-visible . t)
(prototype . t)
autoload
statement creates prototypes.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |