4. Preparing your language for Lexing

In order to reduce a source file into a token list, it must first be converted into a token stream. Tokens are syntactic elements such as whitespace, symbols, strings, lists, and punctuation.

The lexer uses the major-mode's syntax table for conversion. @xref{Syntax Tables,,,elisp}. As long as that is set up correctly (along with the important comment-start and comment-start-skip variable) the lexer should already work for your language.

The primary entry point of the lexer is the semantic-flex function shown below. Normally, you do not need to call this function. It is usually called by semantic-bovinate-toplevel for you.

Function: semantic-flex start end &optional depth length: Using the syntax table, do something roughly equivalent to flex. Semantically check between START and END. Optional argument DEPTH indicates at what level to scan over entire lists. The return value is a token stream. Each element is a list of the form (symbol start-expression . end-expresssion). END does not mark the end of the text scanned, only the end of the beginning of text scanned. Thus, if a string extends past END, the end of the return token will be larger than END. To truly restrict scanning, use `narrow-to-region'. The last argument, LENGTH specifies that semantic-flex should only return LENGTH tokens.

4.1 Lexer Overview
4.2 Lexer Output
4.3 Lexer Options
4.4 Keywords
4.5 Standard Keyword Properties

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.1 Lexer Overview

Semantic lexer breaks up the content of an Emacs buffer into a list of tokens. This process is based mostly on regular expressions which in turn depend on the syntax table of the buffer's major mode being setup properly. @xref{Major Modes,,,emacs}. @xref{Syntax Tables,,,elisp}. @xref{Regexps,,,emacs}.

Specifically, the following regular expressions which rely on syntax tables are used:

\\s-: whitespace characters
\\sw: word constituent
\\s_: symbol constituent
\\s.: punctuation character
\\s<: comment starter
\\s>: comment ender
\\s\\: escape character
\\s): close parenthesis character
\\s$: paired delimiter
\\s\": string quote
\\s\': expression prefix

In addition, Emacs' built-in features such as comment-start-skip, forward-comment, forward-list, and forward-sexp are employed.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.2 Lexer Output

The lexer, semantic-flex, scans the content of a buffer and returns a token list. Let's illustrate this using this simple example.

00: /*
01:  * Simple program to demonstrate semantic.
02:  */
03:
04: #include <stdio.h>
05:
06: int i_1;
07:
08: int
09: main(int argc, char** argv)
10: {
11:     printf("Hello world.\n");
12: }

Evaluating (semantic-flex (point-min) (point-max)) within the buffer with the code above returns the following token list. The input line and string that produced each token is shown after each semi-colon.

((punctuation     52 .  53)     ; 04: #
 (INCLUDE         53 .  60)     ; 04: include
 (punctuation     61 .  62)     ; 04: <
 (symbol          62 .  67)     ; 04: stdio
 (punctuation     67 .  68)     ; 04: .
 (symbol          68 .  69)     ; 04: h
 (punctuation     69 .  70)     ; 04: >
 (INT             72 .  75)     ; 06: int
 (symbol          76 .  79)     ; 06: i_1
 (punctuation     79 .  80)     ; 06: ;
 (INT             82 .  85)     ; 08: int
 (symbol          86 .  90)     ; 08: main
 (semantic-list   90 . 113)     ; 08: (int argc, char** argv)
 (semantic-list  114 . 147)     ; 09-12: body of main function
 )

As shown above, the token list is a list of "tokens". Each token in turn is a list of the form

(TOKEN-TYPE BEGINNING-POSITION . ENDING-POSITION)

where TOKEN-TYPE is a symbol, and the other two are integers indicating the buffer position that delimit the token such that

(buffer-substring BEGINNING-POSITION ENDING-POSITION)

would return the string form of the token.

Note that one line (line 4 above) can produce seven tokens while the whole body of the function produces a single token. This is because the depth parameter of semantic-flex was not specified. Let's see the output when depth is set to 1. Evaluate (semantic-flex (point-min) (point-max) 1) in the same buffer. Note the third argument of 1.

((punctuation    52 .  53)     ; 04: #
 (INCLUDE        53 .  60)     ; 04: include
 (punctuation    61 .  62)     ; 04: <
 (symbol         62 .  67)     ; 04: stdio
 (punctuation    67 .  68)     ; 04: .
 (symbol         68 .  69)     ; 04: h
 (punctuation    69 .  70)     ; 04: >
 (INT            72 .  75)     ; 06: int
 (symbol         76 .  79)     ; 06: i_1
 (punctuation    79 .  80)     ; 06: ;
 (INT            82 .  85)     ; 08: int
 (symbol         86 .  90)     ; 08: main

 (open-paren     90 .  91)     ; 08: (
 (INT            91 .  94)     ; 08: int
 (symbol         95 .  99)     ; 08: argc
 (punctuation    99 . 100)     ; 08: ,
 (CHAR          101 . 105)     ; 08: char
 (punctuation   105 . 106)     ; 08: *
 (punctuation   106 . 107)     ; 08: *
 (symbol        108 . 112)     ; 08: argv
 (close-paren   112 . 113)     ; 08: )

 (open-paren    114 . 115)     ; 10: {
 (symbol        120 . 126)     ; 11: printf
 (semantic-list 126 . 144)     ; 11: ("Hello world.\n")
 (punctuation   144 . 145)     ; 11: ;
 (close-paren   146 . 147)     ; 12: }
 )

The depth parameter "peeled away" one more level of "list" delimited by matching parenthesis or braces. The depth parameter can be specified to be any number. However, the parser needs to be able to handle the extra tokens.

This is an interesting benefit of the lexer having the full resources of Emacs at its disposal. Skipping over matched parenthesis is achieved by simply calling the built-in functions forward-list and forward-sexp.

All common token symbols are enumerated below. Additional token symbols aside from these can be generated by the lexer if user option semantic-flex-extensions is set. It is up to the user to add matching extensions to the parser to deal with the lexer extensions. An example use of semantic-flex-extensions is in `semantic-make.el' where semantic-flex-extensions is set to the value of semantic-flex-make-extensions which may generate shell-command tokens.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.2.1 Default syntactic tokens if the lexer is not extended.

bol: Empty string matching a beginning of line. This token is produced only if the user set semantic-flex-enable-bol to non-nil.
charquote: String sequences that match \\s\\+.
close-paren: Characters that match \\s). These are typically ), }, ], etc.
comment: A comment chunk. These token types are not produced by default. They are produced only if the user set semantic-ignore-comments to nil.
newline: Characters matching \\s-*\$\n\\|\\s>\$. This token is produced only if the user set semantic-flex-enable-newlines to non-nil.
open-paren: Characters that match \\s(. These are typically (, {, [, etc. Note that these are not usually generated unless the depth argument to semantic-flex is greater than 0.
punctuation: Characters matching \$\\s.\\|\\s$\\|\\s'\$.
semantic-list: String delimited by matching parenthesis, braces, etc. that the lexer skipped over, because the depth parameter to semantic-flex was not high enough.
string: Quoted strings, i.e., string sequences that start and end with characters matching \\s\". The lexer relies on forward-sexp to find the matching end.
symbol: String sequences that match \$\\sw\\|\\s_\$+.
whitespace: Characters that match `\\s-+' regexp. This token is produced only if the user set semantic-flex-enable-whitespace to non-nil. If semantic-ignore-comments is non-nil too comments are considered as whitespaces.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.3 Lexer Options

Although most lexer functions are called for you by other semantic functions, there are ways for you to extend or customize the lexer. Three variables shown below serve this purpose.

Variable: semantic-flex-unterminated-syntax-end-function: Function called when unterminated syntax is encountered. This should be set to one function. That function should take three parameters. The SYNTAX, or type of syntax which is unterminated. SYNTAX-START where the broken syntax begins. FLEX-END is where the lexical analysis was asked to end. This function can be used for languages that can intelligently fix up broken syntax, or the exit lexical analysis via throw or signal when finding unterminated syntax.

Variable: semantic-flex-extensions

Buffer local extensions to the lexical analyzer. This should contain an alist with a key of a regex and a data element of a function. The function should both move point, and return a lexical token of the form:

( TYPE START . END)

nil is a valid return value. TYPE can be any type of symbol, as long as it doesn't occur as a nonterminal in the language definition.

Variable: semantic-flex-syntax-modifications

Changes the syntax table for a given buffer. These changes are active only while the buffer is being flexed. This is a list where each element has the form

(CHAR CLASS)

CHAR is the char passed to `modify-syntax-entry', and CLASS is the string also passed to `modify-syntax-entry' to define what syntax class CHAR has.

(setq semantic-flex-syntax-modifications '((?. "_"))

This makes the period . a symbol constituent. This may be necessary if filenames are prevalent, such as in Makefiles.

Variable: semantic-flex-enable-newlines: When flexing, report 'newlines as syntactic elements. Useful for languages where the newline is a special case terminator. Only set this on a per mode basis, not globally.

Variable: semantic-flex-enable-whitespace: When flexing, report 'whitespace as syntactic elements. Useful for languages where the syntax is whitespace dependent. Only set this on a per mode basis, not globally.

Variable: semantic-flex-enable-bol: When flexing, report beginning of lines as syntactic elements. Useful for languages like python which are indentation sensitive. Only set this on a per mode basis, not globally.

Variable: semantic-number-expression

Regular expression for matching a number. If this value is nil, no number extraction is done during lex. This expression tries to match C and Java like numbers.

DECIMAL_LITERAL:
    [1-9][0-9]*
  ;
HEX_LITERAL:
    0[xX][0-9a-fA-F]+
  ;
OCTAL_LITERAL:
    0[0-7]*
  ;
INTEGER_LITERAL:
    <DECIMAL_LITERAL>[lL]?
  | <HEX_LITERAL>[lL]?
  | <OCTAL_LITERAL>[lL]?
  ;
EXPONENT:
    [eE][+-]?[09]+
  ;
FLOATING_POINT_LITERAL:
    [0-9]+[.][0-9]*<EXPONENT>?[fFdD]?
  | [.][0-9]+<EXPONENT>?[fFdD]?
  | [0-9]+<EXPONENT>[fFdD]?
  | [0-9]+<EXPONENT>?[fFdD]
  ;

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.4 Keywords

Another important piece of the lexer is the keyword table (see 6.1 Settings). You language will want to set up a keyword table for fast conversion of symbol strings to language terminals.

The keywords table can also be used to store additional information about those keywords. The following programming functions can be useful when examining text in a language buffer.

Function: semantic-flex-keyword-p text: Return non-nil if TEXT is a keyword in the keyword table.

Function: semantic-flex-keyword-put text property value: For keyword TEXT, set PROPERTY to VALUE.

Function: semantic-token-put-no-side-effect token key value: For TOKEN, put the property KEY on it with VALUE without side effects. If VALUE is nil, then remove the property from TOKEN. All cons cells in the property list are replicated so that there are no side effects if TOKEN is in shared lists.

Function: semantic-flex-keyword-get text property: For keyword TEXT, get the value of PROPERTY.

Function: semantic-flex-map-keywords fun &optional property: Call function FUN on every semantic keyword. If optional PROPERTY is non-nil, call FUN only on every keyword which has a PROPERTY value. FUN receives a semantic keyword as argument.

Function: semantic-flex-keywords &optional property: Return a list of semantic keywords. If optional PROPERTY is non-nil, return only keywords which have PROPERTY set.

Keyword properties can be set up in a BNF file for ease of maintenance. While examining the text in a language buffer, this can provide an easy and quick way of storing details about text in the buffer.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

4.5 Standard Keyword Properties

Add known properties here when they are known.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by XEmacs Webmaster on October, 2 2007 using texi2html