[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In order to reduce a source file into a token list, it must first be converted into a token stream. Tokens are syntactic elements such as whitespace, symbols, strings, lists, and punctuation.
The lexer uses the major-mode's syntax table for conversion.
@xref{Syntax Tables,,,elisp}.
As long as that is set up correctly (along with the important
comment-start
and comment-start-skip
variable) the lexer
should already work for your language.
The primary entry point of the lexer is the semantic-flex function shown below. Normally, you do not need to call this function. It is usually called by semantic-bovinate-toplevel for you.
4.1 Lexer Overview | ||
4.2 Lexer Output | ||
4.3 Lexer Options | ||
4.4 Keywords | ||
4.5 Standard Keyword Properties |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Semantic lexer breaks up the content of an Emacs buffer into a list of tokens. This process is based mostly on regular expressions which in turn depend on the syntax table of the buffer's major mode being setup properly. @xref{Major Modes,,,emacs}. @xref{Syntax Tables,,,elisp}. @xref{Regexps,,,emacs}.
Specifically, the following regular expressions which rely on syntax tables are used:
\\s-
\\sw
\\s_
\\s.
\\s<
\\s>
\\s\\
\\s)
\\s$
\\s\"
\\s\'
In addition, Emacs' built-in features such as
comment-start-skip
,
forward-comment
,
forward-list
,
and
forward-sexp
are employed.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The lexer, semantic-flex, scans the content of a buffer and returns a token list. Let's illustrate this using this simple example.
00: /* 01: * Simple program to demonstrate semantic. 02: */ 03: 04: #include <stdio.h> 05: 06: int i_1; 07: 08: int 09: main(int argc, char** argv) 10: { 11: printf("Hello world.\n"); 12: } |
Evaluating (semantic-flex (point-min) (point-max))
within the buffer with the code above returns the following token list.
The input line and string that produced each token is shown after
each semi-colon.
((punctuation 52 . 53) ; 04: # (INCLUDE 53 . 60) ; 04: include (punctuation 61 . 62) ; 04: < (symbol 62 . 67) ; 04: stdio (punctuation 67 . 68) ; 04: . (symbol 68 . 69) ; 04: h (punctuation 69 . 70) ; 04: > (INT 72 . 75) ; 06: int (symbol 76 . 79) ; 06: i_1 (punctuation 79 . 80) ; 06: ; (INT 82 . 85) ; 08: int (symbol 86 . 90) ; 08: main (semantic-list 90 . 113) ; 08: (int argc, char** argv) (semantic-list 114 . 147) ; 09-12: body of main function ) |
As shown above, the token list is a list of "tokens". Each token in turn is a list of the form
(TOKEN-TYPE BEGINNING-POSITION . ENDING-POSITION) |
where TOKEN-TYPE is a symbol, and the other two are integers indicating the buffer position that delimit the token such that
(buffer-substring BEGINNING-POSITION ENDING-POSITION) |
would return the string form of the token.
Note that one line (line 4 above) can produce seven tokens while
the whole body of the function produces a single token.
This is because the depth parameter of semantic-flex
was
not specified.
Let's see the output when depth is set to 1.
Evaluate (semantic-flex (point-min) (point-max) 1)
in the same buffer.
Note the third argument of 1
.
((punctuation 52 . 53) ; 04: # (INCLUDE 53 . 60) ; 04: include (punctuation 61 . 62) ; 04: < (symbol 62 . 67) ; 04: stdio (punctuation 67 . 68) ; 04: . (symbol 68 . 69) ; 04: h (punctuation 69 . 70) ; 04: > (INT 72 . 75) ; 06: int (symbol 76 . 79) ; 06: i_1 (punctuation 79 . 80) ; 06: ; (INT 82 . 85) ; 08: int (symbol 86 . 90) ; 08: main (open-paren 90 . 91) ; 08: ( (INT 91 . 94) ; 08: int (symbol 95 . 99) ; 08: argc (punctuation 99 . 100) ; 08: , (CHAR 101 . 105) ; 08: char (punctuation 105 . 106) ; 08: * (punctuation 106 . 107) ; 08: * (symbol 108 . 112) ; 08: argv (close-paren 112 . 113) ; 08: ) (open-paren 114 . 115) ; 10: { (symbol 120 . 126) ; 11: printf (semantic-list 126 . 144) ; 11: ("Hello world.\n") (punctuation 144 . 145) ; 11: ; (close-paren 146 . 147) ; 12: } ) |
The depth parameter "peeled away" one more level of "list" delimited by matching parenthesis or braces. The depth parameter can be specified to be any number. However, the parser needs to be able to handle the extra tokens.
This is an interesting benefit of the lexer having the full
resources of Emacs at its disposal.
Skipping over matched parenthesis is achieved by simply calling
the built-in functions forward-list
and forward-sexp
.
All common token symbols are enumerated below. Additional token
symbols aside from these can be generated by the lexer if user option
semantic-flex-extensions is set. It is up to the user to add
matching extensions to the parser to deal with the lexer
extensions. An example use of semantic-flex-extensions is in
`semantic-make.el' where semantic-flex-extensions is set to
the value of semantic-flex-make-extensions which may generate
shell-command
tokens.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
bol
nil
.
charquote
\\s\\+
.
close-paren
\\s)
.
These are typically )
, }
, ]
, etc.
comment
nil
.
newline
\\s-*\\(\n\\|\\s>\\)
.
This token is produced only if the user set
semantic-flex-enable-newlines to
non-nil
.
open-paren
\\s(
.
These are typically (
, {
, [
, etc.
Note that these are not usually generated unless the depth
argument to semantic-flex is greater than 0.
punctuation
\\(\\s.\\|\\s$\\|\\s'\\)
.
semantic-list
string
\\s\"
.
The lexer relies on forward-sexp
to find the
matching end.
symbol
\\(\\sw\\|\\s_\\)+
.
whitespace
nil
. If
semantic-ignore-comments is non-nil
too comments are
considered as whitespaces.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Although most lexer functions are called for you by other semantic functions, there are ways for you to extend or customize the lexer. Three variables shown below serve this purpose.
( TYPE START . END) |
nil
is a valid return value.
TYPE can be any type of symbol, as long as it doesn't occur as a
nonterminal in the language definition.
(CHAR CLASS) |
CHAR is the char passed to `modify-syntax-entry', and CLASS is the string also passed to `modify-syntax-entry' to define what syntax class CHAR has.
(setq semantic-flex-syntax-modifications '((?. "_")) |
This makes the period . a symbol constituent. This may be necessary if filenames are prevalent, such as in Makefiles.
'newlines
as syntactic elements.
Useful for languages where the newline is a special case terminator.
Only set this on a per mode basis, not globally.
'whitespace
as syntactic elements.
Useful for languages where the syntax is whitespace dependent.
Only set this on a per mode basis, not globally.
nil
, no number extraction is done during lex.
This expression tries to match C and Java like numbers.
DECIMAL_LITERAL: [1-9][0-9]* ; HEX_LITERAL: 0[xX][0-9a-fA-F]+ ; OCTAL_LITERAL: 0[0-7]* ; INTEGER_LITERAL: <DECIMAL_LITERAL>[lL]? | <HEX_LITERAL>[lL]? | <OCTAL_LITERAL>[lL]? ; EXPONENT: [eE][+-]?[09]+ ; FLOATING_POINT_LITERAL: [0-9]+[.][0-9]*<EXPONENT>?[fFdD]? | [.][0-9]+<EXPONENT>?[fFdD]? | [0-9]+<EXPONENT>[fFdD]? | [0-9]+<EXPONENT>?[fFdD] ; |
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Another important piece of the lexer is the keyword table (see 6.1 Settings). You language will want to set up a keyword table for fast conversion of symbol strings to language terminals.
The keywords table can also be used to store additional information about those keywords. The following programming functions can be useful when examining text in a language buffer.
nil
if TEXT is a keyword in the keyword table.
nil
, then remove the property from TOKEN.
All cons cells in the property list are replicated so that there
are no side effects if TOKEN is in shared lists.
Keyword properties can be set up in a BNF file for ease of maintenance. While examining the text in a language buffer, this can provide an easy and quick way of storing details about text in the buffer.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Add known properties here when they are known.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |