[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Implementation notes


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Data Types and Operations


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.1 Element Type

Data type: eltype
Data type representing the information about an element type. An eltype has information from `ELEMENT' and `ATTLIST' declarations. It can also store data for the application.

The element types are symbols in a special oblist. The oblist is the table of element types. The symbols name is the GI, its value is used to store three flags and the function definition holds the content model. Other information about the element type is stored on the property list.

Function: sgml-eltype-name et
The name (a string) of the element type et.

Function: sgml-eltype-appdata et prop
Get application data from element type et with name prop. prop should be a symbol, reserved names are: flags, model, attlist, includes, excludes, conref-regexp, mixed, stag-optional, etag-optional.

This function can be used as a place in setf, push and other functions from the CL library.

Function: sgml-eltype-all-miscdata eltype
A list of all data properties for eltype except for flags, model, includes and excludes. This function filters the property list of eltype. Used when saving the parsed DTD.

Function: sgml-eltype-set-all-miscdata eltype miscdata
Append the miscdata data properties to the properties of eltype.

Function: sgml-eltype-attlist et
The attribute specification list for the element type et.

Function: sgml-eltype-completion-table eltypes
Make a completion table from a list, eltypes, of element types.

Function: sgml-eltype-stag-optional et
True if the element type et has optional start-tag.

Function: sgml-eltype-etag-optional et
True if the element type et has optional end-tag.

Function: sgml-eltype-excludes et
The list of excluded element types for element type et.

Function: sgml-eltype-includes et
The list of included element types for element type et.

Function: sgml-eltype-flags et
Contains three flags as a number. The flags are stag-optional, etag-optional and mixed.

Function: sgml-eltype-mixed et
True if element type et has mixed content.

Function: sgml-eltype-model et
The content model of element type et. The content model is either the start state in the DFA for the content model or a symbol identifying a declared content.

Function: sgml-eltype-shortmap et
The name of the shortmap associated with element type et. This can also be the symbol empty (if declared with a `<!USEMAP gi #EMPTY>' or nil (if no associated map).

Function: sgml-eltype-token et
Return a token for the element type et.

Function: sgml-eltypes-in-state state tree
List of element types valid in state and tree.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.2 DTD

The DTD data type is realised as a lisp vector using defstruct.

There are two additional fields for internal use: dependencies and merged.

Function: sgml-dtd-dependencies dtd
The list of files used to create this DTD.

Function: sgml-dtd-merged dtd
The pair (file . merged-dtd), if the DTD has had a precompiled dtd merged into it. File is the file containing the compiled DTD and merged-dtd is the DTD loaded from that file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1.3 Element and Tree

Data Type: tree
This is the data type for the nodes in the tree build by the parser.

The tree nodes are represented as lisp vectors, using defstruct to define basic operations.

The Element data type is a view of the tree built by the parser.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 Parsing model

PSGML uses finite state machines and a stack to parse SGML. Every element type has an associated DFA (deterministic finite automaton). This DFA is constructed from the content model.

SGML restricts the allowed content models in such a way that it is easy to directly construct a DFA.

To be able to determine when a start-tag can be omitted the DFA need to contain some more information than the traditional DFA. In PSGML a DFA has a set of states and two sets of edges. The edges are associated with tokens (corresponding to SGML's primitive content tokens). I call these moves. One set of moves, the optional moves, represents optional tokens. I call the other set required moves. The correspondence to SGML definitions are: if there is precisely one required move from one state, then the associated token is required. A state is final if there is not required move from that state.

The SGML construct `(...&...&...)' (AND-group) is another problem. There is a simple translation to sequence- and or-connectors. For example `(a & b & c)' is can be translated to:

 
((a, ((c, b) | (b, c))) | 
 (b, ((a, c) | (c, a))) | 
 (c, ((a, b) | (b, a))) )

But this grows too fast to be of direct practical use. PSGML represents an AND-group with one DFA for every (SGML) token in the group. During parsing of an AND-group there is a pointer to a state in one of the group's DFAs, and a list of the DFAs for the tokens not yet satisfied. Most of this is hidden by the primitives for the state type. The parser only sees states in a DFA and moves.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 Entity manager

Function: sgml-push-to-entity entity &optional ref-start type
Set current buffer to a buffer containing the entity entity. entity can also be a file name. Optional argument ref-start should be the start point of the entity reference. Optional argument type, overrides the entity type in entity look up.

Function: sgml-pop-entity
Should be called after a sgml-push-to-entity (or similar). Restore the current buffer to the buffer that was current when the push to this buffer was made.

Function: sgml-push-to-string string
Create an entity from string and push it on the top of the entity stack. After this the current buffer will be a scratch buffer containing the text of the new entity with point at the first character.

Use sgml-pop-entity to exit from this buffer.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4 Parser functions

Function: sgml-need-dtd
This makes sure that the buffer has a DTD and set global variables needed by parsing routines. One global variable is sgml-dtd-info which contain the DTD (type dtd).

Function: sgml-parse-to goal &optional extra-cond quiet
This is the low level interface to the parser.

Parse until (at least) goal, a buffer position. Optional argument extra-cond should be a function. This function is called in the parser loop, and the loop is exited if the function returns t. If third argument quit is non-nil, no "`Parsing...'" message will be displayed.

Function: sgml-reparse-buffer shortref-fun
Reparse the buffer and let shortref-fun take care of short references. shortref-fun is called with the entity as argument and sgml-markup-start pointing to start of short reference and point pointing to the end.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.5 Saved DTD Format

 
File =         Comment,
               File version,
               S-expression --dependencies--,
               Parameter entites,
               Document type name,
               Elements,
               General entities,
               S-expression --shortref maps--,
               S-expression --notations--

Elements =     Counted Sequence of S-expression --element type name--,
               Counted Sequence of Element type description

File version = "(sgml-saved-dtd-version 5)
"

Comment =      (";",
                (CASE
                  OF [0-9]
                  OF [11-255])*,
                [10] --end of line marker--)*

Element type description = S-expression --Misc info--,
               CASE
                OF [0-7] --Flags 1:stag-opt, 2:etag-opt, 4:mixed--,
                    Content specification,
                    Token list --includes--,
                    Token list --excludes--
                OF [128] --Flag undefined element--

Content specification = CASE
                OF [0] --cdata--
                OF [1] --rcdata--
                OF [2] --empty--
                OF [3] --any--
                OF [4] --undefined--
                OF [128] --model follows--,
                    Model --nodes in the finite state automaton--

Model =        Counted Sequence of Node

Node =         CASE
                OF Normal State
                OF And Node

Normal State = Moves --moves for optional tokens--,
               Moves --moves for required tokens--

Moves =        Counted Sequence of (Token,
                    OCTET --state #--)

And Node =     [255] --signals an AND node--,
               Number --next state (node number)--,
               Counted Sequence of Model --set of models--

Token =        Number --index in list of elements--

Number =       CASE
                OF [0-250] --Small number 0--250--
                OF [251-255] --Big number, first octet--,
                    OCTET --Big number, second octet--

Token list =   Counted Sequence of Token

Parameter entites = S-expression --internal representation of parameter entities--

General entities = S-expression --internal representation of general entities--

Document type name = S-expression --name of document type as a string--

S-expression = OTHER

Counted Sequence = Number_a --length of sequence--,
               (ARG_1)^a



[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by XEmacs Webmaster on October, 2 2007 using texi2html