Introduction

I show a simple grammar for SVG files.


A grammar is a "pattern matcher".


PEG is as easy to use as REGEXP.


PEG is more powerful (it allows nesting, whereas raw REGEXP does not).


github

https://github.com/guitarvydas/glue


repo: glue

branch: dev0


Grammar

SVGSwitchAndForeign {

  Svg = XMLHeader DOCTypeHeader SvgElement

  XMLHeader = "<?" stuff* "?>" 

  DOCTypeHeader = "<!DOCTYPE" stuff* ">"  

  SvgElement = "<svg" attribute* ">" EmptyDefs Element+ "</svg>"

  EmptyDefs = "<defs/>" 

  Element = ElementWithSwitch | ElementWithForeign | ElementWithElements | ElementWithoutElements

  ElementWithSwitch =  "<switch>" Element Element "</switch>"

  ElementWithForeign = "<foreignObject" attribute* ">" Element "</foreignObject>"

  ElementWithElements = "<" name stuff* ">" (Element+ | text*) "</" name ">"

  ElementWithoutElements = "<" name stuff* "/>"

  stuff = ~">" ~"/>" ~"<" ~"?>" any

  text = stuff

  attribute = stuff

  name = name1st nameFollow*

  name1st = "a" .. "z" | "A" .. "Z"

  nameFollow = "0" .. "9" | name1st

}



Grammar Syntax

I am using Ohm-js.  It is a PEG grammar processor.


In Ohm, you need to surround the grammar with a name, and brace brackets.


I've named this grammar "SVGSwitchAndForeign".  I intend to use this grammar for a larger project (but not very large) and this will require breaking out switches and foreignObjects.


The grammar is compose of rules.  


Rules are free-form (spaces don't matter).


There happen to be 16 rules in this grammar.


Ohm special-cases rules that have names beginning with capital letters.


The first rule is   


Svg = XMLHeader DOCTypeHeader SvgElement


This rule is called "Svg" and is composed of calls to 3 other rules XMLHeader, DOCTypeHeader and SvgElement.


Items in double-quotes are constants and are parsed literally, for example "<svg" parses 4 letters "<", "s", "v" and "g".


Rules and literals can be suffixed with syntax similar to REGEXP, i.e.


Alternation is written with "|" (in the original PEG thesis, "/" was used).


Negative matching is specified as "~" (whereas in the original PEG thesis "!" was used).


Any single character is specified with the keyword "any".  In REGEXP, this is "." and is also "." in many other PEG libraries.


Character classes are formed with two constants separated by ellipsis, i.e. "A" .. "Z" means capital-A through capital-Z.  This is often written as [A-Z] in REGEXP syntax (and some other PEG libraries).


Ohm-js separates match-variables from the grammar.  In Ohm-js, each match has a corresponding variable, i.e. the rule


Svg = XMLHeader DOCTypeHeader SvgElement


needs a corresponding Javascript function declared as:


svg = function (xmlHeader, docTypeHeader, svgElement) { … }


(where the parameter names are arbitrary — I tend to use _1, _2, and _3 as parameter names).


In other PEG parsers, matches are labelled explicitly, e.g.


Svg = xmlHeader:XMLHeader docTypeHeader:DOCTypeHeader svgElement:SvgElement


(I favour Ohm-js' choice - it leaves the grammar unadorned and more readable (IMO).  This is helpful when one is — as I am — concerned with DI (Design Intent, aka Architecture.  I intend to use PEG as a way to design SCLs (Solution Centric Languages - DSLs, but tighter) — clarity matters to me).


In Ohm-js, if a rule begins with a capital letter, the rule skips all whitespace.  The builtin rule space is used.  One can augment the space rule (see the Ohm documentation).[1]


Rules can be in any order, except that Ohm-js takes the first rule as the main rule and starts[2] with it.   In other PEG parsing libraries, the main rule is specified in the call to the parser.  


Lookahead matching is specified with "&", e.g. "&rule".  Lookahead succeeds only if the match succeeds, but lookahead does not consume input characters.

Multiple Matches

Ohm-js returns a Javascript array of matches when ?/*/+ are used.


The array has 0 length if there were 0 matches.


A simple call to a rule, e.g.


rule+


returns an array of matches as specified by the rule rule.


In parenthesized matches, e.g.


(rule1 rule2)*


An array is returned for each sub-rule.  For example, the above would return two arrays — one for rule1 and the other for rule2.


Constant matches follow the above convention, for example:


("abc" "def")*


returns two arrays (the first with a bunch of matches for "abc" and the second for "def").

Negative Lookahead Match

Negative lookahead is specified by "~", for example "~rule".


Example: to match any character that is not an @, we would write:


~"@" any


Negative lookahead does not consume input, but, any consumes one character.  The end result of this pattern is that we consume one character if it is not @.


The ~rule parse tree does not appear in the CST.

Lookahead Match

Lookahead matching is specified with "&", e.g. "&rule".  Lookahead succeeds only if the match succeeds, but lookahead does not consume input characters.


Example: to match any character that is preceded by an @, we would write:


&"@" any


(the any keyword consumes one character, but &"@" consumes no characters).


The &rule creates a parse tree, although it consumes none of the characters.  The &rule tree appears in the CST.

Rule Order

PEG — and Ohm-js — matches rules in order of appearance.


For example, the two similar rules:


Rule = A B C


Rule = B C A


might match differently depending on the input.  


PEG creates parsers.  If you desire theoretical detail, see LR(1) theory and YACC.  In fact, some people have been known to "test" their grammar by running YACC on it, then building a TDPL[3] parser.

Code Generation

The Ohm-js parser generator inputs a grammar and produces a tree — CST (Concrete Syntax Tree — often conflated with AST).  


After a successful parse, the tree can be further annotated with Javascript code.  This code is called the "semantics".

 

Arity Checking

The Ohm-js parser generator checks that the semantics object contains a set of functions that correspond to the grammar.  See the Ohm documentation for further detail and exceptions.


Ohm-js checks the number of parameters — called the arity — of each javascript function to ensure that each function matches the grammar.

Grammar Reading (Detailed Description)


Below, I explain every line of the grammar:


  Svg = XMLHeader DOCTypeHeader SvgElement


Creates the first rule called "Svg".  It consists of calls to 3 other rules.


  XMLHeader = "<?" stuff* "?>" 


Creates the rule called "XMLHeader".  It matches "<" then "?" then calls the rule stuff for zero-or-more matches, then matches "?" then ">".  


  DOCTypeHeader = "<!DOCTYPE" stuff* ">"  


Creates the rule called "DOCTypeHeader".  It matches the string "<!DOCTYPE", followed by a call to the rule stuff (for 0-or-more matches) then matches the string ">".


  SvgElement = "<svg" attribute* ">" EmptyDefs Element+ "</svg>"


Creates the rule called "SvgElement".  This rule has 6 sub-matches ("<svg", attribute*, ">", EmptyDefs, Element+, "</svg>").  Matches #2 and #5 return arrays of matches (attribute* and Element+, resp.)


  EmptyDefs = "<defs/>" 


Creates the rule called "EmptyDefs".  It matches one string "<defs/>".  We might wish to expand this rule in future projects, but I believe in the YAGNI[4] principle and don't need more than this for my current project.


  Element = ElementWithSwitch | ElementWithForeign | ElementWithElements | ElementWithoutElements


Creates the rule called "Element".  This is an alternation rule that returns one result.  It has 4 possibilities, but matches only one of the possibilities.


  ElementWithSwitch =  "<switch>" Element Element "</switch>"


Creates the rule called "ElementWithSwitch" that has 4 sub-matches.


  ElementWithForeign = "<foreignObject" attribute* ">" Element "</foreignObject>"


Creates the rule called "ElementWithForeign" that has 5 sub-matches.  Match #2 returns an array.


  ElementWithElements = "<" name stuff* ">" (Element+ | text*) "</" name ">"


Creates the rule called "ElementWithElements" that has 8 sub-matches ("<", name, stuff*, ">", (…), "</", name, ">").


  ElementWithoutElements = "<" name stuff* "/>"


Creates the rule called "ElementWithoutElements" that has 4 sub-matches.


  stuff = ~">" ~"/>" ~"<" ~"?>" any


Creates the first rule called "stuff".  The rule begins with a lower-case letter ("stuff" not "Stuff") which tells Ohm to read every character instead of skipping over spaces.  This rule has 4 negative matches and consumes only one character.  It matches any character unless the character is ">" or "<".  It also checks for negative matches for the two strings "/>" and ">?".  The built-in rule any consumes one character, if all of the negative matches succeed.  The rule, basically, says "if the next character is not '>' and if the next two characters are not '/>' and if the next character is not '<' and if the next two characters are not '?>', then consume one character.


  text = stuff


Creates the first rule called "text" (lower-case, space-skipping disabled).  This rule calls one other rule stuff and then returns the result.  [This is a useful trick — rules that do nothing but call another rule allow the semantics code to see what context the match occurred in.  For example, "stuff" is a generic rule, but "stuff" where text is expected might generate different code.] 


  attribute = stuff


Creates the first rule called "attribute".  Again, we see the use of the pattern matching trick of  calling a more generic rule in a specific context.  In this context, the parser is matching SVG elements.[5]


  name = name1st nameFollow*


Creates the rule called "name".  It consumes a first character followed by zero-or-more follow characters.  In most PLs, a name must begin with a letter, but the rest of the name might contain digits.


  name1st = "a" .. "z" | "A" .. "Z"


Creates the rule called "name1st".  It matches one character, either a lower-case letter (between "a" and "z") or an upper-case letter.


  nameFollow = "0" .. "9" | name1st


Creates the rule called "nameFollow".  It says that following characters in a name can contain digits, or any character allowed at the front of a name.  [As I'm writing this, I see that I probably wanted to include underscores.  I can fix this later — getting something to work is the most important activity at this early-stage of this project — divide and conquer.] 


[1] https://github.com/harc/ohm

[2] See the Ohm documentation for how to change this assumption.

[3] TDPL means Top Down Parsing Language - recursive descent parsing.

[4] YAGNI means "You aren't going to need it".

[5] Again, due to the use of YAGNI, this rule doesn't perform a very thorough match of attributes.  I named this rule to leave room for future detailing.  This kind of detail is not needed in this project (hence, I don't waste human intellect on getting it "right").  PEG parsers let me cut such corners.  The result is the ability to build parsers fairly quickly without having to dot all of the i's and cross all of the t's.  The grammar is not wrong, but it skips over uninteresting stuff.