In this essay, I develop a little language (SCL)[1] in about 1 day.
Then I use the little language (SCL) to filter SVG files to help me build another essay.
This SCL was built in layers. The glue SCL layer reads a spec, then generates code that can be used in a lower layer. Both layers happen to use Ohm-JS. This example is very simple, hence it contains only 2 layers. Ideally, all SCLs should be this simple.
The code associated with this essay can be found at
https://github.com/guitarvydas/glue/tree/master
The various branches — dev0, dev1, dev2, and dev3 — show the project at different stages (described below).
The final test is in branch foreignFilter.
All branches have been collapsed into branch master.
The glue language and tool was developed in less than one day.
The goal of this SCL is to help me write PEG grammars and associated code.
I want to use parsers the way the most people use REGEXPs.
Note that REGEXPs are not "type checked" in most languages and editors. Likewise, pattern matching in this SCL is not "type checked". The programmer is responsible for writing the code correctly.
Note that pattern-matching is already a kind of type check (pattern matching is use in FP languages), so, the lack of type-checking is not as big a problem here as it might be in general purpose programming languages.
This is a fundamental principle of SCL design - YAGNI. Save development time by skipping hoary operations, like type-checking.
The goal is to create something that will generate useful code in less than a day of work (undercutting one of the principle reasons why DSLs are not used frequently).
"Type checking" will come later, if this tool gets used frequently.
I believe that it will be easier to type-check Glue programs than it would be to type-check REGEXPs.
I want to write a little language — and SCL — that lets me pattern match a text file, then rearrange it and output it in some other way.
For example, I have the problem that I'm writing essays that include diagrams. I use Draw.io to create my diagrams. Draw.io does not save in SVG format. One must ask Draw.io nicely to export the drawing in SVG format.
The exported SVG diagram contains a lot of noise. I just want to see pure SVG, without the noise.
The exported SVG file contains "<switch>…</switch>" clauses that contain "<foreignObject>…</foreignObject>". This stuff overwhelms the .SVG file and I can't see the stuff that I really want to see (the rects, the ellipses, the paths, etc.).
Can I write a filter to remove the noise?
That is my goal. Write a filter. Delete the noise.
Divide and conquer — on steroids — is recursive design.
Chop every problem up into two pieces.
Treat each piece separately.
If you don't know how to solve a piece, chop it up into two pieces.
Keep doing this until you know how to accomplish every piece.
As you go, new ideas will pop up. The new ideas can modify the problem at any level.
My divide & conquer for this (simple) problem went something like this:
The above is recursive design. Each step reduces the original problem until the termination case is encountered. Recursive break-down is used (recursively) on each branch of the problem — there are many termination cases.
[I think that the current crop of PLs — e.g. Python, JavaScript, etc. — show a distinct lack of divide-and-conquer mentality. IMO, everything should be a function call until the termination case is encountered in the recursive design. It should not be possible to use operators other than function calls except at the leaf levels. For example, "+" should not appear in any code except the very lowest-level code. Likewise, "cons" and array operations. Compilers can optimize-away function calls by making them into inline (macro) calls. AFAICT, most PLs allow unrestricted use of low-level operations (like "+", cons, arrays) at any level of the design.]
I wrote a first cut of the grammar in https://github.com/guitarvydas/glue (branch: dev0).
I got it running. That was a "major" hurdle, since it required me to understand how to use Ohm-JS, how to read a file in JS, etc., etc.
[The "hurdle" decreases every time I use Ohm-JS and JavaScript, but I didn't need to wait to go down the respective learning curves.]
I started to use Ohm-JS and JavaScript right away.
[Ohm-JS knows how to do its magic in HTML scripts. I've explored that possibility in https://computingsimplicity.neocities.org/blogs/OhmInSmallSteps.pdf.]
Then, I refined my ideas and re-cut the SVG grammar. Branch: dev3.
SVGSwitchAndForeign {
Svg = XMLHeader DOCTypeHeader SvgElement
XMLHeader = "<?" stuff* "?>"
DOCTypeHeader = "<!DOCTYPE" stuff* ">"
SvgElement = "<svg" attribute* ">" EmptyDefs Element+ "</svg>"
EmptyDefs = "<defs/>"
Element = ElementWithSwitch | ElementWithForeign | ElementWithElements | ElementWithoutElements
ElementWithSwitch = "<switch>" Element Element "</switch>"
ElementWithForeign = "<foreignObject" attribute* ">" Element "</foreignObject>"
ElementWithElements = "<" name stuff* ">" (Element+ | text*) "</" name ">"
ElementWithoutElements = "<" name stuff* "/>"
stuff = ~">" ~"/>" ~"<" ~"?>" any
text = stuff
attribute = stuff
name = name1st nameFollow*
name1st = "a" .. "z" | "A" .. "Z"
nameFollow = "0" .. "9" | name1st
}
Problem (2.1) is that of creating an identity grammar for SVG.
I break this problem down into 2 parts
Using Ohm-JS, I need to
I want a tool which makes it easy to pattern-match SVG and to re-arrange the matched bits.
[In Ohm-JS, you write a grammar to do the pattern-matching, and you write some JavaScript code to do the re-arranging. There is one JavaScript function for each rule in the grammar. The matches are passed in as function parameters.]
Roughly, I want a tool that does something like:
pattern matcher —> javascript code[3].
The pattern matcher portion is handled by Ohm-JS. It's called the grammar. The syntax is well-documented in https://github.com/harc/ohm.
The javascript re-arranging code is just a mess of JavaScript code. This is called the semantics in Ohm-JS documentation.
I've done this before. Writing the semantics code can be very repetitive and boring. In fact, all that I need is some way to tie grammar rules to JavaScript `…${v}…` strings.
I want:
.SVG —> Ohm-JS grammar —> string-language —> same .SVG
Actually, I would be happy if the string-language was simply the same as JavaScript `…${v}…` strings. So, I would settle for:
.SVG —> Ohm-JS grammar —> JavaScript `…` strings —> same .SVG
OK, so I want to write a grammar in Ohm-JS, then I want a mini-language that lets me rewrite pattern matches using JavaScript strings. I could have done this in raw JavaScript, but I didn't want to write details when I could automate[4].
Ohm-JS gives me each match (in some internal format)[5] as function parameters.
Now, what I want is:
.SVG —> Ohm-JS grammar —> JavaScript variables —> JavaScript `…` strings —> same as original .SVG
So, my requirements boil down to:
I need to write a tool in Ohm-JS to spit out code that could be used with Ohm-JS to eat and spit out .SVG files.
My tool syntax is something like:
grammarRuleName variables —> javascript
One more complication: some of the grammar rules match one thing, but some grammar rules match multiple-things. The +/*/? operators in the grammar match multiple things.
In Ohm-JS, single matches are returned as JavaScript variables, and multiple matches are returned as JavaScript arrays (see the appendix). So, for this mini-language, I need to differentiate between the two kinds of things and generate different code for each kind of thing (single vs. multiple).
If the grammar has +/*/? in it, then we need more JavaScript code. If there is no +/*/?, then we still need JavaScript code, but less of it.
If we choose to use strings, then JavaScript has the .join('') operator, which makes handling of arrays of strings particularly easy.
One grammar rule can have both types of matches (singles and arrays).
So, for example, for the grammar rule:
R = A B+ C
we need to create a function like:
R = function (a, bArray, c) { return a.semcode () + bArray.semcode (). join('') + c.semcode (); },
[There are more details, but I'm going to skip over them for now. See the final source code. Details kill.].
[Note that I like to leave spaces before parameter lists (it results in cleaner-looking code after you get used to it). I write "fn(a,b)" as "fn (a, b)". Compressing whitespace is so 1950's.]
My first cut at the SCL (mini language) was to imagine a language where statements like:
XMLHeader [1 2s 3] = $1 @2s $3
would generate JavaScript, like:
XMLHeader = function (p1, p2s, p3) { return p1.glue () + p2s.glue.join('') + p3.glue () };
I generated a prototype and made it run. Branches "dev0" and "dev1"
[Why did I use brackets instead of parentheses for the parameter lists? To remind me that this isn't JavaScript — and — to remind me that I was trying to create an SCL that was declarative — i.e. I was allowed to put "operators" in the left-hand side as well as on the right-hand side.]
While tinkering with the details, I realized that I could reduce this language to something with statements like:
XMLHeader [x @y z] = abc${x}def${y}ghi${z}jkl
that would generate (JavaScript) code like:
XMLHeader = function (_x, _y, _z) {
var x = _x.glue ();
var y = _y.glue.join('');
var z = _z.glue ();
return `abc${x}def${y}ghi${z}jkl`
}
[In JavaScript, "_" is just a normal character. It is a convention to use "_" as a prefix for untouchable data (unexported).]
SemanticsSCL {
Semantics = SemanticsStatement+
SemanticsStatement = RuleName "[" Parameters "]" "=" Rewrites
RuleName = letter1 letterRest*
Parameters = Parameter+
Parameter = treeparameter | flatparameter
flatparameter = fpws | fpd
fpws = pname ws+
fpd = pname delimiter
treeparameter = "@" tflatparameter
tflatparameter = tfpws | tfpd
tfpws = pname ws+
tfpd = pname delimiter
pname = letterRest letterRest*
Rewrites = rwstring
letter1 = "_" | "a" .. "z" | "A" .. "Z"
letterRest = "0" .. "9" | letter1
ws = "\n" | " " | "\t" | ","
delimiter = &"]" | &"="
rwstring = stringchar*
stringchar = ~"\n" any
}
branch: dev3
I wrote code that is very repetitive, on purpose. For example, in SemanticsStatement I evaluated each match, although only 1, 3 and 6 are used.
I name each parameter _n. (Underscore is not special — it is just another character).
I name each local (temporary) variable as __n. (Two underscores and a digit).
To "walk the tree" — evaluate the CST by applying semantics functions, I needed to call the function _glue () on each match.
I chose to make every function return a string. I use JavaScript `…` strings to build the results.
In places where Ohm-JS returns an array, I also call the .join ('') function. For example, in RuleName, the second parameter is an array. I collapse it with the following code:
var __2s = _2s._glue ().join ('')
Use run.bash to run the GLUE language.
The final JavaScript code is:
// npm install ohm-js
function ohm_parse (grammar, text) {
var ohm = require ('ohm-js');
var parser = ohm.grammar (grammar);
var cst = parser.match (text);
if (cst.succeeded ()) {
return { parser: parser, cst: cst };
} else {
console.log (parser.trace (text).toString ());
throw "Ohm matching failed";
}
}
function getNamedFile (fname) {
var fs = require ('fs');
if (fname === undefined || fname === null || fname === "-") {
return fs.readFileSync (0, 'utf-8');
} else {
return fs.readFileSync (fname, 'utf-8');
}
}
var varNameStack = [];
function addSemantics (sem) {
sem.addOperation ('_glue', {
Semantics: function (_1s) {
var __1 = _1s._glue ().join ('');
return `sem.addOperation ('_glue', {${__1}});`;
},
SemanticsStatement: function (_1, _2, _3, _4, _5, _6) {
varNameStack = [];
var __1 = _1._glue ();
var __2 = _2._glue ();
var __3 = _3._glue ();
var __4 = _4._glue ();
var __5 = _5._glue ();
var __6 = _6._glue ();
return `
${__1} : function (${__3}) {
${varNameStack.join ('\n')}
return \`${__6}\`;
},
`;
},
RuleName: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1 + __2s; },
Parameters: function (_1s) { var __1s = _1s._glue ().join (','); return __1s; },
Parameter: function (_1) {
var __1 = _1._glue ();
return `${__1}`;
},
flatparameter: function (_1) {
var __1 = _1._glue ();
varNameStack.push (`var ${__1} = _${__1}._glue ();`);
return `_${__1}`;
},
fpws: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1; },
fpd: function (_1, _2) { var __1 = _1._glue (); var __2 = _2._glue (); return __1; },
treeparameter: function (_1, _2) {
var __1 = _1._glue ();
var __2 = _2._glue ();
varNameStack.push (`var ${__2} = _${__2}._glue ().join ('');`);
return `_${__2}`;
},
tflatparameter: function (_1) {
var __1 = _1._glue ();
return `${__1}`;
},
tfpws: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1; },
tfpd: function (_1, _2) { var __1 = _1._glue (); var __2 = _2._glue (); return __1; },
pname: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1 + __2s;},
Rewrites: function (_1) { var __1 = _1._glue (); return __1; },
letter1: function (_1) { var __1 = _1._glue (); return __1; },
letterRest: function (_1) { var __1 = _1._glue (); return __1; },
ws: function (_1) { var __1 = _1._glue (); return __1; },
delimiter: function (_1) { return ""; },
rwstring: function (_1s) { var __1s = _1s._glue ().join (''); return __1s; },
stringchar: function (_1) { var __1 = _1._glue (); return __1; },
_terminal: function () { return this.primitiveValue; }
});
}
function main () {
// usage: node glue <file
// reads grammar from "glue.ohm"
var text = getNamedFile ("-");
var grammar = getNamedFile ("glue.ohm");
var { parser, cst } = ohm_parse (grammar, text);
var sem = {};
var outputString = "";
if (cst.succeeded ()) {
sem = parser.createSemantics ();
addSemantics (sem);
outputString = sem (cst)._glue ();
}
return { cst: cst, semantics: sem, resultString: outputString };
}
var { cst, semantics, resultString } = main ();
console.log(resultString);
It is better to do something rather than just sitting around and thinking.
It is OK to throw intermediate results away.
Sometimes the intermediate results generate new ideas.
This is called brainstorming in songwriting and is such a reliable technique that several teachers teach you to do this before creating every song.
The brainstorming techniques in songwriting get you to think outside of the box and to fill-in the story with more detail.
In software development, brainstorming helped me make the glue SCL even more simple.
Tinkering with code produces results similar to "shower time". Menial tasks move the project forward while allowing time for deeper thought. Deeper thought, applied to bits of the working project, resulted in out-of-the-box thoughts that would not have occurred to me if I hadn't made the base levels work. Thinking works better when it has "something to latch onto".
I used the glue tool to remove <switch> and <foreignObject…> from a sample SVG file (generated by Drawio).
My final spec is:
SVGSwitchAndForeign {
svg = xmlHeader docTypeHeader svgElement
xmlHeader = "<?" stuff* "?>" ws*
docTypeHeader = "<!DOCTYPE" stuff* ">" ws*
svgElement = "<svg" attribute* ">" ws* emptyDefs element+ "</svg>" ws*
emptyDefs = "<defs/>" ws*
element = (elementWithSwitch | elementWithForeign | elementWithelements | elementWithoutelements) ws*
elementWithSwitch = "<switch>" ws* element element "</switch>" ws*
elementWithForeign = "<foreignObject" attribute* ">" ws* element "</foreignObject>" ws*
elementWithelements = "<" name stuff* ">" ws* (element+ | text*) "</" name ">" ws*
elementWithoutelements = "<" name stuff* "/>"
stuff = ~">" ~"/>" ~"<" ~"?>" any
text = stuff
attribute = stuff
name = name1st nameFollow*
name1st = "a" .. "z" | "A" .. "Z"
nameFollow = "0" .. "9" | name1st
ws = " " | "\n" | "\t"
}
Svg [a b c] = ${a}${b}${c}
XMLHeader [a @b c] = ${a}${b}${c}
DOCTypeHeader [a @b c] = ${a}${b}${c}
SvgElement [a @b c d @e f] = ${a}${b}${c}${d}${e}${f}
EmptyDefs [a] = ${a}
Element [a] = ${a}
ElementWithSwitch [a b c d] = ${a}${b}${c}${d}
ElementWithForeign [a @b c d e] = ${a}${b}${c}${d}${e}
ElementWithElements [a b @c d @e f g h] = ${a}${b}${c}${d}${e}${f}${g}${h}
ElementWithoutElements [a b @c d] = ${a}${b}${c}${d}
stuff [a] = ${a}
text [a] = ${a}
attribute [a] = ${a}
name [a @b] = ${a}${b}
name1st [a] = ${a}
nameFollow [a] = ${a}
Which is a lot less code[6] than what is written in raw JavaScript.
This code chops — divide and conquer — up the problem into two obvious parts:
Each part does one thing only — the first part does pattern matching, the second part does rearranging. Each part is described by its own SCL (DSL). Pattern matching is best described as a grammar, while rearranging is best described as JavaScript `…` syntax.
[I don't try to force-fit everything into one paradigm. Pattern matchers don't make for good code rearrangers, JavaScript strings don't make for good pattern matchers. General purpose languages don't make for good anything. Except details. Details kill.]
The glue code that corresponds to the SVG grammar is:
Svg [a b c] = ${a}${b}${c}
XMLHeader [a @b c] = ${a}${b}${c}
DOCTypeHeader [a @b c] = ${a}${b}${c}
SvgElement [a @b c d @e f] = ${a}${b}${c}${d}${e}${f}
EmptyDefs [a] = ${a}
Element [a] = ${a}
ElementWithSwitch [a b c d] = ${a}${b}${c}${d}
ElementWithForeign [a @b c d e] = ${a}${b}${c}${d}${e}
ElementWithElements [a b @c d @e f g h] = ${a}${b}${c}${d}${e}${f}${g}${h}
ElementWithoutElements [a b @c d] = ${a}${b}${c}${d}
stuff [a] = ${a}
text [a] = ${a}
attribute [a] = ${a}
name [a @b] = ${a}${b}
name1st [a] = ${a}
nameFollow [a] = ${a}
This says:
Svg is a grammar rule.
Svg = XMLHeader DOCTypeHeader SvgElement
When the Svg grammar rule is matched, the matches are provided (as CSTs) in parameters a, b, and c. Combine the three parameters using JavaScript `…` string syntax and return that string result.
XMLHeader is another grammar rule. The grammar rule is
XMLHeader = "<?" stuff* "?>"
In this case, the grammar matches 3 items ("<?", stuff* and "?>"). The second item, though, has a zero-or-more operator (*), which means that the grammar returns an array (for zero items, the array has length 0). The fact that the second item — b — is non-scalar (an array) is denoted by writing @b on the left-hand side of the GLUE statement. The right-hand side uses simple JavaScript `…` notation where the tool has collapsed the second item into the final variable called b.
The programmer is responsible for writing the LHS's correctly.
There is no "type checking". This tool language is more like an editor operation than a DSL. Comparison: REGEXPs are not type-checked (yet) in languages that use them.
In Ohm-JS, each grammar rule returns <something> after it is finished.
If the rule is something like:
R = A B C
then the grammar rule called "R" returns a single thing — a combination of the return values from A and B and C. In this case A maps to a JS variable and B maps to a JS variable and C maps to a JS variable. Each variable contains one <thing>[7].
But, if the rule is something like:
R = A B* C
then the B maps to an array of <something>s. This is easy to handle in JavaScript, but you — the programmer — need to know when to expect a single thing or an array of things.
A side-note on how notation affects thinking… In ESRAP, B* returns a list (a tree).
It took me a while to reconcile what I expected (coming from Lisp to JS). JS wants you to express details in arrays, whereas Lisp makes it easy to think in terms of trees (aka lists).
Ohm-JS could have returned JS objects, but it returned arrays instead. The creator(s) of Ohm-JS was influenced by JS to use arrays instead of Objects.
The creator of ESRAP was influenced by Lisp to return lists.[8]
The difference is made more clear in something like
(A* B* C*)
where ESRAP returns one list and Ohm-JS returns three arrays.
[Note that ESRAP rewrites this as (and (* A) (* B) (* C)) which is less clear, if you are thinking in terms of pattern matching. This is yet another orthogonal conversation — see my essay https://guitarvydas.github.io/2021/03/16/Triples.html]
[1] SCL means Solution Specific Language. Like the original idea behind DSLs.
[2] I've written parsers many times before. Each time I learned something new. This time, I can apply a subset of what I learned, with cofidence.
[3] The JavaScript code that hangs off of the grammar is called "semantics" in Ohm-JS. This term comes from compiler technology, but you don't really need to know about this stuff to simply use it.
[4] See my essay "Details Kill"
[5] The format is a CST - a concrete syntax tree. CSTs are often conflated with ASTs, but there is nothing "abstract" about CSTs. ASTs define the universe of possibilities, but CSTs represent the actual incoming code.
[6] and less detail - details kill
[7] A CST, to be exact. A CST is represented as an Object with certain format (see Ohm-JS source code for exact details).
[8] One could argue that arrays are just optimized lists, but that's beside the point.