Little Language

Introduction

In this essay, I develop a little language (SCL)^[1] in about 1 day.

Then I use the little language (SCL) to filter SVG files to help me build another essay.

This SCL was built in layers. The glue SCL layer reads a spec, then generates code that can be used in a lower layer. Both layers happen to use Ohm-JS. This example is very simple, hence it contains only 2 layers. Ideally, all SCLs should be this simple.

Github

The code associated with this essay can be found at

https://github.com/guitarvydas/glue/tree/master

The various branches — dev0, dev1, dev2, and dev3 — show the project at different stages (described below).

The final test is in branch foreignFilter.

All branches have been collapsed into branch master.

Quick

The glue language and tool was developed in less than one day.

Goal

The goal of this SCL is to help me write PEG grammars and associated code.

I want to use parsers the way the most people use REGEXPs.

Note that REGEXPs are not "type checked" in most languages and editors. Likewise, pattern matching in this SCL is not "type checked". The programmer is responsible for writing the code correctly.

Note that pattern-matching is already a kind of type check (pattern matching is use in FP languages), so, the lack of type-checking is not as big a problem here as it might be in general purpose programming languages.

This is a fundamental principle of SCL design - YAGNI. Save development time by skipping hoary operations, like type-checking.

The goal is to create something that will generate useful code in less than a day of work (undercutting one of the principle reasons why DSLs are not used frequently).

"Type checking" will come later, if this tool gets used frequently.

I believe that it will be easier to type-check Glue programs than it would be to type-check REGEXPs.

Emitter

I want to write a little language — and SCL — that lets me pattern match a text file, then rearrange it and output it in some other way.

For example, I have the problem that I'm writing essays that include diagrams. I use Draw.io to create my diagrams. Draw.io does not save in SVG format. One must ask Draw.io nicely to export the drawing in SVG format.

The exported SVG diagram contains a lot of noise. I just want to see pure SVG, without the noise.

The exported SVG file contains "<switch>…</switch>" clauses that contain "<foreignObject>…</foreignObject>". This stuff overwhelms the .SVG file and I can't see the stuff that I really want to see (the rects, the ellipses, the paths, etc.).

Can I write a filter to remove the noise?

That is my goal. Write a filter. Delete the noise.

Divide and Conquer

Divide and conquer — on steroids — is recursive design.

Chop every problem up into two pieces.

Treat each piece separately.

If you don't know how to solve a piece, chop it up into two pieces.

Keep doing this until you know how to accomplish every piece.

As you go, new ideas will pop up. The new ideas can modify the problem at any level.

My divide & conquer for this (simple) problem went something like this:

0 main problem / main goal - remove <switch> and <foreignObject> elements from the Drawio generated SVG files
This can't be done in REGEXP, so I decided to use something more powerful — a parser — which allowed me to use nesting^[2]
- 1 SVG grammar — I know how to do this (PEG grammar in Ohm-JS)
- 2 other stuff - I need to write the JavaScript to accompany the grammar

The above is recursive design. Each step reduces the original problem until the termination case is encountered. Recursive break-down is used (recursively) on each branch of the problem — there are many termination cases.

[I think that the current crop of PLs — e.g. Python, JavaScript, etc. — show a distinct lack of divide-and-conquer mentality. IMO, everything should be a function call until the termination case is encountered in the recursive design. It should not be possible to use operators other than function calls except at the leaf levels. For example, "+" should not appear in any code except the very lowest-level code. Likewise, "cons" and array operations. Compilers can optimize-away function calls by making them into inline (macro) calls. AFAICT, most PLs allow unrestricted use of low-level operations (like "+", cons, arrays) at any level of the design.]

The SVG Grammar

I wrote a first cut of the grammar in https://github.com/guitarvydas/glue (branch: dev0).

I got it running. That was a "major" hurdle, since it required me to understand how to use Ohm-JS, how to read a file in JS, etc., etc.

[The "hurdle" decreases every time I use Ohm-JS and JavaScript, but I didn't need to wait to go down the respective learning curves.]

I started to use Ohm-JS and JavaScript right away.

[Ohm-JS knows how to do its magic in HTML scripts. I've explored that possibility in https://computingsimplicity.neocities.org/blogs/OhmInSmallSteps.pdf.]

Then, I refined my ideas and re-cut the SVG grammar. Branch: dev3.

The Final SVG Grammar

SVGSwitchAndForeign {

Svg = XMLHeader DOCTypeHeader SvgElement

XMLHeader = "<?" stuff* "?>"

DOCTypeHeader = "<!DOCTYPE" stuff* ">"

SvgElement = "<svg" attribute* ">" EmptyDefs Element+ "</svg>"

EmptyDefs = "<defs/>"

Element = ElementWithSwitch | ElementWithForeign | ElementWithElements | ElementWithoutElements

ElementWithSwitch = "<switch>" Element Element "</switch>"

ElementWithForeign = "<foreignObject" attribute* ">" Element "</foreignObject>"

ElementWithElements = "<" name stuff* ">" (Element+ | text*) "</" name ">"

ElementWithoutElements = "<" name stuff* "/>"

stuff = ~">" ~"/>" ~"<" ~"?>" any

text = stuff

attribute = stuff

name = name1st nameFollow*

name1st = "a" .. "z" | "A" .. "Z"

nameFollow = "0" .. "9" | name1st

}

The glue Tool

Problem (2.1) is that of creating an identity grammar for SVG.

I break this problem down into 2 parts

match SVG and leave hooks
rearrange the matches.

Using Ohm-JS, I need to

write a grammar (for SVG)
write so-called "semantics" code to do the rearranging.

I want a tool which makes it easy to pattern-match SVG and to re-arrange the matched bits.

[In Ohm-JS, you write a grammar to do the pattern-matching, and you write some JavaScript code to do the re-arranging. There is one JavaScript function for each rule in the grammar. The matches are passed in as function parameters.]

Roughly, I want a tool that does something like:

pattern matcher —> javascript code ^[3].

The pattern matcher portion is handled by Ohm-JS. It's called the grammar. The syntax is well-documented in https://github.com/harc/ohm.

The javascript re-arranging code is just a mess of JavaScript code. This is called the semantics in Ohm-JS documentation.

I've done this before. Writing the semantics code can be very repetitive and boring. In fact, all that I need is some way to tie grammar rules to JavaScript `…${v}…` strings.

I want:

.SVG —> Ohm-JS grammar —> string-language —> same .SVG

Actually, I would be happy if the string-language was simply the same as JavaScript `…${v}…` strings. So, I would settle for:

.SVG —> Ohm-JS grammar —> JavaScript `…` strings —> same .SVG

OK, so I want to write a grammar in Ohm-JS, then I want a mini-language that lets me rewrite pattern matches using JavaScript strings. I could have done this in raw JavaScript, but I didn't want to write details when I could automate ^[4].

Ohm-JS gives me each match (in some internal format)^[5] as function parameters.

Now, what I want is:

.SVG —> Ohm-JS grammar —> JavaScript variables —> JavaScript `…` strings —> same as original .SVG

So, my requirements boil down to:

use Ohm-JS to write a grammar for SVG (YAGNI, I don't need to handle all of SVG, just enough for my current problem, I can get away with 95% of SVG. The last 5% is a killer, usually. So avoid it.)
use another tool to build my JavaScript re-arranger code
run Ohm-JS+JS-rearranger-code to input a .SVG and spit it back out unchanged (but leaving me hooks for later).

I need to write a tool in Ohm-JS to spit out code that could be used with Ohm-JS to eat and spit out .SVG files.

My tool syntax is something like:

grammarRuleName variables —> javascript

One more complication: some of the grammar rules match one thing, but some grammar rules match multiple-things. The +/*/? operators in the grammar match multiple things.

In Ohm-JS, single matches are returned as JavaScript variables, and multiple matches are returned as JavaScript arrays (see the appendix). So, for this mini-language, I need to differentiate between the two kinds of things and generate different code for each kind of thing (single vs. multiple).

If the grammar has +/*/? in it, then we need more JavaScript code. If there is no +/*/?, then we still need JavaScript code, but less of it.

If we choose to use strings, then JavaScript has the .join('') operator, which makes handling of arrays of strings particularly easy.

One grammar rule can have both types of matches (singles and arrays).

So, for example, for the grammar rule:

R = A B+ C

we need to create a function like:

R = function (a, bArray, c) { return a.semcode () + bArray.semcode (). join('') + c.semcode (); },

[There are more details, but I'm going to skip over them for now. See the final source code. Details kill.].

[Note that I like to leave spaces before parameter lists (it results in cleaner-looking code after you get used to it). I write "fn(a,b)" as "fn (a, b)". Compressing whitespace is so 1950's.]

My first cut at the SCL (mini language) was to imagine a language where statements like:

XMLHeader [1 2s 3] = $1 @2s $3

would generate JavaScript, like:

XMLHeader = function (p1, p2s, p3) { return p1.glue () + p2s.glue.join('') + p3.glue () };

I generated a prototype and made it run. Branches "dev0" and "dev1"

[Why did I use brackets instead of parentheses for the parameter lists? To remind me that this isn't JavaScript — and — to remind me that I was trying to create an SCL that was declarative — i.e. I was allowed to put "operators" in the left-hand side as well as on the right-hand side.]

While tinkering with the details, I realized that I could reduce this language to something with statements like:

XMLHeader [x @y z] = abc${x}def${y}ghi${z}jkl

that would generate (JavaScript) code like:

XMLHeader = function (_x, _y, _z) {

var x = _x.glue ();

var y = _y.glue.join('');

var z = _z.glue ();

return `abc${x}def${y}ghi${z}jkl`

}

[In JavaScript, "_" is just a normal character. It is a convention to use "_" as a prefix for untouchable data (unexported).]

The Final glue Tool Grammar

SemanticsSCL {

Semantics = SemanticsStatement+

SemanticsStatement = RuleName "[" Parameters "]" "=" Rewrites

RuleName = letter1 letterRest*

Parameters = Parameter+

Parameter = treeparameter | flatparameter

flatparameter = fpws | fpd

fpws = pname ws+

fpd = pname delimiter

treeparameter = "@" tflatparameter

tflatparameter = tfpws | tfpd

tfpws = pname ws+

tfpd = pname delimiter

pname = letterRest letterRest*

Rewrites = rwstring

letter1 = "_" | "a" .. "z" | "A" .. "Z"

letterRest = "0" .. "9" | letter1

ws = "\n" | " " | "\t" | ","

delimiter = &"]" | &"="

rwstring = stringchar*

stringchar = ~"\n" any

}

Emitter

branch: dev3

I wrote code that is very repetitive, on purpose. For example, in SemanticsStatement I evaluated each match, although only 1, 3 and 6 are used.

I name each parameter _n. (Underscore is not special — it is just another character).

I name each local (temporary) variable as __n. (Two underscores and a digit).

To "walk the tree" — evaluate the CST by applying semantics functions, I needed to call the function _glue () on each match.

I chose to make every function return a string. I use JavaScript `…` strings to build the results.

In places where Ohm-JS returns an array, I also call the .join ('') function. For example, in RuleName, the second parameter is an array. I collapse it with the following code:

var __2s = _2s._glue ().join ('')

Use run.bash to run the GLUE language.

The final JavaScript code is:

// npm install ohm-js

function ohm_parse (grammar, text) {

var ohm = require ('ohm-js');

var parser = ohm.grammar (grammar);

var cst = parser.match (text);

if (cst.succeeded ()) {

return { parser: parser, cst: cst };

} else {

console.log (parser.trace (text).toString ());

throw "Ohm matching failed";

}

function getNamedFile (fname) {

var fs = require ('fs');

if (fname === undefined || fname === null || fname === "-") {

return fs.readFileSync (0, 'utf-8');

} else {

return fs.readFileSync (fname, 'utf-8');

}

var varNameStack = [];

function addSemantics (sem) {

sem.addOperation ('_glue', {

Semantics: function (_1s) {

var __1 = _1s._glue ().join ('');

return `sem.addOperation ('_glue', {${__1}});`;

SemanticsStatement: function (_1, _2, _3, _4, _5, _6) {

varNameStack = [];

var __1 = _1._glue ();

var __2 = _2._glue ();

var __3 = _3._glue ();

var __4 = _4._glue ();

var __5 = _5._glue ();

var __6 = _6._glue ();

return `

${__1} : function (${__3}) {

${varNameStack.join ('\n')}

return \`${__6}\`;

RuleName: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1 + __2s; },

Parameters: function (_1s) { var __1s = _1s._glue ().join (','); return __1s; },

Parameter: function (_1) {

var __1 = _1._glue ();

return `${__1}`;

flatparameter: function (_1) {

var __1 = _1._glue ();

varNameStack.push (`var ${__1} = _${__1}._glue ();`);

return `_${__1}`;

fpws: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1; },

fpd: function (_1, _2) { var __1 = _1._glue (); var __2 = _2._glue (); return __1; },

treeparameter: function (_1, _2) {

var __1 = _1._glue ();

var __2 = _2._glue ();

varNameStack.push (`var ${__2} = _${__2}._glue ().join ('');`);

return `_${__2}`;

tflatparameter: function (_1) {

var __1 = _1._glue ();

return `${__1}`;

tfpws: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1; },

tfpd: function (_1, _2) { var __1 = _1._glue (); var __2 = _2._glue (); return __1; },

pname: function (_1, _2s) { var __1 = _1._glue (); var __2s = _2s._glue ().join (''); return __1 + __2s;},

Rewrites: function (_1) { var __1 = _1._glue (); return __1; },

letter1: function (_1) { var __1 = _1._glue (); return __1; },

letterRest: function (_1) { var __1 = _1._glue (); return __1; },

ws: function (_1) { var __1 = _1._glue (); return __1; },

delimiter: function (_1) { return ""; },

rwstring: function (_1s) { var __1s = _1s._glue ().join (''); return __1s; },

stringchar: function (_1) { var __1 = _1._glue (); return __1; },

_terminal: function () { return this.primitiveValue; }

});

}

function main () {

// usage: node glue <file

// reads grammar from "glue.ohm"

var text = getNamedFile ("-");

var grammar = getNamedFile ("glue.ohm");

var { parser, cst } = ohm_parse (grammar, text);

var sem = {};

var outputString = "";

if (cst.succeeded ()) {

sem = parser.createSemantics ();

addSemantics (sem);

outputString = sem (cst)._glue ();

}

return { cst: cst, semantics: sem, resultString: outputString };

}

var { cst, semantics, resultString } = main ();

console.log(resultString);

Brainstorming

It is better to do something rather than just sitting around and thinking.

It is OK to throw intermediate results away.

Sometimes the intermediate results generate new ideas.

This is called brainstorming in songwriting and is such a reliable technique that several teachers teach you to do this before creating every song.

The brainstorming techniques in songwriting get you to think outside of the box and to fill-in the story with more detail.

In software development, brainstorming helped me make the glue SCL even more simple.

Tinkering with code produces results similar to "shower time". Menial tasks move the project forward while allowing time for deeper thought. Deeper thought, applied to bits of the working project, resulted in out-of-the-box thoughts that would not have occurred to me if I hadn't made the base levels work. Thinking works better when it has "something to latch onto".

The Test

Test Use Case

I used the glue tool to remove <switch> and <foreignObject…> from a sample SVG file (generated by Drawio).

I used the SVG grammar "as is".
I wrote a glue script to generate the extra JavaScript "semantics" code.
I ran the glue tool and pasted the result into my boilerplate.
I ran frun.bash. (This ran the glue tool using semantics.glue, then ran the result using input file test.svg).

Transpiler Spec

My final spec is:

SVGSwitchAndForeign {

svg = xmlHeader docTypeHeader svgElement

xmlHeader = "<?" stuff* "?>" ws*

docTypeHeader = "<!DOCTYPE" stuff* ">" ws*

svgElement = "<svg" attribute* ">" ws* emptyDefs element+ "</svg>" ws*

emptyDefs = "<defs/>" ws*

element = (elementWithSwitch | elementWithForeign | elementWithelements | elementWithoutelements) ws*

elementWithSwitch = "<switch>" ws* element element "</switch>" ws*

elementWithForeign = "<foreignObject" attribute* ">" ws* element "</foreignObject>" ws*

elementWithelements = "<" name stuff* ">" ws* (element+ | text*) "</" name ">" ws*

elementWithoutelements = "<" name stuff* "/>"

stuff = ~">" ~"/>" ~"<" ~"?>" any

text = stuff

attribute = stuff

name = name1st nameFollow*

name1st = "a" .. "z" | "A" .. "Z"

nameFollow = "0" .. "9" | name1st

ws = " " | "\n" | "\t"

}

Svg [a b c] = ${a}${b}${c}

XMLHeader [a @b c] = ${a}${b}${c}

DOCTypeHeader [a @b c] = ${a}${b}${c}

SvgElement [a @b c d @e f] = ${a}${b}${c}${d}${e}${f}

EmptyDefs [a] = ${a}

Element [a] = ${a}

ElementWithSwitch [a b c d] = ${a}${b}${c}${d}

ElementWithForeign [a @b c d e] = ${a}${b}${c}${d}${e}

ElementWithElements [a b @c d @e f g h] = ${a}${b}${c}${d}${e}${f}${g}${h}

ElementWithoutElements [a b @c d] = ${a}${b}${c}${d}

stuff [a] = ${a}

text [a] = ${a}

attribute [a] = ${a}

name [a @b] = ${a}${b}

name1st [a] = ${a}

nameFollow [a] = ${a}

Which is a lot less code ^[6] than what is written in raw JavaScript.

This code chops — divide and conquer — up the problem into two obvious parts:

breathe in — pattern match the .SVG
breathe out — rearrange the matched code and spit it out.

Each part does one thing only — the first part does pattern matching, the second part does rearranging. Each part is described by its own SCL (DSL). Pattern matching is best described as a grammar, while rearranging is best described as JavaScript `…` syntax.

[I don't try to force-fit everything into one paradigm. Pattern matchers don't make for good code rearrangers, JavaScript strings don't make for good pattern matchers. General purpose languages don't make for good anything. Except details. Details kill.]

Test glue Code

The glue code that corresponds to the SVG grammar is:

Svg [a b c] = ${a}${b}${c}

XMLHeader [a @b c] = ${a}${b}${c}

DOCTypeHeader [a @b c] = ${a}${b}${c}

SvgElement [a @b c d @e f] = ${a}${b}${c}${d}${e}${f}

EmptyDefs [a] = ${a}

Element [a] = ${a}

ElementWithSwitch [a b c d] = ${a}${b}${c}${d}

ElementWithForeign [a @b c d e] = ${a}${b}${c}${d}${e}

ElementWithElements [a b @c d @e f g h] = ${a}${b}${c}${d}${e}${f}${g}${h}

ElementWithoutElements [a b @c d] = ${a}${b}${c}${d}

stuff [a] = ${a}

text [a] = ${a}

attribute [a] = ${a}

name [a @b] = ${a}${b}

name1st [a] = ${a}

nameFollow [a] = ${a}

This says:

Svg is a grammar rule.

Svg = XMLHeader DOCTypeHeader SvgElement

When the Svg grammar rule is matched, the matches are provided (as CSTs) in parameters a, b, and c. Combine the three parameters using JavaScript `…` string syntax and return that string result.

XMLHeader is another grammar rule. The grammar rule is

XMLHeader = "<?" stuff* "?>"

In this case, the grammar matches 3 items ("<?", stuff* and "?>"). The second item, though, has a zero-or-more operator (*), which means that the grammar returns an array (for zero items, the array has length 0). The fact that the second item — b — is non-scalar (an array) is denoted by writing @b on the left-hand side of the GLUE statement. The right-hand side uses simple JavaScript `…` notation where the tool has collapsed the second item into the final variable called b.

The programmer is responsible for writing the LHS's correctly.

There is no "type checking". This tool language is more like an editor operation than a DSL. Comparison: REGEXPs are not type-checked (yet) in languages that use them.

Appendix

Details About Matching

In Ohm-JS, each grammar rule returns <something> after it is finished.

If the rule is something like:

R = A B C

then the grammar rule called "R" returns a single thing — a combination of the return values from A and B and C. In this case A maps to a JS variable and B maps to a JS variable and C maps to a JS variable. Each variable contains one <thing>^[7].

But, if the rule is something like:

R = A B* C

then the B maps to an array of <something>s. This is easy to handle in JavaScript, but you — the programmer — need to know when to expect a single thing or an array of things.

Notation Affects Thinking

A side-note on how notation affects thinking… In ESRAP, B* returns a list (a tree).

It took me a while to reconcile what I expected (coming from Lisp to JS). JS wants you to express details in arrays, whereas Lisp makes it easy to think in terms of trees (aka lists).

Ohm-JS could have returned JS objects, but it returned arrays instead. The creator(s) of Ohm-JS was influenced by JS to use arrays instead of Objects.

The creator of ESRAP was influenced by Lisp to return lists.^[8]

The difference is made more clear in something like

(A* B* C*)

where ESRAP returns one list and Ohm-JS returns three arrays.

[Note that ESRAP rewrites this as (and (* A) (* B) (* C)) which is less clear, if you are thinking in terms of pattern matching. This is yet another orthogonal conversation — see my essay https://guitarvydas.github.io/2021/03/16/Triples.html]

[1] SCL means Solution Specific Language. Like the original idea behind DSLs.

[2] I've written parsers many times before. Each time I learned something new. This time, I can apply a subset of what I learned, with cofidence.

[3] The JavaScript code that hangs off of the grammar is called "semantics" in Ohm-JS. This term comes from compiler technology, but you don't really need to know about this stuff to simply use it.

[4] See my essay "Details Kill"

[5] The format is a CST - a concrete syntax tree. CSTs are often conflated with ASTs, but there is nothing "abstract" about CSTs. ASTs define the universe of possibilities, but CSTs represent the actual incoming code.

[6] and less detail - details kill

[7] A CST, to be exact. A CST is represented as an Object with certain format (see Ohm-JS source code for exact details).

[8] One could argue that arrays are just optimized lists, but that's beside the point.