Part 2

[This is a follow-on to part 1 of "Recursive Design, Iterative Design By Example".]


Goal: transpile a file of ASON text to Lisp items.


Sub-goal: instead of trying to implement ASON and Rebol, I start by converting ASON text into tokens.  I will see, later,  how to progress from there…


In general, the parser inputs a stream of text and outputs a stream of ASON tokens.


Specifically, when emitting a stream of Lisp tokens, the ASON tokens will appear as 


[I am trying to capture my design thoughts as they occur, kind of like a live-stream video.  This is not a paper, nor a YouTube video nor a Twitch live-stream.  The result is probably uneven - I will continue to experiment with various thought-capture methodologies.]


Discovery

Discovery / Observation; the 7th pass - called blocks - is unnecessary.


The block tokens [], {}, () simply emit 


The grammar does not need to parse compound blocks, since emitting Lisp lists (function calls) is sufficient.  Lisp will read the emitted compound blocks into its internal (list) format.


The Lisp REPL applies different semantics to lists than the Rebol REPL / compiler.  I will deal with interpretation of the stream later.  


For now, I just want to tokenize the input.


ASON has many basic data types and I want to experiment with tokenizing them using PEG.

Restructuring Token Layers

The grammar has become a tokenizer consisting of 2 layers:


very basic tokens

compound tokens


The 3rd layer is the emitter.


I built a generic emitter and tested it.


Then, I built a Lisp emitter and tested it.  I RY'ed the generic emitter to create the lisp emitter.



Lisp Emitter Sample

The sample input code:

     { year: 2021

       tx: [

         [ jul-23 ??? -333.53 pppppp ]

       ]

     }

is converted into the token stream:

D{ {

e! year

I 2021

e! tx

D[ [

D[ [

cP july 23

U

D- -

bF 333 53 

e? pppppp

D] ]

D] ]

D} }

which is then converted into Lisp code

((ason-object-block 

  (define-word year) 

  2021 

  (define-word tx) 

  (ason-array-block 

   (ason-array-block 

    (mmdd july 23) 

    ???

    #\-

    333.53

    (use-word pppppp)))))


Next Step

I want to have "define-word" be a function of 2 arguments, e.g. the stream should be

((ason-object-block 

  (define-word year 2021)

  (define-word tx

      (ason-array-block 

       (ason-array-block 

(mmdd july 23) 

???

#\-

333.53

(use-word pppppp))))))


This could be done in 

  1. Lisp, or,
  2. the Grammar.


I haven't decided, yet, how to do this.


Step (a) would need to change to grammar.  Lisp would simply read the list of tokens and consume tokens that are needed.


Step (b) would require a change to the grammar, but would make the Lisp code simpler.


This looks like a simple pattern match, which is the domain of the grammar/parser.  I will start there.


Currently, the tokenizer emits something like

((ason-object-block 

  (define-word year) 2021

  (define-word tx)

  (ason-array-block 

     (ason-array-block 

(mmdd july 23) 

???

#\-

333.53

(use-word pppppp))))))


Bug

PEG allows one to write grammars that are "forgiving".


I am choosing to write grammars explicitly, to 

  1. make the explanations of what is going on more clear, and,
  2. to double-check the data in the pipeline (the grammar can be loosened up after most of the passes are working).


At the moment, some of the passes are failing on parse errors, due to an extra blank character inserted into some of the tokens.  Unfortunately, the error is caused by an invisible character, which has made it harder to detect the nature of the failure.


It seems that the token


bF 333 53 


has an extra space appended to it (bF is a floating point token containing the two integer parts of a float value).


I currently save the output of each pass in a temp file (_temp[0-9]).  This is, of course, inefficient, but helps me debug the current problem.  I see that _temp3 does not contain the extra space, but _temp4 does.  Looking at run.bash, I see that the pass date.grasem inputs _temp3 and outputs _temp4.  This pass needs to be examined more closely.


The first step in debugging is to grab the grammar (only) from date.grasem and the file _temp3 and put them into Ohm-editor[1], for a sanity check.  


Debugging question #1: is the token being parsed correctly?


It is immediately obvious that the token is being parsed by the rule numericalToken, but with only one text field.  The definition of the rule probably allows spaces in the first text field — something I don't intend to allow.


Indeed the rule says


 numericalToken = ntag nsubtag whiteSpace text text? eol

  text = whiteSpace *textChar+

  textChar = ~eol any


TextChar accepts any character except newline.  I need to tighten this up to exclude whitespace.


[Note to self: check if other passes use this same code snippet.  Fix it, if so.[2]]


Debugging question #2: does the code emission for rule numericalToken look OK?  (A: Yes.  Maybe the extra space was a result of mis-parsing.  Try the test, again.)


Fixing that parse has made the problem go away.


Groups

Design Upgrade / Change:


While working on this, I see a simplification. 


It is not necessary to emit ASON blocks as separate tokens.


I use "G ident" to denote the beginning of a block.  The ident supplies information about the block type. 


I have ason-arrays, ason-objects and ason-expressions.  


To this, I add define-word.  This addition makes it easier to parse define-word.


The parser now emits:

Negative Integer

I found that simple integers are not enough — not a surprise, but I didn't catch this requirement earlier.


Fixing this problem is simple — I need to add another pass.


The only tricky part of this problem is that several ASON types require dashes between integers (e.g. dates).


I want to leave the earlier parser passes alone (they are meant to be isolated) and insert a new pass near the end of the pipeline.


[see https://github.com/guitarvydas/ason/blob/master/negativenum.grasem]

Bug 2

I added the line

      balance : 1234.56

to the small test. 


This is causing me to flip-flop on what the correct solution is.


[The grammar as-it-stands — expects no pairs after the list of transactions.  The above simple addition violates this original assumption.]


I am learning something new about the problem.


The line causes a pair of tokens to be passed to the emitter

e! balance

bF 1234 56


  1. At first, I thought that e! should be weeded out by the expr pass (which immediately precedes the emitter pass),
  2. Then, I thought that this sequence of tokens is legal.
  3. Then, I thought that this sequence of tokens is illegal and should be weeded out by the expr pass before being passed to the emitter passes.


I am going back-and-forth in understanding the details of this problem.


This is "nothing special".  


I did not understand all of the details of the problem when I started.


The deeper understanding of the problem does not wreck my earlier design and code.


Layering some of the details — and eliding details — has made it possible for me to see more nuance in the problem.


At first, I thought of setting variables occurring at the beginning of object blocks.  This new addition shows that it is reasonable to have setting of variables occur.


Now that I see this, I wonder "how could I have missed this?".  Layering — and moving forward with the easy stuff — has revealed more nuance to me.


I've seen such revelations happen in work that took 2 years to build.  At the point that the revelation was made, the design had been calcified and the new revelation(s) could not be taking into account in the existing code.  


The idea of scripting and generating code, makes it easy to incorporate such new revelations into the code base.



Break Down of Bug 2

The problem now becomes "why didn't the expr pass capture and rewrite this line and its two tokens?".


Theories:

  1. the grammar did not catch the pair of tokens that form the defineWord sequence
  2. a blunder
  3. something else.


Theory (a) might indicate that the rule ident is picking off tags, preventing rule defineWord group from succeeding.


This leads to an examination of the rules tag and ident.


In anticipation of needing to tighten up tag vs. ident matching in all passes, I have split some of the code (from expr) into a boilerplate.  I include this boilerplate into the expr pass using the m4[3] tool.


First, I test the file split and the use of m4 to see that I get the same result (the result, at this point, is an error message).


Indeed — the two tokens are being parsed as two separate tokens and the defineWordGroup rule is not matching.


The rule defineWordToken is defined as

  defineWordGroup = defineWordToken expr expr

whereas it should be

  defineWordGroup = defineWordToken expr

(There are one too many expr's in the rule).


Layering Solutions

One needs to roll the problem around in one's mind to fully understand it.  


Some people "see" the details of the problem more rapidly than others, but, at some point this "seeing" is overwhelmed by masses of detail.  


I theorize that this always happens, but at different degrees of detail for different people.


At some point, the details overwhelm one's ability to reason about a problem.


The point of Recursive Design and FDD[4] is to create layers for details and to make it possible to reason about larger-and-larger problems, regardless of one's ability to cut through details at any level.  


I may need more layers, whereas others may need fewer layers for the same problem.  


The goal is to allow anyone to chunk a problem into layers and to address more interesting problems.  


In the hands of those who require fewer layers, further chunking and layering should make it possible to address more-and-more interesting problems.



[1] https://ohmlang.github.io/editor/

[2] Same code found in stringsAndBinary.grasem and words.grasem.

[3] A standard UNIX® command-line tool.

[4] Failure-Driven Design