Tokenizing

I’m struggling with this. I think in terms of tokenizing, but Ohm-JS fights me. I was very un-amused to discover that ESRAP only worked with characters (instead of tokens and CL forms).

You can’t skip whitespace unless the whitespace has been tokenized. Ohm-JS handles languages, like JS, that use commas (,). Ohm-JS creates bugs (accidental complexity) when parsing comma-less languages.

My experiments with ASON parsing1 using Ohm-JS might show how to tokenize using Ohm-JS.

Isolated Components

I think in terms of isolated software components.

I want to build-and-forget any component.

Ideally, there are no dependencies between components. Adding new components does not affect the way that old ones work.

[This is possible, but unlikely with current programming languages.]

Parsing and tokenizing is like that. The first pass should break the input into two kinds of tokens

  1. whitespace
  2. non-whitespace.

Then, if we want to delete whitespace, we simple drop tokens of type (1). The rest of the tokens remain the same.

[For the record, I can’t bring myself to do something this simple using current languages. When I address (1), I immediately worry about counting newlines. Counting lines should, ideally, be done in another pass.]


  1. https://guitarvydas.github.io/2021/04/10/ASON–Notation–Pipeline.html  ↩︎