PEG vs. Other Pattern Matchers

Introduction

In this essay, I discuss why you should use PEG (instead of, say, REGEXP).

PEG

PEG is a breakthrough technology.

It lets you write pattern-matchers as easily as you would write REGEXPs.

The difference between PEG and REGEXP is that PEG lets you write nested matches, which are much harder (if not impossible) to write as REGEXPs.

PEG can match programming language constructs, like "{ … }".

PEG can match such patterns to any depth, e.g. "{ … { … { … } … } … }".

PEG can rearrange the matches any way you would like, then spit out new code.

PEG can be used to write transpilers.

Parser technology used to be reserved only for compiler-writers.

PEG changes that.

Now, "normal" programmers can use nested pattern-matching as easily as they use REGEXP.

Ohm-JS

I happen to use Ohm-JS.

It is a PEG grammar tool that works with JavaScript.

There are PEG libraries for many different languages, for example ESRAP for Common Lisp, "#lang peg" for Racket.

In some cases there are multiple PEG libraries for a single PL. I had to choose between Ohm-JS and peg-js, for JavaScript.

What to Look For

A grammar is a DSL for pattern-matching.

REGEXP is a DSL for pattern-matching, but it doesn't (easily) allow nesting, whereas PEG does allow nesting.

My opinion is that you want to have as little noise in the grammar as possible.

PEG can match patterns, but it leaves the "what to do with the matches" up to you and your language.

Most PEG libraries include extra syntax to assign matches to variables.

Ohm-JS ^[1] doesn't do this. Ohm-JS leaves the grammar alone, to stand on its own.

When you want to rearrange the patterns,

in Ohm-JS you have to look at another part of the code/script.
in other PEGs, you have to soil the grammar by adding such details ^[2] to the grammar. Details kill readability.^[3]

Scanning and Parsing

[Skip this section if you just want to use PEG]

In previous technologies, one had to write a scanner and a parser separately.

For example, the YACC tool generates parsers, but it needs a scanner tool, usually LEX.

LEX uses a different notation (REGEXP) than YACC.

PEG subsumes all of that into one notation - you can write the scanner and the parser in one go. This seems like a small difference but it makes it possible to write SCLs (little languages) at will.

In fact, I argue that one should write many SCLs for any one project. Any one project — if big and interesting enough — involves more than one paradigm. Each paradigm should get its own SCL. At present, we choose one 3^rd-generation language (like Python, JavaScript, Rust, etc.) and implement all paradigms in that one language. The result is an unreadable mess. Details kill readability.

The YACC/LEX dichotomy kept me from building many little languages for each project.

PEG is opening my eyes to new ways of doing things.

Scanning

REGEXP is a DSL for building scanners.

REGEXP is built into many languages, including Perl and JavaScript.

REGEXP is "flat". It cannot (easily) match nested patterns.

One of the original tools that used the REGEXP DSL syntax was LEX.

The theory for converting REGEXPs to code can be found in The Dragon Book. If you just want to use REGEXPs, don't read The Dragon Book.

Parsing

Parsing is pattern matching.

Most parser-generator tools require some sort of regular input that is more-than just characters, e.g. tokens.

The theory for building parsers is documented in The Dragon Book. If you just want to use pattern matching, don't bother reading The Dragon Book.

Pattern matching is becoming ubiquitous in FP (functional programming) languages.

All that you really need is a way to match patterns and then emit code as lambdas. JavaScript, for example, has a lot of this kind of stuff ^[4]. Most PLs have 1^st-class functions (aka lambdas) and pattern matching libraries (REGEXP and PEG and LR(k) and …).

We want theorists to tell us what works and what doesn't, but we don't need to wait for theories before starting to use this stuff.

[1] and ESRAP, in a sense

[2] In ESRAP, the grammar is left alone and lambdas are tacked onto the end. The problem with ESRAP is that it uses a Lisp syntax instead of a PEG syntax. (This can be cured by writing a PEG in ESRAP that transpiles to ESRAP, but that's another essay).

[3] See my essay "Details Kill"

[4] JS has REGEXPs. You can write TDPL parsers using REGEXPs in functions. JS has lambdas, but calls them anonymous functions.