Tasks about PEG-parsers

Tasks about PEG-parsers

I once wanted to do a parsing contest for Codeforces. I came up with two types of tasks:

  1. An informal description of the language in which the grammar is to be created is given (for example, “a language with correct parenthetical sequences”);

  2. Examples of lines in the language that you need to restore the grammar are given.

The problem with the first type of task: informal description will be understood differently by different people, so the test is not the ability to compose grammars, but the ability to understand task descriptions. And a formal description will not do, because it resembles correct grammar.

The problem of the second type of problem of limitation of examples. You can “not catch” what the structure of the language looks like. And you can also create a grammar that simply contains all the given examples. It is not clear how to appeal.

As a result, I did the game the CrateGram program, in which you can solve tasks of the second type, while checking strings for belonging to the guessed language.

Let’s introduce three definitions that will come in handy later:

Language L – Many lines.
Grammar of the L language – A function of type (input: string) -> bool, which returns true if the input string is in the L language, false otherwise.
AST (Abstract syntax tree) is a hierarchical structure into which the string is translated after parsing. The nodes of the tree correspond to the rules of the grammar, and the leaves correspond to the parts of the steamed string.

Caterpillar logic

Caterpillar logic is a game where you have to guess the rule that snakes correspond to (it was this game that inspired me). “Correct” snakes are placed in the valid column, and incorrect ones – invalid. Snakes consist of segments of 4 colors. The game is that we make new snakes from segments, and the program tells us whether they fit the rule or not. So, you can understand the regularity. To pass the level, you need to distribute 15 snakes on the columns (these 15 snakes are generated by the game when we click “I know the rule and ready for the test!”).

Caterpillar logic

In essence, caterpillar logic is the same as a black box. What I like about this approach is that the player can check combinations of segments against the rule.

However, there is a small chance of guessing the rule (). I don’t like it, so to pass a level in CrateGram you need not to specify the correctness of 15 sequences, but to formally describe the rule through PEG. This rule will be tested on several tests.

What is Parsing Expression Grammar?

Parsing Expression Grammar (or PEG) is a “format” of grammar that uses regular expression operators. List of operators:

"abc" - точное совпадение строки abc
[abc] - совпадение одного из символов в скобках (или промежутка символов, если записаны через `-`)
e* - от 0 до бесконечности совпадений e
e+ - от 1 до бесконечности совпадений e
&e - e должно находиться далее
!e - e не должно находиться далее

e1 e2 - выражения e1 и e2 должны идти друг за другом
e1 / e2 - если выражение e1 не парсится, тогда парсим e2
r = ... - создание нового правила с названием r

A simple example:

// в каждой грамматике должно быть стартовое правило, с которого начинается парсинг
// здесь оно названо `root`
root = "2" digit digit digit
digit = [0-9]

This is an example of the grammar of the language of all numbers from 2000 to 2999. To match the root rule, the string must start with a binary. This is followed by three repetitions of the rule digitwhich describes one digit.

Correction

In general, for the grammar to be correct, you need to add !. at the end of the rule root. This addition means that there should be no characters after the fourth digit (. – means any symbol).

There is a more well-known way of assigning grammars: EBNF or Extended Backus-Naur form (also simply BNF). Its difference from PEG is that one string can be processed differently (giving multiple ASTs). There is a known dangling-else problem. The point is that for nested if-then-else, the next else can refer to both the outer if and the inner if.

Case analysis for EBNF

Let the expression `if A then if B then C else D’ be given. Two AST options are possible for him:

"root": [
  "if ",
  "A",
  " then ",
  [
	  "if ",
	  "B",
	  " then ",
	  "C",
	  " else ",
	  "D"
  ]
]

or

"root": [
  "if ",
  "A",
  " then ",
  [
	  "if ",
	  "B",
	  " then ",
	  "C",
  ],
  " else ",
  "D"
]

There is no such thing in PEG, parsing of grammar and expression is unambiguous. And the PEG parser is easier to write.

Parsing for PEG

Let

root = expr !.
expr = "if " ([A-Z] / expr) " then " ([A-Z] / expr) (" else " ([A-Z] / expr))?
// AST для `if A then if B then C else D`
{
  "root": {
    "expr": [
      "if ",
      "A",
      " then ",
      {
        "expr": [
          "if ",
          "B",
          " then ",
          "C",
          " else ",
          "D"
        ]
      },
      ""
    ]
  }
}

Tasks in the image and likeness of snakes

This is what the CrateGram interface looks like

In CrateGram, at each level you need to create a grammar that will only accept valid strings. Like Caterpillar logic, the player can check the validity of lines (add test button, you can write a test below).

It’s made simple: there’s a hidden grammar that checks the user’s input. To check the response (the player’s grammar), there are several test lines that check the user’s grammar. The test lines are made as follows: a line is randomly generated, checked with a hidden grammar, after which it is set to valid or invalid depending on the check.

At first, I wanted to generate valid strings, but for this you need to get rid of operators ! and &. There is an algorithm how to do it (chapter 4), but I couldn’t While the tests are a fairly reliable way of checking.

Other CrateGram features

There is also a Playground where you can easily create and test grammars, just like in PEG.js. Advantage over PEG.js in detailed AST tree. It will show the names of the rules.

The downside is primitive error messages (and most likely performance). If you create a grammar root = rootthen PEG.js will write Line 1, column 8: Possible infinite loop when parsing (left recursion: root -> root). CrateGram will crash already while writing the test.

In the future, ASTs will be able to add unwrap/ignore rules with an underscore as in Lark, or a unary minus to remove from the AST as in better-parse.

Conclusions

When I used PEG parsers, I thought it was a complicated program that I wouldn’t be able to figure out. When I wrote my PEG parser, that feeling disappeared, and so did the blind spot in understanding parsing algorithms.

Related posts