What is it like to create a programming language today? / Hebrew

What is it like to create a programming language today? / Hebrew

“This book is a classic. Treat her with care.”

An architect from our team said this when he handed me The Dragon Book. I became interested in compiler development about 15 years ago, at the dawn of my career. Once, while reading this book late at night, I fell asleep, carelessly dropping it on the floor. I hope the owner didn’t notice the small dent in the cover after I returned it.

This book was published in 1986. In those days, creating compilers was an extremely difficult task that required the possession of various skills in the field of computer science in general and programming in particular. Now, almost four decades later, I am engaged in this task. How difficult is it today? I invite you to analyze the process of language creation together and see how much modern tools have simplified it.

Target language

To begin with, we need to choose some specific language so that the conversation is more substantive. I have always believed that real examples are much more effective than fictional ones, so I will use the ZModel language that we create in

ZenStack

. It is a Domain Specific Language (DSL) used to model database tables and access control rules. And in order not to stretch the article, I will take only a small part of the possibilities for demonstration. Our goal will be to compile the following code snippet:

model User {
  id Int
  name String
  posts Post[]
}

model Post {
  id Int
  title String
  author User
  published Boolean

  @@allow('read', published == true)
}

A few quick notes:

  • the syntax of the model defines the database table, and its fields are displayed in the column of the table;
  • models can be cross-referenced, forming the necessary relationships. In the example above, the model User and Post form a “one-to-many” connection;
  • expression @@allow represents an access control rule and takes two arguments: one that describes the type of access (create, read, update, delete or all), and the second logical, indicating the availability of rights for this type of access.

That’s all. It’s time to roll up your sleeves and start compiling!

ZModel is a superset of Prisma Schema Language.

Creating a language in six steps

▍ Step 1: From text to syntax tree

In general, the construction of the compiler has not changed much over the years. First, we need a lexer to break the text into lexemes (tokens), and then a parser to build a stream of these tokens into a syntactic tree. High-level language authoring tools usually combine these two steps and allow you to go from text to tree right away.

The open source toolkit Langium was used to create the language. It’s a great kit built on top of TypeScript that simplifies the process of creating a language. Langium provides an intuitive DSL that allows you to define rules for the lexer and parser.

In this, the Langium DSL itself is built with Langium. Such recursion is called bootstrapping in compiler jargon. The first version of the compiler must be written using a different language/tool.

The syntax of ZModel can be as follows:

grammar ZModel

entry Schema:
    (models+=Model)*;

Model:
    'model' name=ID '{'
        (fields+=Field)+
        (rules+=Rule)*
    '}';

Field:
    name=ID type=(Type | ModelReference) (isArray?='[' ']')?;

ModelReference:
    target=[Model];

Type returns string:
    'Int' | 'String' | 'Boolean';

Rule:
    '@@allow' '('
        accessType=STRING ',' condition=Condition
    ')';

Condition:
    field=SimpleExpression '==' value=SimpleExpression;

SimpleExpression:
    FieldReference | Boolean;

FieldReference:
    target=[Field];

Boolean returns boolean:
    'true' | 'false';

hidden terminal WS: /\s+/;
terminal ID: /[_a-zA-Z][\w_]*/;
terminal STRING: /"(\\.|[^"\\])*"|'(\\.|[^'\\])*'/;

I hope this syntax is intuitive enough to understand. It consists of two parts:

  • Lexing rules.
    The terminal rules at the bottom are lexing rules that determine the order in which the source text is broken down into tokens. Our simple language has only identifier tokens (ID) and strings (STRING). Gaps in it are ignored.
  • Parsing rules.
    The rest are parsing rules. They determine the order of organization of the flow of tokens into a tree and may contain keywords (for example, Int, @@allow), which are also used in the lexing process. In a complex language, you will probably have recursive parsing rules (like nested expressions) that require special care when creating them. But in our example, we will do without it.

After preparing the rules, we can use the Langium API to transform our initial code snippet into the following syntax tree:

▍ Step 2: From syntactic tree to related

The syntax tree is very helpful in understanding the semantics of the source file. However, there is often one final step that needs to be taken.

Our ZModel language allows the use of so-called cross-references. For example, a field posts models User refers to the model Postwhich refers back to it through the field author. When in the process of traversing the tree we reach a node ModelReferencethen we will see that he refers to the name Post, but we can understand what it means. In this case, you can search to find a model with a matching name, but a more systematic approach would be to perform a “binding” traversal to resolve all such references and bind them to the target nodes. After binding is complete, our syntax tree will look like this (only part shown):


Related syntax tree (part of)

Technically, it’s now more of a graph than a tree, but by agreement we’ll continue to call it a syntax tree.

The good thing about Langium is that, in most cases, this tool helps to perform the union bypass automatically. It traces the nesting hierarchy of sparse nodes and uses it to construct “regions”. This allows it, when it encounters a name, to allow it and associate it with the appropriate target node. In complex languages, there will be times when you need to implement special name resolution behavior. Langium simplifies this task by allowing you to influence the linking process by implementing your own services.

▍ Step 3: from spanning tree to semantic correctness

If the source file contains parser/lexer errors, the compiler will report this and stop execution.

model {
  id
  title String
}
Expecting token of type 'ID' but found `{`. [Ln1, Col7]

But the absence of such errors still guarantees the semantic correctness of the code. For example, the fragment below is syntactically correct, but contains a semantic problem because compare

title

with

true

pointless

model Post {
  id Int
  title String
  author User
  published Boolean

  @@allow('read', title == true) // <- это сравнение является невалидным.
}

Each language usually has its own semantic rules, and tools rarely handle them automatically. Langium provides hooks for this, allowing you to evaluate the validity of different types of nodes.

export function registerValidationChecks(services: ZModelServices) {
  const registry = services.validation.ValidationRegistry;
  const validator = services.validation.ZModelValidator;
  const checks: ValidationChecks<ZModelAstType> = {
    SimpleExpression: validator.checkExpression,
  };
  registry.register(checks, validator);
}

export class ZModelValidator {
  checkExpression(expr: SimpleExpression, accept: ValidationAcceptor) {
   if (isFieldReference(expr) && expr.target.ref?.type !== 'Boolean') {
     accept('error', 'Only boolean fields are allowed in conditions', {
       node: expr,
     });
   }
  }
}

Now we get an interesting semantic error:

Only boolean fields are allowed in conditions [Ln 7, Col 19]

Unlike lexing, parsing, and linking, semantics checking is not a particularly declarative or systematic process. In complex languages, you have to write many rules using imperative code.

▍ Step 4: Increasing developer convenience

Today, the bar for creating good development tools is very high. For successful development, innovations must not only work well, but also be convenient. In the context of languages ​​and compilers, usability for developers is determined by three aspects:

▍ 1. IDE support

High IDE support – syntax highlighting, formatting, autocompletion, etc. – significantly reduces the complexity of learning and simplifies the developer’s life. And what I like about Langium in this regard is the built-in support

Language Server Protocol

. Your parsing and validation rules automatically become an acceptable base LSP implementation that works directly with VSCode and later.

IDE from JetBrains

(With restrictions). However, to ensure a high-quality IDE experience, you will need to go the extra mile by overriding the default implementation of LSP-related services using Langium.

▍ 2. Error notification

Your validation logic will generate error messages in many cases. At the same time, the accuracy and informativeness of such messages will largely determine how quickly the developer will be able to understand them and take the necessary actions.

▍ 3. Debugging

If your language is “executable” (more on that in the next section), then you need debugging in it. Moreover, the value of debugging will depend on the nature of the language. If it is an imperative language that includes instructions and control flow, then it should allow step-by-step progress and state inspection. If the language is declarative, debugging will most likely involve visualization to help clarify complex points (rules, expressions, etc.).

▍ Step 5: establishing a benefit

Getting a fully allowed and error-free syntax tree is cool, of course, but it won’t do much good by itself. From this point, you have several possible paths that will allow you to give the language actual value:

  1. Stop at this stage.
    You can stop here by committing the syntax tree as the final output so that users can apply it at their own discretion.
  2. Convert it to other languages.
    Most often, the language will have a “backend” to convert the syntax tree to a lower-level language. For example, the Java compiler backend generates JVM (Java Virtual Machine) bytecode. Here at ZenStack, we transform the ZModel into Prisma Schema Language, after which the target language’s tools or runtime can accept it as input.
  3. Implement the transformation mechanism in the form of a plug-in.
    You could also create a built-in mechanism to allow users to perform the transformation themselves, which would be a more constructive variation of point 1.
  4. Create an environment for the syntax tree.
    This is the most “full-fledged” way of building a language. You can implement an interpreter to execute sparse code. What is meant by “execution” is up to you. In ZenStack, we also have a runtime that interprets access control rules to apply them when accessing data.

▍ Step 6: Search for users

Congratulations! Now you can rest, because you have completed 20% of the work of creating a new language. As with almost any new development, the hardest part is selling it to people, even when the product is free. This question may not be a concern if the language is intended solely for use by you or your team. If it was created for an external audience, then it will not be so easy to distribute it. This is the remaining 80% of the work.

Conclusion

Given the rapidity with which the field of software development has developed in recent decades, creating a compiler feels like an ancient art. But I still believe that any serious developer should implement such a project, even for the sake of a unique experience. In this process, the dualism of programming – aesthetics and pragmatism – is very well manifested. Great software systems usually have an elegant conceptual model, but you’ll also come across a lot of improvisations that don’t look very pretty on the inside.

You should try writing a programming language. Because why not?

Discounts, raffle results and news about the RUVDS satellite — in our Telegram channel 🚀

Related posts