Reference

Complete guide to the langlang grammar syntax.

Overview

langlang uses a grammar format based on Parsing Expression Grammars (PEG). If you're familiar with regular expressions or BNF notation, you'll find the syntax intuitive.

A grammar consists of productions (also called rules) that define how to parse input. The parser starts from the first production and works its way through the grammar recursively.

Productions

A production has a name on the left, an arrow, and an expression on the right:

ProductionName <- Expression

The first production in a grammar is the start rule — parsing begins there.

Terminals

Terminals match actual characters in the input.

Literals

Match exact character sequences using single or double quotes:

Keyword  <- 'function'
Greeting <- "hello"

Both quote styles are equivalent. Use whichever makes your grammar more readable (e.g.: double quotes if the literal contains single quotes).

Escape Sequences

Standard escape sequences work inside literals:

SequenceMeaning
\nNewline
\rCarriage return
\tTab
\\Backslash
\'Single quote
\"Double quote
\u{XXXX}Unicode
Newline <- '\n'
Tab     <- '\t'

Character Classes

Match any single character from a set using square brackets:

Digit  <- [0-9]
Letter <- [a-zA-Z]
Hex    <- [0-9a-fA-F]

[a-z] matches any character from 'a' to 'z' (inclusive). You can combine multiple ranges and individual characters:

// Matches digits, letters, and underscore
Identifier <- [a-zA-Z_0-9]

Any Character

The dot . matches any single character except end-of-input:

// Match anything until closing quote (using not predicate)
StringChar <- !'"' . / '\\' .

Unicode Support

Unicode Literals

langlang has full Unicode support. You can match characters from any script, including CJK, Arabic, Cyrillic, and even emoji, directly in your grammars.

Konnichiwa <- 'こんにちは'      // Japanese greeting
Annyeong   <- '안녕하세요'       // Korean greeting
Marhaba    <- 'مرحبا'           // Arabic greeting
Privet     <- 'привет'        // Russian greeting

Unicode in Character Classes

Character classes work with Unicode characters, including ranges:

Hiragana  <- [ぁ-ん]             // Match a single Hiragana character
Katakana  <- [ァ-ン]             // Match a single Katakana character
Hangul    <- [가-힣]             // Match a single Hangul syllable
Cyrillic  <- [а-я]              // Match Cyrillic lowercase letters
MixedWord <- [a-zA-Zあ-ん]+      // Mix Unicode with ASCII

Emoji and Non-BMP Characters

langlang supports characters outside the Basic Multilingual Plane (BMP), including emoji. These require 4 bytes in UTF-8 and codepoints above U+FFFF.

// Match specific emoji
Brain <- '🧠'
Heart <- '❤'

// Match range of emoji: Brain, DNA, microbe, petri dish, test tube, etc.
ScienceEmoji <- [🧠-🧬]

// Combine emoji with text
Reaction <- '👍' / '👎' / '❤' / '😂'

For example, [🧠-🧬] matches any emoji from U+1F9E0 (brain) to U+1F9EC (DNA double helix).

Composition Operators

Build complex expressions from simpler ones.

Sequence

Expressions separated by whitespace must match in order:

// Matches "let x = 5"
LetStatement <- 'let' Identifier '=' Expression

Ordered Choice

The / operator tries alternatives left-to-right, stopping at the first match:

Value <- Number / String / Boolean

Unlike regular expressions or CFGs, PEG choices are ordered and unambiguous — the first matching alternative wins.

Grouping

Use parentheses to group sub-expressions:

Ambiguous <- 'a' / 'b' 'c'    // Without grouping: matches 'a' OR 'bc'

Clear <- 'a' ('b' / 'c')      // With grouping: matches 'ab' OR 'ac'

Repetition Operators

Control how many times an expression matches.

Zero or More (*)

Match zero or more occurrences:

As <- 'a'*        // Matches "", "a", "aaa", ...

List <- Value (',' Value)* // Matches comma-separated values

Zero-or-more never fails — it succeeds even without matching anything.

One or More (+)

Match one or more occurrences:

As <- 'a'+        // Matches "a", "aa", "aaa", ... (but not "")

Digits <- [0-9]+  // Must have at least one digit

This is syntactic sugar for e e*.

Optional (?)

Match zero or one occurrence:

Color    <- 'colo' 'u'? 'r'  // Matches "color" or "colour"

SignedInt <- [+-]? Digits    // Optional sign before number

Predicates

Predicates look ahead without consuming input.

Not Predicate (!)

Succeeds if the expression fails to match (without consuming input):

NotQuote   <- !'"' .                          // Match any char that's not a quote
Identifier <- !Keyword [a-zA-Z_][a-zA-Z0-9_]* // Match word that's not a keyword

This enables unlimited lookahead — you can check arbitrarily complex conditions before committing to a match.

// Classic PEG example: match content between brackets
BracketString <- '[' (!']' .)* ']'

And Predicate (&)

Succeeds if the expression matches (without consuming input):

// Only match if followed by specific char
SpecialA <- 'a' &'('

Use & when you want to assert a positive condition before committing to a branch. It reads more clearly than a double negation (!!e) in those cases:

// Assert the identifier starts with a valid character before consuming
Identifier <- &[a-zA-Z_] [a-zA-Z_0-9]+

// Gate on a delimiter that signals the format to expect
TypedValue <- &'{' Object / &'[' Array / Scalar

Whitespace Control

Automatic Whitespace Handling

langlang automatically inserts whitespace handling between elements of non-syntactic productions. A production is syntactic if all its expressions lead only to terminal matches.

// Non-syntactic: calls other production, gets auto-spacing
Expr <- Int '+' Int

// Syntactic: only terminals, no auto-spacing
Int <- [0-9]+

For Expr, the parser automatically allows whitespace before Int, '+', and the second Int. So "1+2" and "1 + 2" both parse successfully.

Disabling Auto-Spacing (#)

Use # to disable automatic whitespace handling for an expression:

// Without #: "3 rd" would match (spaces between number and suffix)
// With #: space between number and suffix causes failure
Ordinal <- Decimal #('st' / 'nd' / 'rd' / 'th')
Decimal <- [0-9]+
InputWithout #With #
"3rd"✓ matches✓ matches
"3 rd"✓ matches✗ fails
" 3rd"✓ matches✓ matches

A common use case is string literals where internal spaces matter:

// Non-syntactic (calls DQ production)
String <- DQ #((!DQ .)* DQ)
DQ     <- '"'

Error Handling

Labels

Attach error labels to expressions using ^:

Array <- '[' (Value (',' Value^itemExpected)*)? ']'^closeBracket

When parsing fails at a labeled expression, the label appears in the error message instead of a generic "expected X" message. Labels are syntactic sugar for throwing a failure if the labeled expression fails. e.g.: Exp^label = Exp / throw(label)

Recovery Expressions

Recovery rules let the parser continue after errors, producing a partial parse tree with error nodes:

// Main grammar
JSON  <- Value^jsonValue EOF^eof
Array <- '[' (Value (',' Value^item)*)? ']'^close

// Recovery expressions (lowercase by convention)
jsonValue <- .*           // Consume rest of input on failure
item      <- (![,\]] .)*  // Skip until comma or bracket
close     <-              // Empty: just mark the error
eof       <- .*

When a labeled expression fails and a recovery rule with that label exists, the parser:

  1. Executes the recovery expression
  2. Records an error node in the tree
  3. Continues parsing

This enables "keep parsing" behavior useful for IDEs and linters.

Import System

Split large grammars across files using imports:

// file: main.peg
@import Value from "./value.peg"
@import String from "./string.peg"

Document <- Value+
// file: value.peg
Value  <- Number / String
Number <- [0-9]+

Imported productions and all their dependencies are merged into the importing grammar.

Import Syntax

@import ProductionName from "path/to/file.peg"
  • Paths are relative to the importing file
  • Only the named production (and its dependencies) are imported
  • Multiple imports can reference the same file

Operator Reference

OperatorNameDescription
'...'LiteralMatch exact string (supports Unicode)
[...]ClassMatch character from set (supports Unicode ranges)
.AnyMatch any single character (including Unicode)
e1 e2SequenceMatch e1 then e2
e1 / e2Ordered ChoiceTry e1, then e2 if e1 fails
e*Zero or MoreMatch e zero or more times
e+One or MoreMatch e one or more times
e?OptionalMatch e zero or one time
!eNot PredicateSucceed if e fails (no consume)
&eAnd PredicateSucceed if e matches (no consume)
#eLexificationDisable auto-whitespace for e
e^labelLabelAttach error label to e
(e)GroupingGroup expressions

Tips and Best Practices

Avoid Common Pitfalls

Greedy Matching

Repetitions are greedy — they consume as much as possible:

// WRONG: .* consumes everything including quotes
BadString <- '"' .* '"'

// CORRECT: stop at the quote
GoodString <- '"' (!'"' .)* '"'

Order Matters

In ordered choice, put specific matches before general ones:

// WRONG: 'if' never matches because Identifier matches first
Keyword <- Identifier / 'if' / 'else'

// CORRECT: keywords before identifier
Keyword <- 'if' / 'else' / Identifier