Reference
Complete guide to the langlang grammar syntax.
Overview
langlang uses a grammar format based on Parsing Expression Grammars (PEG). If you're familiar with regular expressions or BNF notation, you'll find the syntax intuitive.
A grammar consists of productions (also called rules) that define how to parse input. The parser starts from the first production and works its way through the grammar recursively.
Productions
A production has a name on the left, an arrow, and an expression on the right:
ProductionName <- ExpressionThe first production in a grammar is the start rule — parsing begins there.
Terminals
Terminals match actual characters in the input.
Literals
Match exact character sequences using single or double quotes:
Keyword <- 'function'
Greeting <- "hello"Both quote styles are equivalent. Use whichever makes your grammar more readable (e.g.: double quotes if the literal contains single quotes).
Escape Sequences
Standard escape sequences work inside literals:
| Sequence | Meaning |
|---|---|
\n | Newline |
\r | Carriage return |
\t | Tab |
\\ | Backslash |
\' | Single quote |
\" | Double quote |
\u{XXXX} | Unicode |
Newline <- '\n'
Tab <- '\t'Character Classes
Match any single character from a set using square brackets:
Digit <- [0-9]
Letter <- [a-zA-Z]
Hex <- [0-9a-fA-F][a-z] matches any character from 'a' to 'z' (inclusive). You can combine multiple ranges and individual characters:
// Matches digits, letters, and underscore
Identifier <- [a-zA-Z_0-9]Any Character
The dot . matches any single character except end-of-input:
// Match anything until closing quote (using not predicate)
StringChar <- !'"' . / '\\' .Unicode Support
Unicode Literals
langlang has full Unicode support. You can match characters from any script, including CJK, Arabic, Cyrillic, and even emoji, directly in your grammars.
Konnichiwa <- 'こんにちは' // Japanese greeting
Annyeong <- '안녕하세요' // Korean greeting
Marhaba <- 'مرحبا' // Arabic greeting
Privet <- 'привет' // Russian greetingUnicode in Character Classes
Character classes work with Unicode characters, including ranges:
Hiragana <- [ぁ-ん] // Match a single Hiragana character
Katakana <- [ァ-ン] // Match a single Katakana character
Hangul <- [가-힣] // Match a single Hangul syllable
Cyrillic <- [а-я] // Match Cyrillic lowercase letters
MixedWord <- [a-zA-Zあ-ん]+ // Mix Unicode with ASCIIEmoji and Non-BMP Characters
langlang supports characters outside the Basic Multilingual Plane (BMP), including emoji. These require 4 bytes in UTF-8 and codepoints above U+FFFF.
// Match specific emoji
Brain <- '🧠'
Heart <- '❤'
// Match range of emoji: Brain, DNA, microbe, petri dish, test tube, etc.
ScienceEmoji <- [🧠-🧬]
// Combine emoji with text
Reaction <- '👍' / '👎' / '❤' / '😂'For example, [🧠-🧬] matches any emoji from U+1F9E0 (brain) to
U+1F9EC (DNA double helix).
Composition Operators
Build complex expressions from simpler ones.
Sequence
Expressions separated by whitespace must match in order:
// Matches "let x = 5"
LetStatement <- 'let' Identifier '=' ExpressionOrdered Choice
The / operator tries alternatives left-to-right, stopping at the first match:
Value <- Number / String / BooleanUnlike regular expressions or CFGs, PEG choices are ordered and unambiguous — the first matching alternative wins.
Grouping
Use parentheses to group sub-expressions:
Ambiguous <- 'a' / 'b' 'c' // Without grouping: matches 'a' OR 'bc'
Clear <- 'a' ('b' / 'c') // With grouping: matches 'ab' OR 'ac'Repetition Operators
Control how many times an expression matches.
Zero or More (*)
Match zero or more occurrences:
As <- 'a'* // Matches "", "a", "aaa", ...
List <- Value (',' Value)* // Matches comma-separated valuesZero-or-more never fails — it succeeds even without matching anything.
One or More (+)
Match one or more occurrences:
As <- 'a'+ // Matches "a", "aa", "aaa", ... (but not "")
Digits <- [0-9]+ // Must have at least one digitThis is syntactic sugar for e e*.
Optional (?)
Match zero or one occurrence:
Color <- 'colo' 'u'? 'r' // Matches "color" or "colour"
SignedInt <- [+-]? Digits // Optional sign before numberPredicates
Predicates look ahead without consuming input.
Not Predicate (!)
Succeeds if the expression fails to match (without consuming input):
NotQuote <- !'"' . // Match any char that's not a quote
Identifier <- !Keyword [a-zA-Z_][a-zA-Z0-9_]* // Match word that's not a keywordThis enables unlimited lookahead — you can check arbitrarily complex conditions before committing to a match.
// Classic PEG example: match content between brackets
BracketString <- '[' (!']' .)* ']'And Predicate (&)
Succeeds if the expression matches (without consuming input):
// Only match if followed by specific char
SpecialA <- 'a' &'('Use & when you want to assert a positive condition before committing
to a branch. It reads more clearly than a double negation (!!e) in
those cases:
// Assert the identifier starts with a valid character before consuming
Identifier <- &[a-zA-Z_] [a-zA-Z_0-9]+
// Gate on a delimiter that signals the format to expect
TypedValue <- &'{' Object / &'[' Array / ScalarWhitespace Control
Automatic Whitespace Handling
langlang automatically inserts whitespace handling between elements of non-syntactic productions. A production is syntactic if all its expressions lead only to terminal matches.
// Non-syntactic: calls other production, gets auto-spacing
Expr <- Int '+' Int
// Syntactic: only terminals, no auto-spacing
Int <- [0-9]+For Expr, the parser automatically allows whitespace before Int,
'+', and the second Int. So "1+2" and "1 + 2" both parse
successfully.
Disabling Auto-Spacing (#)
Use # to disable automatic whitespace handling for an expression:
// Without #: "3 rd" would match (spaces between number and suffix)
// With #: space between number and suffix causes failure
Ordinal <- Decimal #('st' / 'nd' / 'rd' / 'th')
Decimal <- [0-9]+| Input | Without # | With # |
|---|---|---|
"3rd" | ✓ matches | ✓ matches |
"3 rd" | ✓ matches | ✗ fails |
" 3rd" | ✓ matches | ✓ matches |
A common use case is string literals where internal spaces matter:
// Non-syntactic (calls DQ production)
String <- DQ #((!DQ .)* DQ)
DQ <- '"'Error Handling
Labels
Attach error labels to expressions using ^:
Array <- '[' (Value (',' Value^itemExpected)*)? ']'^closeBracketWhen parsing fails at a labeled expression, the label appears in the
error message instead of a generic "expected X" message. Labels are
syntactic sugar for throwing a failure if the labeled expression
fails. e.g.: Exp^label = Exp / throw(label)
Recovery Expressions
Recovery rules let the parser continue after errors, producing a partial parse tree with error nodes:
// Main grammar
JSON <- Value^jsonValue EOF^eof
Array <- '[' (Value (',' Value^item)*)? ']'^close
// Recovery expressions (lowercase by convention)
jsonValue <- .* // Consume rest of input on failure
item <- (![,\]] .)* // Skip until comma or bracket
close <- // Empty: just mark the error
eof <- .*When a labeled expression fails and a recovery rule with that label exists, the parser:
- Executes the recovery expression
- Records an error node in the tree
- Continues parsing
This enables "keep parsing" behavior useful for IDEs and linters.
Import System
Split large grammars across files using imports:
// file: main.peg
@import Value from "./value.peg"
@import String from "./string.peg"
Document <- Value+// file: value.peg
Value <- Number / String
Number <- [0-9]+Imported productions and all their dependencies are merged into the importing grammar.
Import Syntax
@import ProductionName from "path/to/file.peg"- Paths are relative to the importing file
- Only the named production (and its dependencies) are imported
- Multiple imports can reference the same file
Operator Reference
| Operator | Name | Description |
|---|---|---|
'...' | Literal | Match exact string (supports Unicode) |
[...] | Class | Match character from set (supports Unicode ranges) |
. | Any | Match any single character (including Unicode) |
e1 e2 | Sequence | Match e1 then e2 |
e1 / e2 | Ordered Choice | Try e1, then e2 if e1 fails |
e* | Zero or More | Match e zero or more times |
e+ | One or More | Match e one or more times |
e? | Optional | Match e zero or one time |
!e | Not Predicate | Succeed if e fails (no consume) |
&e | And Predicate | Succeed if e matches (no consume) |
#e | Lexification | Disable auto-whitespace for e |
e^label | Label | Attach error label to e |
(e) | Grouping | Group expressions |
Tips and Best Practices
Avoid Common Pitfalls
Greedy Matching
Repetitions are greedy — they consume as much as possible:
// WRONG: .* consumes everything including quotes
BadString <- '"' .* '"'
// CORRECT: stop at the quote
GoodString <- '"' (!'"' .)* '"'Order Matters
In ordered choice, put specific matches before general ones:
// WRONG: 'if' never matches because Identifier matches first
Keyword <- Identifier / 'if' / 'else'
// CORRECT: keywords before identifier
Keyword <- 'if' / 'else' / Identifier