Introduction to syntax trees
unified uses abstract syntax trees (abbreviated as ASTs), that plugins can work on. This guide introduces what ASTs are and how to work with them.
Contents
What is an AST?
An abstract syntax tree (AST) is a tree representation of the syntax of programming languages. For us that’s typically markup languages.
As a JavaScript developer you may already know things that are like ASTs: The DOM and React’s virtual DOM. Or you may have heard of Babel, ESLint, PostCSS, Prettier, or TypeScript. They all use ASTs to inspect and transform code.
In unified, we support several ASTs. The reason for different ASTs is that each markup language has several aspects that do not translate 1-to-1 to other markup languages. Taking markdown and HTML as an example, in some cases markdown has more info than HTML: markdown has several ways to add a link (“autolinks”: <https://url>, resource links: [label](url), and reference links with definitions: [label][id] and [id]: url). In other cases, HTML has more info than markdown. It has many tags, which add new meaning (semantics), that aren’t available in markdown. If there was one AST, it would be quite hard to do the tasks that several and rehype plugins now do.
See “How to build a syntax tree” for more info on how to make a tree. See “Syntax trees in TypeScript” on how to work with ASTs in TypeScript.
What is unist?
But all our ASTs have things in common. The bit in common is called unist. By having a shared interface, we can also share tools that work on all ASTs. In practice, that means you can use for example unist-util-visit to visit nodes in any supported AST.
See “Tree traversal” for more info on unist-util-visit.
unist is different from the ASTs used in other tools. Quite noticeable because it uses a particular set of names for things: type, children, position. But perhaps harder to see is that it’s compatible with JSON. It’s all objects and arrays. Strings, numbers. Where other tools use instances with methods, we use plain data. Years ago in retext we started out like that too. But we found that we preferred to be able to read and write a tree from/to a JSON file, to treat ASTs as data, and use more functional utilities.
When to use an AST?
You can use an AST when you want to inspect or transform content.
Say you wanted to count the number of headings in a markdown file. You could also do that with a regex:
const value = `# Pluto
Pluto is a dwarf planet in the Kuiper belt.
## History
### Discovery
In the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the
position of…`
const expression = /^#+[^\r\n]+/gm
const headings = [...value.matchAll(expression)].length
console.log(headings) //=> 3
const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"
const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"
(method) String.matchAll(regexp: RegExp): RegExpStringIterator<RegExpExecArray>
Matches a string with a regular expression, and returns an iterable of matches containing the results of that search.
- @param regexp A variable name or string literal containing the regular expression pattern and flags.
(property) Array<RegExpExecArray>.length: number
Gets or sets the length of the array. This is a number one higher than the highest index in the array.
(method) console.Console.log(...data: any[]): void
But what if the headings were in a code block? Or if Setext headings were used instead of ATX headings? The grammar of markdown is more complex than a regex can handle. That’s where an AST can help.
import {fromMarkdown} from 'mdast-util-from-markdown'
import {visit} from 'unist-util-visit'
const value = `# Pluto
Pluto is a dwarf planet in the Kuiper belt.
## History
### Discovery
In the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the
position of…`
const tree = fromMarkdown(value)
let headings = 0
visit(tree, 'heading', function () {
headings++
})
console.log(headings) //=> 3
(alias) function fromMarkdown(value: Value, encoding?: Encoding | null | undefined, options?: Options | null | undefined): Root (+1 overload)
import fromMarkdown
Turn markdown into a syntax tree.
- @overload
- @overload
- @param value Markdown to parse.
- @param encoding Character encoding for when
value is Buffer. - @param options Configuration.
- @returns mdast tree.
(alias) function visit<Tree extends Node, Check extends Test>(tree: Tree, check: Check, visitor: BuildVisitor<Tree, Check>, reverse?: boolean | null | undefined): undefined (+1 overload)
import visit
Visit nodes.
This algorithm performs depth-first tree traversal in preorder (NLR) or if reverse is given, in reverse preorder (NRL).
You can choose for which nodes visitor is called by passing a test. For complex tests, you should test yourself in visitor, as it will be faster and will have improved type information.
Walking the tree is an intensive task. Make use of the return values of the visitor when possible. Instead of walking a tree multiple times, walk it once, use unist-util-is to check if a node matches, and then perform different operations.
You can change the tree. See Visitor for more info.
- @overload
- @overload
- @param tree Tree to traverse.
- @param testOrVisitor
unist-util-is-compatible test (optional, omit to pass a visitor). - @param visitorOrReverse Handle each node (when test is omitted, pass
reverse). - @param maybeReverse Traverse in reverse preorder (NRL) instead of the default preorder (NLR).
- @returns Nothing.
- @template {UnistNode} Tree Node type.
- @template {Test} Check
unist-util-is-compatible test.
const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"
(alias) fromMarkdown(value: Value, encoding?: Encoding | null | undefined, options?: Options | null | undefined): Root (+1 overload)
import fromMarkdown
Turn markdown into a syntax tree.
- @overload
- @overload
- @param value Markdown to parse.
- @param encoding Character encoding for when
value is Buffer. - @param options Configuration.
- @returns mdast tree.
const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"
(alias) visit<Root, "heading">(tree: Root, check: "heading", visitor: BuildVisitor<Root, "heading">, reverse?: boolean | null | undefined): undefined (+1 overload)
import visit
Visit nodes.
This algorithm performs depth-first tree traversal in preorder (NLR) or if reverse is given, in reverse preorder (NRL).
You can choose for which nodes visitor is called by passing a test. For complex tests, you should test yourself in visitor, as it will be faster and will have improved type information.
Walking the tree is an intensive task. Make use of the return values of the visitor when possible. Instead of walking a tree multiple times, walk it once, use unist-util-is to check if a node matches, and then perform different operations.
You can change the tree. See Visitor for more info.
- @overload
- @overload
- @param tree Tree to traverse.
- @param testOrVisitor
unist-util-is-compatible test (optional, omit to pass a visitor). - @param visitorOrReverse Handle each node (when test is omitted, pass
reverse). - @param maybeReverse Traverse in reverse preorder (NRL) instead of the default preorder (NLR).
- @returns Nothing.
- @template {UnistNode} Tree Node type.
- @template {Test} Check
unist-util-is-compatible test.
(method) console.Console.log(...data: any[]): void
See “Tree traversal” for more info on unist-util-visit.