Intro to syntax trees

Introduction to syntax trees

unified uses abstract syntax trees (abbreviated as ASTs), that plugins can work on. This guide introduces what ASTs are and how to work with them.

What is an AST?
What is unist?
When to use an AST?

What is an AST?

An abstract syntax tree (AST) is a tree representation of the syntax of programming languages. For us that’s typically markup languages.

As a JavaScript developer you may already know things that are like ASTs: The DOM and React’s virtual DOM. Or you may have heard of Babel, ESLint, PostCSS, Prettier, or TypeScript. They all use ASTs to inspect and transform code.

In unified, we support several ASTs. The reason for different ASTs is that each markup language has several aspects that do not translate 1-to-1 to other markup languages. Taking markdown and HTML as an example, in some cases markdown has more info than HTML: markdown has several ways to add a link (“autolinks”: <https://url>, resource links: [label](url), and reference links with definitions: [label][id] and [id]: url). In other cases, HTML has more info than markdown. It has many tags, which add new meaning (semantics), that aren’t available in markdown. If there was one AST, it would be quite hard to do the tasks that several remark and rehype plugins now do.

See “How to build a syntax tree” for more info on how to make a tree. See “Syntax trees in TypeScript” on how to work with ASTs in TypeScript.

What is unist?

But all our ASTs have things in common. The bit in common is called unist. By having a shared interface, we can also share tools that work on all ASTs. In practice, that means you can use for example unist-util-visit to visit nodes in any supported AST.

See “Tree traversal” for more info on unist-util-visit.

unist is different from the ASTs used in other tools. Quite noticeable because it uses a particular set of names for things: type, children, position. But perhaps harder to see is that it’s compatible with JSON. It’s all objects and arrays. Strings, numbers. Where other tools use instances with methods, we use plain data. Years ago in retext we started out like that too. But we found that we preferred to be able to read and write a tree from/to a JSON file, to treat ASTs as data, and use more functional utilities.

When to use an AST?

You can use an AST when you want to inspect or transform content.

Say you wanted to count the number of headings in a markdown file. You could also do that with a regex:

const value = `# Pluto

Pluto is a dwarf planet in the Kuiper belt.

## History

### Discovery

In the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the
position of…`

const expression = /^#+[^\r\n]+/gm
const headings = [...value.matchAll(expression)].length

console.log(headings) //=> 3

const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"

const expression: RegExp

const headings: number

const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"

(method) String.matchAll(regexp: RegExp): RegExpStringIterator<RegExpExecArray>

const expression: RegExp

(property) Array<RegExpExecArray>.length: number

namespace console
var console: Console

The console module provides a simple debugging console that is similar to the JavaScript console mechanism provided by web browsers.

The module exports two specific components:

A Console class with methods such as console.log(), console.error() and console.warn() that can be used to write to any Node.js stream.
A global console instance configured to write to process.stdout and process.stderr. The global console can be used without importing the node:console module.

Warning: The global console object's methods are neither consistently synchronous like the browser APIs they resemble, nor are they consistently asynchronous like all other Node.js streams. See the note on process I/O for more information.

Example using the global console:

console.log('hello world');
// Prints: hello world, to stdout
console.log('hello %s', 'world');
// Prints: hello world, to stdout
console.error(new Error('Whoops, something bad happened'));
// Prints error message and stack trace to stderr:
//   Error: Whoops, something bad happened
//     at [eval]:5:15
//     at Script.runInThisContext (node:vm:132:18)
//     at Object.runInThisContext (node:vm:309:38)
//     at node:internal/process/execution:77:19
//     at [eval]-wrapper:6:22
//     at evalScript (node:internal/process/execution:76:60)
//     at node:internal/main/eval_string:23:3

const name = 'Will Robinson';
console.warn(`Danger ${name}! Danger!`);
// Prints: Danger Will Robinson! Danger!, to stderr

Example using the Console class:

const out = getStreamSomehow();
const err = getStreamSomehow();
const myConsole = new console.Console(out, err);

myConsole.log('hello world');
// Prints: hello world, to out
myConsole.log('hello %s', 'world');
// Prints: hello world, to out
myConsole.error(new Error('Whoops, something bad happened'));
// Prints: [Error: Whoops, something bad happened], to err

const name = 'Will Robinson';
myConsole.warn(`Danger ${name}! Danger!`);
// Prints: Danger Will Robinson! Danger!, to err

@see source

(method) Console.log(message?: any, ...optionalParams: any[]): void

Prints to stdout with newline. Multiple arguments can be passed, with the first used as the primary message and all additional used as substitution values similar to printf(3) (the arguments are all passed to util.format()).

const count = 5;
console.log('count: %d', count);
// Prints: count: 5, to stdout
console.log('count:', count);
// Prints: count: 5, to stdout

See util.format() for more information.

@since v0.1.100

const headings: number

But what if the headings were in a code block? Or if Setext headings were used instead of ATX headings? The grammar of markdown is more complex than a regex can handle. That’s where an AST can help.

import {fromMarkdown} from 'mdast-util-from-markdown'
import {visit} from 'unist-util-visit'

const value = `# Pluto

Pluto is a dwarf planet in the Kuiper belt.

## History

### Discovery

In the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the
position of…`

const tree = fromMarkdown(value)

let headings = 0

visit(tree, 'heading', function () {
  headings++
})

console.log(headings) //=> 3

(alias) function fromMarkdown(value: Value, encoding?: Encoding | null | undefined, options?: Options | null | undefined): Root (+1 overload)
import fromMarkdown

(alias) function visit<Tree extends Node, Check extends Test>(tree: Tree, check: Check, visitor: BuildVisitor<Tree, Check>, reverse?: boolean | null | undefined): undefined (+1 overload)
import visit

const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"

const tree: Root

(alias) fromMarkdown(value: Value, encoding?: Encoding | null | undefined, options?: Options | null | undefined): Root (+1 overload)
import fromMarkdown

const value: "# Pluto\n\nPluto is a dwarf planet in the Kuiper belt.\n\n## History\n\n### Discovery\n\nIn the 1840s, Urbain Le Verrier used Newtonian mechanics to predict the\nposition of…"

let headings: number

(alias) visit<Root, "heading">(tree: Root, check: "heading", visitor: BuildVisitor<Root, "heading">, reverse?: boolean | null | undefined): undefined (+1 overload)
import visit

const tree: Root

let headings: number

namespace console
var console: Console