February 20, 2022

Rebuilding Babel: The Tokenizer

How does a tokenizer convert a code string into a list of tokens?

A few weekends ago, I spent some time rebuilding the Babel compiler from scratch to learn a bit more about how it works internally. You see, I knew enough about compilers to know how to use them (like in my debugger post), but I didn't know enough to know how to make one from scratch.

In this post, we'll go over how to build a tokenizer, the very first component of a compiler. Specifically, we'll build a tokenizer that can understand the following code snippet and absolutely nothing more:

function hello() {
  console.log("hello, world!");
}

function hello() {
  console.log("hello, world!");
}

To do so, we're going to look into:

What a token is and why a tokenizer is necessary;
How to implement a tokenizer in three concrete steps:
1. Tokenizing single character tokens;
2. Tokenizing identifiers and keywords; and
3. Tokenizing string literals.

Let's get started!

A Compiler Primer

Let's deviate for a second to go over how a compiler works at a high level. I think of a compiler as a pipeline with four concrete steps:

Tokenize
Parse
Transform
Generate

Each step takes in the output of the previous step and transforms it to something else. The first step, the tokenization step, takes in your original source code as its input, while the last step, the generation step, spits out the modified code.

Back to the tokenizer!

Tokens Are the Language's Words

Essentially, a tokenizer breaks up your source code into small objects called tokens (hence the name). I like to think of a token as a "word" in the programming language, i.e. the smallest sequence of characters that still carry a meaning.

For example, if you tokenize the following JavaScript code:

function hello() {
  console.log("hello, world!");
}

function hello() {
  console.log("hello, world!");
}

You'll end up with the following tokens:

function

hello

(

)

{

console

log

(

hello, world!

)

}

Just like how a word in english can be a noun, verb, adjective, etc., each token has a type that represents that token's meaning. In our previous example, those types might be something like:

keyword

function

identifier

hello

left-paren

(

right-paren

)

left-curly

{

identifier

console

dot

identifier

log

left-paren

(

string

hello, world!

right-paren

)

right-curly

}

Why Do We Do This?

Before we continue, I want to take a brief detour here to talk about why a tokenizer is necessary in the first place. I couldn't find a super clear resource to point to here but from what I can gather, there are two reasons:

To organize data to a format more friendly for machines;
To separate the logic that handles a language's microsyntax from the logic that handles regular syntax.

Machine-Friendly Data

Writing programs is a lot easier when the data that you're working with is structured in a consistent manner. Grouping strings into tokens means the rest of the compiler pipeline don't need to individually parse the source code - they get the advantage of working with a tidy array of objects instead.

Microsyntax and Regular Syntax

The other reason is to separate the logic that handles the language's microsyntax from the logic that handles regular syntax. When I say "syntax" here I mean the rules that govern the "correct" structure of the programming language.

For example, the correct way to define a constant variable in JavaScript is to use the const keyword like so:

const hello = "world";

const hello = "world";

This is an example of regular syntax because the rule is about the correct way to arrange different words together.

On the other hand, microsyntax is about the correct way to arrange different characters together to make up a word. One example is how strings are defined in JavaScript, i.e. they're words that are surrounded by apostrophes or double quotes:

// valid strings
'hello'
"world"

// invalid strings
<-hello->
&world&

It's important to separate the code that handles this different type of syntax because they're concerned with two different things. Trying to jumble them together will only lead to more complex code that's difficult to read and follow.

In Short...

To reiterate, the tokenizer's job is to break up the source code (which it receives as a string) into a list of tokens. Breaking up the code like this makes the other phases' lives much easier by:

Tidying up the input into a more structured format, and
Making the code simpler by separating logic for syntax and microsyntax.

Implementation Plan

Anyways, back to the tokenizer!

Now that we know what tokens are and why a tokenizer is necessary, we're finally ready to start implementing it. Again, we want to focus on tokenizing the following code snippet and nothing more:

function hello() {
  console.log("hello, world!");
}

function hello() {
  console.log("hello, world!");
}

Here's a preview of the tokenizer we're implementing, making its way through the code snippet:

Starting... ⚙️

(

)

{

(

)

}

Known Tokens

(
)
{
}
.
;
=

Known Keywords

function
const

We'll break down the implementation into three parts*:

Parsing single character tokens;
Parsing identifiers and keywords; and
Parsing string literals.

Let's get started!

Single Character Tokens

Let's begin by trying to parse out the simplest tokens first - the ones that are only one character long. In the code snippet we're parsing, that would be all of the following tokens:

function

hello

left-paren

(

right-paren

)

left-curly

{

console

dot

log

left-paren

(

hello, world!

right-paren

)

right-curly

}

We'll start off by iterating through every character of our input:

(

)

{

(

)

}

What's that \n?

The \n character is a special character that represents a new line. It's typically invisible to us when we're editing code or text, but I chose to explicitly show it here to show what the computer sees.

As we iterate through the code string, we check if the current character is one of these single character tokens, and if it is, we add the character to the final tokens list.

One way to check if a character is a single character token is to keep a list of all single character tokens we support and checking if the character is in that list:

Starting... ⚙️

{

(

)

}

Known Tokens

(
)
{
}
.
;
=

Identifiers and Keywords

With the single character tokens out of the way, the next thing we have to do is to parse the identifier and keyword tokens.

keyword

function

identifier

hello

(

)

{

identifier

console

identifier

log

(

hello, world!

)

}

But hold on a second - what's an identifier anyway?

What Makes an Identifier?

In JavaScript, an identifier is a sequence of characters that is used to refer to some piece of data. For example, in our input code snippet, the words hello, console, and log are all identifiers because they refer to a function definition, an object, and a method respectively (all data available to the program).

According to MDN, a valid identifier in JavaScript is a sequence of alphanumeric characters, except the first character cannot be a number. This means the following strings are valid identifiers:

hello
_abc
abc123

But the following are not:

2cool
8ball

I initially wanted to support the MDN identifier rules exactly, but I chose to limit identifiers to only alphabetical characters for now to keep it scoped to our input code snippet. This means that, out of all the examples above, my tokenizer will only recognize the word hello as an identifier.

Implementation

To recap, an identifier (for our purposes) is any sequence of alphabetical characters. To parse it, I went with the following approach:

If the current character is alphabetical, start parsing an identifier;
Keep adding characters to the current identifier token until the current character is not alphabetical.

Starting... ⚙️

Keywords

Some words, like function, while, and switch, have a special meaning in JavaScript and therefore cannot be used as regular identifiers. This group of identifiers are called keywords and typically have their own individual token types.

Therefore...

One way is to do the same thing as the single character tokens:

Keep a set of known keywords;
When we're parsing an identifier, check if the parsed name is in this set;
If it is, change the token's type to the keyword's type.

Starting... ⚙️

(

)

{

}

Known Keywords

function
const

String Literals

Next up is tokenizing the 'hello, world!' part of the code snippet, also known as a string literal.

function

hello

(

)

{

console

log

(

string

hello, world!

)

}

In JavaScript, a string literal abides by the following rules:

Starts and ends with an apostrophe (') or double quotes (") pair, and
Cannot span multiple lines.

To keep things simple, I've opted to support only apostrophes and multiple-line strings (turns out the implementation is simpler if we support multiple lines)*.

To tokenize a string literal, we're going to do something similar to tokenizing identifiers except we only stop when we reach another apostrophe. Specifically:

If the current character is an apostrophe, start parsing a string literal;
Keep collecting characters until either the current character is another apostrophe, marking the end of the string; or
You've reached the end of the file, making the string unterminated - an error!

Starting... ⚙️

Summary

And there we have our tokenizer! It can't do too much at the moment, but it's able to tokenize the code snippet that we started with:

function hello(message) {
  console.log(message);
}

function hello(message) {
  console.log(message);
}

I purposely omitted code snippets because I wanted to solidify the concepts behind a tokenizer rather than tying it down to any direct implementation. After all, there's a bunch of different ways of implementing the same thing! But if you'd like to read some code anyway, check out my implementation of this tokenizer (written in TypeScript).

To finish off, I have a few exercises for you to try out if you'd like to learn more:

Implement this tokenizer in your language of choice; use the visualization from the introduction as a reference.
Once it's done, extend the tokenizer to support JS syntax of your choice - maybe something like async-await?.

I'd love to see what you come up with! You can either use the feedback form below or reach out to me @nandafyi on Twitter to get in touch.

That's it for today - thanks for reading!