• Savidu Dias

Developing a Simple Programming Language- Part 1: Compilers and Interpreters

Hello! Welcome to Season 1, Episode 1 of Developing a Simple Programming Language (and other ways to hang yourself). If you’re here, there’s a slight chance that you don’t know how to create a programming language, and are interested in learning how programming languages work.


Remember compilers, and interpreters? That thing you learned in your first year of college, and then completely forgot about? You may have a vague understanding of how they work, but not really 100% sure. Worry not, my child. By the end of this chapter, you’ll be an expert on compilers, and interpreters.


Before we get into compiling, and interpreting, let’s talk about how programming languages work.

For newbies such as ourselves, we feel as if every time we run a bunch of code, a horrible goblin-fairy living inside our computer looks at the code makes the computer do what we told it to do. Sometimes, he decides to be a little jerk and adds a few bugs to our program.


Believe it or not, that’s not how programming languages actually work. Instead of a goblin-fairy, what we have is a translator. I know, not as exciting as we thought, but bear with me. As complicated as our computers are, they are not smart enough to understand people. The translator’s job is to take our code, and read it back to the computer in the language that it understands.

Both compilers, and interpreters are translators. So what’s the difference? In a compiler, the translator looks at our source code, translates it into machine language, and writes it in a post-it note for the machine to read. However, in an interpreter, the translator looks at our source code, and reads it out to the computer.

It may sound like an interpreter is the better deal, but there are many advantages to using a compiler. Click here to see a comparison between compilers, and interpreters.


In this series, we will be going over how to write our own programing language by developing an interpreter. Before we get into that, let me introduce myself. At the time of writing this chapter, I had graduated college 2 months ago. I developed a programming language, as my final year research project. The language is called “Cuneiform”, which is a programming language used to develop conversational applications. If you’re interested in checking it out, you can find the project on GitHub. You can also find the documentation for the project here.


Before we begin, what do we need to know before we start developing our own interpreter? Honestly? Not much. Developing an interpreter is a lot easier than you’d think. If you have an intermediate level of programming experience, object oriented programming, and programming concepts such as recursion, this is going to be a piece of cake!


So what are we going to do? We are going to create a simple interpreter for JavaScript. By the end of this series, you will have a working JavaScript interpreter, which is capable of understanding basic JavaScript code. Why JavaScript? For one thing, JS is a very popular programming language, that has many language constructs.


The JS interpreter will be developed using Java. This is mainly because Java is a language that lets us implement object oriented concepts very easily. We will see how important this is, as the series progresses. However, you can use any programming language you want, because the concepts that we will talk about do not depend on any particular programming language. Bonus points if you follow this series using any programming language of your choice :)


Alrighty, enough chit chat. Let’s just jump into it! We’ll start by creating an interpreter for a calculator, which can do basic arithmetic calculations. I know, exciting, isn’t it? What we’re going to start off with is going to be very simple. Our calculator is going to handle the addition of two single digit integers (for example: 2 + 3). In order for our simple calculator (interpreter) to work, let’s make some assumptions:


  • The input should only have single digit integers.

  • Addition is the only arithmetic operation which is supported.

  • The input must not have any white spaces (eg: “2 + 3” is incorrect, “2+3” is correct).


How is this going to work? What we need to understand is that you, as a human are much smarter than a computer (way to go!). Although we see the input “2+3” as an arithmetic addition, the interpreter will see this as a string. In order for the interpreter to understand what needs to be done with the string, it needs to break “2+3” into smaller components called tokens. So, “2+3” will be broken into tokens like this:


The process of breaking the input string into tokens is called lexical analysis. The first step of the interpreter is to read the input, and convert it into a stream of tokens. The component of the interpreter that does this is called lexical analyser, or lexer for short.


A token is an object that has a type, and value. In our example, the string “2” will have a type of INTEGER, and a corresponding integer value of 2. It will be in the 0th position in the token stream. In the next position (1st) of the token stream, we have “+”, which will have the type PLUS, and the value +. The next position in the stream will have “3”, which is of type INTEGER, and value 3. Additionally, a token called EOF is used to indicate the end of file.


The token will be represented in Code as follows:

enum Type {
    INTEGER,
    PLUS,
    EOF
}

class Token {
    // token type: INTEGER, PLUS, or EOF
    private Type type;
    // token value: 0, 1, 2. 3, 4, 5, 6, 7, 8, 9, '+', or null
    private Object value;


    Token(Type type, Object value) {
        this.type = type;
        this.value = value;
    }

    Type getType() {
        return type;
    }

    Object getValue() {
        return value;
    }
}

A Parser class represents the behaviour of the interpreter. It is represented as shown below:

class Parser {
    // String input (eg: "2+3")
    private String text;
    // Keeps an index of the current position in the input
    private int pos;
    // Current token instance
    private Token currentToken;

    Parser(String text) {
        this.text = text;
        this.pos = 0;
        this.currentToken = null;
}

A method called getNextToken of the Parser class acts as the lexer. Each time this is called, the next token created from the input of characters is obtained.


/* Lexical Analyzer (also known as lexer, scanner, or tokenizer)

    This method is responsible for breaking the input apart into a stream of tokens.
     */
    private Token getNextToken() {
        // Checks if the current position index has passed the number of characters in the input string. If so, return
        // an EOF token as there is no more input left to be converted into tokens.
        if (this.pos > this.text.length() - 1) {
            return new Token(Type.EOF, null);
        }

        // get a character at the current position
        char currentChar = this.text.charAt(this.pos);

        // If the character is a digit, then convert it to int, create an INTEGER token, increment pos index to point
        // to the next character after the digit, and return the INTEGER token
        if (Character.isDigit(currentChar)) {
            Token token = new Token(Type.INTEGER, Character.getNumericValue(currentChar));
            this.pos++;
            return token;
        }
        if (currentChar == '+') {
            Token token = new Token(Type.PLUS, currentChar);
            this.pos++;
            return token;
        }

        try {
            throw new Exception("Invalid token type");
        } catch (Exception e) {
            System.err.printf("Invalid token type %c found", currentChar);
            System.exit(0);
            return null;
        }
    }


Let’s take a look at the getNextToken method, The input is stored in the text String variable. Pos is an index into that String. The value of pos is initially set to 0, and points to the character ‘3’. This method checks if the character is a digit, and if so, it increments the value of pos, and returns a Token of type INTEGER, and value 3.


Now that the interpreter has the stream of tokens made from the input, thanks to the Lexer, the next task is to identify a structure in the stream. Since we are making an interpreter for an arithmetic expression, our interpreter expects to find a structure like this:


After the structure is confirmed, the result should be generated by adding the value of the token on the left side of PLUS with the token on the right side of PLUS. We will create two new functions called expr(), and eat() to do this.

/*
    Compares the current token type with the token type passed. If they match, then "eat" the current token and assign
    the next token using the getNextToken() method. If they do not match, raise an exception.
     */
    private void eat(Type tokenType) {
        if (this.currentToken.getType() == tokenType) {
            this.currentToken = getNextToken();
        } else {
            try {
                throw new Exception("Expected token type did not match");
            } catch (Exception e) {
                System.err.printf("Expected token type %s did not match", tokenType.toString());
                System.exit(0);
            }
        }
    }

    /*
    Tries to identify the structure "INTEGER PLUS INTEGER"
    expr -> INTEGER PLUS INTEGER
     */
    int expr() {
        // Set the current token to the first token taken from the input
        this.currentToken = getNextToken();

        // The current token is expected to be an integer
        Token left = this.currentToken;
        eat(Type.INTEGER);

        // The current token is expected to be a PLUS (+)
        Token operation = this.currentToken;
        eat(Type.PLUS);

        // The current token is expected to be a single digit integer
        Token right = this.currentToken;
        eat(Type.INTEGER);

        int result = (Integer)left.getValue() + (Integer)right.getValue();
        return result;
    }

The eat() method acts as a helper method for the expr() method in order to verify that the token type passed to the eat() method matches the current token type. If we get a successful match, eat() gets the next token and assigns it to the currentToken variable. Think of this as “eating” the token that was matched, and going over to the next token in the stream. If the structure in the stream of token does not match what we expect, the eat() method throws an exception.


Our Parser class now looks like this.

class Parser {
    // String input (eg: "2+3")
    private String text;
    // Keeps an index of the current position in the input
    private int pos;
    // Current token instance
    private Token currentToken;

    Parser(String text) {
        this.text = text;
        this.pos = 0;
        this.currentToken = null;
    }

    /* Lexical Analyzer (also known as lexer, scanner, or tokenizer)

    This method is responsible for breaking the input apart into a stream of tokens.
     */
    private Token getNextToken() {
        // Checks if the current position index has passed the number of characters in the input string. If so, return
        // an EOF token as there is no more input left to be converted into tokens.
        if (this.pos > this.text.length() - 1) {
            return new Token(Type.EOF, null);
        }

        // get a character at the current position
        char currentChar = this.text.charAt(this.pos);

        // If the character is a digit, then convert it to int, create an INTEGER token, increment pos index to point
        // to the next character after the digit, and return the INTEGER token
        if (Character.isDigit(currentChar)) {
            Token token = new Token(Type.INTEGER, Character.getNumericValue(currentChar));
            this.pos++;
            return token;
        }
        if (currentChar == '+') {
            Token token = new Token(Type.PLUS, currentChar);
            this.pos++;
            return token;
        }

        try {
            throw new Exception("Invalid token type");
        } catch (Exception e) {
            System.err.printf("Invalid token type %c found", currentChar);
            System.exit(0);
            return null;
        }
    }

    /*
    Compares the current token type with the token type passed. If they match, then "eat" the current token and assign
    the next token using the getNextToken() method. If they do not match, raise an exception.
     */
    private void eat(Type tokenType) {
        if (this.currentToken.getType() == tokenType) {
            this.currentToken = getNextToken();
        } else {
            try {
                throw new Exception("Expected token type did not match");
            } catch (Exception e) {
                System.err.printf("Expected token type %s did not match", tokenType.toString());
                System.exit(0);
            }
        }
    }

    /*
    Tries to identify the structure "INTEGER PLUS INTEGER"
    expr -> INTEGER PLUS INTEGER
     */
    int expr() {
        // Set the current token to the first token taken from the input
        this.currentToken = getNextToken();

        // The current token is expected to be an integer
        Token left = this.currentToken;
        eat(Type.INTEGER);

        // The current token is expected to be a PLUS (+)
        Token operation = this.currentToken;
        eat(Type.PLUS);

        // The current token is expected to be a single digit integer
        Token right = this.currentToken;
        eat(Type.INTEGER);

        int result = (Integer)left.getValue() + (Integer)right.getValue();
        return result;
    }
}

Create a main method which takes in the input and runs it through the Parser to give us a result.

import java.util.Scanner;

public class Main {
    public static void main(String[] args) {
        Scanner sc = new Scanner(System.in);

        while (true) {
            System.out.print("calc> ");
            String input = sc.nextLine();

            Parser parser = new Parser(input);
            int result = parser.expr();
            System.out.println(result);
        }
    }
}

Save the code into three class files Token.java, Parser.java, and Main.Java. We can also download it directly from GitHub. Now we’re ready to compile and run our simple interpreter! We will compile all of our Java class files, and then run our calculator.

$ javac -d . Token.java
$ javac -d . Parser.java
$ javac -d . Main.java
$ java Main

calc> 2+3
5
calc> 5+7
12
calc> 9+9
18

Congratulations! You have made your very first interpreter! Let’s do a quick recap to see what our interpreter does to evaluate arithmetic expressions:


  • The interpreter takes the input as a String (eg: “2+3”).

  • The expr() method in the interpreter finds the structure in the stream of tokens, which is obtained by the getNextToken() method.

  • After the structure is validated, it interprets the input by taking the sum of the two INTEGER tokens.


In the next chapter, we will add more features to our interpreter. Our new interpreter will be able to:

  • Ignore whitespace characters (“2 + 3” will also work)

  • Handle subtractions

  • Take integers that will have more than one digit, like “23+4”


Feeling brave? Why don’t you try to add these features to our calculator right now, before reading the next chapter? Bonus points if you managed to do complete this!


If you're feeling ready, check out the next tutorial: Part 2: Components of an Interpreter


Huge thanks goes to Ruslan Spivak, who wrote a tutorial series on developing interpreters, which you can find here. A lot of what you’ll see in this series was adapted from this tutorial. Also, if you’re interested in learning more about developing compilers and interpreters, take some time to read these books.


10 views

© 2020 by savidude.com