Developing a Simple Programming Language - Part 2: Components of an Interpreter

Savidu Dias
Dec 24, 2018
8 min read

Hello, and welcome to part 2 of developing a simple Programming Language! If you have made it this far, give yourself a pat on the back! No, seriously, do it! You have made your first interpreter, but today, we’re going to make it better. Aww yeah! Let’s call our new version Calculator 2.0.

Once we’re done with today’s tutorial, our new and advanced calculator will be able to:

Ignore whitespace characters in the input string.
Subtract two integers.
Take integers with multiple digits from the input.

Sounds exciting, isn’t it? Enough chit chat. Let’s just jump into it!. Before we get into the good stuff, let’s make our code from the last chapter a little bit cleaner, shall we?

Instead of having our interpreter, and lexical analyzer in one Java class, let’s create a separate class for our Lexical Analyzer (Lexer). It should look like this:

class Lexer {
    // String input (eg: "2+3")
    private String text;
    // Keeps an index of the current position in the input
    private int pos;
    // Current token instance
    private Token currentToken;
    // Current character being pointed to
    private char currentChar;

    Lexer(String text) {
        this.text = text;
        this.pos = 0;
        this.currentToken = null;
    }

    Token getCurrentToken() {
        if (this.currentToken == null) {
            this.currentToken = getNextToken();
        }
        return this.currentToken;
    }

    /*
    Compares the current token type with the token type passed. If they match, then "eat" the current token and assign
    the next token using the getNextToken() method. If they do not match, raise an exception.
     */
    void eat(Type tokenType) {
        if (this.currentToken.getType() == tokenType) {
            this.currentToken = getNextToken();
        } else {
            try {
                throw new Exception("Expected token type did not match");
            } catch (Exception e) {
                System.err.printf("Expected token type %s did not match", tokenType.toString());
                System.exit(0);
            }
        }
    }

    /*
    This method is responsible for breaking the input apart into a stream of tokens.
    */
    private Token getNextToken() {
        // Checks if the current position index has passed the number of characters in the input string. If so, return
        // an EOF token as there is no more input left to be converted into tokens.
        if (this.pos > this.text.length() - 1) {
            return new Token(Type.EOF, null);
        }
        
        // get a character at the current position  
        char currentChar = this.text.charAt(this.pos);
        
        // If the character is a digit, then convert it to int, create an INTEGER token, increment pos index to point
        // to the next character after the digit, and return the INTEGER token
        if (Character.isDigit(currentChar)) {
            Token token = new Token(Type.INTEGER, Character.getNumericValue(currentChar));
            this.pos++;
            return token;
        }
        
        if (currentChar == '+') {
            Token token = new Token(Type.PLUS, currentChar);
            this.pos++;
            return token; 
        }
        
        try {
            throw new Exception("Invalid token type");
        } catch (Exception e) {
            System.err.printf("Invalid token type %c found", currentChar);
            System.exit(0);
            return null;
        }    
    }
}

Our Parser should now be updated to look like this:

class Parser {
    private Lexer lexer;

    Parser(String text) {
        this.lexer = new Lexer(text);
    }

    /*
    Tries to identify the structure "INTEGER PLUS INTEGER"
    expr -> INTEGER PLUS INTEGER
    expr -> INTEGER MINUS INTEGER
     */
    int expr() {
        // The current token is expected to be an integer
        Token left = lexer.getCurrentToken();
        this.lexer.eat(Type.INTEGER);
        
        // The current token is expected to be a PLUS (+)
        Token operation = lexer.getCurrentToken();
        lexer.eat(Type.PLUS);
        
        // The current token is expected to be a single digit integer
        Token right = lexer.getCurrentToken();
        lexer.eat(Type.INTEGER);
        
        int result = (Integer)left.getValue() + (Integer)right.getValue();
        return result;
    }
}

Let’s compile and run our program again to see if everything works just as it used to.

$ javac -d . Token.java
$ javac -d . Lexer.java
$ javac -d . Parser.java
$ javac -d . Main.java
$ java Main

calc> 2+3
5
calc> 5+7
12
calc> 9+9
18

If you run into any exceptions, carefully go through the code again and see where things could have gone wrong. As always, you can find the complete code for this chapter in our GitHub repository.

We just made a bunch of code changes, and ended up with exactly what we had in the last chapter. What happened? Well, an important thing to keep note of is that an interpreter is divided into three parts:

Token: The smallest unit of source code.
Lexer (lexical analyser): Breaks the input source code into a stream of tokens.
Parser: Identifies a structure in the stream of tokens.

Keep these three words in mind (better yet, make a note of it), as we will be using these terms a lot as we go on in this tutorial. If we go through our Lexer class, we can see that it breaks the input source code apart into a stream of tokens using the getNextToken() method. Our parser class tries to identify the structure INTEGER -> PLUS -> INTEGER using the expr() method

As we can see, our interpreter is now nicely divided into three classes: Token, Lexer, and Parser. Not only is this much clearer, it will save us a lot of headaches in the future. Let’s make our Lexer a little bit cleaner to save ourselves a bunch of headaches :)

We will introduce a new variable as a pointer for the current character in our input string, and a method called advance() to set the value for this character. Our Lexer should now look like this.

class Lexer {
   // String input (eg: "2+3")
   private String text;
   // Keeps an index of the current position in the input
   private int pos;
   // Current token instance
   private Token currentToken;
   // Current character being pointed to
   private char currentChar;

   Lexer(String text) {
       this.text = text;
       this.pos = 0;
       this.currentToken = null;
       this.currentChar = this.text.charAt(this.pos);
   }

   Token getCurrentToken() {
       if (this.currentToken == null) {
           this.currentToken = getNextToken();
       }
       return this.currentToken;
   }

   /*
   Compares the current token type with the token type passed. If they match, then "eat" the current token and assign
   the next token using the getNextToken() method. If they do not match, raise an exception.
    */
   void eat(Type tokenType) {
       if (this.currentToken.getType() == tokenType) {
           this.currentToken = getNextToken();
       } else {
           try {
               throw new Exception("Expected token type did not match");
           } catch (Exception e) {
               System.err.printf("Expected token type %s did not match", tokenType.toString());
               System.exit(0);
           }
       }
   }

   /*
   Advances the 'pos' pointer, and set a value to the currentChar variable
    */
   private void advance() {
       this.pos++;
       // Checks if the current position index has passed the number of characters in the input string. If so, set a
       // MIN_VALUE for current char, indicating the end of file.
       if (this.pos > this.text.length() -1) {
           this.currentChar = Character.MIN_VALUE;
       } else {
           this.currentChar = this.text.charAt(this.pos);
       }
   }

   /*
   This method is responsible for breaking the input apart into a stream of tokens.
   */
   private Token getNextToken() {
       // If the character is a digit, then convert it to int, create an INTEGER token, increment pos index to point
       // to the next character after the digit, and return the INTEGER token
       if (Character.isDigit(currentChar)) {
           Token token = new Token(Type.INTEGER, Character.getNumericValue(currentChar));
           advance();
           return token;
       }
       if (currentChar == '+') {
           Token token = new Token(Type.PLUS, currentChar);
           advance();
           return token;
       }

       try {
           throw new Exception("Invalid token type");
       } catch (Exception e) {
           System.err.printf("Invalid token type %c found", currentChar);
           System.exit(0);
           return null;
       }
   }
}

Now let’s get into the fun part!. We’ll have our interpreter ignore white spaces. To do this, we’ll create a new method skipWhitespace(). Also, we’ll change the getNextToken() method to run in a loop until the end of the file is reached.

private void skipWhitespace() {
   while (this.currentChar != Character.MIN_VALUE && Character.isWhitespace(this.currentChar)) {
       advance();
   }
}

/*
This method is responsible for breaking the input apart into a stream of tokens.
*/
private Token getNextToken() {
   while (this.currentChar != Character.MIN_VALUE) {
       if (Character.isWhitespace(this.currentChar)) {
           skipWhitespace();
       }
       if (Character.isDigit(this.currentChar)) {
           Token token = new Token(Type.INTEGER, Character.getNumericValue(currentChar));
           advance();
           return token;
       }
       if (this.currentChar == '+') {
           Token token = new Token(Type.PLUS, currentChar);
           advance();
           return token;
       }

       try {
           throw new Exception("Invalid token type");
       } catch (Exception e) {
           System.err.printf("Invalid token type %c found", currentChar);
           System.exit(0);
           return null;
       }
   }
   return new Token(Type.EOF, null);
}

There you go! We can try running the program and see if we can now have white spaces.

$ javac -d . Token.java
$ javac -d . Lexer.java
$ javac -d . Parser.java
$ javac -d . Main.java
$ java Main

calc> 2 + 3
5
calc> 6 +    2
8

What’s next? Now we’ll have our calculator be able to do subtractions. This is very simple. First, we should create a new value for the Type enum in the Token class which represents the MINUS token.

enum Type {
   INTEGER,
   PLUS,
   MINUS,
   EOF
}

Next, we should update our Lexer to accept MINUS tokens when creating a token stream.

private Token getNextToken() {
   while (this.currentChar != Character.MIN_VALUE) {
       if (Character.isWhitespace(this.currentChar)) {
           skipWhitespace();
       }
       if (Character.isDigit(this.currentChar)) {
           Token token = new Token(Type.INTEGER, Character.getNumericValue(currentChar));
           advance();
           return token;
       }
       if (this.currentChar == '+') {
           Token token = new Token(Type.PLUS, currentChar);
           advance();
           return token;
       }
       if (this.currentChar == '-') {
           Token token = new Token(Type.MINUS, currentChar);
           advance();
           return token;
       }

       try {
           throw new Exception("Invalid token type");
       } catch (Exception e) {
           System.err.printf("Invalid token type %c found", currentChar);
           System.exit(0);
           return null;
       }
   }
   return new Token(Type.EOF, null);
}

Finally, update the expr() method in the parser to identify a pattern which could be either INTEGER PLUS INTEGER, or INTEGER MINUS INTEGER.

/*
Tries to identify the structure "INTEGER PLUS INTEGER"
expr -> INTEGER PLUS INTEGER
expr -> INTEGER MINUS INTEGER
*/
int expr() {
   // The current token is expected to be an integer
   Token left = lexer.getCurrentToken();
   this.lexer.eat(Type.INTEGER);

   // The current token is expected to be either a PLUS (+) or a MINUS (-)
   Token operation = lexer.getCurrentToken();
   if (operation.getType() == Type.PLUS) {
       lexer.eat(Type.PLUS);
   } else {
       lexer.eat(Type.MINUS);
   }

   // The current token is expected to be a single digit integer
   Token right = lexer.getCurrentToken();
   lexer.eat(Type.INTEGER);

   // Either a INTEGER PLUS INTEGER, or a INTEGER MINUS INTEGER token stream should be found at this point. As a
   // result, the method can return the result by adding, or subtracting the two integers.
   int result;
   if (operation.getType() == Type.PLUS) {
       result = (Integer)left.getValue() + (Integer)right.getValue();
   } else {
       result = (Integer)left.getValue() - (Integer)right.getValue();
   }
   return result;
}

Let’s run our application and see if everything works.

$ javac -d . Token.java
$ javac -d . Lexer.java
$ javac -d . Parser.java
$ javac -d . Main.java
$ java Main

calc> 5   - 3
2
calc> 7 - 9
-2

For our final act of the day, we will develop our calculator to be able to work with multi-digit integers. When a digit is encountered by the Lexer, an INTEGER token will be created by appending each digit in the input integer to a string, and then converting it to an integer. This is handled by the integer() method. Additionally, we will update getNextToken() as well.

/*
Return a single, ot multi-digit integer taken from the input
*/
private int integer() {
   String result = "";
   while (this.currentChar != Character.MIN_VALUE && Character.isDigit(this.currentChar)) {
       result += this.currentChar;
       advance();
   }
   return Integer.parseInt(result);
}

/*
This method is responsible for breaking the input apart into a stream of tokens.
*/
private Token getNextToken() {
   while (this.currentChar != Character.MIN_VALUE) {
       if (Character.isWhitespace(this.currentChar)) {
           skipWhitespace();
           continue;
       }
       if (Character.isDigit(this.currentChar)) {
           return new Token(Type.INTEGER, integer());
       }
       if (this.currentChar == '+') {
           advance();
           return new Token(Type.PLUS, currentChar);
       }
       if (this.currentChar == '-') {
           advance();
           return new Token(Type.MINUS, currentChar);
       }

       try {
           throw new Exception("Invalid token type");
       } catch (Exception e) {
           System.err.printf("Invalid token type %c found", currentChar);
           System.exit(0);
           return null;
       }
   }
   return new Token(Type.EOF, null);
}

Let’s run our program with the new changes and see if we can work with multi-digit integers.

$ javac -d . Token.java
$ javac -d . Lexer.java
$ javac -d . Parser.java
$ javac -d . Main.java
$ java Main

calc> 15 - 12
3
calc> 8 + 12
20

That’s it! We have completed the development of Calculator 2.0! Our calculator can now do additions and subtractions. Not just that, it lets us have as many whitespaces between the integers, and arithmetic expressions as we want! How cool is that?

In the next chapter, we will develop our calculator even further. We will now be able to do multiple additions, and subtractions in our calculations. For example, our next calculator will be able to perform operations like “1 + 2 + 3 + 4”, and so much more! If you’re feeling up to it, try and develop this feature yourself before getting into the next chapter!

See you on the other side!

Previous Chapter: Part 1: Compilers and Interpreters

Next Chapter: Part 3: Syntax Diagrams

Savidude's Blog

Developing a Simple Programming Language - Part 2: Components of an Interpreter

Recent Posts

ความคิดเห็น