database internals: a brief look at h2's tokeniser

Nov 19, 2024

Introduction

I recently became fascinated by databases. It occurred to me that they are essential components across all areas of software development, existing in various forms and playing critical roles in countless applications. That got me on the path to digging into its internals.

I started out by re-building a mini version of SQLite using Kotlin (more on that in a future post). However, I encountered performance issues when querying indexed fields in a large SQLite file. While I could locate the indexes efficiently using the B+ tree structure, retrieving the actual data entries was very slow.

To get a better understanding of how data retrieval works, I decided it was time to look into how this bit was implemented. I tried to study the actual sqlite implementation, but I found it a bit overwhelming. It was a LOT of C code. I hadn’t written or read C in almost a decade, since uni. Also, setting it up took a while. I had to download CLion, figure out how to setup the amalgamated version of SQLite and get it to play nicely with the IDE, among other things

Thankfully, H2 came to the rescue. It’s a fairly popular, lightweight db that runs on the JVM. It is used more on the backend and to a lesser degree on Android. It is commonly used in embedded mode but can run in server mode too.

I pulled the code, and it was impressive: thousands of lines of clear, easy-to-follow Java. I was able to set it up in less than 5 minutes and started to debug with breakpoints and tests.

The Tokenizer

When a SQL statement is run on a db engine, the input string has to be parsed into a query tree. The input string is split into “chunks” called tokens. H2 supports 11 types of tokens, the main ones being;

IdentifierToken - An identifier is a token that forms a name e.g a table name or column name.
KeywordToken - Keyword tokens are the reserved words that have a special meaning to the SQL parser. They cannot be used as identifiers (such as table names or column names) unless they are quoted .e.g "TABLE", "PRIMARY", "SELECT". You can find a list of keywords as identified by h2 here

There are “literal” types which represent fixed values that appear directly in your SQL queries. Examples of these are IntegerToken, CharacterStringToken, BigintToken and BinaryStringToken which as their names imply, hold tokens whose values are integers or regular strings or binary strings.

There’s also the EndOfInputToken which marks the end of a stream of tokens.

The classification of tokens is straightforward. By default, if a token is in the list of keywords defined in ParserUtils , they’re categorised as KeywordToken. Otherwise, they’re set to be IdentifierToken, if they’re not one of the literal types.

The Process, Briefly

The tokenization process in H2 starts by initializing an array to store the identified tokens. It then iterates through the SQL string, character by character. For each character, it checks for the presence of delimiters like spaces or parentheses to identify the end of a token. Once a token is identified, it’s categorized as either a keyword (like “CREATE” or “TABLE”) or an identifier (like table or column names). The process also handles various literal types, such as integers and strings, using specialized functions for different formats. Finally, after processing all characters, the tokenizer appends an EndOfInputToken to the array and returns the complete list of tokens.

    ArrayList<Token> tokenize(String sql, boolean stopOnCloseParen, BitSet parameters) {
        ArrayList<Token> tokens = new ArrayList<>();
        int end = sql.length() - 1;
        boolean foundUnicode = false;
        int lastParameter = 0;
        loop: for (int i = 0; i <= end;) {
            char c = sql.charAt(i);
            Token token;
            switch (c) { ...over 100 lines of cases... }
            tokens.add(token);
            i++;
        }
        if (foundUnicode) {
            processUescape(sql, tokens);
        }
        tokens.add(new Token.EndOfInputToken(end + 1));
        return tokens;
    }

I encourage you to check out the full Tokenizer.tokenize() function to see how it all works end-to-end.

Examples

Using 3 very simple but common sql statements, here’s how the tokens are resolved.

create table test(id int primary key, name varchar(255))

Token	Type
`create`	`org.h2.command.Token$IdentifierToken`
`table`	`org.h2.command.Token$KeywordToken`
`test`	`org.h2.command.Token$IdentifierToken`
`(`	`org.h2.command.Token$KeywordToken`
`id`	`org.h2.command.Token$IdentifierToken`
`int`	`org.h2.command.Token$IdentifierToken`
`primary`	`org.h2.command.Token$KeywordToken`
`key`	`org.h2.command.Token$KeywordToken`
`,`	`org.h2.command.Token$KeywordToken`
`name`	`org.h2.command.Token$IdentifierToken`
`varchar`	`org.h2.command.Token$IdentifierToken`
`(`	`org.h2.command.Token$KeywordToken`
`255`	`org.h2.command.Token$IntegerToken`
`)`	`org.h2.command.Token$KeywordToken`
`)`	`org.h2.command.Token$KeywordToken`
end-of-string	`org.h2.command.EndOfInputToken`

insert into test values(1, 'Hello')

Token	Type
`insert`	`org.h2.command.Token$IdentifierToken`
`into`	`org.h2.command.Token$IdentifierToken`
`test`	`org.h2.command.Token$IdentifierToken`
`values`	`org.h2.command.Token$KeywordToken`
`(`	`org.h2.command.Token$KeywordToken`
`1`	`org.h2.command.Token$IntegerToken`
`,`	`org.h2.command.Token$KeywordToken`
`Hello`	`org.h2.command.Token$CharacterStringToken`
`)`	`org.h2.command.Token$KeywordToken`
end-of-string	`org.h2.command.EndOfInputToken`

select * from test

Token	Type
`select`	`org.h2.command.Token$KeywordToken`
`*`	`org.h2.command.Token$KeywordToken`
`from`	`org.h2.command.Token$KeywordToken`
`test`	`org.h2.command.Token$IdentifierToken`
end-of-string	`org.h2.command.EndOfInputToken`

Easter Eggs? 😉

I found a couple of interesting tricks while studying the code

Using a bitwise AND operation to convert a character to uppercase, using 0xffdf as a bitmask

 private Expression readTermWithIdentifier(String name, boolean quoted) {
    /*
     * Convert a-z to A-Z. This method is safe, because only A-Z
     * characters are considered below.
     *
     * Unquoted identifier is never empty.
     */
    switch (name.charAt(0) & 0xffdf) {
    case 'C':
    ...

Labelled loops, which allows for easy mixing of switch and for statements. You know break can break from both switch and for-loops. To break from the loop, you have to do break loop instead of just break which marks the end of a condition in the switch statement.

 case ')':
    token = new Token.KeywordToken(i, CLOSE_PAREN);
    if (stopOnCloseParen) {
        tokens.add(token);
        end = skipWhitespace(sql, end, i + 1) - 1;
        break loop;
    }
    break;
...

For chars '0' - '9', you can get its equivalent integer representation by subtracting '0' from it

 }
number = number * 10 + (c - '0');
if (number > Integer.MAX_VALUE) {

The characters '0', '1', '2' have the ASCII values of 48, 49 and 50 respectively. A neat yet efficient way to convert numeric chars into their integer equivalents is to subtract '0' from them. This avoids the overhead of string-to-integer operations

'1' - '0' => 49 - 48 = 1
'2' - '0' => 50 - 48 = 2

References

Database Design and Implementation - Edward Sciore
H2 Database, the Java SQL database - https://github.com/h2database/h2database

In my next post, I’ll explain how these tokens are interpreted and converted by the engine into executable commands.