Introduction
I recently became fascinated by databases. It occurred to me that they are essential components across all areas of software development, existing in various forms and playing critical roles in countless applications. That got me on the path to digging into its internals.
I started out by re-building a mini version of SQLite using Kotlin (more on that in a future post). However, I encountered performance issues when querying indexed fields in a large SQLite file. While I could locate the indexes efficiently using the B+ tree structure, retrieving the actual data entries was very slow.
To get a better understanding of how data retrieval works, I decided it was time to look into how this bit was implemented. I tried to study the actual sqlite implementation, but I found it a bit overwhelming. It was a LOT of C code. I hadn’t written or read C in almost a decade, since uni. Also, setting it up took a while. I had to download CLion, figure out how to setup the amalgamated version of SQLite and get it to play nicely with the IDE, among other things
Thankfully, H2 came to the rescue. It’s a fairly popular, lightweight db that runs on the JVM. It is used more on the backend and to a lesser degree on Android. It is commonly used in embedded mode but can run in server mode too.
I pulled the code, and it was impressive: thousands of lines of clear, easy-to-follow Java. I was able to set it up in less than 5 minutes and started to debug with breakpoints and tests.
The Tokenizer
When a SQL statement is run on a db engine, the input string has to be parsed into a query tree. The input string is split into “chunks” called tokens. H2 supports 11 types of tokens, the main ones being;
IdentifierToken
- An identifier is a token that forms a name e.g a table name or column name.KeywordToken
- Keyword tokens are the reserved words that have a special meaning to the SQL parser. They cannot be used as identifiers (such as table names or column names) unless they are quoted .e.g"TABLE", "PRIMARY", "SELECT"
. You can find a list of keywords as identified by h2 here
There are “literal” types which represent fixed values that appear directly in your SQL queries. Examples of these are
IntegerToken
, CharacterStringToken
, BigintToken
and BinaryStringToken
which as their names imply, hold tokens whose values are integers or regular strings or binary strings.
There’s also the EndOfInputToken
which marks the end of a stream of tokens.
The classification of tokens is straightforward. By default, if a token is in the list of keywords defined in ParserUtils
, they’re categorised as KeywordToken
. Otherwise, they’re set to be IdentifierToken
, if they’re not one of the literal types.
The Process, Briefly
The tokenization process in H2 starts by initializing an array to store the identified tokens. It then iterates through the SQL string, character by
character. For each character, it checks for the presence of delimiters like spaces or parentheses to identify the end of a token. Once a token is identified,
it’s categorized as either a keyword (like “CREATE” or “TABLE”) or an identifier (like table or column names). The process also handles various literal types,
such as integers and strings, using specialized functions for different formats. Finally, after processing all characters, the tokenizer appends an EndOfInputToken
to the array and returns the complete list of tokens.
ArrayList<Token> tokenize(String sql, boolean stopOnCloseParen, BitSet parameters) {
ArrayList<Token> tokens = new ArrayList<>();
int end = sql.length() - 1;
boolean foundUnicode = false;
int lastParameter = 0;
loop: for (int i = 0; i <= end;) {
char c = sql.charAt(i);
Token token;
switch (c) { ...over 100 lines of cases... }
tokens.add(token);
i++;
}
if (foundUnicode) {
processUescape(sql, tokens);
}
tokens.add(new Token.EndOfInputToken(end + 1));
return tokens;
}
I encourage you to check out the full Tokenizer.tokenize()
function
to see how it all works end-to-end.
Examples
Using 3 very simple but common sql statements, here’s how the tokens are resolved.
create table test(id int primary key, name varchar(255))
Token | Type |
---|---|
create |
org.h2.command.Token$IdentifierToken |
table |
org.h2.command.Token$KeywordToken |
test |
org.h2.command.Token$IdentifierToken |
( |
org.h2.command.Token$KeywordToken |
id |
org.h2.command.Token$IdentifierToken |
int |
org.h2.command.Token$IdentifierToken |
primary |
org.h2.command.Token$KeywordToken |
key |
org.h2.command.Token$KeywordToken |
, |
org.h2.command.Token$KeywordToken |
name |
org.h2.command.Token$IdentifierToken |
varchar |
org.h2.command.Token$IdentifierToken |
( |
org.h2.command.Token$KeywordToken |
255 |
org.h2.command.Token$IntegerToken |
) |
org.h2.command.Token$KeywordToken |
) |
org.h2.command.Token$KeywordToken |
end-of-string | org.h2.command.EndOfInputToken |
insert into test values(1, 'Hello')
Token | Type |
---|---|
insert |
org.h2.command.Token$IdentifierToken |
into |
org.h2.command.Token$IdentifierToken |
test |
org.h2.command.Token$IdentifierToken |
values |
org.h2.command.Token$KeywordToken |
( |
org.h2.command.Token$KeywordToken |
1 |
org.h2.command.Token$IntegerToken |
, |
org.h2.command.Token$KeywordToken |
Hello |
org.h2.command.Token$CharacterStringToken |
) |
org.h2.command.Token$KeywordToken |
end-of-string | org.h2.command.EndOfInputToken |
select * from test
Token | Type |
---|---|
select |
org.h2.command.Token$KeywordToken |
* |
org.h2.command.Token$KeywordToken |
from |
org.h2.command.Token$KeywordToken |
test |
org.h2.command.Token$IdentifierToken |
end-of-string | org.h2.command.EndOfInputToken |
Easter Eggs? 😉
I found a couple of interesting tricks while studying the code
- Using a bitwise
AND
operation to convert a character to uppercase, using0xffdf
as a bitmask
private Expression readTermWithIdentifier(String name, boolean quoted) {
/*
* Convert a-z to A-Z. This method is safe, because only A-Z
* characters are considered below.
*
* Unquoted identifier is never empty.
*/
switch (name.charAt(0) & 0xffdf) {
case 'C':
...
- Labelled loops, which allows for easy mixing of switch and for statements. You know
break
can break from bothswitch
andfor-loops
. To break from the loop, you have to dobreak loop
instead of justbreak
which marks the end of a condition in theswitch
statement.
case ')':
token = new Token.KeywordToken(i, CLOSE_PAREN);
if (stopOnCloseParen) {
tokens.add(token);
end = skipWhitespace(sql, end, i + 1) - 1;
break loop;
}
break;
...
- For chars
'0' - '9'
, you can get its equivalent integer representation by subtracting'0'
from it
}
number = number * 10 + (c - '0');
if (number > Integer.MAX_VALUE) {
The characters '0', '1', '2'
have the ASCII values of 48, 49 and 50
respectively. A neat yet efficient way to convert
numeric chars into their integer equivalents is to subtract '0'
from them. This avoids the overhead of string-to-integer operations
'1' - '0' => 49 - 48 = 1
'2' - '0' => 50 - 48 = 2
References
- Database Design and Implementation - Edward Sciore
- H2 Database, the Java SQL database - https://github.com/h2database/h2database
In my next post, I’ll explain how these tokens are interpreted and converted by the engine into executable commands.