Activity

Johann

Deployed the Demo to Cloudflare Pages

You can now play around with the prototype demo at https://tempest.jofh.me! I’ve also had some issues with cloudflare and deployment via their CLI and even the Web Panel, but it actually turned out to be an issue on their end, which has been resolved now.

It was some bug where it falsely flagged my files as blacklisted and also gave me practically no proper error, so I had to crawl through the logs of their Wrangler CLI, which was quite a hassle.

I also rewrote the old README.md that was mostly AI generated, so the repository is now finally free of AI generated content again. I also really like the new version, because it puts a proper focus on the design decisions and my personal reasoning - the old one was sort of missing a proper identity.


The attachment shows the deployed version on the web, as you can see in the URL at the top.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/a81ee13...38be372

Attachment
0
Johann

REPL that utilizes Web Assembly to run in the Browser!

This was very hard, because I had to use many completely new technologies: Web Assembly, Web Workers, wasm-bindgen, Vue + Nuxt, xterm.js, and certain other crazy browser APIs like SharedArrayBuffers and Atomics.

stdio - stdin, stdout, stderr

I pulled out the basic terminal operations into an interface so that I can supply the implementation based on whether I’m running in the browser or in a native terminal. What came out was WasmStdio and TerminalStdio. It basically gives me the stdin reader, and the stdout/stderr writers by reference, allowing me to use the normal std::io operations on them. WasmStdio also does buffering and TerminalStdio handles raw terminal stdin operations i.e. key presses for history, deletion, etc. I used termion for the key parsing on native, but rolled my own for Wasm, which is still a work-in-progress right now.

WasmIo

The new implementation is kind of just like VirtualIo, but Io now also exposes now() -> time::Duration, since the implementation requires syscalls / JS APIs. On native targets, it goes through the OS, while it uses the Date API in the browser.

Crossterm Issues

I could not compile for Wasm at first, because I was using comfy-table to render the REPL tables originally, which was dependent on crossterm. It could only compile for native targets, so I had to swap comfy-table out for tabled and also remove rustyline so that I got rid of the crossterm transitive dependency. This also meant that I had to replace the implementation for reading in commands by the user with my own one.

Wasm + Worker Architecture

Now that it compiles, I still have to ‘drive’ it. To not block my main thread with the runtime, I put it on a web worker and made it communicate via postMessage() for stdout/stderr and a SharedArrayBuffer for stdin. The buffer in combination with Atomics.wait allows us to put the web worker to sleep, until it receives a new input line from the terminal. When originally implementing this, I kind of forgot that, and it crashed everything! Windows sure does not isolate those processes in the right way, even though it sucks up all of my RAM :’(

xterm.js

This is a cool npm package that gives me the raw terminal-like interactions in the browser. I still have to hook up input processing, buffering, and output all by myself though. But that was actually good, because it allowed me to hook it up with the database really well.

Vue + Nuxt Frontend

I was unsure as to what frontend framework I wanted to use, but I finally decided on Vue, because… it does not really matter. I do not like using React, because it mixes up scripting with markup/styling. So when looking for alternatives, Vue seemed the most stable. I’ve used Svelte before, which was really nice, but it’s also quite unstable and less popular, so Vue seemed like the best fit.

Going Forward

Now that it runs as one big thing, I’m going to optimize the implementation and extend it with new features after!


The attachment shows the REPL working in the Browser through xterm.js.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/89d9950...a81ee13

Attachment
0
Johann

Delete Statements (also with Where-Clauses)

You can now delete data using the delete from {table} [where {condition}]; syntax, which let’s you optionally have the where part so you can either delete the WHOLE table, or just a part of it.

It was fairly easy, since it’s just combining what happens in a select and using the rows’ primary keys for the delete KEY operation in the storage. This is also a single WriteBatch btw, which is more efficient than N different disk operations for N rows while also giving us atomicity: Either the whole delete operation is committed completely, or nothing.

What still has to happen in the future, is optimizing:

When we delete based on the primary key, or want to drop the whole table, we can implement a range-deletion in the KV-Store, which will let us 1) avoid scanning and filtering and 2) results in not just one batch, but rather one operation, so it’s more compact for storage.

One more thing: I added .clear (or the .c shorthand) in the REPL that clears the screen. Was literally just two lines.


The attachment shows how executing delete works and looks like like in the REPL.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/b8f10ea...89d9950

Attachment
0
Johann

Where-Clause in Select Queries

Now that I’m back at the higher layers of implementation, I am able to add a new features, starting with the where-clause in select-statements!
If you’ve used SQL before, you already know how it works, but to give a quick explanation: where {predicate} let’s you specify conditions so you can filter for certain columns in queries. This will come in handy soon, when I’ll add delete, so you can, for example, delete from main.users where id = 5.

Expression Evaluation

The implementation is not as simple as one might think. For a normal calculation, we could just evaluate immediately. But here, we usually cannot, since the prediacte/condition usually references columns in the row that it is based on. This means, that we have two phases

Compilation & Type-Checking

In the first phase, we traverse the parsed syntax tree that we got from the raw statement. This recursive traversal does not only transform syntax nodes into the CompiledExpr representation, but also gives is the TempestType of it.

For example, we map IntegerLiteral(5) into (CompiledExpr::Literal(TempestValue::Int64(5)), TempestType::Int64). The CompiledExpr also exposes the variant ColumnRef(idx) in the schema. When we have a column reference in the expression, we get it’s type from the schema and return the index inside of it.

Along the way, we also check that every operation between types is supported, or return an error if it is not.

Evaluation

Firstly, we decode the row we just got. We get out a Vec<..> which is passed in by reference to the evaluator.

We evaluate recursively, just like when compiling. Literals can be computed immediately and ColumnRefs are looked up by their index in the passed in row during evaluation of the compiled expression when filtering.


The attachments - specifically the second one - shows how selecting with a filter looks like.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/2423cfb...b8f10ea

Attachment
Attachment
0
Johann

REPL finally works again - but WAY BETTER than originally

“This little maneuver is gonna cost us 51 years!” - someone I did not listen to, apparently . . .

While the foundation that was I/O layer, runtime, and the KV-Store took quite a while, the rest went… a lot quicker than I thought. I just had to change up a few function signatures, imports, move some stuff in the workspace and BAM: it’s working again, (almost) like it was before. What is different: We’re still not persisting the data in the key value store, but we’re also just using VirtualIo anyways, so it doesn’t really matter right now.

This means I can now move on to adding actual new feature, now that the foundation is there! Starting with…

The Query Planner

Firstly, I’ll start with splitting up my query planner into multiple phases, so it can turn the Abstract Syntax Tree into a LogicalPlan which is just a sequence of operations, and then analyze that plan, generating a PhysicalPlan, which will contain more explicit instructions, like which range to scan, based on the query analysis.

The where-clause

This is kind of the core of what a database has to offer: filtering out rows based on a few condition-expressions. It’s also going to happen somewhat in-sync with the query planner refactor, because it’s dependent on each other. We don’t relly have any difference between the LogicalPlan and the PhysicalPlan, unless we add in filtering, which can be optimized, e.g. when we have conditions like = on a primary key, which could in this case be optimized from a range scan to a point lookup, which is just O(log N) instead of O(N)-scanning + filtering/compares.

The delete statement

Yes, it’s a bit embarrassing, but this one’s still missing after what… according to Flavortown it’s been ~260h, but Hackatime says it’s at least 290h and WakaTime puts me at ~330h?! Deleting will kinda require the where-clause, unless you want to delete EVERYTHING 🤨, but I doubt that’s the case.

Range-deletes in KV store?

This would be useful to optimize certain delete statement, reducing them from N tombstone-inserts in the KV store, to just one range insert, but will require another layer next to the whole rest which is for range data in both Memtable and Sst implementation, which will have to be queried/zipped with the point inserts. Quite the difficult optimization I’ll save for a bit later.


The first attachment shows our FULL test-suite passing, verifying that all of what TempestDB was before starting the refactor/partial rewrite is now back and even more stable than I could have ever imagined. The second one shows a quick example of using the new old REPL.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/51f9c3f...2423cfb

Attachment
Attachment
0
Johann

Implement Storage in tempest_kv

Since we now have the runtime for background tasks, the memtable, etc., it was really well possible to create our top-level Storage struct, that exposes the key-value store API through the StorageHandle.

When creating a Storage instance, we spawn a background runtime task that loops and listens to three channels: write_rx, get_rx, scan_rx, for writing batches, getting a single key entry, and scanning across a range, respectively.

Concurrency

For each handler on those channels, we don’t just execute them in-place, but spawn another background task that will get the required references (Rc<T>) to currently just the Memtable, and then try responding there, so that we don’t block any other requests coming in. This is especially important once we do file I/O, since that will take a while, and we want as much concurrency as possible of course.

So we can actually read and write “at the same time” - concurrently, not in parallel 😉.

Writing

We send over the WriteBatch with the data, as well as a oneshot::Sender that will tell us once the write has either succeeded or failed.

Reading

We send over the lookup-key as a Bytes, which is just a smart reference-counting pointer to our data. Then send back Some(key)/None through a oneshot:Sender, or an error if the read failed.

Scanning

We send over the start and end bounds. Contrary to some expectations, we don’t just send back something like a Vec with all the data inside, but we actually send back an mpsc::Receiver, which is a Stream, so we can stream back the data higher up, avoiding allocating for all the data, which would create a bottleneck on our RAM based on the biggest table scans we do. Once the channel closes, we know that we have reached the end.


The first attachment shows all 17 tests for Storage passing, verifying that it can do everything what we just implemented correctly. The second attachment shows how one of those tests look like, so you can imagine how we will be able to use Storage in our database engine that is coming next!

Yes, I will not get hung up on implementing file based persistence, since that won’t help on WASM anyways, so we’ll just stick to the Memtable we just hooked up.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/7e0ea0d...51f9c3f

Attachment
Attachment
0
Johann

BIG Improvements to the Runtime

Aside from some smaller refactors on the side, the big improvement is finally implementing and using the Waker that async-Rust passes through for every Future i.e. state-machine.

Waker Implementation

We identify tasks by their TaskId, which is either Main or Task(id). We store the runtime context during every tick() inside of a thread_local!{} where we have a wake-set. The Waker just accesses the thread-local context-object and insert’s the TaskId it was created from into the wake-set. Now, whenever we’re waiting on some task, we can just give it that Waker and tell it to just .wake() us up, once whatever it’s doing is done.

The Runtime just goes through the whole wake-set on tick() to determine what has to be polled and what doesn’t. If nothing has to be polled, it suspends the current thread by calling io.wait(), which - for the real LinuxIo - will put the actual OS thread to sleep, until we get at least one I/O completion. This can range from simple file I/O, to actual incoming network requests! That’s what makes a TCP listener possible!

Note that waking has a tick-delay of one i.e. it registers for execution on the next tick, or else we would repoll without any completed I/O.

Waker Usage across Channels, Tasks Handles, and I/O

sync::oneshot

single use channel to transfer a value from A to B:

  • receiver checks -> nothing there? -> store waker
  • sender inserts -> receiver waiting? -> wake it

sync::mpsc

multi-producer single-consumer channel with a fixed capacity:

  • consumer checks -> nothing there? -> store waker for producers
  • consumer checks -> something there? -> consumers waiting for capacity -> wake them
  • producer inserts -> consumer waiting? -> wake it
  • producer inserts -> queue full? -> store waker for consumer

JoinHandle

waiting for a task to complete:

  • internally uses the sync::oneshot as a signaling primitive, nothing new here

fs::*

asynchronous file-system I/O helpers:

  • try submitting their request to I/O -> queue full? -> wake itself -> return Pending

The interesting thing here is that the Runtime can identify what TaskId an I/O completion comes from, because it is stored in the upper 20 bits of the tempest_io::OpHandle(u64). This allows us to easily correlate an I/O completion with the task that has to be polled, without any extra storage type, like a HashMap.


Attachments:

  1. old + new tests for tempest_rt succeed, so we’re running correctly!
  2. graphic to show how task joining executes in the tick-based runtime with the new wakers. we no longer poll everything, every tick!

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/31d4220...7e0ea0d

Attachment
Attachment
0
Johann

Bounded Multi-Producer Single-Consumer Channels for the Tempest async runtime

The new tempest_rt::sync::mpsc module provides the BoundedSender<T> and BoundedReceiver<T> primitives. You create them through the mpsc::bounded(capacity) function and they expose both synchronous try_send() and try_recv() methods, as well as asynchronous send() and recv() methods. It’s multi-producer, because the BoundedSender<T> is clonable, but single-consumer, because the BoundedReceiver<T> is NOT clonable.

Usage

Useful when we have long running processes that require communicating back and forth. I think the first real application will be the main tempest_kv::Storage implementation that has to communicate with the Wal that runs in the background. This is because of the inversion-of-control style implementation of the WAL, which makes the caller (async task) hold control over when the WAL gets to work and not callee (for the Wal, we want concurrency, so I wrote it as poll_fn(callback) based per-tick compute over having .await everywhere, which forces pending operations to wait for the other operations to complete).

To actually prevent our CPU from spinning on a Future that is Poll::Pending - usually because of I/O -, we’ll require proper Waker infrastructure and implement wake and wake_by_ref on our RawWakerVTable implementation, but this would be a lot more complex than it is now, since waking will be driven by the Io completions in the Runtime, which adds a lot of state and deadlock potential.

It’s fine for now to just “busy-wait” in our loop { .. }, because the tests are short-lived anyways. Once we have a long-running process, we will require this mechanism to suspend the execution until we get I/O completion i.e. CQEs, to not suck up CPU resources beyond reason.


The attachment shows the 16 unit tests for every edge case of the new mpsc channels.

Git commit: https://codeberg.org/Vimthusiast/tempest/commit/31d4220

Attachment
0
Johann

Implement async Sorted-String Tables

write()

  • takes in a source: StorageIterator, which means it could be from memory i.e. when flushing a Memtable, or from disk i.e. when running a compaction job that creates a MergeIterator over multiple sst readers.
  • reads from source until it’s exhausted, writing every block
  • writes the bloom filter, that can do quick checks in O(1) if a key exists, with a small chance for false positives, but generally reducing the need for actual lookups
  • writes the block index that stores the offsets of blocks
  • writes the footer with metadata
  • gives us metadata about stuff like the key ranges, so we can see if a key could even be inside of the SST or not, as well as the entry count, which is required for compaction, so we know how to size our bloom filter for the right false positive rate

load()

  • opens a file
  • reads footer, bloom filter, index
  • returns it all to us as an SstHandle

SstHandle::get()

  • checks the bloom filter - may return early if key is definitely not present
  • get block metadata for key from the block index
  • read in the block from Io
  • check the block for the key

SstIterator

  • implements every operation that the StorageIterator trait requires:
    • next() returns the next key, or None, if exhausted
    • seek(seek_key) causes the iterator to jump forward to the first key, where key >= seek_key
    • close() does… well nothing, since closing the handle happens higher up on the handle itself

=> It shares the handle via an Rc so that we can have multiple readers at the same time. handle.get() works on the Rc too, since Rc<T> implements Deref to &T.


The attachment shows all 14 tests for the sst logic passing, that validate:

  • write an SST to the file system
  • load an SST, giving us a handle
  • get from a handle (point lookup)
  • iter over the SST entries sequentially

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/be29b1d...fcaa5b2

Attachment
1

Comments

Yoid
Yoid 9 days ago

nicee!

reads from source until it’s exhausted
looks very cruel /silly

Johann

Impl async Write-Ahead Log, async Memtable Iterator, and oneshot channels

Oneshot Channels

To communicate between different async tasks, like when signaling a task has completed, we have to send over a value. As a new reusable primitive for that, I’ve made the oneshot-channel, which can be used to send over a singular value from one to another.

Async Memtable Iterator

Since I changed up the model to async from tick based again, I had to rething the StorageIterator trait. Of course that also meant changing up the MemtableIterator to implement the new async trait.

Async Write-Ahead Log

The new async-impl is even better than the v1-async and the v2-tick-based impls. It supports submitting multiple write batches, allowing us to utilize io_uring to the fullest. You can submit writes and will receive confirmation/acknowledgement via a oneshot channel.

I handle writing and fsyncing, as well as recovery on init. I rotate the file after a certain size threshold, to eventually delete old data that was flushed, and so that loads don’t get too slow.

Recovery record pass-through is handled through an FnMut(record) callback, that you pass into Wal::init(). It does not have to already accept writes or be very fast, since it’s only a startup thing that happens after crashes, but I’ll likely still optimize it later.


The attachments show WAL tests executing successfully:

  • basic single ack that a submitted write completes within a certain time
  • multi ack that shows writes happen in batches (all the same tick)
  • wal file rotation; forced through a small threshold in the config
  • wal recovery that first writes three records and then reads them back in order

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/5ac6757...be29b1d

Attachment
Attachment
Attachment
Attachment
0
Johann

Implemented Background Tasks

The Runtime now supports spawning tasks in the background and joining them back later! This is a huge step that is necessary, so that our flush jobs, compaction jobs, reads, and writes do not block each other. It is what finally allows us the desired concurrency. We should always limit something like the number of concurrent compaction jobs, but this allows to scale high!

You can spawn a background task, through the spawn(fut: impl Future) function like so:

let join_handle = spawn(async {
    // do I/O or something, then return the result when finished:
    42
}).await;

// this waits for the background task to complete
let result = join_handle.await;

// the result will be `42`
assert_eq!(result, 42);

This should read natural to anyone who used something like Tokio before.
We don’t have to “join the task” at all. The runtime will drive all the tasks, just like it drives the main future in block_on, sharing compute times - everyone gets polled evenly, until the main future resolves.


The attachments show the tests succeeding that verify usage of the spawn() function.
It also has code for how spawning two tasks and then joining them would look like, signaling that we can indeed have single-threaded cooperative multitasking.

Relevant commit: https://codeberg.org/Vimthusiast/tempest/commit/5ac6757

Attachment
Attachment
0
Johann

Lots of smaller stuff, like refactoring Journal and parts of the KV layer to async

Journal

It’s now async, and collapsed down from a state enum with like ~19 variants or something down to just the struct with a few async methods for initialization, modification, and closing.

Manifest

The Manifest is now correctly in the root of tempest-kv and has been made async based, like the Journal.

StorageIterator

The new version is better, since we already know what iterators we have. We don’t expose the StorageIterator API. So I will just use an enum for when I’m polling from both the Memtables and SSTs; but that comes later. This is why I can just use async on the trait itself, instead of dyn.

I also implemented the MockIterator, which serves as a kind of reference to test my other iterators agains.

Chores: Updating the Cargo.toml files

even the boring stuff always has to happen at some point

  • patch the repo URLs (from the old GitHub to the new Codeberg repo)
  • bump the versions, for the next publish/release
  • updated the license to license-file with a path
  • updated authors, license-file, repository, and rust edition to be workspace wide

Also note that I also transitioned from the MIT license to the more restrictive Functional Software License (by Sentry). It restricts making a competing SaaS (like when I’ll be making a platform as a service for managed TempestDB).

It’s mainly meant to protect against the free-rider problem that occurs when cloud providers just yoink your software and sell it on their infrastructure as a new product. I think FOSS is flawed in that way, but source-open with an expiration date of 2 years like FSL to a license that is FOSS seems really awesome. Best of both worlds.


The attachment shows the tests passing for the new journal, manifest, and the mock iterator.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/71c6bc2...4820daf

Attachment
0
Johann

I wrote an async runtime?!

Yes, you’re not dyslexic: I really wrote a custom async executor - and it was quite simple!

Even though I just began to leave async behind, I’ve been met with a harsh reality: manually writing all of these state machines does not scale.

The reality is, that I did not have an issue with async itself, but with tokio. It is a heavy runtime and brings in a lot more than just concurrency, which I don’t need. The other issue is, that with all of its dependencies, it increases compile times, makes the binary bigger, and makes eventually compiling to WASM a bit of a headache.

By handrolling my own little runtime, I’ve removed the need for manually writing the tick(io: &mut Io) methods. I no longer access I/O through a reference, but do not use a thread_local! { .. } for that either. Instead, I just keep the Io instance on the Runtime itself. When it executes a future, it internally has the same loop for 1) polling the Io’s completion queue and 2) advancing the state-machine.

Example:

let fd = open_file::<I>(path, opts).await?;
let ... = write_exact::<_, I>(fd, b"hello world", 0).await;
close_file::<VirtualIo>(fd).await?;

And where before I’d have to write three different states in a big enum and handle transitions in tick, it all just get’s generated by the compiler for me.

The implementation was, because my tick-based model is so simple, really easy. I did not need to write any complex task-waking stuff - the waker is literally just a noop function - and it is supposed to run on one thread anyways, so it should mostly just run exactly like it did before.


The attachments show the tests for the new Runtime and the Future-based file operations passing and also show how partial writes/reads are easily handled through the new async helpers!

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/3df31f6...71c6bc2

Attachment
Attachment
0
Johann

Bringing back Storage

Writes and gets are working again - even better than ever before!

Using the existing state machines, I’ve hooked up the Memtable and the Wal (write-ahead log) into the single Storage interface again, with major improvements over the old implementation that relied on async Rust: It allows for concurrent reads and writes without blocking each other!

Similar to how you get an OpHandle from my Io that you can use to see if a submitted I/O operation has completed, you receive WriteTokens from Storage::write() and ReadTokens from Storage::read().

When advancing the Storage state-machine, you pass in an output parameter &mut StorageOutput that contains the completions (currenlty just reads and writes, later also scans, which is a bit more complex, since that one has to yield multiple values). An output param removes the need for a complicated StorageResult enum and also reduces memory usage, because we can reuse the same Vec on every interaction.

Next is probably scanning across ranges and implementing support for MVCC filtering. What’s also missing would be recovery, since the current Storage implementation has a todo!() on the Wal::RecoveryRecord(r) case during initialization, so it would just crash when reopening the directory. Proper closing of Wal, Manifest, etc. is also still missing.


The attachments show the same test of writing a few values and reading one of them back two times multiple times, one on the INFO logging level and one on the more verbose DEBUG logging level for the more interested people.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/bdb9108...3df31f6

Attachment
Attachment
0
Johann

Implemented SstWriter and SstReader and optimized Journal

Quickly about the Journal: Before, I was repeatedly moving data even when sub state machines did not complete; now I copy data around when really necessary, which also simplified the implementation. It required inlining every tick_* helper method, but that’s worth it.

Since the SST Manifest was ready to track the sorted string table files I had, creating SSTs and searching them was the next obvious step.

SST File Format

  • blocks (about 4KiB each) that contain a few entries
  • bloom filter to allow quick O(1) check if a key could be in the file
  • block index that stores the key ranges of the blocks together with the block’s filepos
  • footer that mainly contains offsets and sizes of the index and bloom filter, along with some other metadata

SstWriter

Firstly, it creates a file to write the data into. When that finishes, it initializes the builders for the next block, the block index and the bloom filter, before transitioning into the Active phase.

In the Active phase, it operates as a sub state-machine to the main writer state machine:

  • Write Blocks:
    • poll data from the source iterator, and if data is ready this tick, write them into the block builder, until it’s full
    • when the block builder is full OR source is exhausted (no remaining entries), firstly finalize the block builder, replacing it with a new one, then secondly write the encoded block into the file at the current offset
  • *Write Bloom
  • Write Index
  • Write Footer

[…]

This devlog got slightly too long, so you can read the full devlog in this gist: https://gist.github.com/Vimthusiast/24e217a21b806304a241d753e34da7a3


The attachment shows the tests for the SST writing, loading, and reading all succeeding.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/8d2d289...bdb9108

Attachment
0
Johann

SST Manifest and IO Registered Buffers

Firstly, I implemented the new SST Manifest over my generic Journal<T>, which will track actions that happen with the sorted string tables (SSTs). It has to track their file numbers, which allows us reconstructing their paths for reading, compaction, and cleanup. Edits happen atomically, so when compaction ran, it either writes “new sst 3, removed 1, 2” or not. All or nothing. This prevents something like “remove 1”, “remove 2”, “add 3”, which could cause us to loose data, when crashing between “remove 2” and “add 3”. We could of course have solved this by reordering them, but firstly, that is more write intensive, since we do many smaller writes, and secondly, it requires me to be reliable, and I frankly don’t trust myself that much, so I let myself be helped through the type system :)

I also implemented registered buffers for I/O, which is a bit complex to explain. Simply said, when you write to a file through LinuxIo and pass a byte buffer to I/O, it normally has to pin them so they’re stable - so preventing stuff like writing it that memory to the swap file - during the direct memory access (DMA) between the buffer and the device, and unpin them when done. With registered buffers, we tell the kernel to pin some buffers once at startup, saving on the frequent pin/unpin operations. We also ensure that they’re block aligned, i.e. just fixed alignment to 4096 bytes right now, so they can all safely be used with files opened with the O_DIRECT flag. For the VirtualIo, it does not require pin/unpin


The attachments show the tests for the manifest as well as the new registered buffer I/O on VirtualIo and LinuxIo succeeding.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/76e1eb0...8d2d289

Attachment
Attachment
0
Johann

Implemented WAL Recovery sub-SM and refactored code

I implemented the remaining logic for the WAL sub-state-machine that is responsible for recovery.
The WAL first has to finish recovery before being usable, so I no longer have a strange split between a WalRecoveryReader and a Wal like before, but just have Wal and I always just call tick() after polling io, getting back one of the following WalResult::* variants:

  • Pending when it’s currently waiting for I/O, meaning we’re wait for it to get ready for us
  • RecoveryRecord(bytes) when it has recovered a new record; it does not care about encoding scheme, it just takes and reads back raw bytes
  • Acked(write_tokens) when a list of append actions/writes have been acknowledged as persisted, identified by their uniquely assigned WriteToken when calling append(record)
  • Ready when it is initialized but does not return anything.

It should never return a RecoveryRecord result, once it has returned Ready for the first time - though I did not assert that invariant on the type-system level, as it would require a more complex typestate pattern, likely forcing me to consume and restore self on every tick, providing little advantage over just assert!ing it normally, since that branch only hits during startup.

I also did a larger mechanical refactor over the codebase that removed the need to keep passing completions separate from io, since I can now read the CQEs directly from an Io implementation through Io::completions() and - replacing find_cqe() - get specific results through Io::get_cqe(handle).


The attachment shows the recovery test executing correctly, veryfing we get back the records A, B, and C in the order we inserted in.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/3acf187...76e1eb0

Attachment
0
Johann

Began implementing improved WAL-state-machine over new Io interface

Note: I may be abbreviating “state-machine” as SM from now on.

I also refactored the Journal further, using the new state machines I created, as planned in the last devlog: OpenFile, CloseFile, RemoveFile, Fstat, and Fsync. These also came in handy for the new Write-Ahead Log:

The WAL, as you might already now, ensures that all the writes we do are actually persistent. We don’t want new data to become visible to queries, that may vanish after a crash, which would happen with the data in our Memtable, if it weren’t for the WAL.

The old async WAL implementation was severly limited, since it only allowed submitting one write at the same time. My new custom state machine allows you to submit multiple record writes at once! How does that work?

When calling wal.append(record), you get back a WriteToken, which is basically just a u64 internally. When calling wal.tick(), you may now get a special WalResult::Acked(tokens) back, where tokens is a list of records that haves been said to be appended before and have now been persisted in a way that is power-outage-resilient a.k.a. satisfying ACID compliance - the greatest feat in database engineering!! And how does that work?

We keep a queue of all the pending writes, ticking them every tick. Once, say, the first 4/7 writes complete, we start a new Fsync sub-SM in the background, specifying it covers 4 records. After it completes, we remove the first four writes from the pending writes queue and return the Acked(tokens) result from our WAL.

Next up is implementing the wal recovery sub-SM for the WAL.


The attachment shows the append test succeeding, showing that the submitted writes get back in the correct order of submission, preventing recovery issues on torn writes / crashes.

Full diff: https://codeberg.org/Vimthusiast/tempest/compare/833375c...3acf187

Attachment
0
Johann

Began KV Store Refactor And Improved JournalImplementation Further

I was able to remove about 400 lines from the journal/mod.rs file, because I improved the implementation of the ReadExact and WriteExact helpers. They no longer do any I/O in the initializer begin(), but only in tick(), so now I don’t have to do separate checks for a full io_uring and do retries directly in the Journal implementation, but it happens inside of ReadExact::tick() / WriteExact::tick(). Of course I should create similar retry helpers for OpenFile, CloseFile, RemoveFile, Fstat, and Fsync, which would get rid of about 10 other Await* states on my Journal.

I also began TheBigRefactor™ of the Key-Value Storage Layer. First up, I commented out most of the modules that did I/O using the old FioFS trait, which will be replaced by the new Io trait.
I began with overhauling the StorageIterator first, as that is what drives all of the reads and compaction. Then I began adapting the Memtable and specifically the MemtableIterator to implement the new StorageIterator methods and tested it.


The attachment shows the tests for the memtable (and one other test) passing, confirming that point lookups through get() and iteration through begin_next() (+ also begin_seek) work as intended / specified on the StorageIterator trait. Namely, when we’re at key “orange” and seek to “apple”, we should just stop immediately, since we’ve already moved past that point.

Relevant commits/full diff: https://codeberg.org/Vimthusiast/tempest/compare/71a56b2...833375c

Attachment
0
Johann

Implemented Journal<T, I> using the new Io Interface!

I reimplemented the Journal I use all throughout Tempest as the first move towards using the new Io interface over the old FioFS, which relied on async, so I can eventually drop Tokio and async as a whole. This is the first step towards that goal!

The advantage of this new implementation is, that it is more performant (of course), more predictable, and can be run through the SimDriver, which is sort of a deterministic runtime for executing all of my state machines. Everything will be a state machine. Everything is a state machine in a way. The database - and the multi-node cluster in the future - will just be its own state machines that is driven through the tick() method.

All of this should allow me to do Deterministic Simulation Testing in the future, just like FoundationDB and Tigerbeetle do it.


The attachments show the output of the journal test completing successfully. It shows basic modifications of a ReplayableCounter example. You can find it in tempest_core/src/journal/tests.rs, if you wanna take a look at it

Relevant commits/full diff: https://codeberg.org/Vimthusiast/tempest/compare/3505082...71a56b2

Attachment
Attachment
Attachment
Attachment
0
Johann

Implemented New And Improved Io Interface!

The new Io trait is no longer using async and is now sync instead, using a poll-based mechanism to check for completed operations. It lives inside of tempest_io and will be the root of the dependency graph for my project workspace. Anything that does I/O, including getting the system time or doing network stuff - not just file I/O - has to be done through this interface. I just inject it where needed.

The LinuxIo implementation uses the io_uring bindings internally, supporting true asynchronous file I/O on the kernel level.

The VirtualIo implementation just completed directly, without any stepts in between. I may simulate delay between submission and completion so it has to be polled a few times, but thats not required for now.

All of this allows me to refactor tempest_coreand eventually even tempest_kv and tempest_engine to operate as a deterministic state machine, just like the Tigerbeetle database developers have done it, allowing me to try deterministic simulation testing in the future, where I am able to simulate all sorts of faulty behavior from the I/O side and ensure my database system still operates correctly.


The attachment shows the tests passing for the new I/O implementations. There is no fancy output since it gets hidden on success. It just shows both of my implementations are working as intended!
Relevant commits/full diff: https://codeberg.org/Vimthusiast/tempest/compare/8822b9d...3505082

Attachment
0
Johann

Improve KV Storage Layer for MVCC in the Future

This time I worked more on the iterator layer of the storage engine.

I implemented the FilterIterator:

  • it is generic over a Filter implementation which may allow for filtering based on MVCC timestamps or TTL based expiry of individual entries. The disk-deletion of TTL marked entries can happen during compaction, i.e. lazily, which is the most performant way here.
  • refactored Storage to be generic over a StorageStrategy instead of just the Comparer, which wraps both the Comparer and the Filter.
  • no longer require passing in the snapshot: SeqNum parameter for the read path (i.e. get() and scan()), since that is an implementation detail of the key-value store

I also refactored the general iterator module into multiple submodules inside of tempest_kv/src/iterator/, putting every iterator implementation into it’s own file.


The attachment shows how the tests for the FilterIterator based on simple Filter implementations are passing as expected, verifying correctness for further use. You can see that a tuple (a, b, c) of multiple filters is also a valid filter, combining the inner filters into a bigger one, basically doing a.keep(k) && b.keep(k) && c.keep(k). You see that tuple case in the test_chained_filters() screenshot on line 10.

Attachment
Attachment
Attachment
Attachment
0
Johann

Implemented Interactive REPL-Shell

You can now type tempest repl <DATA_DIR> to create/open a database at the specified path.
After doing so, you will be put in an interactive shell. The shell has different command built-in commands, all starting with a .:

  • .help | .h show help menu
  • .quit | .q terminate the REPL session
  • .databases | .dbs list all databases
  • .tables <database> list all tables, optionally scoped to a database
  • .types <database> list all types, optionally scoped to a database

Aside from the built-in commands, you can execute valid TQL statements. If you happen to make a syntax error, it will show nice error messages using the span data I collect inside of the parser. The formatted error report is then rendered using the ariadne crate, so I don’t have to do all that ugly string interpolation myself.

Here is a list of queries you can try out:

  • create database main;: create a new database called main
  • create type main.User { id: Int64, username: String }: create a type called User inside of the database main. The fields are id, which is a 64 bit integer, and username, which is a String.
  • create table main.users : main.User { primary key (id) };: create a table called users inside of the database main with the type User from the same database and with the field id as the primary key.
  • insert into main.users { id: 42, username: "John" };: insert an entry into the table users inside of the database main. The entry has the value 42 for the field id, and the value "John" for the field username.
  • select * from main.users;: select every column of all the entries from the table users inside of the database main. Alternatively, you can supply specific fields, instead of *, to only get out those fields.

Not sure what I’ll be doing next, but it will be fun!

0
Johann

Select Queries Now Working

After improving the Storage implementation to support seek(search_key) on the StorageIterator trait, which continuously advances, until it gets a key that is logically greater than or equal to the search_key. Logically here means, that it will not include the suffix data, which is internal to the engine, specifically a HlcTimestamp, which is a specific 64 bit timestamp that will come in handy later, when implementing migrations and going distributed!

Next up is implementing the REPL, so y’all can finally try out writing some TQL for yourself!


The attachment shows you a test that verifies that the engine is correctly getting “John” from the setup of the test in the previous devlog. It also shows projections, i.e. selecting just the username column.

Was finally able to merge in the feature branch for executing simple queries: https://github.com/Vimthusiast/tempest/commit/9389ad4

Attachment
0
Johann

Query Planning and Execution in the Engine

Implemented the first execution of queries! What’s working right now, is creating databases, types, and tables, and also inserting into the tables. Next thing I’m working on will be selects, to have the first vertical slice!

The attachment shows the (currently sparse) debug outputs of the following commands:

create database main;

create type main.User {
    id: Int64,
    username: String,
};


create table main.users : main.User {
    primary key (id),
};

insert into main.users {
    id: 4,
    username: "John",
};

Note that you cannot see the inserting output, since the logging is filtered for just the tempest_engine module, while the storage layer sits inside of the tempest_kv crate.

Also notice how there is ids for everything? FieldIds, DatabaseIds, TypeIds? That’s so I can later support renaming them without breaking old code. I keep the IDs stable, so I can just modify the name simpler. Still have to think about how I will implement schema migrations without having to rewrite EVERYTHING, which would be bad. I guess I have to track the old schema states for every migration so I can, when decoding a row, look up the right strategy and apply the migrations lazily.
But that’s way too complex for the duration of Flavortown and will be implemented much, much later.

Next up is of course selecting, before moving on to the REPL for the fully interactive experience!

Attachment
0
Johann

Completed The Core TQL Parser

I managed to finish implementing the core of the TQL parser, allowing us to get a typed representation - called an Abstract Syntax Tree, or AST for short - of the following syntax examples, that can be passed to our engine for execution in the future:

create database main;

create type User {
    id       : Int64,
    username : String,
};

create table mydb.users : User {
    primary key (id),
};

insert into mydb.users { id: 1, username: "john" };
insert into mydb.users { id: 2, username: "bob" };

select * from mydb.users;

select id, username from mydb.users;

It actually gracefully recovers errors as much as possible, to show us everything that is wrong in our code, and not just the first thing at the top (see example attachment with garbage statement).

Next up is implementing the actual database engine itself! It will take in a produced AST, create a query plan for it, run it, and return to us what happened: AST -> Plan -> Execution -> Result!


The attachments show our tests that verify the parser is working correctly. The actual usage of all of these ~1.1K LoC will be really visible, once I get to wire up the Engine to the REPL shell - both don’t really exist yet, hehe ;)

Relevant commits/diff for this Devlog: https://github.com/Vimthusiast/tempest/compare/b1c19ea..61fd23d

Attachment
Attachment
Attachment
Attachment
Attachment
0
Johann

The Time Sync Issue

So uhm… considering this devlog is 123 HOURS long, it’s kind of tough to explain, but I’ll try.
The other two devlogs were from this week, but now, thanks to @mahad, I was able to import the data all the way up from the end of January when I started this project. The thing is, I did not connect my WakaTime client with HackaTime… so I did not get anything tracked 😭. Now, I’ve got the remaining hours here as well :D

…this devlog got a bit long, so if you wanna read a project summary for up until now, you can do so in this gist: https://gist.github.com/Vimthusiast/f8ee05a3df1a98abc55bab584a41c72a

The code has been growing steadily, and now I’m quite sure it has a mind of its own.

Attachment
0
Johann

Row Encoder/Decoder

Implemented the RowEncoder and the RowDecoder! They take in the specified TableSchema and are able to encode the values efficiently into the key and value parts for our storage layer.

Next up is TQL language parsing for our REPL!


The attachment contains the example output of encoding and decoding a few rows using a simple schema definition (currently schema is just in code, soon it will be runtime editable through TQL!).

Relevant commit: https://github.com/Vimthusiast/tempest/commit/8a59a92

Attachment
Attachment
0
Johann

Catalog (+ Value Encoding)

Implemented the catalog today — Tempest’s registry of every database and table, with full journal-backed persistence.

The catalog tracks two things: DatabaseSchema (name, set of table IDs) and TableSchema (name, columns, primary key, parent database). Both get stable numeric IDs (DatabaseId(u32), TableId(u32)) allocated at creation time and never reused - even after a drop. Renaming a table is a single metadata write with zero row rewrites, because names never appear in data keys.

Edits are versioned from day one: CatalogEdit::V1(CatalogEditV1). If the format needs to evolve later, old journals stay readable. The edit set for now is CreateDatabase, CreateTable, and Snapshot - the snapshot deliberately omits dropped entries - which will come later - rather than recording a drop, so the journal never needs to reason about create/drop closed loops. All of this results in a more compact catalog file after our internal journal file rotation.

Raw and lexical encoding/decoding for TempestValue: the typed representation of data in the engine. Lexical encoding is order-preserving (important for key sorting), raw encoding is compact (used for values). It is important, that - for example with strings - we implement the lexical decoding, so that “banana” comes before “car”. With normal length-prefixes, the byte-based ordering would mix it up, and put “car” first, because “car” has length 3 and “banana” has length 6. Therefore we use null-terminators for byte sequences, which doesn’t mess up the lexicographical ordering.

Next up is key encoding! The bridge that turns a TableId + primary key values into the actual bytes that hit the KV layer.


The video attachment shows how the catalog usage. There is currently no use of the value encoding/decoding, but I will need it soon.

Relevant commits/diff for this devlog: https://github.com/Vimthusiast/tempest/compare/1f3b0b0..6e45cef

0
Johann

SST Compaction + File GC

Built size-tiered compaction and file garbage collection for Tempest’s storage engine today.

Without compaction, SST files accumulate on disk indefinitely. Every read had to scan all of them, so read amplification grew unboundedly. The fix: once L0 exceeds a file count threshold (default 4), merge all SSTs into one using the existing MergingIterator + DeduplicatingIterator infrastructure I already had. The deduplicator handles MVCC automatically, newest version of each key wins, tombstones are preserved. (Naive first STCS for the MVP)

The initially tricky part was crash safety. The manifest update (new SST in, old SSTs out) is the atomic commit point. Crash before it - old SSTs still valid, new one is just an orphan. Crash after - new SST is the source of truth. No partial state is possible. It was easy thanks to my Journal<T> abstraction.

That orphan case is what file GC handles. On startup, Tempest scans the SST and WAL directories, diffs against the manifest, and spawns a background task to delete anything not accounted for. During normal compaction, the exact list of removed filenums is passed directly to the GC task - no scan needed, no race possible. Note that I still have to implement GC of old manifest/journal files, but they’re so small and few, that it does not really matter yet.

All of this is tested: basic functionality/correctness, deduplication, tombstone preservation, SST count reduction, and recovery after compaction.

The KV layer is now complete enough to build on. Next up: key encoding and the Value type, so basically the bridge between raw bytes and the typed world of TQL and the schema layer.


In the attachment you can see the log/tracing output of some tests I wrote that verify that our compaction and file gc is working (I marked the relevant lines).

Relevant Commit: https://github.com/Vimthusiast/tempest/commit/1f3b0b08

0