SHOOOOOOOOOOOOT
I’m so good at forgetting projects existed sighhh.
So this is a lot less hours logged than I originally spent (a majority was googling HOW ePubs work (and why they’re so… hard to parse)) but I guess 2 and a half hours is fine whatever…
Okay so you all get the everything I did up to now devlog. Warning, this will be insanely rambly:
So. EPUBs.
Right.
You’d think “oh it’s just a zip file with some HTML in it” and you would be CORRECT but also WRONG because the way they organize everything is kind of a nightmare and I spent way too long just figuring out the file structure before I wrote a single line of Rust.
Okay so the gist: an EPUB is a zip file, inside that zip file is a META-INF/container.xml which points you to an OPF file (like OEBPS/content.opf or wherever), and THAT file has all the metadata AND a manifest (list of all files) AND a spine (the reading order). So you can’t just iterate the zip entries in order, you have to parse the OPF spine to figure out what order chapters actually go in.
Which is fun.
Very fun.
Super fun.
So I built the whole thing in layers basically. There’s an EpubExtractor at the bottom that just knows how to open a zip and read files out of it, and it can do this from a file path, from raw bytes, OR from a streaming async reader. That last one was kind of annoying to get right because async_zip has opinions about what traits your reader needs to implement and I had to do some fun (/sarc) stuff to get it to work.
Then on top of that there’s the actual parsing layer, ContainerParser for container.xml, OpfParser for the OPF file (metadata + spine + manifest), and ChapterParser / extract_text_content for turning the XHTML chapters into actual readable text using the scraper crate.
The main LexEpub struct is what you actually use and it caches chapters and metadata so you’re not re-parsing the whole thing every time you call get_metadata() twice.
Oh also there’s a lowmem feature flag that swaps out the scraper-based HTML parser for a dumb little hand-rolled state machine that just strips tags manually. It’s not as good at handling block elements and whitespace but it doesn’t build a full DOM tree which is the point. Useful for embedded targets theoretically.
The streaming story is… partially done. ChapterStream implements futures::Stream so you can consume chapters one at a time without loading all of them into memory at once. The benchmarks use jemalloc to actually measure heap allocation delta per operation which I’m pretty happy about as a setup.
For some reason I thought making benchmarks would be fun so there’s Criterion benches for from_bytes, from_reader, and extract_text_only. The CI runs cargo bloat to check binary size which is something I actually care about for once because the end goal is for this to be usable in WASM and potentially on embedded stuff (esp32 ahem)
WASM bindings exist in theory (src/wasm.rs) but they’re kind of broken right now, extract_with_ast() doesn’t exist, has_cover() doesn’t exist, cover_image() doesn’t exist. Those are all in the TODO. The C FFI via Diplomat is in better shape and actually generates a real header file.
The test suite is… extensive? Like maybe embarrassingly extensive for something this early. There’s unit tests, integration tests, API tests, edge case tests, streaming tests, performance tests, and a memory threshold test that reads /proc/self/status to check RSS delta. I went a little overboard. The edge case tests especially are kind of a placeholder graveyard right now, most of them are just “open the test epub and hope nothing crashes” because actually testing edge cases properly requires mock EPUBs and I have not built those yet.
The big things left:
- A lot (just kidding, there’s a TODO.md)
Anyway that’s the summary and stuff thanks for coming to my TED Talk hope you enjoy.