A Rust library for parsing ePubs FAST and memory efficiently.
It provides asynchronous streaming, metadata validation, and asset extraction across Rust, C/C++, and WebAssembly from a single core implementation.
No AI used AT ALL :3
A Rust library for parsing ePubs FAST and memory efficiently.
It provides asynchronous streaming, metadata validation, and asset extraction across Rust, C/C++, and WebAssembly from a single core implementation.
No AI used AT ALL :3
I PUBLISHED ITTTTTT!!!
This was the final cleanup sprint before shipping lexepub v0.1.0, and it ended up being a surprisingly big set of changes across docs, CI, workflows, and the README. Nothing glamorous, but all the stuff that makes a project feel real instead of “a folder on my computer.”
I went through the entire docs site and updated everything:
get_resource() and the chapter‑relative resource helpers.Basically: the docs now actually reflect what lexepub can do.
The README got the cleanup it always deserved (and needed)
I rewrote (ahem, copied from Tamaru) the entire Pages deployment workflow:
I added proper release workflows:
It even handles platform‑specific artifacts cleanly and doesn’t hardcode paths anymore (I fixed that… again).
I added:
get_resource(path)The WASM adapter is now fully capable of powering the demo without hacks.
Yep!
lexepub is now on crates.io and npm!
Log in to leave a comment
lexepub now has a Proper browser demo that actually renders EPUBs, images, CSS, TOC, everything!!!!
using the WASM build. I didn’t plan on building a renderer for v0.1.0, but here we are.
I put together a small demo that was yanked from HTMLReader (hey, again why remake the wheel if I already have a good UI/modular thing) that loads EPUBs through the WASM adapter and renders them directly in the browser. It works well!!!
(and shows off what lexepub can do without needing any external libraries)
It now supports:
<link> tagsHonestly, it’s starting to feel like a real reader, which feels insane…
I showed it to a friend and they actually thought it was like ePubJS or something haha, not my own thing!!!
I added get_chapter_resource and resolve_chapter_resource_path to the WASM API so the demo can fetch images and other linked assets.
This required adding:
Images now display correctly inside chapters.
Chapters now have a title field inferred from:
<h1>, <h2>, or <title> in the ASTThe WASM adapter exposes get_toc() and get_toc_json(), and the demo uses this to build a real table of contents.
<link> Tags
Linked CSS files now load and apply correctly.
This required:
It’s still a simple CSS engine, but it’s enough for EPUBs.
EPUBs love weird paths (../, ./, backslashes, nested folders), so I added a normalize_internal_path() helper and integrated it into:
This fixed a bunch of rendering/random other issues
Added then checked off:
<link>
All that’s left really is cleanup and publishing and docs and so on, nothing major honestly, the TODO all has stuff that is beyond v0.1.0
I added a basic demo!
The demo is based off HTMLReader, which i modified and stripped out epubjs from and replaced with the WASM compilation of LexePub. It works cleanly but there’s no formatting sob since like LexePub doesn’t have a proper renderer yet haha.
Log in to leave a comment
Today (well yesterday, I was tired, okay?) ended up being a pretty productive set of commits, nothing too dramatic, but a lot of important groundwork and cleanup that makes lexepub feel more complete and consistent across all adapters.
I finally checked off the “1‑1 API functionality” item, then unchecked it when I added CSS, then rechecked it in the same commit loll
This mostly meant adding the missing sync wrappers (get_metadata_sync, has_cover_sync, cover_image_sync) and wiring them into the C‑FFI layer.
(very annoying thing dealing with Diplomat’s restrictions, the thing I made, Saikuro is so much betterrrr and much cleaner and more languages are automatically supported.
I added a small CSS parser, handrolled because cssparser from Servo, is difficult to deal with (future project, better api for it, 👀), just enough to handle basic selectors, declarations, and at‑rules.
It’s simple, sadly, but it works well for EPUB‑level CSS.
There’s a tiny AST (Stylesheet, CssRule, StyleRule), comment removal, declaration parsing, and some tests to make sure it doesn’t defy my expectations for CSS (copied from an ebook btw).
I added docs!
The README got a big trim.
Most of the detailed examples and API references moved into the docs site, which is where they belong.
The README is now much cleaner and points people to the proper documentation. (Though I don’t know what late night me was thinking with the links being code lines like what?!?
Checked off:
The majority of the TODO’s left are stuff that are probably for the future haha, but I do need to make a small HTMLReader-but-using-LexePub-demo
Log in to leave a comment
Just a tiny devlog this time, I’m learning :O
But I did add my default Docs with a capital D setup and the accompanying CI, and added streaming cover image support.
I added docs.
And yes, I did borrow the folder structure from other projects.
I’m not reinventing the wheel when I already invented it twice (Tamaru, Saikuro, S-eco, need I name more?).
Please respect my efficiency.
And THEN, ONE MORE TODO: I implemented streaming cover image support.
As in:
cover_image_to_writer
Zero allocations.
Direct streaming.
AsyncWrite.
The whole thing.
It works, streams, and is like actually really nice!!!
The TODO list shrank again, -2 more things…
I’m starting to worry I’m going to run out of TODOs and have to invent new ones.
Log in to leave a comment
Chaos Devlog time!!!
I added EPUB version detection.
Like, real version detection.
lexepub now looks at <package version="3.0"> and goes “oh okay cool” instead of staring blankly like a goldfish.
Then I added cover image format detection!
The manifest used to be a cute little HashMap<String, String> and now it’s a full (href, media-type) tuple because I decided lexepub should know MIME types like a sommelier knows wine.
This broke EVERYTHING.
Every. Thing.
Every .join(href) became .join(&href.0) aaaagh.
And then I fixed WASM.
Not “fixed WASM” like “haha a typo,”
I mean FIXED WASM like “rewrote half the bindings because Past Me was clearly having a moment’ (second time i’ve said that today haha).
Everything returns proper Result<T, JsValue> now.
Metadata serializes.
Chapters serialize.
Cover extraction works.
Oh and AST parsing?
Yeah that’s real now.
extract_ast() actually does AST things instead of returning ast: None like a liar.
WASM uses it too.
ParsedChapter is serializable.
Chapter is serializable (but I skipped the raw bytes because I’m not a monster).
This was supposed to be a “later” thing.
It is no longer a “later” thing.
Now we got:
EPUB Version: 3.0
Has Cover: true
Cover Format: image/jpeg
And the TODO list?
Oh my god the TODO list.
I chugged through TODOs fast as fluff.
WASM support? Done.
AST parsing? Done.
Version detection? Done.
Cover format detection? Done.
I swear the TODO list is shrinking faster than my sanity.
I also updated integration tests because apparently I’m responsible now.
They actually check MIME types and cover presence and error cases and everything.
Who am I.
Anyway.
I love how lexepub is turning out.
It started as a tiny little “haha unzip EPUB” thing and now it’s a full parsing engine with metadata, ASTs, WASM bindings, cover extraction, version detection, and a manifest that actually knows what it’s doing.
Log in to leave a comment
SHOOOOOOOOOOOOT
I’m so good at forgetting projects existed sighhh.
So this is a lot less hours logged than I originally spent (a majority was googling HOW ePubs work (and why they’re so… hard to parse)) but I guess 2 and a half hours is fine whatever…
Okay so you all get the everything I did up to now devlog. Warning, this will be insanely rambly:
So. EPUBs.
Right.
You’d think “oh it’s just a zip file with some HTML in it” and you would be CORRECT but also WRONG because the way they organize everything is kind of a nightmare and I spent way too long just figuring out the file structure before I wrote a single line of Rust.
Okay so the gist: an EPUB is a zip file, inside that zip file is a META-INF/container.xml which points you to an OPF file (like OEBPS/content.opf or wherever), and THAT file has all the metadata AND a manifest (list of all files) AND a spine (the reading order). So you can’t just iterate the zip entries in order, you have to parse the OPF spine to figure out what order chapters actually go in.
Which is fun.
Very fun.
Super fun.
So I built the whole thing in layers basically. There’s an EpubExtractor at the bottom that just knows how to open a zip and read files out of it, and it can do this from a file path, from raw bytes, OR from a streaming async reader. That last one was kind of annoying to get right because async_zip has opinions about what traits your reader needs to implement and I had to do some fun (/sarc) stuff to get it to work.
Then on top of that there’s the actual parsing layer, ContainerParser for container.xml, OpfParser for the OPF file (metadata + spine + manifest), and ChapterParser / extract_text_content for turning the XHTML chapters into actual readable text using the scraper crate.
The main LexEpub struct is what you actually use and it caches chapters and metadata so you’re not re-parsing the whole thing every time you call get_metadata() twice.
Oh also there’s a lowmem feature flag that swaps out the scraper-based HTML parser for a dumb little hand-rolled state machine that just strips tags manually. It’s not as good at handling block elements and whitespace but it doesn’t build a full DOM tree which is the point. Useful for embedded targets theoretically.
The streaming story is… partially done. ChapterStream implements futures::Stream so you can consume chapters one at a time without loading all of them into memory at once. The benchmarks use jemalloc to actually measure heap allocation delta per operation which I’m pretty happy about as a setup.
For some reason I thought making benchmarks would be fun so there’s Criterion benches for from_bytes, from_reader, and extract_text_only. The CI runs cargo bloat to check binary size which is something I actually care about for once because the end goal is for this to be usable in WASM and potentially on embedded stuff (esp32 ahem)
WASM bindings exist in theory (src/wasm.rs) but they’re kind of broken right now, extract_with_ast() doesn’t exist, has_cover() doesn’t exist, cover_image() doesn’t exist. Those are all in the TODO. The C FFI via Diplomat is in better shape and actually generates a real header file.
The test suite is… extensive? Like maybe embarrassingly extensive for something this early. There’s unit tests, integration tests, API tests, edge case tests, streaming tests, performance tests, and a memory threshold test that reads /proc/self/status to check RSS delta. I went a little overboard. The edge case tests especially are kind of a placeholder graveyard right now, most of them are just “open the test epub and hope nothing crashes” because actually testing edge cases properly requires mock EPUBs and I have not built those yet.
The big things left:
Anyway that’s the summary and stuff thanks for coming to my TED Talk hope you enjoy.
Log in to leave a comment