lexepub

NellowTCS shipped lexepub

3 months ago

Shipped this project!

Hours: 7.04

Cookies: 🍪 227

Multiplier: 26.88 cookies/hr

I built an EPUB engine…

lexepub started as a small “what if I just unzip an EPUB” idea and slowly turned into a full high-performance streaming EPUB parser with Rust, WASM, and C/C++ adapters.

I forgot to log a bunch of the progress at the beginning, so the first devlog is basically the full summary of everything that happened before I realized I could submit this to Flavortown haha!

The hardest part was dealing with how inconsistent EPUBs are. Every file structure is different, every path behaves differently, and every book seems to interpret the spec in its own way. I spent a lot of time normalizing paths, resolving chapter relative links, and figuring out why some EPUBs hide their CSS in places that make no sense. (Seriously, I mean it, one of my future projects will be an ebook format that’s just better. Like the “ideal” format for ebooks. (like flo was for audio!))

But I got it working, and I am happy with how it turned out!

What I built

A Rust EPUB parsing engine
Streaming chapter extraction with low memory usage
HTML to AST parsing
CSS parsing and application
Cover extraction
TOC extraction with real chapter titles
A WASM adapter for browsers
A C/C++ adapter through Diplomat
A browser demo that actually renders EPUBs

What was challenging

Rewriting the WASM bindings multiple times (this is why I made Saikuro sigh)
Getting EPUB path resolution correct
Making the demo load images, CSS, and internal resources properly
Fixing GitHub Pages deployment after breaking it repeatedly
Updating the docs so they matched reality

What I am proud of

The demo looks like a real reader (mostly because it’s just a modified version of HTMLReader sob), and someone even thought it was epubjs instead of my own custom library.
The documentation is complete and consistent (well, complete enough, I didn’t document excessively like I did for some other projects)
The API matches across Rust, WASM, and C/C++
It is published on crates.io and npm
I actually finished it and it exists as a real project :O

Here have these again:
Demo: https://nellowtcs.me/lexepub/
Docs: https://nellowtcs.me/lexepub/docs

I learned a lot about EPUB internals, streaming design, WASM, and cross language APIs while building this. It was a fun project and I am glad I saw it through!

NellowTCS worked on lexepub

3 months ago

0h 46m logged

I PUBLISHED ITTTTTT!!!

This was the final cleanup sprint before shipping lexepub v0.1.0, and it ended up being a surprisingly big set of changes across docs, CI, workflows, and the README. Nothing glamorous, but all the stuff that makes a project feel real instead of “a folder on my computer.”

Docs Cleanup + Expansion

I went through the entire docs site and updated everything:

Rust adapter docs now include resource APIs, TOC APIs, and link normalization notes.
WASM adapter docs now list the full API surface, including get_resource() and the chapter‑relative resource helpers.
C/C++ adapter docs got updated with the new functions and a clearer API reference.
Quickstart now includes useful commands, WASM build instructions, and resource‑loading examples.
The homepage got a new “Resource + TOC Utilities” card and a clearer adapter parity section.

Basically: the docs now actually reflect what lexepub can do.

README Overhaul

The README got the cleanup it always deserved (and needed)

Proper formatting
Clearer feature list
NPM + crates.io links
Better installation instructions
Cleaner demo section
Updated build instructions

GitHub Pages Workflow Refactor

I rewrote (ahem, copied from Tamaru) the entire Pages deployment workflow:

fixed Node version
fixed root/base paths
fixed demo paths
fixed docs paths
fixed the deploy directory structure
fixed the index redirect

Release CI

I added proper release workflows:

cargo publish (with dry‑run option)
wasm-pack build
C header generation
release tarball creation
GitHub Release upload

It even handles platform‑specific artifacts cleanly and doesn’t hardcode paths anymore (I fixed that… again).

Resource API + WASM Updates

I added:

get_resource(path)
better resource resolution
normalized internal paths
improved TOC extraction
updated TypeScript definitions

The WASM adapter is now fully capable of powering the demo without hacks.

I Published It!!!!!!!!!!!!!!

Yep!
lexepub is now on crates.io and npm!

0

Log in to leave a comment

NellowTCS worked on lexepub

3 months ago

0h 22m logged

lexepub now has a Proper browser demo that actually renders EPUBs, images, CSS, TOC, everything!!!!
using the WASM build. I didn’t plan on building a renderer for v0.1.0, but here we are.

HTMLReader‑borrowed Demo

I put together a small demo that was yanked from HTMLReader (hey, again why remake the wheel if I already have a good UI/modular thing) that loads EPUBs through the WASM adapter and renders them directly in the browser. It works well!!!
(and shows off what lexepub can do without needing any external libraries)

It now supports:

chapter images
proper table of contents (using inferred chapter titles instead of raw filenames)
CSS from <link> tags
internal resource loading
normalized paths
and a clean UI for navigating chapters

Honestly, it’s starting to feel like a real reader, which feels insane…
I showed it to a friend and they actually thought it was like ePubJS or something haha, not my own thing!!!

Chapter Images + Resource Loading

I added get_chapter_resource and resolve_chapter_resource_path to the WASM API so the demo can fetch images and other linked assets.
This required adding:

path normalization
chapter-relative resolution
fallback logic for weird EPUB structures (why is this such a badly standardized format omg… I have ANOTHER project idea now, a new ebook format sob)

Images now display correctly inside chapters.

Proper TOC Support

Chapters now have a title field inferred from:

<h1>, <h2>, or <title> in the AST
otherwise the first non-empty text
otherwise the filename stem

The WASM adapter exposes get_toc() and get_toc_json(), and the demo uses this to build a real table of contents.

CSS from `<link>` Tags

Linked CSS files now load and apply correctly.
This required:

resolving relative paths
reading CSS resources
parsing them
applying them to the AST before rendering

It’s still a simple CSS engine, but it’s enough for EPUBs.

Internal Path Normalization

EPUBs love weird paths (../, ./, backslashes, nested folders), so I added a normalize_internal_path() helper and integrated it into:

resource resolution
AST link normalization
href/src rewriting

This fixed a bunch of rendering/random other issues

TODO Updates

Added then checked off:

chapter images
proper TOC
CSS from <link>
demo added
internal path fixes

All that’s left really is cleanup and publishing and docs and so on, nothing major honestly, the TODO all has stuff that is beyond v0.1.0

0

1

Log in to leave a comment

Comments

Cyclic(John fire department) not FD 3 months ago

This is cool

NellowTCS worked on lexepub

3 months ago

0h 28m logged

I added a basic demo!
The demo is based off HTMLReader, which i modified and stripped out epubjs from and replaced with the WASM compilation of LexePub. It works cleanly but there’s no formatting sob since like LexePub doesn’t have a proper renderer yet haha.

0

Log in to leave a comment

NellowTCS worked on lexepub

3 months ago

1h 10m logged

Today (well yesterday, I was tired, okay?) ended up being a pretty productive set of commits, nothing too dramatic, but a lot of important groundwork and cleanup that makes lexepub feel more complete and consistent across all adapters.

API Parity Across Rust, WASM, and C/C++

I finally checked off the “1‑1 API functionality” item, then unchecked it when I added CSS, then rechecked it in the same commit loll
This mostly meant adding the missing sync wrappers (get_metadata_sync, has_cover_sync, cover_image_sync) and wiring them into the C‑FFI layer.
(very annoying thing dealing with Diplomat’s restrictions, the thing I made, Saikuro is so much betterrrr and much cleaner and more languages are automatically supported.

Minimal CSS Parser that turned into a pretty average simple one

I added a small CSS parser, handrolled because cssparser from Servo, is difficult to deal with (future project, better api for it, 👀), just enough to handle basic selectors, declarations, and at‑rules.
It’s simple, sadly, but it works well for EPUB‑level CSS.
There’s a tiny AST (Stylesheet, CssRule, StyleRule), comment removal, declaration parsing, and some tests to make sure it doesn’t defy my expectations for CSS (copied from an ebook btw).

Documentation Pass

I added docs!

Rust adapter page includes full API references, sync wrappers, and CSS/AST behavior.
C/C++ adapter page lists the full generated API and includes a full example.
WASM adapter page has a proper API reference and example usage.
Quickstart has with optional features, sync wrappers, and convenience functions.

README Cleanup

The README got a big trim.
Most of the detailed examples and API references moved into the docs site, which is where they belong.
The README is now much cleaner and points people to the proper documentation. (Though I don’t know what late night me was thinking with the links being code lines like what?!?

TODO Updates

Checked off:

API parity
CSS parsing + application
Streaming cover image support

The majority of the TODO’s left are stuff that are probably for the future haha, but I do need to make a small HTMLReader-but-using-LexePub-demo

0

Log in to leave a comment

NellowTCS worked on lexepub

3 months ago

0h 40m logged

Just a tiny devlog this time, I’m learning :O

But I did add my default Docs with a capital D setup and the accompanying CI, and added streaming cover image support.

What I Did

I added docs.
And yes, I did borrow the folder structure from other projects.
I’m not reinventing the wheel when I already invented it twice (Tamaru, Saikuro, S-eco, need I name more?).
Please respect my efficiency.

And THEN, ONE MORE TODO: I implemented streaming cover image support.
As in:
cover_image_to_writer
Zero allocations.
Direct streaming.
AsyncWrite.
The whole thing.

It works, streams, and is like actually really nice!!!

TODO Updates

The TODO list shrank again, -2 more things…
I’m starting to worry I’m going to run out of TODOs and have to invent new ones.

0

Log in to leave a comment

NellowTCS worked on lexepub

3 months ago

1h 7m logged

~~Chaos~~ Devlog time!!!

I added EPUB version detection.
Like, real version detection.
lexepub now looks at <package version="3.0"> and goes “oh okay cool” instead of staring blankly like a goldfish.

Then I added cover image format detection!
The manifest used to be a cute little HashMap<String, String> and now it’s a full (href, media-type) tuple because I decided lexepub should know MIME types like a sommelier knows wine.
This broke EVERYTHING.
Every. Thing.
Every .join(href) became .join(&href.0) aaaagh.

And then I fixed WASM.
Not “fixed WASM” like “haha a typo,”
I mean FIXED WASM like “rewrote half the bindings because Past Me was clearly having a moment’ (second time i’ve said that today haha).
Everything returns proper Result<T, JsValue> now.
Metadata serializes.
Chapters serialize.
Cover extraction works.

Oh and AST parsing?
Yeah that’s real now.
extract_ast() actually does AST things instead of returning ast: None like a liar.
WASM uses it too.
ParsedChapter is serializable.
Chapter is serializable (but I skipped the raw bytes because I’m not a monster).
This was supposed to be a “later” thing.
It is no longer a “later” thing.

Now we got:

EPUB Version: 3.0
Has Cover: true
Cover Format: image/jpeg

And the TODO list?
Oh my god the TODO list.
I chugged through TODOs fast as fluff.
WASM support? Done.
AST parsing? Done.
Version detection? Done.
Cover format detection? Done.
I swear the TODO list is shrinking faster than my sanity.

I also updated integration tests because apparently I’m responsible now.
They actually check MIME types and cover presence and error cases and everything.
Who am I.

Anyway.
I love how lexepub is turning out.
It started as a tiny little “haha unzip EPUB” thing and now it’s a full parsing engine with metadata, ASTs, WASM bindings, cover extraction, version detection, and a manifest that actually knows what it’s doing.

1

0

Log in to leave a comment

NellowTCS worked on lexepub

3 months ago

2h 26m logged

SHOOOOOOOOOOOOT
I’m so good at forgetting projects existed sighhh.

So this is a lot less hours logged than I originally spent (a majority was googling HOW ePubs work (and why they’re so… hard to parse)) but I guess 2 and a half hours is fine whatever…

Okay so you all get the everything I did up to now devlog. Warning, this will be insanely rambly:

So. EPUBs.
Right.

You’d think “oh it’s just a zip file with some HTML in it” and you would be CORRECT but also WRONG because the way they organize everything is kind of a nightmare and I spent way too long just figuring out the file structure before I wrote a single line of Rust.

Okay so the gist: an EPUB is a zip file, inside that zip file is a META-INF/container.xml which points you to an OPF file (like OEBPS/content.opf or wherever), and THAT file has all the metadata AND a manifest (list of all files) AND a spine (the reading order). So you can’t just iterate the zip entries in order, you have to parse the OPF spine to figure out what order chapters actually go in.
Which is fun.
Very fun.
Super fun.

So I built the whole thing in layers basically. There’s an EpubExtractor at the bottom that just knows how to open a zip and read files out of it, and it can do this from a file path, from raw bytes, OR from a streaming async reader. That last one was kind of annoying to get right because async_zip has opinions about what traits your reader needs to implement and I had to do some fun (/sarc) stuff to get it to work.

Then on top of that there’s the actual parsing layer, ContainerParser for container.xml, OpfParser for the OPF file (metadata + spine + manifest), and ChapterParser / extract_text_content for turning the XHTML chapters into actual readable text using the scraper crate.

The main LexEpub struct is what you actually use and it caches chapters and metadata so you’re not re-parsing the whole thing every time you call get_metadata() twice.

Oh also there’s a lowmem feature flag that swaps out the scraper-based HTML parser for a dumb little hand-rolled state machine that just strips tags manually. It’s not as good at handling block elements and whitespace but it doesn’t build a full DOM tree which is the point. Useful for embedded targets theoretically.

The streaming story is… partially done. ChapterStream implements futures::Stream so you can consume chapters one at a time without loading all of them into memory at once. The benchmarks use jemalloc to actually measure heap allocation delta per operation which I’m pretty happy about as a setup.

For some reason I thought making benchmarks would be fun so there’s Criterion benches for from_bytes, from_reader, and extract_text_only. The CI runs cargo bloat to check binary size which is something I actually care about for once because the end goal is for this to be usable in WASM and potentially on embedded stuff (esp32 ahem)

WASM bindings exist in theory (src/wasm.rs) but they’re kind of broken right now, extract_with_ast() doesn’t exist, has_cover() doesn’t exist, cover_image() doesn’t exist. Those are all in the TODO. The C FFI via Diplomat is in better shape and actually generates a real header file.

The test suite is… extensive? Like maybe embarrassingly extensive for something this early. There’s unit tests, integration tests, API tests, edge case tests, streaming tests, performance tests, and a memory threshold test that reads /proc/self/status to check RSS delta. I went a little overboard. The edge case tests especially are kind of a placeholder graveyard right now, most of them are just “open the test epub and hope nothing crashes” because actually testing edge cases properly requires mock EPUBs and I have not built those yet.

The big things left:

A lot (just kidding, there’s a TODO.md)

Anyway that’s the summary and stuff thanks for coming to my TED Talk hope you enjoy.

1

0

Log in to leave a comment

0 Followers

Shipped this project!

What I built

What was challenging

What I am proud of

Docs Cleanup + Expansion

README Overhaul

GitHub Pages Workflow Refactor

Release CI

Resource API + WASM Updates

I Published It!!!!!!!!!!!!!!

HTMLReader‑borrowed Demo

Chapter Images + Resource Loading

Proper TOC Support

CSS from <link> Tags

Internal Path Normalization

TODO Updates

Comments

API Parity Across Rust, WASM, and C/C++

Minimal CSS Parser that turned into a pretty average simple one

Documentation Pass

README Cleanup

TODO Updates

What I Did

TODO Updates

CSS from `<link>` Tags