Adit on Flavortown

Adit shipped FibrumPDF

about 2 months ago

Shipped this project!

Hours: 57.27

Cookies: 🍪 708

Multiplier: 12.36 cookies/hr

I built a fast PDF extractor, getting 200 pages a second on benchmarks.

It was actually MUCH harder than I thought to extract tables…surprisingly everything else wasn’t too bad. Like geez, such a TERRIBLE format that we use everyday! PDFs. You gotta guess everything.

I’m proud of the speed.. honestly when I started it was more of a small port of pymupdf(4llm), just writing it in Cython cause I wanted more speed. I didn’t even think I’d make my own library, I was just tryna make that like 2-3x faster. But, here we are.

Adit worked on FibrumPDF

about 2 months ago

43h 19m logged

I have once again utterly failed to log anything over so long.

I keep forgetting. I get too engrossed in fixing something and then I iterate and iterate but don’t log my changes here.

In this amount of time, the entire codebase has undergone a major ‘refactoring’ (though, this includes featural changes).

The latest thing I’ve done is add benchmarks. I used them a lot internally initially for the below improvements, but have decided to publish them to help my tool. It uses the Marker dataset and compares Fibrum’s performance and quality to pymupdf4llm and Docling, which I found were two of the more popular PDF tools.

Go codebase changes:

Added borderless table extraction, this was a pain to implement; you have to make it not detect text as tables, but actually detect tables at the same time
Made the edge based table extraction much more reliable.
The table improvements also help in the text quality, as misdetecting text as tables results in a lower text score.
Overall, a lot of stuff was made more ‘adaptive’ in the sense, like, instead of assuming an average, we actually adapt page-by-page.

Refactoring was done a lot here, and honestly, I still think the code is a bit sloppy in parts. It was more ‘reorganization’ of complexity. But a lot of unnecessary logic was shaved off. Always a win; it made the net additions less from all the feature additions. Less is more.

C code:

There was a bug where we weren’t traversing FZ_STEXT_BLOCK_STRUCT, where, for some reason, sometimes text was nested in there, so that impacted scores on certain PDFs. This has been fixed.
Minor refactoring here as well, mostly just simple line by line optimizing if that makes sense

Performance (everywhere):

Go side, we were accumulating everything in RAM, a big oversight for a ‘performant’ tool. But we can’t flush to disk for every character as that hammers the kernel, so we have buffers and flush periodically.
Optimized the worker counts and Go rountine setup. We do oversubscribe, which depending on your outlook can be considered as dumb (2x for pipeline, 3x for I/O).. but this accounts for the GC, RAM, I/O bottelenecks and pauses. Basically for like a small period all the workers might be waiting on something (that is much slower, relatively), so then the oversubscribed workers can come into play. And if they are working, the oversubscribed workers are at rest.
Overall, much more aggressive with CPU usage, and, despite fixing RAM issues, a little more lenient on RAM for the speed benefit. Note that it is stable to N cores at a time, not N pages.

This is a very high level summary, it took me lots of trial and error, and this is mostly just what I remember as the end results. I know these devlogs are supposed to be detailed.
I’ll probably get a terrible score story-telling wise.

0

Log in to leave a comment

Adit worked on FibrumPDF

4 months ago

13h 56m logged

I think I forgot about this and took way too long to write it; it’s been 22 hours and 30 commits, though, many are minimal, so I’ll try to summarize everything into one. Forgive me if there’s not enough detail,

First, I tried overhauling a lot of the table detection code, C side. I also made many bug fixes to the formatting. just typical stuff. I honestly forgot by now.

I also did much refactoring Python side, fixed many bugs in CI/CD and shared library issues.

Added Markdown conversion support Python side,

Eventually, I was trying to refactor the C code, and realized, this is shit.

So.. i rewrote it in Go. I know, so dumb to change languages. but it wasn’t for fun. It reduced the entire codebase by like 40%. And while I ported it, I didn’t just directly do it line-for-line; I refactored and used some external libraries and optimized for performance in the mean time.

So, summary:

refactored Python code
rewrote in Go
overhauled table detection logic specifically
overhauled/fixed formatting bugs
optimized and bug-fixed CI/CD
optimized performance a bit
updated docs to be a bit more useful with the performance info, API usage and clarity

Sorry for the lengthy read, I tried to summarize as best as I could. To be honest, I forgot about this thing and was just having fun.

0

Log in to leave a comment

Adit worked on rich-soup

6 months ago

3h 33m logged

tldr: links, lists, inline styling

support links & lists as their own blocks now
before links & lists were just.. being ignored/being treated in the paragraph blocks
so now they’re their own block

started using arrays of spans instead of raw text to support inline styling.

like

spans: [<text, bold>(“This”), <text,italic>(“is”)]

before it was just one chunk of text, and so nothing would really get flagged in that if that makes sense. because the way it was calculated with mean and median the text & stuff.

post-processing; just made a separate module that cleans up after the parsed output. more of a refactor kinda thing. but also added some filters to start it off. like removing duplicate links, cleaning up those language selectors.

added a few tests; performance ones, i’ll do smoke test (just examining it myself) for the rest.

and a little bit of shuffling and refactoring, i always do that.

0

Log in to leave a comment

Adit worked on rich-soup

6 months ago

2h 44m logged

refactored the extraction and parsing stuff, renamed functions for clarity, cleaned up text handling with headings, duplicates, and small-text filtering, reorganized reading order and header/footer logic, revamped the cli output handling, tightened up models and type hints, basically just a big internal cleanup and rework to make things clearer and more consistent without changing the core functionality. markdown is now better; shows everything nicely. & more formats supported.

0

Log in to leave a comment

Adit worked on rich-soup

6 months ago

1h 22m logged

enhance block extraction, image and table support

add image, table support
add a bit of logging
improve performance MASSIVELY
remove stroke extraction
change mindset: not gonna rasterize and make stuff from the strokes
improve algorithms & stuff

had to do some weird hacks to get it to work on sites with strict CSP
& performance hacks as well, but it really really 10x’d the perf.

and even though the output quality would be slightly worse in edge cases, i took the lazy route. handle most cases, give good perf, keep codebase simple.

0

Log in to leave a comment

Adit worked on rich-soup

6 months ago

2h 26m logged

Just made my first commit for rich-soup!
(^^ BeautifulSoup is static and preserves junk, playwright is too low-level, this uses playwright (to actually render the page, then uses semantics and heuristics instead of raw HTML tags) but gives you BeautifulSoup level DX!)

benefits are that it can operate on React/Svelte/Vue/* because it actually renders it, and it ignores the div hell and the tags cause they’re usless nowadays. it sees what the human sees basically.

I got basic text extraction and formatting working:

headings
bold (true/false)
paragraphs
clean filtering
demo Markdown output

but right now tables and stuff don’t work, and its an mvp so the logic is brittle. but it’s actually looking pretty clean atm for what i’ve tested (wikipedia, microsoft learn, some whitepapers).

0

Log in to leave a comment

Adit worked on FibrumPDF

6 months ago

I liked the quality of pymupdf4llm, but it was too slow for my needs. I couldn’t find any alternatives of similar quality, and so, I decided to rewrite it in C (and later Go), and bind it back to Python.

I also couldn’t find anything locally that gave ‘structured’ output; JSON instead of just Markdown (though, you can convert it to Markdown), (this is better for RAG and smarter chunking.

Also, surprisingly, in my project, it seemed to detect MORE tables and formatting than pymupdf4llm. So that’s a bonus.

It uses MuPDF under the hood in C, then processes it in Go, and Python calls it as a shared library.

To bypass multithreading lock and get faster performance (for MuPDF), we use fork C-side and go rountines set to the number of cores Go side. (to not oversaturate).

It’s just geometry and huerisitcs under the hood, a bit opionionated, but it’s the only option.

0

Log in to leave a comment

Adit

Joined 2025-12-23

8

Posts

2

Projects

1

Ships

27

Votes

All time:

67.4 hrs

Today:

0.0 hrs

Achievements

Projects

rich-soup

FibrumPDF

Shipped

Orders

Ordered Hyperpixel 4.0 1 time.

Ordered This Is Water by David Foster-Wallace 1 time.

Ordered Raspberry Pi Zero 2 W 1 time.

Ordered Free Domain 1 time.

Ordered ESP32 Kit 1 time.

Ordered Slow Cook Broth 1 time.

Activity

Shipped this project!