FibrumPDF banner

FibrumPDF

3 devlogs
57h 15m 55s

Updated Project: The project changed a lot since then, I cannot summarize everything here, but some important ones are:

  • the core is now in Go, not C
  • I renamed it to Fibrum, it used to be called pymupdf4llm-c
  • Added benchmarks
  • Quality,…

Updated Project: The project changed a lot since then, I cannot summarize everything here, but some important ones are:

  • the core is now in Go, not C
  • I renamed it to Fibrum, it used to be called pymupdf4llm-c
  • Added benchmarks
  • Quality, performance in all measures improved
    I’ve been working on a faster alternative to pymupdf4llm and Docling.

I just do want to say somewhere that Hack Club is really great, so thank you :)

It processes 200+ pages/sec on CPU.
It extracts tables, text, formatting, and bounding boxes and font sizes.

Output is JSON, with optional Markdown.

It’s written for Python, with a Go core and a thin C layer interfacing with MuPDF.

Table extraction precision/recall is currently lower than existing tools, I want to improve this.

This project uses AI

OpenCode with GitHub Copilot to plan, write repetitive code, find issues for me, implement certain parts of features. I did use it moderately.

Demo Repository

Loading README...

AVD

Tagged your project as well cooked!

🔥 AVD marked your project as well cooked! As a prize for your nicely cooked project, look out for a bonus prize in the mail :)

Adit

Shipped this project!

Hours: 57.27
Cookies: 🍪 708
Multiplier: 12.36 cookies/hr

I built a fast PDF extractor, getting 200 pages a second on benchmarks.

It was actually MUCH harder than I thought to extract tables…surprisingly everything else wasn’t too bad. Like geez, such a TERRIBLE format that we use everyday! PDFs. You gotta guess everything.

I’m proud of the speed.. honestly when I started it was more of a small port of pymupdf(4llm), just writing it in Cython cause I wanted more speed. I didn’t even think I’d make my own library, I was just tryna make that like 2-3x faster. But, here we are.

Adit

I have once again utterly failed to log anything over so long.

I keep forgetting. I get too engrossed in fixing something and then I iterate and iterate but don’t log my changes here.

In this amount of time, the entire codebase has undergone a major ‘refactoring’ (though, this includes featural changes).

The latest thing I’ve done is add benchmarks. I used them a lot internally initially for the below improvements, but have decided to publish them to help my tool. It uses the Marker dataset and compares Fibrum’s performance and quality to pymupdf4llm and Docling, which I found were two of the more popular PDF tools.

Go codebase changes:

  • Added borderless table extraction, this was a pain to implement; you have to make it not detect text as tables, but actually detect tables at the same time
  • Made the edge based table extraction much more reliable.
  • The table improvements also help in the text quality, as misdetecting text as tables results in a lower text score.
  • Overall, a lot of stuff was made more ‘adaptive’ in the sense, like, instead of assuming an average, we actually adapt page-by-page.

Refactoring was done a lot here, and honestly, I still think the code is a bit sloppy in parts. It was more ‘reorganization’ of complexity. But a lot of unnecessary logic was shaved off. Always a win; it made the net additions less from all the feature additions. Less is more.

C code:

  • There was a bug where we weren’t traversing FZ_STEXT_BLOCK_STRUCT, where, for some reason, sometimes text was nested in there, so that impacted scores on certain PDFs. This has been fixed.
  • Minor refactoring here as well, mostly just simple line by line optimizing if that makes sense

Performance (everywhere):

  • Go side, we were accumulating everything in RAM, a big oversight for a ‘performant’ tool. But we can’t flush to disk for every character as that hammers the kernel, so we have buffers and flush periodically.
  • Optimized the worker counts and Go rountine setup. We do oversubscribe, which depending on your outlook can be considered as dumb (2x for pipeline, 3x for I/O).. but this accounts for the GC, RAM, I/O bottelenecks and pauses. Basically for like a small period all the workers might be waiting on something (that is much slower, relatively), so then the oversubscribed workers can come into play. And if they are working, the oversubscribed workers are at rest.
  • Overall, much more aggressive with CPU usage, and, despite fixing RAM issues, a little more lenient on RAM for the speed benefit. Note that it is stable to N cores at a time, not N pages.

This is a very high level summary, it took me lots of trial and error, and this is mostly just what I remember as the end results. I know these devlogs are supposed to be detailed.
I’ll probably get a terrible score story-telling wise.

Attachment
Attachment
0
Adit

I think I forgot about this and took way too long to write it; it’s been 22 hours and 30 commits, though, many are minimal, so I’ll try to summarize everything into one. Forgive me if there’s not enough detail,

First, I tried overhauling a lot of the table detection code, C side. I also made many bug fixes to the formatting. just typical stuff. I honestly forgot by now.

I also did much refactoring Python side, fixed many bugs in CI/CD and shared library issues.

Added Markdown conversion support Python side,

Eventually, I was trying to refactor the C code, and realized, this is shit.

So.. i rewrote it in Go. I know, so dumb to change languages. but it wasn’t for fun. It reduced the entire codebase by like 40%. And while I ported it, I didn’t just directly do it line-for-line; I refactored and used some external libraries and optimized for performance in the mean time.

So, summary:

  • refactored Python code
  • rewrote in Go
  • overhauled table detection logic specifically
  • overhauled/fixed formatting bugs
  • optimized and bug-fixed CI/CD
  • optimized performance a bit
  • updated docs to be a bit more useful with the performance info, API usage and clarity

Sorry for the lengthy read, I tried to summarize as best as I could. To be honest, I forgot about this thing and was just having fun.

Attachment
0
Adit

I liked the quality of pymupdf4llm, but it was too slow for my needs. I couldn’t find any alternatives of similar quality, and so, I decided to rewrite it in C (and later Go), and bind it back to Python.

I also couldn’t find anything locally that gave ‘structured’ output; JSON instead of just Markdown (though, you can convert it to Markdown), (this is better for RAG and smarter chunking.

Also, surprisingly, in my project, it seemed to detect MORE tables and formatting than pymupdf4llm. So that’s a bonus.

It uses MuPDF under the hood in C, then processes it in Go, and Python calls it as a shared library.

To bypass multithreading lock and get faster performance (for MuPDF), we use fork C-side and go rountines set to the number of cores Go side. (to not oversaturate).

It’s just geometry and huerisitcs under the hood, a bit opionionated, but it’s the only option.

Attachment
0