Updated Project: The project changed a lot since then, I cannot summarize everything here, but some important ones are:
- the core is now in Go, not C
- I renamed it to Fibrum, it used to be called pymupdf4llm-c
- Added benchmarks
- Quality,…
Updated Project: The project changed a lot since then, I cannot summarize everything here, but some important ones are:
- the core is now in Go, not C
- I renamed it to Fibrum, it used to be called pymupdf4llm-c
- Added benchmarks
- Quality, performance in all measures improved
I’ve been working on a faster alternative to pymupdf4llm and Docling.
I just do want to say somewhere that Hack Club is really great, so thank you :)
It processes 200+ pages/sec on CPU.
It extracts tables, text, formatting, and bounding boxes and font sizes.
Output is JSON, with optional Markdown.
It’s written for Python, with a Go core and a thin C layer interfacing with MuPDF.
Table extraction precision/recall is currently lower than existing tools, I want to improve this.
OpenCode with GitHub Copilot to plan, write repetitive code, find issues for me, implement certain parts of features. I did use it moderately.