Claude for planning, use it like a dump. And OpenCode for when I got stuck with problems. But, even when I got it to write for me, I checked it.
Claude for planning, use it like a dump. And OpenCode for when I got stuck with problems. But, even when I got it to write for me, I checked it.
I think I forgot about this and took way too long to write it; it’s been 22 hours and 30 commits, though, many are minimal, so I’ll try to summarize everything into one. Forgive me if there’s not enough detail,
First, I tried overhauling a lot of the table detection code, C side. I also made many bug fixes to the formatting. just typical stuff. I honestly forgot by now.
I also did much refactoring Python side, fixed many bugs in CI/CD and shared library issues.
Added Markdown conversion support Python side,
Eventually, I was trying to refactor the C code, and realized, this is shit.
So.. i rewrote it in Go. I know, so dumb to change languages. but it wasn’t for fun. It reduced the entire codebase by like 40%. And while I ported it, I didn’t just directly do it line-for-line; I refactored and used some external libraries and optimized for performance in the mean time.
So, summary:
Sorry for the lengthy read, I tried to summarize as best as I could. To be honest, I forgot about this thing and was just having fun.
Log in to leave a comment
I liked the quality of pymupdf4llm, but it was too slow for my needs. I couldn’t find any alternatives of similar quality, and so, I decided to rewrite it in C (and later Go), and bind it back to Python.
I also couldn’t find anything locally that gave ‘structured’ output; JSON instead of just Markdown (though, you can convert it to Markdown), (this is better for RAG and smarter chunking.
Also, surprisingly, in my project, it seemed to detect MORE tables and formatting than pymupdf4llm. So that’s a bonus.
It uses MuPDF under the hood in C, then processes it in Go, and Python calls it as a shared library.
To bypass multithreading lock and get faster performance (for MuPDF), we use fork C-side and go rountines set to the number of cores Go side. (to not oversaturate).
It’s just geometry and huerisitcs under the hood, a bit opionionated, but it’s the only option.
Log in to leave a comment