Activity

Adit

I think I forgot about this and took way too long to write it; it’s been 22 hours and 30 commits, though, many are minimal, so I’ll try to summarize everything into one. Forgive me if there’s not enough detail,

First, I tried overhauling a lot of the table detection code, C side. I also made many bug fixes to the formatting. just typical stuff. I honestly forgot by now.

I also did much refactoring Python side, fixed many bugs in CI/CD and shared library issues.

Added Markdown conversion support Python side,

Eventually, I was trying to refactor the C code, and realized, this is shit.

So.. i rewrote it in Go. I know, so dumb to change languages. but it wasn’t for fun. It reduced the entire codebase by like 40%. And while I ported it, I didn’t just directly do it line-for-line; I refactored and used some external libraries and optimized for performance in the mean time.

So, summary:

  • refactored Python code
  • rewrote in Go
  • overhauled table detection logic specifically
  • overhauled/fixed formatting bugs
  • optimized and bug-fixed CI/CD
  • optimized performance a bit
  • updated docs to be a bit more useful with the performance info, API usage and clarity

Sorry for the lengthy read, I tried to summarize as best as I could. To be honest, I forgot about this thing and was just having fun.

Attachment
0
Adit

tldr: links, lists, inline styling

support links & lists as their own blocks now
before links & lists were just.. being ignored/being treated in the paragraph blocks
so now they’re their own block

  • started using arrays of spans instead of raw text to support inline styling.

like

spans: [<text, bold>(“This”), <text,italic>(“is”)]

before it was just one chunk of text, and so nothing would really get flagged in that if that makes sense. because the way it was calculated with mean and median the text & stuff.

post-processing; just made a separate module that cleans up after the parsed output. more of a refactor kinda thing. but also added some filters to start it off. like removing duplicate links, cleaning up those language selectors.

  • added a few tests; performance ones, i’ll do smoke test (just examining it myself) for the rest.

and a little bit of shuffling and refactoring, i always do that.

0
Adit

refactored the extraction and parsing stuff, renamed functions for clarity, cleaned up text handling with headings, duplicates, and small-text filtering, reorganized reading order and header/footer logic, revamped the cli output handling, tightened up models and type hints, basically just a big internal cleanup and rework to make things clearer and more consistent without changing the core functionality. markdown is now better; shows everything nicely. & more formats supported.

0
Adit

enhance block extraction, image and table support

  • add image, table support
  • add a bit of logging
  • improve performance MASSIVELY
  • remove stroke extraction
  • change mindset: not gonna rasterize and make stuff from the strokes
  • improve algorithms & stuff

had to do some weird hacks to get it to work on sites with strict CSP
& performance hacks as well, but it really really 10x’d the perf.

and even though the output quality would be slightly worse in edge cases, i took the lazy route. handle most cases, give good perf, keep codebase simple.

0
Adit

Just made my first commit for rich-soup!
(^^ BeautifulSoup is static and preserves junk, playwright is too low-level, this uses playwright (to actually render the page, then uses semantics and heuristics instead of raw HTML tags) but gives you BeautifulSoup level DX!)

benefits are that it can operate on React/Svelte/Vue/* because it actually renders it, and it ignores the div hell and the tags cause they’re usless nowadays. it sees what the human sees basically.

I got basic text extraction and formatting working:

  • headings

  • bold (true/false)

  • paragraphs

  • clean filtering

  • demo Markdown output

but right now tables and stuff don’t work, and its an mvp so the logic is brittle. but it’s actually looking pretty clean atm for what i’ve tested (wikipedia, microsoft learn, some whitepapers).

Attachment
0
Adit

I liked the quality of pymupdf4llm, but it was too slow for my needs. I couldn’t find any alternatives of similar quality, and so, I decided to rewrite it in C (and later Go), and bind it back to Python.

I also couldn’t find anything locally that gave ‘structured’ output; JSON instead of just Markdown (though, you can convert it to Markdown), (this is better for RAG and smarter chunking.

Also, surprisingly, in my project, it seemed to detect MORE tables and formatting than pymupdf4llm. So that’s a bonus.

It uses MuPDF under the hood in C, then processes it in Go, and Python calls it as a shared library.

To bypass multithreading lock and get faster performance (for MuPDF), we use fork C-side and go rountines set to the number of cores Go side. (to not oversaturate).

It’s just geometry and huerisitcs under the hood, a bit opionionated, but it’s the only option.

Attachment
0