rich-soup banner

rich-soup

4 devlogs
10h 6m 59s

Rich Soup

Inspired by BeautifulSoup. Instead of parsing static HTML and using tags, it fully renders the page and the entire DOM (including JS/CSS & slop) using Playwright. Then, it uses semantics; i.e: avg font size versus larger font sizes, l…

Rich Soup

Inspired by BeautifulSoup. Instead of parsing static HTML and using tags, it fully renders the page and the entire DOM (including JS/CSS & slop) using Playwright. Then, it uses semantics; i.e: avg font size versus larger font sizes, lines, gaps, spacing, hierachy/reading order; etc, to reconstruct the page into a clean JSON/Markdown format.
Currently, the options are either:

  • BeautifulSoup; static only, messy.

  • Playwright; lower level, manual.

  • Rich Soup builds on Playwright to give the DX of BeautifalSoup but can render properly like Playwright.

Primarily intended for document-like pages; i.e: Microsoft Learn, whitepapers (PDF-like), Wiki-like sites. Best part is it uses the layout, not tags, and it’s not static! It can extract from garbled DOMs with hundreds of divs and hydration from React and Astro islands and Tailwind, etc etc, perfectly fine.

Demo Repository

Loading README...

Adit

tldr: links, lists, inline styling

support links & lists as their own blocks now
before links & lists were just.. being ignored/being treated in the paragraph blocks
so now they’re their own block

  • started using arrays of spans instead of raw text to support inline styling.

like

spans: [<text, bold>(“This”), <text,italic>(“is”)]

before it was just one chunk of text, and so nothing would really get flagged in that if that makes sense. because the way it was calculated with mean and median the text & stuff.

post-processing; just made a separate module that cleans up after the parsed output. more of a refactor kinda thing. but also added some filters to start it off. like removing duplicate links, cleaning up those language selectors.

  • added a few tests; performance ones, i’ll do smoke test (just examining it myself) for the rest.

and a little bit of shuffling and refactoring, i always do that.

0
Adit

refactored the extraction and parsing stuff, renamed functions for clarity, cleaned up text handling with headings, duplicates, and small-text filtering, reorganized reading order and header/footer logic, revamped the cli output handling, tightened up models and type hints, basically just a big internal cleanup and rework to make things clearer and more consistent without changing the core functionality. markdown is now better; shows everything nicely. & more formats supported.

0
Adit

enhance block extraction, image and table support

  • add image, table support
  • add a bit of logging
  • improve performance MASSIVELY
  • remove stroke extraction
  • change mindset: not gonna rasterize and make stuff from the strokes
  • improve algorithms & stuff

had to do some weird hacks to get it to work on sites with strict CSP
& performance hacks as well, but it really really 10x’d the perf.

and even though the output quality would be slightly worse in edge cases, i took the lazy route. handle most cases, give good perf, keep codebase simple.

0
Adit

Just made my first commit for rich-soup!
(^^ BeautifulSoup is static and preserves junk, playwright is too low-level, this uses playwright (to actually render the page, then uses semantics and heuristics instead of raw HTML tags) but gives you BeautifulSoup level DX!)

benefits are that it can operate on React/Svelte/Vue/* because it actually renders it, and it ignores the div hell and the tags cause they’re usless nowadays. it sees what the human sees basically.

I got basic text extraction and formatting working:

  • headings

  • bold (true/false)

  • paragraphs

  • clean filtering

  • demo Markdown output

but right now tables and stuff don’t work, and its an mvp so the logic is brittle. but it’s actually looking pretty clean atm for what i’ve tested (wikipedia, microsoft learn, some whitepapers).

Attachment
0