Rich Soup

Inspired by BeautifulSoup. Instead of parsing static HTML and using tags, it fully renders the page and the entire DOM (including JS/CSS & slop) using Playwright. Then, it uses semantics; i.e: avg font size versus larger font sizes, l…

Rich Soup

Inspired by BeautifulSoup. Instead of parsing static HTML and using tags, it fully renders the page and the entire DOM (including JS/CSS & slop) using Playwright. Then, it uses semantics; i.e: avg font size versus larger font sizes, lines, gaps, spacing, hierachy/reading order; etc, to reconstruct the page into a clean JSON/Markdown format.
Currently, the options are either:

BeautifulSoup; static only, messy.
Playwright; lower level, manual.
Rich Soup builds on Playwright to give the DX of BeautifalSoup but can render properly like Playwright.

Primarily intended for document-like pages; i.e: Microsoft Learn, whitepapers (PDF-like), Wiki-like sites. Best part is it uses the layout, not tags, and it’s not static! It can extract from garbled DOMs with hundreds of divs and hydration from React and Astro islands and Tailwind, etc etc, perfectly fine.

Demo Repository

Adit worked on rich-soup

6 months ago

3h 33m logged

tldr: links, lists, inline styling

support links & lists as their own blocks now
before links & lists were just.. being ignored/being treated in the paragraph blocks
so now they’re their own block

started using arrays of spans instead of raw text to support inline styling.

like

spans: [<text, bold>(“This”), <text,italic>(“is”)]

before it was just one chunk of text, and so nothing would really get flagged in that if that makes sense. because the way it was calculated with mean and median the text & stuff.

post-processing; just made a separate module that cleans up after the parsed output. more of a refactor kinda thing. but also added some filters to start it off. like removing duplicate links, cleaning up those language selectors.

added a few tests; performance ones, i’ll do smoke test (just examining it myself) for the rest.

and a little bit of shuffling and refactoring, i always do that.

0

0

Log in to leave a comment

Adit worked on rich-soup

6 months ago

2h 44m logged

refactored the extraction and parsing stuff, renamed functions for clarity, cleaned up text handling with headings, duplicates, and small-text filtering, reorganized reading order and header/footer logic, revamped the cli output handling, tightened up models and type hints, basically just a big internal cleanup and rework to make things clearer and more consistent without changing the core functionality. markdown is now better; shows everything nicely. & more formats supported.

0

0

Log in to leave a comment

Adit worked on rich-soup

6 months ago

1h 22m logged

enhance block extraction, image and table support

add image, table support
add a bit of logging
improve performance MASSIVELY
remove stroke extraction
change mindset: not gonna rasterize and make stuff from the strokes
improve algorithms & stuff

had to do some weird hacks to get it to work on sites with strict CSP
& performance hacks as well, but it really really 10x’d the perf.

and even though the output quality would be slightly worse in edge cases, i took the lazy route. handle most cases, give good perf, keep codebase simple.

0

0

Log in to leave a comment

Adit worked on rich-soup

6 months ago

2h 26m logged

Just made my first commit for rich-soup!
(^^ BeautifulSoup is static and preserves junk, playwright is too low-level, this uses playwright (to actually render the page, then uses semantics and heuristics instead of raw HTML tags) but gives you BeautifulSoup level DX!)

benefits are that it can operate on React/Svelte/Vue/* because it actually renders it, and it ignores the div hell and the tags cause they’re usless nowadays. it sees what the human sees basically.

I got basic text extraction and formatting working:

headings
bold (true/false)
paragraphs
clean filtering
demo Markdown output

but right now tables and stuff don’t work, and its an mvp so the logic is brittle. but it’s actually looking pretty clean atm for what i’ve tested (wikipedia, microsoft learn, some whitepapers).

Attachment

0

0

Log in to leave a comment