SpecScraper banner

SpecScraper

3 devlogs
2h 5m 54s

SpecScraper is a simple CLI Python tool, which scrapes the entire AQA Physics A Level specification from its website, and turns it into JSON format.

This project uses AI

I used AI to create the Github Action file which created the executables

Demo Repository

Loading README...

Vulcan

I realised that on the website the text doesn’t look so good, so I decided to scrape the HTML instead of turning it into text first.

This worked greatly as now it looks much better on the website now.

Attachment
0
Vulcan

Shipped this project!

Hours: 1.67
Cookies: 🍪 9
Multiplier: 5.48 cookies/hr

This is the first step to creating a todolist from my A Level spec, and I’m really happy I’ve finished this part.

Now using the JSON I’ve created, I’m going to turn it into a todolist app, but specifically towards the subjects I’m doing.

Vulcan

I made the scraper fully work for AQA Physics A level

  • I added code to get all the topics from a subject content page using the a tags and href
  • Then I scraped each topic page individually using the topic scraping code I had before
  • Then I realised some topics have subsubtopics instead of just subtopics, so I added code to account for those situations (looking for h4 tags as well as h3 tags)
  • I made it upload the json to a json file
  • I tested the code on AQA Physics A level and it fully worked!

Testing on another subject

Although this parser was made specific to physics AQA, I tried it on Psychology, and it didn’t work because it doesn’t use tables.
Then I tried it on CS (which does use tables), but it uses multiple rows, so when I ran the code it only worked for the first rows of each table.

Next steps

I need it to work for any AQA specification which uses tables (even if they are multi rows).
Instead of

subtopic: {"content": "", "opp": ""}

I should do

subtopic: [{"content": "", "opp": ""}, {"content": "", "opp": ""}, {"content": "", "opp": ""}]

(so an array of each content and opportunity row)

Summary

This was a really good hour, and I got a lot done.

Attachment
0
Vulcan

I’ve started off with trying to scrape AQA A level Physics (hoping this will work for all AQA specs once this fully works).

So far, I’ve implemented scraping the subtopics from a topic page, which gets the contents + opportunities of each topic, when given the URL. It turns this into JSON and outputs this.

The next step is to get all the topic URLs from the page and scrape all of them.

Attachment
0