to make the game, I needed organised data on people. I decided to use wikidata as it is extensive and has a lot of information on well classified entities that can be queried using code (SPARQL).
initially I thought it’d be simple as I can just send a SPARQL query to wikidata for every question in the game, but later realised that it’d be highly inefficient and unethical, as each question would sift through 120M entities, and multiple times per a single game. To tackle this, I decided to query once and get a small list of people I want.
I decided on only using humans (not fictional characters), but even humans have 13M entries. An akinator-clone game wouldn’t require very niche people, so I decided to add a few “requirements”. That includes: must have a citizenship, occupation, 25+ sitelinks, 1.5M+ social media followers. With this, I got a list of 10k people (which is actually a very small number).
Also, as I was querying this in the earlier stages, a limit of 5 people, hit a timeout (wikidata has a 60s). so I had to optimise it by removing labels (occupation wouldn’t be “singer”, but Q177220, which would’ve taken a lot of time converting all IDs to labels separately). Even this required a 20-30s for 10 people, and it would hit a timeout for the entire thing. However, when I tried to run it in the morning (3AM in UTC, where most researchers use wikidata), it returned all 10k people in 9s! that meant I could add back the labels and an optional “field of work”, and save a lot of time, increase accuracy
PS. ik 6h seems like a lot for just curating a database from a SPARQL query, but this was the first time I’ve even heard of it, the syntax and logic was completely new to me, and adding filters and optimisation required edits back and forth. also, running this experimentally was done in Jupyter notebook which was tracked by hackatime but not uploaded to GitHub. I promise to post devlogs more regularly now onwards :)
Log in to leave a comment