Activity

aryan
  • created function getQuestion() which judges which question would filter more people (average of yes/no) to return a property worth asking

  • created function special() which deals with asking if a person has a political party/employer. this required separate logic as it decides if political party/employer should be asked about.

  • created function generate() which combines everything to run the game. it displays the questions, works with yes/no button logic, and adds to “obj” (the filtering list).

  • if less than 7 (avg of yes/no) people remain, it turns to descriptions instead of properties for questions. using compromise.js to get nouns (except those in occupations) and organisation (if any).

  • HTML and small fixes

thats pretty much the entire game. gonna test/edit some stuff, add some styling, and ship tomorrow or the day after

Attachment
Attachment
Attachment
0
aryan

finally did something apart from database yay!
(very very sorry about the 11h devlog, but I promise its worth it. some of the complicated stuff required multiple revisions which added up in the time)

changes:

  • created a function getAll() which takes an input like this: {citizenshipLabel : [“India”, “NOT United States”], occupationLabel : [“actor”, “NOT scientist”] …}, which makes it much easier for the game logic to add the yes/no answers. putting “NOT “ before a value excludes it. the function converts this to an SQL query and returns all people matching. this is crucial for the logic! It returns an array of arrays, where each inner array is one person with their properties one by one (initially expected the SQL to return an object, but it didn’t so took a lot of time to debug).

  • added a property called “special” which tells if someone belongs to a political party/has an employer; employer and political party properties also added. helps filter much better. also dead/alive as a property

  • added property P, which is a popularity score of a person. uses followers and a multiplied sitelink count, where the multiplier is influenced by “tiers” of followers. fixes problem where popular people with no follower count are considered unpopular.

  • P was required for a function called getPopular(), which gives the most “popular” values of a property. kinda complicated, here’s how it works if you’re interested:

  1. Creates an array of objects called raw, taking the property of each person and the person’s associated P
  2. Creates array “unique” with list of unique values from raw
  3. Creates array “aggregate” which takes each value from unique, adds the P of all occurrences of it from raw, creates objects with bayesian averages of each (a way to normalise data with entries with lots of unpopular occurrences/few ultra-popular occurrences of a value).
  4. Returns the most popular value of the specified property among the people returned by getAll(), and compares with the filter object (the input for getAll) to make sure a value isn’t repeated

Also edited the flatten function to give every property as an array making it simpler to program getPopular. Previously multiple-value properties (citizenship & occupation) would be plain strings separated by commas. Making everything an array made it more versatile

Attachment
Attachment
0
aryan

Edited cleaner.py to utilise “batching”, so it can process ~30 rows at a time instead of one by one, significantly increasing it’s speed

Attachment
Attachment
0
aryan

2 major things:

edited table: old table (~2k people) missed out on a lot of popular people for a few reasons:
Timothée Chalamet: no “occupation” mentioned, 2M followers, 84 sitelinks
Vivek Oberoi: no “followers” mentioned, 38 sitelinks
…and more for similar reasons.

to fix this, I added the following filters:
has 1M+ followers and 15+ sitelinks OR has 2M+ followers (and no site link requirement) OR has 35+ sitelinks (and no follower requirement). And occupations is optional

cleaner.py: a script that uses an on-device LLM to return occupations and field of works for people.
although wikidata provides occupations, they’re often very niche/irrelevant and a lot which can break the game (example: Fernando Alonso’s occupation is vegetarian, Narendra Modi is a bibliographer & writer; they’re technically correct, but a player may answer incorrectly and a large number of occupations per person can make it inaccurate).

also, running the LLM required a LOT of debugging. one of the largest problems was that it started spitting out nonsense after one line of json (attaching images of the LLM’s results, and of the bad text). The problem was fixed by giving a very structured prompt since it’s an instruct model.

Attachment
Attachment
Attachment
0
aryan

the table formed had 10k+ rows for about 1.5k people. this was because multiple occupations/citizenships would create new rows. wrote a script called “flatten.py” to organise it in a way that all occupations of a person are separated by commas.
this makes it easier to clean up

Attachment
0
aryan

to make the game, I needed organised data on people. I decided to use wikidata as it is extensive and has a lot of information on well classified entities that can be queried using code (SPARQL).
initially I thought it’d be simple as I can just send a SPARQL query to wikidata for every question in the game, but later realised that it’d be highly inefficient and unethical, as each question would sift through 120M entities, and multiple times per a single game. To tackle this, I decided to query once and get a small list of people I want.

I decided on only using humans (not fictional characters), but even humans have 13M entries. An akinator-clone game wouldn’t require very niche people, so I decided to add a few “requirements”. That includes: must have a citizenship, occupation, 25+ sitelinks, 1.5M+ social media followers. With this, I got a list of 10k people (which is actually a very small number).

Also, as I was querying this in the earlier stages, a limit of 5 people, hit a timeout (wikidata has a 60s). so I had to optimise it by removing labels (occupation wouldn’t be “singer”, but Q177220, which would’ve taken a lot of time converting all IDs to labels separately). Even this required a 20-30s for 10 people, and it would hit a timeout for the entire thing. However, when I tried to run it in the morning (3AM in UTC, where most researchers use wikidata), it returned all 10k people in 9s! that meant I could add back the labels and an optional “field of work”, and save a lot of time, increase accuracy

PS. ik 6h seems like a lot for just curating a database from a SPARQL query, but this was the first time I’ve even heard of it, the syntax and logic was completely new to me, and adding filters and optimisation required edits back and forth. also, running this experimentally was done in Jupyter notebook which was tracked by hackatime but not uploaded to GitHub. I promise to post devlogs more regularly now onwards :)

Attachment
0