Wiktionary Lexeme Bot banner

Wiktionary Lexeme Bot

5 devlogs
5h 52m 56s

This is a MediaWiki bot that grabs a list of lexemes off of Wikidata's JSON dumps that have senses in a specific language (configured by the bot operator), checks if a specific Wiktionary has any entry about it already and makes a new one that jus…

This is a MediaWiki bot that grabs a list of lexemes off of Wikidata’s JSON dumps that have senses in a specific language (configured by the bot operator), checks if a specific Wiktionary has any entry about it already and makes a new one that just uses the lexeme’s data via a template if not.
Operating the bot requires credentials such as a Wikimedia bot password unless you comment out lines 8 to 11 of bot.py which can be done safely if you do not use a Wikimedia account. But if you do not want to do that, you can extract the lexeme IDs (along with other information the bot needs) using the extract_lemmas.py script after downloading https://dumps.wikimedia.org/other/wikibase/wikidatawiki/latest-lexemes.json.gz (~570 MB).
The demo link shows entries made by the bot on Bengali Wiktionary.
There is no GUI because it is intended to be used from CLIs (preferably on Wikimedia’s Toolforge cloud environment where the dumps are readily available but it can be run anywhere Python runs).

This project uses AI

I used Gemini and Claude Sonnet (Thinking) to generate the script that extracts lexeme ids from Wikidata’s JSON-GZIPped dumps. As the project neared completion, I also had Claude Sonnet review my code for bugs (and it found three that I subsequently fixed on my own).

radman.siddiki

Shipped this project!

Hours: 5.85
Cookies: 🍪 78
Multiplier: 13.27 cookies/hr

Members of the Wikimedia community have been working on a project called Abstract Wikipedia (the name will soon be replaced by a better one but we do not know what it will be yet; see https://meta.wikimedia.org/wiki/Abstract_Wikipedia for details of the project). As part of the efforts, they have been adding lexicographical data to Wikidata lexemes, which involves adding senses, examples and other data about different lexicographical entities to them. The Bengali Wiktionary project currently only has a little more than 107k entries, and could greatly benefit from the work being done on Wikidata. The Bengali Wiktionary community started using a MediaWiki template and a Lua module to automatically render information about different words based on the data stored in Wikidata lexemes last year. This project aimed to take that a step further by automating the process so editors can focus on editing Wikidata and their efforts do not have to be duplicated. The bot checks the list of Wikidata lexemes to search for ones where there are senses in the bot operator’s target language and compares it against the total list of Wiktionary entries on the target Wiktionary to check which ones are missing. Then, it uses the template and module that editors have been using to automatically make new entries on Wiktionary. This turned out to be a little harder than I expected it to be as I quickly realized that many lemmas would be found across different lexemes and languages so handling them took some extra effort. That being said, I hope this helps people looking for a truly free Bengali dictionary, and in the future can be deployed to other Wiktionaries as well.

radman.siddiki

It’s finally done! I just made some new entries on Bengali Wiktionary using the bot! Check out https://bn.wiktionary.org/wiki/হুড়ুম্বি, https://bn.wiktionary.org/wiki/হিমধামা etc. The full list of edits the bot has made so far (so the editors of Bengali Wiktionary can judge whether we want it to edit regularly) can be found here: https://bn.wiktionary.org/wiki/বিশেষ:অবদান/LexemeBot
The source code of the bot: https://gitlab.wikimedia.org/toolforge-repos/wiktlexbot (licensed under GPLv3)

Attachment
0
radman.siddiki

I have tweaked the schema again, to make it easier to handle cases where a lemma appears in lexemes of multiple different languages. This makes the job of showing section headings with the language’s name on Wiktionary much easier as I have updated the Lua module used on the wiki to render these to automatically add the language’s name but that also means that putting the template in the incorrect order could lead to more than one section headings bearing the same language’s name or that lemmas in the same language would not be nested together.
The new schema for lemmas appearing in more than one language:
{
’language”: “lemma”,
“ids”: [
“Lx”,
“Ly”
]
}
The schema for lemmas appearing in one language only remains unchanged.

Attachment
Attachment
0
radman.siddiki

The lexeme ID-extracting script is now working! It extracts IDs from Wikidata’s GZIP-compressed JSON lexeme dumps and outputs them in a file with this schema:
[
{
“lemma”: “lemma”,
“ids”: [
“Lx”,
“Ly”
]
}
]

Attachment
0
radman.siddiki

I realized that using the Wikidata API to check which lexemes had senses for a specific language would be really inefficient so I have instead decided to use Wikidata’s JSON dumps for this. I have now added a Python script for streaming those dumps without decompressing them (because they can get very large), checking which ones have the senses and logging their IDs.

Attachment
0