For the past few months I’ve been working heavily on an English Dictionary database, to help people learn English. It’s called English Fox, more on this next time.
Generate Definitons via LLM
Anyway I needed a way to pull definitions for a lot of words. I tried using a LLM to generate definitions but it missed some parts of speech or gave bad results, totally unreliable, unless using a massive model 70B+.
For example, take this prompt:
List all the words of speech for the word "back",
include the part of speech and a short definition.
In yaml format.
From ChatGPT:
back:
- part_of_speech: noun
definition: The rear part of the body or something that is opposite the front.
- part_of_speech: adjective
definition: Related to the rear or past.
- part_of_speech: adverb
definition: Toward the rear or in the past.
- part_of_speech: verb
definition: To support or to move backwards.
- part_of_speech: preposition
definition: Behind or at the rear of something.
Hey, this looks pretty good. Why not run this for all 200,000 words?
A few issues:
- That’s expensive
- It will 100% miss some definitions
Let’s take the definitions from Free Dictionary API.
It looks pretty good, but…
...
{
"definition": "A support or resource in reserve.",
"synonyms": [
],
"antonyms": [
]
},
Yeah it missed this important meaning. And about 10 more.
Ruby Gems
So I arrived at the expected solution. Get a large word list and slowly scrape the words and sort the data from there. I ended up using this list and a few others. Github has a few lists, this one by gnu works as well.
The next task is to pull the definitions. Each word has many definitions, and each definition has a part of speech, meaning and other important data. Most of this can be scraped but some of it is generated by a LLM – for example the synonyms.
Here is the staging model for definitions to help you visualize:
create_table "staged_definitions", force: :cascade do |t|
t.bigint "staged_word_id"
t.string "part_of_speech", default: "0", null: false
t.string "meaning", null: false
t.integer "source", default: 0, null: false
t.string "group"
t.string "synonyms", default: [], array: true
t.datetime "created_at", null: false
t.datetime "updated_at", null: false
t.string "antonyms", default: [], array: true
t.integer "position"
t.integer "state", default: 0, null: false
t.string "full_form"
t.index ["staged_word_id"], name: "index_staged_definitions_on_staged_word_id"
t.index ["state"], name: "index_staged_definitions_on_state"
end
I ended up creating some ruby gems to share with the community for pulling word definitions from free apis.
Note: One of them is Cambridge, which isn’t free and I don’t use anymore. I wouldn’t recommend scraping from them, use at your own risk.
In the end I realized almost all the original definitions came from Wiktionary, so I ended up only using that. All the apis give slightly different data, so pick the one that best suits the project.
Enjoy. More on English Fox in my next post.