Ruby Gems For Pulling Dictionary Words


For the past few months I’ve been working heavily on an English Dictionary database, to help people learn English. It’s called English Fox, more on this next time.

Generate Definitons via LLM

Anyway I needed a way to pull definitions for a lot of words. I tried using a LLM to generate definitions but it missed some parts of speech or gave bad results, totally unreliable, unless using a massive model 70B+.

For example, take this prompt:

List all the words of speech for the word "back",
include the part of speech and a short definition.
In yaml format.

From ChatGPT:

back:
  - part_of_speech: noun
    definition: The rear part of the body or something that is opposite the front.
  - part_of_speech: adjective
    definition: Related to the rear or past.
  - part_of_speech: adverb
    definition: Toward the rear or in the past.
  - part_of_speech: verb
    definition: To support or to move backwards.
  - part_of_speech: preposition
    definition: Behind or at the rear of something.

Hey, this looks pretty good. Why not run this for all 200,000 words?

A few issues:

  1. That’s expensive
  2. It will 100% miss some definitions

Let’s take the definitions from Free Dictionary API.

It looks pretty good, but…

...
{
    "definition": "A support or resource in reserve.",
    "synonyms": [
      
    ],
    "antonyms": [
      
    ]
},

Yeah it missed this important meaning. And about 10 more.

Ruby Gems

So I arrived at the expected solution. Get a large word list and slowly scrape the words and sort the data from there. I ended up using this list and a few others. Github has a few lists, this one by gnu works as well.

The next task is to pull the definitions. Each word has many definitions, and each definition has a part of speech, meaning and other important data. Most of this can be scraped but some of it is generated by a LLM – for example the synonyms.

Here is the staging model for definitions to help you visualize:

  create_table "staged_definitions", force: :cascade do |t|
    t.bigint "staged_word_id"
    t.string "part_of_speech", default: "0", null: false
    t.string "meaning", null: false
    t.integer "source", default: 0, null: false
    t.string "group"
    t.string "synonyms", default: [], array: true
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.string "antonyms", default: [], array: true
    t.integer "position"
    t.integer "state", default: 0, null: false
    t.string "full_form"
    t.index ["staged_word_id"], name: "index_staged_definitions_on_staged_word_id"
    t.index ["state"], name: "index_staged_definitions_on_state"
  end

I ended up creating some ruby gems to share with the community for pulling word definitions from free apis.

Note: One of them is Cambridge, which isn’t free and I don’t use anymore. I wouldn’t recommend scraping from them, use at your own risk.

In the end I realized almost all the original definitions came from Wiktionary, so I ended up only using that. All the apis give slightly different data, so pick the one that best suits the project.

Enjoy. More on English Fox in my next post.

Related Posts

Simple Explanation of the Pinyin Sounds

Failed Attempt at Creating a Video Search Engine

Test Your Chinese Using This Quiz

Using Sidekiq Iteration and Unique Jobs

Using Radicale with Gnome Calendar

Why I Regret Switching from Jekyll to Middleman for My Blog

Pick Random Item Based on Probability

Quickest Way to Incorporate in Ontario

Creating Chinese Study Decks

Generating Better Random Numbers