Taalmaster

    Counting Words

    When learning languages you often want to know, how many 'words' you know or how many 'words' a text contains. Well, what is a word actually?

    This seemingly simple question becomes quite complex when we look at different forms of the same word. Should we count cat and cats as one word or two? What about run, runs, running, and ran? Different languages handle word variations in different ways, which makes this question even more challenging. To address this, linguists use the concept of a lemma.

    A lemma is the canonical form of a set of word forms. So for example the words went, going and goes all have the same underlying lemma called go. The lemma is chosen by convention but is most often the least market form. For example, the lemma for verbs is often the infinitive:

    Our philosophy

    At Taalmaster, we think that it makes the most sense to be counting lemmas.

    This has several advantages. Firstly, it makes comparisons across languages more easy. Languages such as Mandarin have no inflection which means that a word is also its own lemma. So we can count the word (to eat) only once. If we look at other (highly) inflected languages such as french, we are counting the same word (manger) around 30 times. If we now want to specify the number of words that a text contains we are highly inflating the estimated word count the more inflected a language is. What we want to measure is the ways in which you can express yourself. Knowing 30 versions of a word doesn't translate to knowing 30 concepts.

    Whenever you see a word or use it, it counts towards the progress with the underlying lemma. So seeing forms like going, went or goes all count towards your progress with the word go. In this context, we can also think of the lemma as the general concept that underlies all the inflections. For the word go the concept is something like 'moving from one place to another'. So we want to measure if you can express this concept.

    At the same time, we can look at the grammatical concepts that underlie the inflection. So when you see the word went it tracks progress towards the word go but also towards the grammatical concept of the simple past.