Counting Words
When learning languages you often want to know, how many 'words' you know or how many 'words' a text contains. Well, what is a word actually?
This seemingly simple question becomes quite complex when we look at different forms of the same word. Should we count cat
and cats
as one word or two? What about run
, runs
, running
, and ran
? Different languages handle word variations in different ways, which makes this question even more challenging. To address this, linguists use the concept of a lemma.
A lemma is the canonical form of a set of word forms. So for example the words went
, going
and goes
all have the same underlying lemma called go
.
The lemma is chosen by convention but is most often the least market form. For example, the lemma for verbs is often the infinitive:
- nous
allons
->aller
- he
goes
->go
Our philosophy
At Taalmaster, we think that it makes the most sense to be counting lemmas.
This has several advantages. Firstly, it makes comparisons across languages more easy. Languages such as Mandarin have no inflection which means that
a word is also its own lemma. So we can count the word 吃
(to eat) only once. If we look at other (highly) inflected languages such as french,
we are counting the same word (manger
) around 30 times. If we now want to specify the number of words that a text contains we are highly
inflating the estimated word count the more inflected a language is. What we want to measure is the ways in which you can express yourself. Knowing
30 versions of a word doesn't translate to knowing 30 concepts.
Whenever you see a word or use it, it counts towards the progress with the underlying lemma. So seeing forms like going
, went
or goes
all count towards
your progress with the word go
. In this context, we can also think of the lemma as the general concept that underlies all the inflections.
For the word go
the concept is something like 'moving from one place to another'. So we want to measure if you can express this concept.
At the same time, we can look at the grammatical concepts that underlie the inflection. So when you see the word went
it tracks progress towards the
word go
but also towards the grammatical concept of the simple past.