If you are a language enthusiast, you may often be thinking about how Swedish is similar to Danish or what similarities exist between Norwegian and Icelandic. Nordic languages obviously have a wide range of very similar grammatical constructs, not to mention very similarly used tenses, which I am sure is a subject of very intense discussion among comparative linguists at various language faculties, but I have the impression that it has never been quantified and only talked about. So let’s try to answer this question with some data.

But how to measure it?

One of the methods to estimate the similarity between two text strings is the method called Levenstein distance. It is a measure of the similarity between two strings by calculating the minimum number of single character edits (insertions, deletions, or substitutions) required to change one string into the other.

For the sake of this small kind of research, I collected a total of about 110 sentences of texts available in Swedish, Norwegian, Danish, and in Icelandic.

Of course, the results won’t tell you much about grammar or phonetics, but they will tell you about lexical similarities.

Are you ready?

If we measure the average of the distances between the examined text strings, we get the following results:

If we scale it to look at the language that is closest to the one in the header, it will look something like this:

What does this tell us?

The closest language to Norwegian is Danish and vice versa, the closest language to Swedish is Norwegian, the closest language to Icelandic is Norwegian.

What about Finnish? The closest language to Finnish is Swedish, but it’s not a Germanic language, it has different grammar rules and so on, but geographical proximity seems to have played a role here, in a way that most likely a lot of vocabulary is intertwined in both languages.

Does this analysis has any limitation?

Yes, of course it does. It may differ slightly from reality for various reasons, as the selected texts may not be representative of all language usage, but it gives us a data-driven answer.

But how to measure it?

Are you ready?

What does this tell us?

Does this analysis has any limitation?

Related Posts