Terminator flaunted Spanish at every goodbye. With its iconic See you around, baby (translated in Spain as Sayonara, baby) the most famous android in the cinema showed that, although he was an English speaker, he had an interest in other languages. This is what is happening, bridging distances and time travel, with the technologies of the language of the real world. The Internet speaks English and that is why artificial intelligences have learned Shakespeare’s language earlier than any other. But things are beginning to change. Alexa defends herself in Spanish, Catalan, Italian, Portuguese, German, Chinese and Japanese. Siri and Ok Google are also polyglots. These are the gifted students in the class, the smartest of all artificial intelligences. The rest of the algorithms lag behind on the complicated path to multilingualism. And that has consequences.
Today’s technology is eminently English-speaking, amplifying the language gap for millions of people. It makes, for example, disinformation campaigns more difficult to detect in Spanish than in English. The group Enough, Facebook, led by an association of American jurists and activists, denounced at the beginning of the year how the social network has been less effective in combating disinformation about the US electoral campaign in Spanish than in English. Five out of 10 fake English posts are quickly tagged by Facebook. Only one in 10 false publications in Spanish suffer the same fate, according to an Avaaz analysis.
A website translation guarantees (or hinders) access to basic services. This is especially relevant in an environment in which the most ordinary procedures, from the medical appointment to the income statement, are carried out by computer. A study from Brigham and Women’s Hospital in Boston (USA) found that non-English speaking patients were 35% more likely to die from COVID-19 than those who did. That is why it is important that each language spoken in the offline world has its representation in the online world. And this can only be achieved by feeding words to the voracious artificial intelligences.
“Since we are talking about language, it is important to call things by their names,” says Nuria Bel, professor of Language Technologies at Pompeu Fabra University, in a telephone conversation. “Artificial intelligence is not a result, it is a method. One by which algorithms are trained with a large amount of data ”. This applies to three large fields. To robotics, with results ranging from smart vacuum cleaners to the autonomous car; visually, detecting and classifying images; and, finally, to language technologies. Bel believes that in recent years we have chosen to speak, in general, of artificial intelligence “because it sounds less prosaic” than language technologies. More epic. But do not forget that these have their own problem. The most obvious is the atomization of its results.
Robotics and imaging can be trained internationally without major problems. The advances that American companies make can benefit the rest of the world equally. But not in language technologies. Here you need a bet from each country to enhance their own languages. This is what is beginning to happen. “We are at a time similar to the one that occurred in the nineteenth century with cartography,” explains the expert. “In previous centuries, maps reflected the Mediterranean very well and the rest of the world very poorly. But in the 19th century, countries found it necessary to map their territory in detail. This is the same but with tongues ”.
Nuria Bel led in 2015 (together with her partner Germán Rigau) the Spanish initiative to map its linguistic territory. The Plan for the Promotion of Language Technologies It began with a report that pointed out the problem of the use of artificial intelligence in this field. “To guarantee the availability of applications in Spanish and co-official languages in Spain, the number, quality and availability of the resources that support them must be increased,” it was stated at the time. You have to feed the algorithms with more words.
Wikipedia, the only public language school for algorithms
Artificial intelligence does not make an algorithm intelligent. For that to happen, it happens with algorithms as with people, they must have read a lot. “Neural networks are fatal until they fail to analyze a significant critical mass of text,” confirms Bel. For a text generator to work, for example, it takes 3 billion words. “That’s all Wikipedia, all Google Books and some other corpus more ”, he assures. This poses several problems. The first is that not all languages have that many digitized words. If you want to know how a language technology works in a specific language, it is best to take a look at how that language is represented on the internet.
The linguistic reality in the online world is not always the same as in the offline world. In Wikipedia there are 53 million pages in English, compared to seven million pages in Spanish. The same number of entries has the online encyclopedia in Italian, a language much less widely spoken. This blurs the importance of Spanish in the online environment.
There are more extreme cases. There are only 52,000 articles on the Bengali Wikipedia, a language spoken by 237 million people. The Swedish version has about four million entries for a language that is barely spoken by 10 million people. On the world map of digital languages there are still unknown areas, while others are faithfully represented.
The online encyclopedia is one of the best databases for training language technologies. ”Wikipedia is widely used in this environment because it is open and public,” says German Rigau, deputy director of the Basque Research Center for Language Technologies. But it is an exception. Most large databases are in the hands of private US companies. Google, Facebook, Amazon, Microsoft… They are the ones that are winning this race. In this sense, Europe is at a clear disadvantage compared to the United States: its companies are smaller and its population speaks dozens of languages that start from very different positions. But some initiatives, public and private, are struggling to change this situation.
Rigau is part of one of them, the program European Language Equality. “We are designing an agenda and a roadmap, fulfilling a mandate from the European Parliament, to achieve equality of languages in Europe in the online environment, by 2030”, says the expert. More than 21 European languages are in danger of digital extinction as explained by this initiative on its website. Rigau denounces how unofficial languages, with oral implementation, but little use online, are doomed to this end. “The Aragonese is going to lose, the Asturian is going to lose. There are hardly any digital texts written with these languages ”, he laments. To avoid this, the first thing that is needed are “very basic tools to create templates of those languages in Wikipedia. That way you can start building a database ”.
Wikipedia is the only training ground for many algorithms, since their data is public in an environment, online, dominated by large private companies. It is, to understand us, the only great public school of languages for algorithms. And this is the second big problem. “Language technologies are in the hands of whoever has the data and these are the large private companies,” Rigau denounces.
Bel agrees with this analysis and explains her concern about it. “It is a bit like with banking, which told us that it was self-regulating. The same thing happens here, a few companies have all the data, and they tell us that we have to talk about ethical artificial intelligence, that we have to trust that they do things right ”. That is why Bel encourages other lines of research in language technologies not to be abandoned. Artificial intelligence works well, but the processes to get there are complicated and opaque, he says.
These results are becoming more and more evident. The algorithms speak English, but they are learning other languages, entering new territories. For the trip to be fruitful, they have to consult an accurate map, it is necessary to map the linguistic realities of all corners of the planet. In this context, Spanish, the world’s second mother tongue by number of speakers, starts from an advantageous situation. Taking advantage of it, experts agree, is not just a matter of economic momentum and socio-political strategy. It is a key aspect for Spanish speakers to be as protected and benefited as those who speak English. The language that machines speak and understand will have a direct impact on the rights of the humans who speak it.