Language Preservation Efforts Get an AI Boost

News subtitle

Computer scientists and linguists build AI tech to strengthen endangered languages.

Image
Image
Ivory yang with a graphic showing the N眉shu and Chinese characters for "ocean"
Ivory Yang, a graduate student in computer science, learned a few words of N眉shu from her grandmother as a child. The characters shown are for the word 鈥渙cean.鈥 (Graphic by Richard Clark) 
Body

Four centuries ago, Yao women in the southern Chinese province of Hunan created a script called N眉shu鈥攍iterally meaning 鈥渨omen鈥檚 writing鈥 in Chinese鈥攖hat was used for centuries by women to communicate with one another in secret.

After women gained greater access to formal education in the 1900s, the use of the script declined and many N眉shu texts were lost or destroyed over time. Since the turn of this century, however, there has been a sustained effort in China to save the script from becoming extinct.

Now, computer science graduate student Ivory Yang, Guarini, who remembers learning a few words of N眉shu from her grandmother as a child, is exploring how artificial intelligence models offer new ways to preserve and help revitalize the rare script.

Yang and her collaborators, Weicheng Ma, Guarini 鈥24, and, built an AI-driven framework called N眉shuRescue that can potentially be adapted to other 鈥渓ow-resource鈥 languages, which have fewer written or translated materials available for training AI systems.

The tool used minimal data鈥攋ust 35 pairs of matching sentences in Chinese and N眉shu鈥攖o train a large language model that had no prior knowledge of N眉shu to expand the database of text in the script through translations from Chinese.

The researchers began with A Compendium of Chinese N眉shu, the most comprehensive, expert-validated collection of scanned N眉shu scripts and corresponding Chinese translations. They worked with expert annotators trained in computational linguistics to create a dataset of 500 digitized Chinese-N眉shu sentence pairs, which included newly mapped words in the two languages.

The researchers used samples from the manually translated dataset to train the GPT-4 Turbo large language model. They found that with just 35 samples, the model began to get a grasp of the script and was able to translate test phrases from Chinese to N眉shu that were not part of the training data. Their work in creating an expert-validated N眉shu-Chinese digital dataset is the first of its kind.

Among other things, Yang is keen to extend her model to different media. 鈥淭here are handkerchiefs and floating fans that have N眉shu writings on them,鈥 she says. 鈥淪o the next step would be to build multimodal models that can use computer vision to capture these images and train a model to recognize and translate the characters for us.鈥

Their work, published recently in the proceedings of the , also demonstrates how the framework, which minimizes reliance on extensive human annotations, can be applied to other low-resource languages such as Cherokee.

鈥淥ur work demonstrates that generative AI and large language models significantly lower barriers to revitalizing endangered languages, rapidly producing valuable linguistic resources even from minimal data,鈥 says Vosoughi. But, he says, despite their transformative potential, these models inherently carry the risk of introducing biases from dominant cultures, potentially distorting or oversimplifying nuanced cultural identities.

鈥淎ctive participation from native speakers and linguists is essential to ensure linguistic authenticity and cultural fidelity. AI and community expertise are both fundamental for meaningful preservation efforts,鈥 says Vosoughi.

Evaluating existing technologies

Besides creating new tools, researchers are also examining whether existing language technologies, which are built for and center around mainstream languages, support endangered languages.

A notable case is Google Translate鈥檚 LangID, which does not support most Native American languages, including Navajo, one of the most widely spoken Indigenous languages in North America, says Yang. This means that these languages cannot even be detected online.

In a , which Yang will present at the Nations of the Americas Chapter of the Association for Computational Linguistics conference, Yang and her collaborators found that Google LangID misidentified Navajo sentences as other, unrelated languages.

To address this, the researchers built a simple yet highly accurate language-identification model for Navajo and related Athabaskan languages that can accurately distinguish these languages from those erroneously suggested by LangID. Their work highlights the need for machine-learning technologies that better support underrepresented languages and cultural diversity.

Tech tools to aid language preservation

Computer tools and AI models are valuable aids that speakers of endangered languages鈥攎embers of these communities as well as researchers studying them鈥攃an use to document and revive these languages, says , an assistant professor in the whose work focuses on the creation of computer models that can understand Indigenous languages.

鈥淎 lot of the work that we linguists do requires a lot of expertise and attention, but is also very repetitive and tedious, something that would be good for a computer to take care of,鈥 says Coto Solano, who is also an adjunct assistant professor of computer science.

Image
Ryan Dudak filming Jean Tekura Mason making "ika mata"
Ryan Dudak 鈥24, left, documents food culture during a foreign study program in the Cook Islands with Jean Tekura Mason, who was making 鈥渋ka mata,鈥 fish marinated in lime and coconut milk. 

It was a meeting with Sally Nicholas, a linguist at the University of Auckland, working in the South Pacific nation of the Cook Islands, that motivated Coto Solano to blend his skills in linguistics and computer science and create new technologies.

鈥淔or her dissertation, Sally said that she had recorded dozens, maybe hundreds, of hours of recordings, and joked that she was going to die before she finished transcribing everything,鈥 he said.

Working with Nicholas and other collaborators, Coto Solano built automatic speech-recognition models for Cook Islands M膩ori that uses machine learning to identify speech patterns from audio recordings and transcribe them into text.

鈥淭ranscription is a very specialized and difficult task, especially in a language that very few people write,鈥 says Coto Solano, who has also made speech-recognition models for the Costa Rican languages Bribri and Cab茅car.

By accelerating transcription, speech recognition makes it possible to transcribe and document stories and cultures of communities that have a dwindling number of native speakers.

Coto Solano also uses techniques from natural language processing, a field of artificial intelligence that enables computers to understand a language by analyzing text and speech data to develop text-to-speech and machine translation tools.

These can open the doors to future applications that can potentially engage young people and motivate them to learn and use the language. 鈥淲e can create learning tools that can have a voice or make it possible for children in the diaspora, for example, to have access to native language content through machine translation,鈥 says Coto Solano.

Collaborating with communities and ultimately empowering them to drive language initiatives will pave the way for developing the most useful and impactful applications, says Coto Solano. Among other efforts in this direction, 鈥淲e conducted a workshop last July to train Cook Islanders in linguistics and a little bit in natural language processing,鈥 he says.

Coto Solano and his collaborators are also working with the to create an easy-to-use interface for the speech recognition tool so that it can be used by the community at large and not just researchers for transcribing and documenting video and audio content.

The Department of Linguistics also conducts a foreign studies program in Auckland, New Zealand, and Rarotonga in the Cook Islands, during which students take classes on M膩ori language and culture before conducting linguistics field research.

鈥淚n the Americas, and around the world, there are a lot of languages that are in danger of going dormant鈥攖hey will no longer be spoken by anyone alive in the near future,鈥 says Coto Solano. 鈥淲hichever tools we can use to help turn this tide are urgent, are necessary.鈥

Harini Barath