Using Artificial Intelligence to Preserve Indigenous Languages

Published: January 22, 2025
Category: Essays | News
Dr. Jacqueline (Lina) Brixey [Photo/Aaron Balana]

By Dr. Jacqueline (Lina) Brixey

Dr. Jacqueline (Lina) Brixey is a computer scientist focusing on dialogue systems, artificial intelligence, and language revitalization, particularly for Indigenous and endangered languages. She holds a PhD in Computer Science from the University of Southern California (USC) and two Masters – MSc (Computer Science) and MA (Linguistics) – from the University of Texas at El Paso. As a computer scientist and citizen of the Choctaw Nation of Oklahoma, Dr. Brixey is committed to supporting and empowering Native American communities through technology and education. In this essay, she talks about her research at ICT, including developing “ChoCo,” a Choctaw language corpus, and a conversational AI called “Masheli.” Next up? A machine translation system to convert English text into Choctaw and vice versa.  

There are 7,000 languages spoken in the world today. However, according to the Internet Society Foundation, “English content dominates over half of all written content online, despite only around 16% of the world’s population speaking this language.” 

The dominance of the English language in technology is a problem on many levels. Without support for other languages, particularly those that are endangered, centuries of knowledge will be lost. Within research fields, languages other than English are categorized as “high resource” (Spanish, Mandarin, and about 18 other languages) and “low resource” (all the others). Online representation and publicly accessible resources mean it is easy for researchers to use “high resource” languages with little effort – while ALL of the other languages will get left behind. 

As a computer scientist, I am interested in addressing the challenge of representation for “low resource” languages using artificial intelligence and language technology. Also, as a citizen of the Choctaw Nation of Oklahoma, the fourth largest indigenous group by population in the United States with 220,000 enrolled members, I have a personal and professional responsibility to ensure that future generations can learn and preserve our language. It is estimated that fewer than 7,000 of us speak Choctaw today, although the Nation is working hard to revitalize the language. Like many people in the Choctaw Nation of Oklahoma, I did not grow up speaking Choctaw. Thankfully, when I started my PhD in Los Angeles, I was finally able to study Choctaw in one of the in-person community courses offered by my Nation. 

Many Indigenous languages in the US are endangered and losing speakers. There are numerous reasons that communities have lost speakers, one of which was US government policies of forced assimilation and suppression of Indigenous languages over hundreds of years. Examples of government policies include forced attendance at residential schools and forced removal from ancestral lands. 

There has been a growing movement to reverse this traumatic forced assimilation in the last twenty years, backed up with significant research that shows the positive effects for Indigenous people in speaking their languages, including lesser propensity to addiction, suicide rates, and decreased prevalence of type 2 diabetes. I am proud to contribute to this research area by asking how technology can aid the revitalization of Indigenous languages, thereby leading to positive social, economic, and health outcomes. 

To this end, my doctoral research at the University of Southern California (USC) focused on developing specific tools, including  “ChoCo,” a Choctaw language corpus, a collection of written and spoken texts essential for the study of languages, and “Masheli,” a conversational dialogue system. 

Here’s how both those projects came about and the research results that followed – plus a preview of my upcoming work. 

CHOCO: A Choctaw Language Corpus

Before building any language technology, we needed a proper corpus at its core. Prior to my research, no digital corpus of the Choctaw language existed, although work had been undertaken to document the language and conduct linguistic studies. In the published paper ChoCo: a multimodal corpus of the Choctaw Language (with co-author Dr. Ron Artstein), we describe how we created a dataset containing audio, video, and text resources, with many texts also translated into English. 

The Choctaw language has 15 consonants and 9 vowels in three series: short, long, and nasalized; the orthography uses the Latin script but is not fully standardized and is a similar writing system to English. The language has subject-object-verb order (as opposed to English’s subject-verb-object order). Choctaw is also highly inflectional, with prefixes, suffixes, and infixes possible on a single verb base. 

Both the Oklahoma Choctaw and the Mississippi Choctaw variants of the language were represented in this corpus, and we provided documentation support for this threatened language, allowing researchers and language teachers access to a diverse collection of resources.

In order to create a repository, I gathered different text materials, such as Marcia Haag and Henry Willis’s two books of teaching material, poetry, short stories, and correspondence, and teaching materials from the Mississippi Band of Choctaw Indians. All texts were manually entered into a database. We aim to archive the materials in the future at the Choctaw Nation’s Cultural Center.

MASHELI

Dr. Jacqueline (Lina) Brixey Presenting Masheli: A Choctaw-English Bilingual Conversational AI

Once we had developed ChoCo, we then moved on to develop a conversational AI (aka a chatbot) called “Masheli” (which is Choctaw for describing how the sky looks after a heavy rainstorm). We selected 17 stories to form Masheli’s knowledge base and enabled it to respond in both English and Choctaw. 

In the published paper Masheli: A Choctaw-English Bilingual Chatbot (co-authored with Dr. David Traum, my PhD advisor), we describe how the chatbot supports learners in gaining conversational fluency in Choctaw. Using NPCEditor (which was developed at ICT) for response selection, Masheli can respond in either Choctaw or English, based on user input, and share cultural animal stories, and repeating responses in both languages when necessary. 

In our pilot study, we explored Masheli’s functionality, highlighting its role in encouraging conversational practice without corrective feedback, thereby aiding language revitalization by providing accessible, low-pressure conversational opportunities. Simply put, we wanted people to feel comfortable practicing their language skills instead of building a tutoring system. 

In 2019, I was honored to present this research at the 1st International Conference for Language Technologies for All (LT4ALL), a UNESCO event organized as part of the International Year of Indigenous Languages. My work also won best video award at the International Workshop for Spoken Dialogue System Technology (IWSDS) in 2020. 

NEXT STEPS: Community work

Following graduation, I am delighted to focus my full attention on two community facing projects. In the first, I am the website and mobile app developer for a language revitalization project with the Mississippi Band of Choctaw Indians as they write their very first dictionary for the Mississippi variant of the Choctaw language. I am incorporating a Choctaw speech-to-text system developed during my dissertation in the website, which will assist fluent speakers who may not write the language be able to use the dictionary by saying aloud the words they would like to search for. The website and apps are expected to be finalized and publicly released in 2025. 

In my second community project, I am developing a website and interactive learning games, as well as retooling the Masheli chatbot, for a NASA Science Activation Grant called Native Earth Native Sky (NENS) at Oklahoma State University. I am delighted to support the efforts to connect Oklahoma middle school students with the Choctaw language and Indigenous scientific knowledge.

Finally, I hope my work in Indigenous and low resource language technology inspires other Indigenous people to consider a future in computer science. Technology, and especially AI for Social Good, is an important tool and can be use to benefit Indigenous communities. I hope my research demonstrates that we can build these tools for ourselves. 

//