Transforming Research in Arabic-Based Languages
Handwritten notes, diaries, letters …
So often, these low-tech documents hold the keys to unlocking whole histories. But how can researchers — and families — search through millions of such pages that span the sixth to the 20th centuries? Digitizing them is the first step. However, when they’re written in Arabic, whose letters take various, position-dependent forms and often carry dots, lines, or diacritic markers, the challenge becomes extreme.
With support from a $476,483 grant from the National Endowment for the Humanities — the largest NEH grant to NC State University to date — researchers at the Moise A. Khayrallah Center for Lebanese Diaspora Studies are taking on that challenge.
“We intend to revolutionize research by making historical data easily and readily accessible to scholars and the general public,” says Akram Khater, a University Faculty Scholar and professor of history. He also holds the Khayrallah Chair in Diaspora Studies and serves as director of the Khayrallah Center. “Our efforts will transform research in Arabic-based languages.”
This work builds on the Khayrallah Center’s Arabic Optical Character Recognition (OCR) Project, which aims to improve access to Arabic texts by developing a fully-searchable database for researchers and the public.
In the first phase of the project, an interdisciplinary team of historians and computer science graduate students created software to convert images of Arabic newspapers and books into computer-readable text files. The database, released in 2020, allows researchers to search through thousands of digitized pages in minutes, rather than hours or days.
Now, the second phase of the project will focus on handwritten documents, which Khater says comprise most of the historical data (pre-1950s) about Arabic-speaking people in North and South America, as well as in the Middle East and North Africa. These pieces, he says, pose a different problem due to their lack of consistency and uniformity in style of writing, in clarity, and in type of document.
We intend to revolutionize research by making historical data easily and readily accessible to scholars and the general public.
When the project is complete, documents in Arabic as well as in Ottoman Turkish and Farsi will be accessible. This will make text searchable a treasure trove of digitized pages archived at the Khayrallah Center and other research centers around the world.
The Khayrallah Center’s archive now contains 250,000 pages, which will increase to 1 million in the next three years. To date, 8% of the center’s material has been uploaded to the searchable OCR database.
The team includes Chau-Wai Wong, assistant professor in NC State’s electrical and computer engineering department, a doctoral student and a postdoctoral fellow in computer vision, and history graduate students.
In addition, paid volunteers will transcribe a number of handwritten documents. This step will create 100,000 data points that will enable the computer to read the documents and make them searchable. Students in the Arabic studies program at NC State will test the software.
More than 1,000 users from the Middle East and the U.S. have already used the project’s initial database. They range from scholars conducting research to people looking for lost relatives. That number will grow with forthcoming software affordances, such as the ability to translate texts from Arabic to English, visualize the data, and do linguistic analysis.
Meanwhile, the project has attracted interest from research centers hoping to collaborate and from organizations willing to provide additional funding, Khater says.
“From the start, the goal was to build the largest searchable database of Arabic documents,” he adds. “We are well on our way to achieving that goal and to putting it in the service of people’s imaginations.”