The author, alumnus David Sasu ’19, is an Archer-Cornfield Fellow in Ashesi’s Computer Science and Information Systems Department. He is also a PhD candidate in the NLPnorth research unit within the Computer Science Department at the IT University of Copenhagen. He has worked as a Visiting Researcher at Columbia University in the Spoken Language Processing Group. His research interests cover the field of Speech and Language Processing for low-resource languages, with an interest in African Languages.
Conversational AI is everywhere, from our phones to hospital check-in desks, yet most systems overlook the most human aspect of speech. They operate through a rigid, text-centered pipeline: transcribing words, reasoning in text, and producing a scripted reply. This approach is fragile and incomplete because it erases prosody, the rhythm, tone, and emotion that give language its meaning. By ignoring how we speak, these systems reduce conversation to transactions. They can tell us the weather, but they cannot register our frustration or share in our joy. They talk, but they do not connect.
Over the past five years, my work has focused on bridging the gap between talking and connecting. The future of speech technology lies in integrated, end-to-end models that listen, interpret, and respond in one fluid motion while preserving the richness of human expression. I focus on steering these technologies toward cultural and humanitarian impact. The world is profoundly multilingual, yet the digital landscape remains narrow. Prosody offers a path forward: by modeling cues such as pitch and emphasis, we can create systems that better resolve ambiguous commands and improve recognition in challenging conditions, especially for communities whose languages are underrepresented in AI.
This philosophy of embedding culture at the core of technology is embodied in our creation of the Akan Cinematic Emotions dataset. As the first multimodal, prosodically annotated resource for an African language, it lays the groundwork for building systems that truly understand speakers of Akan, spoken by millions across West Africa. It moves us beyond translation toward systems that can interpret intent and emotion within a cultural frame. Developing emotionally intelligent and culturally aware technology is about more than convenience; it is about care. In healthcare, an empathetic conversational agent could sense a patient’s anxiety and adjust its tone to provide reassurance. In humanitarian crises, such agents could serve as first responders, collecting vital information while respecting local languages and norms. To succeed in these contexts, conversational AI must embrace the messiness of human dialogue rather than reduce it to neat fragments.
My ongoing work is dedicated to this vision. By placing culture and prosody at the centre of speech technology, we can create conversational agents that do more than process commands: agents that listen with empathy, adapt with respect, and offer support when it matters most.




