Matheus Sousa - Full-Stack Developer

Recently, the internet went crazy over the release of Chat GPT-4o. The announcement showcased jaw-dropping features: it can now understand video, audio and text inputs and interact via a voice interface—an improved Siri/Alexa that perceives its environment beyond mere audio or text. What truly impressed users was the way it interacted.

The LLM model delivered human-like voice responses, natural intonation, strong contextual awareness and even occasional jokes. The “AI will replace us” narrative resurfaced, with some claiming we’d officially reached AGI.

Society will implode! AI-driven robots are going to take our jobs and the war against the machines WILL FINALLY BEGIN!

What everyone is not aware of is Natural Language Processing (NLP) has been around for quite some time now with some studies dating back to the 1940s after WWII. As technology evolves, we can produce extremely fine-tuned probabilistic/mathematical models, enhanced by deep learning and many other machine learning techniques, resulting in more and more complex dialogues, with deeper contextual depth and more natural responses. This summed up with a really good voice generation model can generate quite the feeling of “talking to a person”, but that’s more on the human’s capability of recognizing patterns than the actual ability of a machine to communicate. AI models in general still are very bad at complex problems, and even the brand new ChatGPT-4 omni is not even close to scratching the surface of human-level problem solving.

In short: No, you don’t need to worry about machines taking over… at least not yet

After reading a couple dozens of tweets, I realized that the big AI-is-gonna-take-over will probably be the main focus of discussions around LLMs in the near future but there is one thing I haven’t seen many people talking about: an opportunity for discussing the current state of voice-user interfaces. I’m sure there’s a lot of folks who think it would be awesome to have a Jarvis-like assistant to help you with your household chores, manage schedules, order groceries, etc. But developing VUIs brings lots of challenges you normally don’t have in the normal user experience.

While I personally explored voice-commanded websites in the past, I’ve never actually developed a robust VUI. But for this article, I did a little research about the main problems in building it. This is what I found:

The issues developers face when building VUIs are:

Contextual Understanding
Security and Privacy

1. Contextual Understanding

One of the main problems in VUIs is understanding the context of the user. Speech recognition is often a challenging problem to solve given the wide range of languages and their accents, you add that with the lack of context, sprinkle a little bit of background noise and it becomes a nightmare for developers to properly identify what the user wants.

2. Security and Privacy

Most VUIs have some type of wake word, meaning it needs to be listening to what the user is saying for it to be working. As is known, data production is a huge issue in the tech world, especially in this field, a lot of users feel a lack of trust in using these types of interfaces with a fear of unauthorized data collection. Also, the VUI needs you to speak loud and clear its instructions, which implies that you also have to share with the world what you’re trying to do.

Also, Users are a bit skeptical when you bring up voice-driven interfaces. Most of the daily applications are not voice-powered and there are few real use cases for it in complex systems.

So, what now? Should we just screw the concept of VUIs?

No. I believe that VUIs have their own niche inside software development. Privacy issues are a major issue for day to day applications, like your bank app, social media websites and many other systems that work with sensitive information, but a step in the direction of refining contextual understanding such as ChatGPT-4o can be a huge improvement towards a better experience and the rapid changes in the AI field will put us everyday closer to better understanding and communicating with users.

VUIs can be incredibly powerful for lots of people. Voice-powered interfaces that really understand context and intention can remove barriers for people with physical or cognitive impairments by allowing a natural conversational flow, rather than relying on voice commands with complex “navigation”. This can give users direct, immediate control over their environment without having to introduce and teach them the specific way of using your application. This, summed up with personalized responses, adaptive feedback, and seamless integrations with modern apps can be a huge deal breaker in the world of VUIs in the next few years.

Besides them, elderly people can also be overwhelmed by the amount of information in the modern days. Voice interfaces are essential to bridge the gap between technological advancement and human capability. For older adults facing mobility challenges, declining vision, or memory lapses, a well-designed VUI becomes more than a convenience, it’s a lifeline. It can read aloud messages, set medication reminders, adjust lighting or temperature, and place emergency calls without requiring them to learn new interfaces. This not only preserves their independence but also alleviates caregiver burden and improves confidence in using digital services.

Conclusion

Voice User Interfaces aren't perfect for everyday use and probably will never be. But I think it is important for developers to work on ways to better include and support all kinds of people. Closing that accessibility gap so tech actually works for all.