Meta and Google Are Betting on AI Voice Assistants. Will They Take Off?

A pair of glasses from Meta shoots a picture when you say, “Hey, Meta, take a photo.” A miniature computer that clips to your shirt, the Ai Pin, translates foreign languages into your native tongue. An artificially intelligent screen features a virtual assistant that you talk to through a microphone.

Last year, OpenAI updated its ChatGPT chatbot to respond with spoken words, and recently, Google introduced Gemini, a replacement for its voice assistant on Android phones.

Tech companies are betting on a renaissance for voice assistants, many years after most people decided that talking to computers was uncool.

Will it work this time? Maybe, but it could take a while.

Large swaths of people have still never used voice assistants like Amazon’s Alexa, Apple’s Siri and Google’s Assistant, and the overwhelming majority of those who do said they never wanted to be seen talking to them in public, according to studies done in the last decade.

I, too, seldom use voice assistants, and in my recent experiment with Meta’s glasses, which include a camera and speakers to provide information about your surroundings, I concluded that talking to a computer in front of parents and their children at a zoo was still staggeringly awkward.

It made me wonder if this would ever feel normal. Not long ago, talking on the phone with Bluetooth headsets made people look batty, but now everyone does it. Will we ever see lots of people walking around and talking to their computers as in sci-fi movies?

I posed this question to design experts and researchers, and the consensus was clear: Because new A.I. systems improve the ability for voice assistants to understand what we are saying and actually help us, we’re likely to speak to devices more often in the near future — but we’re still many years away from doing this in public.

Here’s what to know.

Why voice assistants are getting smarter

New voice assistants are powered by generative artificial intelligence, which use statistics and complex algorithms to guess what words belong together, similar to the autocomplete feature on your phone. That makes them more capable of using context to understand requests and follow-up questions than virtual assistants like Siri and Alexa, which could respond only to a finite list of questions.

For example, if you say to ChatGPT, “What are some flights from San Francisco to New York next week?” — and follow up with “What’s the weather there?” and “What should I pack?” — the chatbot can answer those questions because it is making connections between words to understand the context of the conversation. (The New York Times sued OpenAI and its partner, Microsoft, last year for using copyrighted news articles without permission to train chatbots.)

An older voice assistant like Siri, which reacts to a database of commands and questions that it was programmed to understand, would fail unless you used specific words, including “What’s the weather in New York?” and “What should I pack for a trip to New York?”

The former conversation sounds more fluid, like the way people talk to each other.

A major reason people gave up on voice assistants like Siri and Alexa was that the computers couldn’t understand so much of what they were asked — and it was difficult to learn what questions worked.

Dimitra Vergyri, the director of speech technology at SRI, the research lab behind the initial version of Siri before it was acquired by Apple, said generative A.I. addressed many of the problems that researchers had struggled with for years. The technology makes voice assistants capable of understanding spontaneous speech and responding with helpful answers, she said.

John Burkey, a former Apple engineer who worked on Siri in 2014 and has been an outspoken critic of the assistant, said he believed that because generative A.I. made it easier for people to get help from computers, more of us were likely to be talking to assistants soon — and that when enough of us started doing it, that could become the norm.

“Siri was limited in size — it knew only so many words,” he said. “You’ve got better tools now.”

But it could be years before the new wave of A.I. assistants become widely adopted because they introduce new problems. Chatbots including ChatGPT, Google’s Gemini and Meta AI are prone to “hallucinations,” which is when they make things up because they can’t figure out the correct answers. They have goofed up at basic tasks like counting and summarizing information from the web.

When voice assistants help — and when they don’t

Even as speech technology gets better, talking is unlikely to replace or supersede traditional computer interactions with a keyboard, experts say.

People currently have compelling reasons to talk to computers in some situations when they are alone, like setting a map destination while driving a car. In public, however, not only can talking to an assistant still make you look weird, but more often than not, it’s impractical. When I was wearing the Meta glasses at a grocery store and asked them to identify a piece of produce, an eavesdropping shopper responded cheekily, “That’s a turnip.”

You also wouldn’t want to dictate a confidential work email around others on a train. Likewise, it’d be inconsiderate to ask a voice assistant to read text messages out loud at a bar.

“Technology solves a problem,” said Ted Selker, a product design veteran who worked at IBM and Xerox PARC. “When are we solving problems, and when are we creating problems?”

Yet it’s simple to come up with times when talking to a computer helps you so much that you won’t care how weird it looks to others, said Carolina Milanesi, an analyst at Creative Strategies, a research firm.

While walking to your next office meeting, it’d be helpful to ask a voice assistant to debrief you on the people you were about to meet. While hiking a trail, asking a voice assistant where to turn would be quicker than stopping to pull up a map. While visiting a museum, it’d be neat if a voice assistant could give a history lesson about the painting you were looking at. Some of these applications are already being developed with new A.I. technology.

When I was testing some of the latest voice-driven products, I got a glimpse into that future. While recording a video of myself making a loaf of bread and wearing the Meta glasses, for instance, it was helpful to be able to say, “Hey, Meta, shoot a video,” because my hands were full. And asking Humane’s Ai Pin to dictate my to-do list was more convenient than stopping to look at my phone screen.

“While you’re walking around — that’s the sweet spot,” said Chris Schmandt, who worked on speech interfaces for decades at the Massachusetts Institute of Technology Media Lab.

When he became an early adopter of one of the first mobile phones about 35 years ago, he recounted, people stared at him as he wandered around the M.I.T. campus talking on the phone. Now this is normal.

I’m convinced the day will come when people occasionally talk to computers when out and about — but it will come very slowly.

by NYTimes