An Inside Look at Amazon’s Ambitions to Make Alexa Sound Human

June 21, 2018 - 7 minutes read

Amazon’s Alexa is ahead of its competitors, but it’s got a long way to go in making Alexa sound and “act” more human. Improving Alexa’s ability to quip a funny response or having her hold an actual conversation are Amazon’s top priorities right now with their best-selling artificial intelligence (AI)-enabled voice assistant.

The Alexa Prize

The Seattle-based tech giant holds a now-annual developer “hackathon” lasting nine months, named the Alexa Prize. Several universities from across the world will compete for first place, which will be awarded this November. Rohit Prasad is chief scientist for machine learning at Alexa. He hopes the Alexa Prize will accelerate machine learning and conversational AI development.

Amazon gives each team a toolkit of technology they should use: free storage and computing power on Amazon Web Services, stacks of training data from millions of Alexa units, and basic speech recognition tools. The bots are already released for users to test out; saying “Alexa, let’s chat” will randomly pick one bot for you to chat with. And you can even leave feedback for the chatbot for improvement.

The grand prize is a $1 million research grant. It is a difficult milestone to reach: the social bot must “converse coherently and engagingly with humans for 20 minutes with a 4.0 or higher rating.” Last year’s winners, the University of Washington’s bot, conversed for 10 minutes, but that doesn’t meet the $1 million award criteria.

Can a Candid Conversation Be Coded?

Prasad says, “Every technology built as part of the Alexa Prize is applicable to Alexa. Naturally, this means that Amazon’s keeping a close eye on the university contestants to snag new employees, among other things. Most of the participants don’t mind, though; one researcher says, “It’s cheap for them, but also great for us.”

The Amazon Alexa executive wants Alexa evolved to a point where she is holding fluid conversations and seamlessly going back and forth with you about LeBron’s playing in the last NBA Finals. Alexa should become like “a friend, a companion,” says Prasad.

But since the best technology today only holds a seamless conversation for 10 minutes, we’ve got a long way to go before Alexa’s keeping the baby company for a few minutes while you hop in the shower.

Which Way Is Right?

There are two well-known methods for approaching the problem. One is using deep learning, a subset of machine learning. But even though this is the first approach to come to mind for most researchers, it’s not the most effective by a long shot. One participant elaborates, “Everyone starts with machine learning, and eventually, everyone realizes it doesn’t really work.”

The complexity of speech becomes glaringly obvious as researchers get deeper into their machine learning algorithm, and machine learning is more equipped for simpler abstractions, like figuring out if an image is showing a dolphin or whale.

The second approach is more manual; by writing specific response templates and rules on which response to say, the team can “hardcode” their way to a basic conversation level. One of the biggest hurdles that the robot would face, however, would be chatting about current events, recent sport developments, and local news.

But there is another way.

Why Not Both?

Combining the two basic approaches sets the AI up with foundational knowledge and pushes it to learn more on its own with machine learning. Sweden’s Royal Institute of Technology (KTH) used Amazon’s Mechanical Turk to outsource human labor in identifying responses for the chatbot. These responses train the chatbot to eventually become self-reliant in finding its own responses.

“Over time, we’ll develop more and more intelligent strategies to traverse the tree,” says Ulme Wennberg, one of the participants on the KTH team. “To be able to understand what we have been talking about, what you want to talk about, what we should talk about.”

The team’s even going as far as to create a persona from an amalgam of the most popular celebrity personality types. The researchers believe this will make the voice assistant more likable.

Context Is the Foundation of Conversation

Across the world, in Utah, researchers at Brigham Young University (BYU) utilized their classmates instead. The team started a “Chit-Chat Challenge” that called for conversation transcripts on any subject. Prizes were awarded for length and originality, and winners took home an iPad and a MacBook Pro.

The robot was originally trained with Reddit, but its tone quickly became confrontational. Researcher Nancy Fulda says, “The internet gives us loads of text, but none of it is a real analog for human conversation.”

The AI bot, dubbed “Eve”, will eventually create its own dialogue by a technique called word embedding, which calculates the context of one word in relation to others. With recent advancements in computing, however, word embedding now works with sentences, creating the building blocks of conversation.

Tomorrow’s Technology Today

This approach by BYU and the attack plan of KTH, with all of their nuances, point to one strong message: there are infinite ways to program a voice assistant’s personality, tone of voice, and eloquence.

Amazon’s excited for the possibilities; what it means for AI, Alexa, and Amazon’s lineup of devices. “I firmly believe that ambient computing here to stay. And I wouldn’t have said that a year ago. We were still not there,” says David Limp, Amazon’s head of devices.

Amazon has already secured itself a healthy lead in becoming the voice assistant of the near future, but with these innovations implemented, now you’ll be able to have a nice chat as well.