It could be the next best thing to learning a new language. Microsoft
researchers have demonstrated software that translates spoken English
into spoken Chinese almost instantly, while preserving the unique
cadence of the speaker’s voice—a trick that could make conversation more
effective and personal.
The first public demonstration was made by Rick Rashid,
Microsoft’s chief research officer, on October 25 at an event in
Tianjin, China. “I’m speaking in English and you’ll hear my words in
Chinese in my own voice,” Rashid told the audience. The system works by
recognizing a person’s words, quickly converting the text into properly
ordered Chinese sentences, and then handing those over to speech
synthesis software that has been trained to replicate the speaker’s
voice.
Video recorded by audience members has been circulating on Chinese
social media sites since the demonstration. Rashid presented the
demonstration to an English-speaking audience in a blog post today that includes a video.
Microsoft first demonstrated technology that modifies synthesized speech to match a person’s voice earlier this year (see “Software Translates Your Voice Into Another Language”).
But this system was only able to speak typed text. The software
requires about an hour of training to be able to synthesize speech in a
person’s voice, which it does by tweaking a stock text-to-speech model
so it makes certain sounds in the same way the speaker does.
AT&T has previously shown a live translation system for Spanish and English (see “AT&T Wants to Put Your Voice in Charge of Apps”),
and Google is known to have built its own experimental live
translators. However, the prototypes developed by these companies do not
have the ability to make synthesized speech match the sound of a
person’s voice.
The Microsoft system is a demonstration of the
company’s latest speech-recognition technology, which is based on
learning software modeled on how networks of brain cells operate. In a
blog post about the demonstration system, Rashid says that switching to
that technology has allowed for the most significant jump in recognition
accuracy in decades. “Rather than having one word in four or five
incorrect, now the error rate is one word in seven or eight,” he wrote.
Microsoft
is not alone in looking to neural networks to improve speech
recognition. Google recently began using its own neural network-based
technology in its voice recognition apps and services (see “Google Puts Its Virtual Brain Technology to Work”). Adopting this approach delivered between a 20 and a 25 percent improvement in word error rates, Google’s engineers say.
Rashid
told MIT Technology Review by e-mail that he and the researchers at
Microsoft Research Asia, in Beijing, have not yet used the system to
have a conversation with anyone outside the company, but the public
demonstration has provoked strong interest.
“What I’ve seen is
some combination of excitement, astonishment, and optimism about the
future that the technology could bring,” he says.
Rashid says the
system is far from perfect, but notes that it is good enough to allow
communication where none would otherwise be possible. Engineers working
on the neural network-based approach at Microsoft and Google are
optimistic they can wring much more power out of the technique, since it
is only just being deployed.
“We don’t yet know the limits on
accuracy of this technology—it is really too new,” says Rashid. “As we
continue to ’train’ the system with more data, it appears to do better
and better.”
Không có nhận xét nào:
Đăng nhận xét