A demonstration I gave in Tianjin, China at Microsoft Research Asia’s
21st Century Computing event has started to generate a bit of
attention, and so I wanted to share a little background on the history
of speech-to-speech technology and the advances we’re seeing today.
In the realm of natural user interfaces, the single most important
one – yet also one of the most difficult for computers - is that of
human speech.
For the last 60 years, computer scientists have been working to build
systems that can understand what a person says when they talk.
In the beginning, the approach used could best be described as simple
pattern matching. The computer would examine the waveforms produced by
human speech and try to match them to waveforms that were known to be
associated with particular words.
While this approach sometimes worked, it was extremely fragile.
Everyone’s voice is different and even the same person can say the same
word in different ways. As a result these early systems were not really
usable for practical applications.
In the late 1970s a group of researchers at Carnegie Mellon
University made a significant breakthrough in speech recognition using a
technique called hidden Markov modeling which allowed them to use
training data from many speakers to build statistical speech models that
were much more robust. As a result, over the last 30 years speech
systems have gotten better and better. In the last 10 years the
combination of better methods, faster computers and the ability to
process dramatically more data has led to many practical uses.
Today if you call a bank in the US you almost certainly are talking
to a computer that can answer simple questions about your account and
connect you to a real person if necessary. Several products on the
market today, including XBOX Kinect, use speech input to provide simple
answers or navigate a user interface. In fact our Microsoft Windows and
Office products have included speech recognition in them since the late
90’s. This functionality has been invaluable to our customers with
accessibility needs.
Until recently though, even the best speech systems still had word error rates of 20-25% on arbitrary speech.
Just over two years ago, researchers at Microsoft Research and the
University of Toronto made another breakthrough. By using a technique
called Deep Neural Networks, which is patterned after human brain
behavior, researchers were able to train more discriminative and better
speech recognizers than previous methods.
During my October 25 presentation in China, I had the opportunity to
showcase the latest results of this work. We have been able to reduce
the word error rate for speech by over 30% compared to previous methods.
This means that rather than having one word in 4 or 5 incorrect, now
the error rate is one word in 7 or 8. While still far from perfect, this
is the most dramatic change in accuracy since the introduction of
hidden Markov modeling in 1979, and as we add more data to the training
we believe that we will get even better results.
Machine translation of text is similarly difficult. Just like speech,
the research community has been working on translation for the last 60
years, and as with speech, the introduction of statistical techniques
and Big Data have also revolutionized machine translation over the last
few years. Today millions of people each day use products like Bing
Translator to translate web pages from one language to another.
In my presentation, I showed how we take the text that represents my
speech and run it through translation- in this case, turning my English
into Chinese in two steps. The first takes my words and finds the
Chinese equivalents, and while non-trivial, this is the easy part. The
second reorders the words to be appropriate for Chinese, an important
step for correct translation between languages.
Of course, there are still likely to be errors in both the English
text and the translation into Chinese, and the results can sometimes be
humorous. Still, the technology has developed to be quite useful.
Most significantly, we have attained an important goal by enabling an
English speaker like me to present in Chinese in his or her own voice,
which is what I demonstrated in China. It required a text to speech
system that Microsoft researchers built using a few hours speech of a
native Chinese speaker and properties of my own voice taken from about
one hour of pre-recorded (English) data, in this case recordings of
previous speeches I’d made.
Though it was a limited test, the effect was dramatic, and the
audience came alive in response. When I spoke in English, the system
automatically combined all the underlying technologies to deliver a
robust speech to speech experience—my voice speaking Chinese. You can
see the demo in the video above.
The results are still not perfect, and there is still much work to be
done, but the technology is very promising, and we hope that in a few
years we will have systems that can completely break down language
barriers.
In other words, we may not have to wait until the 22nd century for a usable equivalent of Star Trek’s
universal translator, and we can also hope that as barriers to
understanding language are removed, barriers to understanding each other
might also be removed. The cheers from the crowd of 2000 mostly Chinese
students, and the commentary that’s grown on China’s social media
forums ever since, suggests a growing community of budding computer
scientists who feel the same way.
Không có nhận xét nào:
Đăng nhận xét