PROBLEMS OF PERCEPTION FOR A UNIVERSAL TRANSLATOR
Capitan James T. Kirk looks into a green cloudy
atmosphere and becomes alarmed as a large green reptilian beast with a Crystal
looking knife advances towards him. The beast speaks, and Kirks universal
translator takes the sounds and changes them into English. “Hello Capitan it is
a pleasure to meet you”.
The idea of a man made universal translator has been around for many years, but with
the new technology we have today, it is coming closer to reality each day. What
kind of things will this new translator need to be understood by different
languages and cultures, this paper will address the problems associated with
building a universal translator.
When you are speaking with another person do you
feel that speech and auditory senses are the only things needed to understand
what another person is saying? Scientists have conducted research which shows
that while sounds in speech are important, they are not all inclusive to
understanding what is being said. Indeed when communicating face to face with a
person, the words that are heard may be secondary to understanding the person
you are communicating with.
A Universal Translator would be a device which
can take input from all the senses of a human being who is attending a person
speaking, and then could put this information together to understand what the
person is saying.
The first sense that a universal
translator would have to emulate would be the most obvious, listening to
auditory stimulus. A microphone picking up sounds in the acoustical wavelengths
of 20 to 20,000 hz which is the human range of hearing would be needed. Also
this microphone would have to be able to pick up air pressure changes from 0 to
120 decibels in sound as humans do. Particular emphasis would need to be placed
in differentiating sounds 10 decebils at a time. 10 decibels for a human being
is a doubling of the sound being perceived, this is important for understanding
distances and locations of things making sounds. Another aspect of human hearing
is that we seem to break down our entire audible frequency range of 20 to 20,000
hz, and break it down into 24 frequency bands, which are called octave
bands. The strength of an auditory stimulus can be placed in these
different bands. A certain air pressure equals one sound, but the frequency
bands for all sounds being heard, help to understand the sound being
attended to. Human hearing breaks down these band signals and forms auditory objects or
streams to form grouping of sounds which distinguish the sound we are listening
to from background noise.
(Environmental Noise: http://www.epd.gov.hk/epd/noise_education/web/ENG_EPD_HTML/m1/intro_3.html)
An example of how our translator could achieve this is being done currently by
Microsoft. Microsoft is developing a translator for conversion of human speech
in Chinese, to English, and vice a versa. This translator takes auditory sounds
and breaks them down into smaller groups, it differentiates voices and frequency
of the person’s voice being spoken, then creates a text message of the speaking
persons voice. This text is then broken down into groups and translated. Then
the translator uses the same speakers grouping, pitch, and frequency to speak
the translation, in the same voice as the person speaking into the translator.
This grouping of voice, pitch, frequency of the person speaking is broken down
into auditory bands to accomplish this much like a human does. (Microsoft demos instant
English-Chinese translation: http://www.bbc.co.uk/news/technology-20266427)
The next thing our universal translator will need is a video camera that has face finding
software on it. The reason for this is that humans just like these new video
cameras have heightened ability to see faces. The Fusiform Facial Area or FFA in
humans can recognize faces out of sensory stimulus quicker than any other
stimulus. Our survival as a species is hinged on this ability to recognize faces
in an instant. Indeed, individual neuron cells have been found to have a
capacity to distinguish face stimuli. (The
Cognitive and Neural Development of Face Recognition in Humans:
McKone e, Crookes k, Kanwisher n,
2009)
Our translator will need to be able to find faces, and understand lips as
the speaker speaks. It has been proven that the average person attends to lips
and faces more than sounds when a person is speaking. This phenomenon is known
as audiovisual speech perception. Our translator will have to break down vision cues such as
lip readers do to understand exactly what is being said. Also our translator
will need to visually attend to the person speaking, like humans do. That is
vision in humans also takes precedence in the perception of where sounds
originate over auditory stimulus. Much like a movie theater that has speakers on
its side walls, with a screen on its center wall, we as movie watchers attend to
the actor on the screen for the source of the sounds being emitted. But in
reality, the sound is coming from the speakers to the left and right of the
screen. This is known as visual capture. (Sensation and Perception: GoldStein p, 2010
p306)
One of the tasks humans being have to deal with when a
person is speaking is called variability. Variability of human voices, pitches,
frequencies, and other factors pose a very large problem for understanding what
is said. This is when certain sound stimulus sounds much like another sound
stimulus. To deal with this humans have developed certain screens to narrow down
what the stimulus can mean. The first of these tools is context. Variability
Context is a process where sounds are grouped together as a person speaks. By
judging the sounds being emitted we can compare them with other sounds being
spoken, the context of the speech and lips being attended to, can actually
narrow the perception of the sounds being emitted in reality. Thus by the
context of what is being attended to, we can perceive what is actually being
said even though the variability of the original word may have many different
actual sound stimulus from the actual word. (Stimulus
variability and spoken word recognition. I. Effects of variability in speaking
rate and overall amplitude: http://www.ncbi.nlm.nih.gov/pubmed/7962998)
our translator will also have to find a way through its video system and
microphones when processing its text messages to evaluate the context of the
message to ensure that meanings are clear.
Another form or variability is the pronunciation
of words, and the speed in which they are spoken. Voices of individuals are very
different, some have high voices, and others have low voices. Frequency, pitch,
loudness, speed are all factors
that we must discriminate against to find out exactly what is being said. One of
tools used to evaluate this problem is called a spectrograph or sonogram. This
is a visual representation of acoustic signals, which separates the amplitudes
and frequencies on a graph. This graph can be a digital signal. This signal can
be used by our electronic translator to compare the speakers signal with a vast
digital library of sounds and words in all languages. Microsoft’s Chinese
Translator software currently monitors cadence and intonation of speakers so
the speaker’s actual voice speaks the translation, this is a great help in the
translation process of problems dealing with individual speakers. (Microsoft
demos instant English-Chinese translation: http://www.bbc.co.uk/news/technology-20266427)
Speech Therapists are currently using the data from spectrographs to devise new
methods of understanding the digital data of sounds of certain words. These
signals are being computerized and categorized as digital signals to form a
database of individual words. (Speech
Disorder and Spectrograms:
http://www.smilecastcommunications.com/spectrograms.html)
Summary:
The process of understanding what is
needed for a universal translator is being conducted currently by many different
organizations. We have looked at some of the perception problems that such a
translator will need to address. We have shown that our translator will have to
monitor the human range of hearing, and that within this range of hearing,
certain acoustic bands are part of a process of taking auditory cues and making
them into conceptual objects and streams which help us understand what is being
said. We have shown that vision is more important than actual words in a process
known as audiovisual speech perception
for understanding what is being said; that our translator will need a video
camera and computations to address this phenomenon, and also to help our
translator understand who is speaking through the process of visual
capture. We discussed the problems
of variability in context and how this is used as a filter for understanding
words, and the problems of individual speech patterns and how current work in
electronic translators and speech therapy with spectrograms are leading to
breakthroughs of understanding of how to achieve a working translator.
Capitan James T. Kirk looks into a green cloudy
atmosphere and becomes alarmed as a large green reptilian beast with a Crystal
looking knife advances towards him. The beast speaks, and Kirks universal
translator takes the sounds and changes them into English. “Hello Capitan it is
a pleasure to meet you”.
The idea of a man made universal translator has been around for many years, but with
the new technology we have today, it is coming closer to reality each day. What
kind of things will this new translator need to be understood by different
languages and cultures, this paper will address the problems associated with
building a universal translator.
When you are speaking with another person do you
feel that speech and auditory senses are the only things needed to understand
what another person is saying? Scientists have conducted research which shows
that while sounds in speech are important, they are not all inclusive to
understanding what is being said. Indeed when communicating face to face with a
person, the words that are heard may be secondary to understanding the person
you are communicating with.
A Universal Translator would be a device which
can take input from all the senses of a human being who is attending a person
speaking, and then could put this information together to understand what the
person is saying.
The first sense that a universal
translator would have to emulate would be the most obvious, listening to
auditory stimulus. A microphone picking up sounds in the acoustical wavelengths
of 20 to 20,000 hz which is the human range of hearing would be needed. Also
this microphone would have to be able to pick up air pressure changes from 0 to
120 decibels in sound as humans do. Particular emphasis would need to be placed
in differentiating sounds 10 decebils at a time. 10 decibels for a human being
is a doubling of the sound being perceived, this is important for understanding
distances and locations of things making sounds. Another aspect of human hearing
is that we seem to break down our entire audible frequency range of 20 to 20,000
hz, and break it down into 24 frequency bands, which are called octave
bands. The strength of an auditory stimulus can be placed in these
different bands. A certain air pressure equals one sound, but the frequency
bands for all sounds being heard, help to understand the sound being
attended to. Human hearing breaks down these band signals and forms auditory objects or
streams to form grouping of sounds which distinguish the sound we are listening
to from background noise.
(Environmental Noise: http://www.epd.gov.hk/epd/noise_education/web/ENG_EPD_HTML/m1/intro_3.html)
An example of how our translator could achieve this is being done currently by
Microsoft. Microsoft is developing a translator for conversion of human speech
in Chinese, to English, and vice a versa. This translator takes auditory sounds
and breaks them down into smaller groups, it differentiates voices and frequency
of the person’s voice being spoken, then creates a text message of the speaking
persons voice. This text is then broken down into groups and translated. Then
the translator uses the same speakers grouping, pitch, and frequency to speak
the translation, in the same voice as the person speaking into the translator.
This grouping of voice, pitch, frequency of the person speaking is broken down
into auditory bands to accomplish this much like a human does. (Microsoft demos instant
English-Chinese translation: http://www.bbc.co.uk/news/technology-20266427)
The next thing our universal translator will need is a video camera that has face finding
software on it. The reason for this is that humans just like these new video
cameras have heightened ability to see faces. The Fusiform Facial Area or FFA in
humans can recognize faces out of sensory stimulus quicker than any other
stimulus. Our survival as a species is hinged on this ability to recognize faces
in an instant. Indeed, individual neuron cells have been found to have a
capacity to distinguish face stimuli. (The
Cognitive and Neural Development of Face Recognition in Humans:
McKone e, Crookes k, Kanwisher n,
2009)
Our translator will need to be able to find faces, and understand lips as
the speaker speaks. It has been proven that the average person attends to lips
and faces more than sounds when a person is speaking. This phenomenon is known
as audiovisual speech perception. Our translator will have to break down vision cues such as
lip readers do to understand exactly what is being said. Also our translator
will need to visually attend to the person speaking, like humans do. That is
vision in humans also takes precedence in the perception of where sounds
originate over auditory stimulus. Much like a movie theater that has speakers on
its side walls, with a screen on its center wall, we as movie watchers attend to
the actor on the screen for the source of the sounds being emitted. But in
reality, the sound is coming from the speakers to the left and right of the
screen. This is known as visual capture. (Sensation and Perception: GoldStein p, 2010
p306)
One of the tasks humans being have to deal with when a
person is speaking is called variability. Variability of human voices, pitches,
frequencies, and other factors pose a very large problem for understanding what
is said. This is when certain sound stimulus sounds much like another sound
stimulus. To deal with this humans have developed certain screens to narrow down
what the stimulus can mean. The first of these tools is context. Variability
Context is a process where sounds are grouped together as a person speaks. By
judging the sounds being emitted we can compare them with other sounds being
spoken, the context of the speech and lips being attended to, can actually
narrow the perception of the sounds being emitted in reality. Thus by the
context of what is being attended to, we can perceive what is actually being
said even though the variability of the original word may have many different
actual sound stimulus from the actual word. (Stimulus
variability and spoken word recognition. I. Effects of variability in speaking
rate and overall amplitude: http://www.ncbi.nlm.nih.gov/pubmed/7962998)
our translator will also have to find a way through its video system and
microphones when processing its text messages to evaluate the context of the
message to ensure that meanings are clear.
Another form or variability is the pronunciation
of words, and the speed in which they are spoken. Voices of individuals are very
different, some have high voices, and others have low voices. Frequency, pitch,
loudness, speed are all factors
that we must discriminate against to find out exactly what is being said. One of
tools used to evaluate this problem is called a spectrograph or sonogram. This
is a visual representation of acoustic signals, which separates the amplitudes
and frequencies on a graph. This graph can be a digital signal. This signal can
be used by our electronic translator to compare the speakers signal with a vast
digital library of sounds and words in all languages. Microsoft’s Chinese
Translator software currently monitors cadence and intonation of speakers so
the speaker’s actual voice speaks the translation, this is a great help in the
translation process of problems dealing with individual speakers. (Microsoft
demos instant English-Chinese translation: http://www.bbc.co.uk/news/technology-20266427)
Speech Therapists are currently using the data from spectrographs to devise new
methods of understanding the digital data of sounds of certain words. These
signals are being computerized and categorized as digital signals to form a
database of individual words. (Speech
Disorder and Spectrograms:
http://www.smilecastcommunications.com/spectrograms.html)
Summary:
The process of understanding what is
needed for a universal translator is being conducted currently by many different
organizations. We have looked at some of the perception problems that such a
translator will need to address. We have shown that our translator will have to
monitor the human range of hearing, and that within this range of hearing,
certain acoustic bands are part of a process of taking auditory cues and making
them into conceptual objects and streams which help us understand what is being
said. We have shown that vision is more important than actual words in a process
known as audiovisual speech perception
for understanding what is being said; that our translator will need a video
camera and computations to address this phenomenon, and also to help our
translator understand who is speaking through the process of visual
capture. We discussed the problems
of variability in context and how this is used as a filter for understanding
words, and the problems of individual speech patterns and how current work in
electronic translators and speech therapy with spectrograms are leading to
breakthroughs of understanding of how to achieve a working translator.