[Introduction] After a stroke, the paralyzed Ann lost her speech for 18 years. Brain-computer interfaces and digital avatars have recently allowed her to “speak” with facial expressions.
On the same day, Nature’s double-issued “brain-computer interface” research is enough to change the entire human race!
At 30, a devastating stroke left a 47-year-old Canadian woman almost entirely paralyzed and speechless for 18 years.
Fortunately, a team from the University of California developed a new brain-computer interface (BCI), allowing Ann to control the “digital avatar” and start talking again.
“I think you are wonderful.” When these words were spoken, it took more than ten years for Ann.

It is worth mentioning that the facial expressions in this digital avatar are realized using the same technology as “The Last of Us 2”.
Specifically, the researchers implanted a series of electrodes just below the surface of Ann’s brain.

When Ann tries to speak, the BCI intercepts brain signals and converts them into words and sounds. Here, instead of decoding whole words, AI decodes phonemes.
The BCI at the University of California spoke at 78 words per minute, way faster than Ann’s device, which could only manage 14 words per minute.

The research title indicates that the research is about decoding speech and controlling digital avatars. This is different from previous studies.
The new BCI technology animates the digital avatar through facial expressions to mimic the details of natural human communication.

Paper address: https://www.nature.com/articles/s41586-023-06443-4
The breakthrough research was published in Nature on August 23. For the first time, speech and facial movements have been synthesized directly from brain signals, marking a significant leap forward in brain-computer interfaces.
Another study published in Nature also focuses on the brain-to-brain interface that converts speech neural activity into text.
The findings said that paralyzed patients could communicate at a rate of 62 words per minute, 3.4 times faster than previous research.

Paper address: https://www.nature.com/articles/s41586-023-06377-x
New studies have improved the speed of converting brain signals into text. These studies also allow virtual avatars to speak like humans.
Genesis’ brain-computer interface brings human beings closer to mechanical ascension.

When the first sentence came out, she smiled happily.
Standing at thirty, for everyone, many surprises in life still need to be revealed.
I am a high school math teacher in Canada. I teach people in a classroom. The world is filled with opportunities.
However, a sudden stroke instantly caused her to lose control of all her muscles, and she couldn’t breathe.
Since then, she has never said a word.

The most direct consequence of a stroke is the inability to control facial muscles, leading to facial paralysis and the inability to speak.
Ann often tossed and turned for the next five years, fearing that she would die in her sleep.
After years of physical therapy, some initial results were also seen.
She was able to make a person have limited facial expressions. They could make some head and neck movements but couldn’t activate the facial muscles needed for speech.
For this reason, she also underwent brain-computer interface surgery.
The previous BCI technology was not advanced, so Ann had difficulty communicating quickly and easily. It couldn’t translate her brain signals into fluent speech.
Ann moved her head slightly and slowly typed on the computer screen through the device, “Overnight, everything was taken away from me.”

In 2022, Ann decided to try again and volunteered to be a subject of the University of California research team.
Add a face, a voice.
The researchers analyzed Ann’s brain signals as she attempted to speak words. They used this data to train AI algorithms to recognize different speech signals.
Notably, the AI was trained to decode phonemes — the essential elements of speech, rather than entire words, making it 3x faster and more versatile.
To do this, the team implanted a paper-thin rectangle of 253 electrodes on the surface of Ann’s brain.
A cable then plugs into a port fixed on Ann’s head, connecting the electrodes to a set of computers.
Ann’s speech can now be transcribed at nearly 80 words per minute faster than her previous BCI device.
Using video footage of Ann’s wedding in 2005, the research team used artificial intelligence to reconstruct a person’s unique intonation and accent.
They then used software deveSpeech Graphics to create a digital avatar that accurately mirrors Ann’s facial expressions in real time.
He matched the signals from Ann’s brain when she was trying to speak and translated those signals into her avatar’s facial movements.
Includes jaw opening and closing, lips pursing and pursing, tongue up and down, and happy, sad, and surprised facial movements.
When Ann tries to speak, the digital avatar seamlessly animates and says what she wants.

“The Last of Us 2” and “Halo: Infinite” also utilize Speech Graphics’ facial capture technology to deliver realistic and varied facial expressions for their characters.

Michael Berger, CTO and co-founder of Speech Graphics, said:
Creating a digital avatar that uses AI to generate faces that can speak, display emotions, and connect with people’s brains in real-time. This goes beyond the use of video games.
The restoration of speech alone is impressive in itself, and facial communication is an inherent human characteristic, allowing patients to have this remarkable ability again.
This research work of the University of California is a breakthrough in BCI technology and the hope of countless special people.
This technological breakthrough has given hope to Ann and many others who cannot speak due to paralysis, bringing them individual independence and self-expression.
For Ann’s now 13-month-old daughter, the BCI breakthrough allowed her to hear the mother’s voice she had never heard since birth.

According to reports, the next BCI version they developed is wireless, eliminating the need to connect to a physical system.
Edward Chang, who led the experiment at the University of California, has spent more than a decade advancing brain-computer interface technology.

In 2021, he and his research team developed a “speech neuroprosthesis” that allowed a severely paralyzed man to communicate in complete sentences.
A new technology can translate brain signals into spoken words. This is the first time that scientists have been able to decode these signals into whole words.
So, how did the technology behind the University of California let Ann “speak” be realized?
Technical realization
Dr. Edward Chang and his research team conducted a study at UCSF. They performed a surgical procedure on a patient named Ann. They implanted a 253-needle electrode array into a specific area during the process. Of her brain responsible for speech control.

Probes monitor neural signals and transmit them to processors through cable ports in the skull. The computing stack includes machine learning AI.
Ann spent several weeks working with the team to train an AI algorithm. The goal was to teach the algorithm to recognize over 1,000-word patterns in her brain’s neural signals.
The computer had to recognize patterns of brain activity by repeating different phrases from a 1,024-word conversational vocabulary multiple times.
Instead of training the AI to recognize whole words, the researchers created a system that decodes words from smaller phoneme components. Phonemes form spoken language in the same way that letters form written words. For example, “Hello” contains four phonemes: “HH”, “AH,” “L,” and “OW.”
Using this method, a computer only needs to learn 39 phonemes to decipher any word in the English language. This both increases the accuracy of the system and triples its speed.
But this is just a minor prelude to the research. The most noteworthy aspect is the AI’s ability to decode and comprehend Ann’s intentions.

Electrodes were placed in brain regions the team found critical for language.
The research team uses a deep learning model to map neural signals to speech units and features. This allows them to generate text, synthesize speech, and control virtual characters.
As mentioned, the researchers collaborated with Speech Graphics to create avatars for the patients.
SG’s technology analyzes audio input to track the movements of the face. It then uses this data to create real-time images in the game engine without any delays.
Since the patient’s mental signals can be mapped directly onto the avatar, she can also express emotion and communicate non-verbally.
Overview of Multimodal Speech Decoding System
Researchers have designed a speech decoding system to help Ann, who is severely paralyzed and unable to speak, regain communication with others.

Ann trains an AI algorithm to recognize brain signals associated with phonemes, which are the subunits of speech.
The researchers implanted a high-density 253-channel ECoG array in Ann’s cortex, specifically covering cortical regions involved in language, including the SMC and superior temporal gyrus.

Briefly, these regions were associated with the researchers’ movements of the face, lips, tongue, and jaw (1a-c).
The array enabled the researchers to detect electrical signals from these regions when Ann wanted to speak.
The researchers noticed that the array could capture different activation signals when Ann tried to move her lips, tongue, and jaw (1d).
Researchers asked Ann to silently speak a sentence after seeing it on the screen. They wanted to study how language is decoded from brain signals.
The researchers extracted two main brain activity signals from the signals captured by the 253 ECoG electrodes on Ann’s head: high-gamma activity (70-150 Hz) and low-frequency (0.3-17 Hz).

A deep learning model predicts pronunciation, speech, and mouth movements based on brain signals. These predictions are translated into text, synthesized speech, and avatar movements.
Text decoding:
The research team hopes to decode text from the brain, especially when people with dysarthria try to speak.
But, their early efforts suffered from slow decoding speeds and small vocabularies.
This study used phone decoding, a method that decoded random phrases from large vocabularies with speeds similar to natural speech.

To evaluate real-time performance, the research team decoded the text as Ann attempted to read 249 sentences silently. These sentences were randomly selected from a sentence set of 1024 words and were not used during model training. For decoding, they extracted features from the ECoG signal and processed them using a bidirectional recurrent neural network (RNN).
They assessed how well the decoding worked using different metrics such as word error rate (WER), phone error rate (PER), character error rate (CER), and words per minute (WPM).
The research team observed that Ann communicated faster than usual with her assistive device when decoding 78.3 words per minute (WPM). She spoke almost as quickly as a person typically does. Would.
They tested the signal stability by asking Ann to either silently read 26 code words or perform four hand gestures as a separate task. The results showed that the neural network classifier performed very well, with an average accuracy rate of 96.8%.
They conducted simulated decoding on two sets of sentences to assess the model’s performance on specific sentence sets without pauses. The results showed that the model decoded these limited sentence sets rapidly and accurately.
Speech synthesis
Another way to decode text is by converting neural activity into speech, allowing people unable to speak to communicate more naturally and expressively.
Previous research in people with intacResearch has proven that it is possible to create understandable speech from neural activity during vocalization or speech imitation. However, this method has yet to be tested on paralyzed individuals.

The researchers performed real-time speech synthesis by converting neural activity directly into audible speech while attempting to read silently under an audiovisual task (Fig. 3a).
To synthesize speech, the researchers passed temporal windows of neural activity into a bidirectional recurrent neural network (RNN).
Before testing, the researchers trained the RNN to predict the probability of 100 discrete speech units at each time step.
The researchers trained the sequence of reference speech units using Hubert. Hubert is a self-supervised speech representation learning model. It encodes continuous speech waveforms into temporal sequences of discrete speech units. These units represent hidden phonemic and articulatory information.
During training, the researchers used a CTC loss function that enables the RNN to learn the relationship between speech units derived from ECoG features and these reference waveforms without aligning the participants’ silent speech attempts and the reference waveforms. Mapping.
The speech model uses predicted unit probabilities to determine the most likely unit at each time step. This unit is then input into a pre-trained unit, which generates a mel-spectrogram. The mel-spectrogram is then synthesized into a voice waveform that can be listened to.
The researchers processed the decoded speech of the participant offline using a speech conversion model trained shortly before the participant got injured. They created a personalized synthetic voice for the participant.
Facial Avatar Decoding
Researchers created a face avatar BCI interface to translate brain activity into spoken gestures and displayed moving virtual faces during audiovisual tasks.

To create synthetic facial avatars with dynamic animation, researchers used a system called Speech Graphics. This system converts speech signals into animations of facial movements.
The researchers used two approaches to animate the avatar: the direct and acoustic methods. The direct way is to infer the pronunciation action from the neural activity without any speech intermediary.
Acoustic methods are used for real-time audio-video synthesis, which ensures low-latency synchronization between the decoded speech audio and the avatar’s movements.
In addition to the articulation actions accompanying the synthesized speech, a complete avatar BCI should also be able to display speech-independent oral and emotional actions.
The researchers collected neural data from the participants while they did two more tasks: a vocal motor task and an emotional expression task.
The results showed that participants could manipulate the avatar BCI to display vocal movements and strong emotional expressions, revealing the potential of multimodal communication BCIs to restore meaningful oral-facial movements.
Pronunciation representation-driven decoding
In healthy speakers, neural representations of the SMC (including the anterior and posterior central gyri) encode articulatory actions of the orofacial muscles.
After implanting the electrode arrays in the participants’ SMCs, the researchers believed that the neural representations of speech remained active and contributed to the decoding of speech, even in paralyzed individuals.
We used a linear model to predict the HGA at each electrode using the phoneme probabilities computed by the text decoder. The text decoder was conditioned on a 1024-word generic text task.
The researchers computed the maximum encoding weight for each activated electrode for each phoneme, resulting in a phoneme tuning space. In this space, each electrode has a phoneme-encoded weight vector associated with it.

References:
https://www.ucsf.edu/news/2023/08/425986/how-artificial-intelligence-gave-paralyzed-woman-her-voice-back
This article is from the WeChat public account “Xinzhiyuan” (ID: AI_era), author: Xinzhiyuan, published by 36 Krypton with authorization.