Smart home voice control solution: can listen to, speak freely

Far-field speech recognition, cloud semantic recognition, artificial intelligence applications and other technical nodes have made new breakthroughs, providing a new control entry option for smart homes. This article will present the application prospects of voice control technology in smart home products and integration projects from various perspectives such as technology trends, solutions, product applications, and project implementation.

While smart homes bring convenience to people, people's control habits gradually change. The development of voice technology has also supplemented the control portal. After throwing away the traditional remote control and mobile phone APP, the password is sent to make the home environment comfortable and make life more convenient and intelligent. Will this be the next smart home industry? The universal application of the stage?

Far-field speech recognition, cloud semantic recognition, artificial intelligence applications and other technical nodes have made new breakthroughs, providing a new control entry option for smart homes. This article will present the application prospects of voice control technology in smart home products and integration projects from various perspectives such as technology trends, solutions, product applications, and project implementation.

The so-called intelligent voice industry mainly refers to the industry that provides various services to users through voice synthesis technology and voice recognition technology. Generally speaking, the user only needs to speak the service terminal to issue a command, and then the corresponding service can be obtained. This industry has emerged since the 1960s, but it is not well known to the average consumer, and consumers are less aware of it. In recent years, with Apple, Google, Microsoft and other companies have launched smart voice services such as Siri, this service and related industries have begun to be concerned by the general consumer and investment community.

Voice control technology

Communicate with the machine and let the machine understand what you are saying. This is what people have long dreamed of. Speech recognition technology is a technique that allows a machine to transform a speech signal into a corresponding text or command through an identification and understanding process.

Speech recognition is an interdisciplinary subject. The combination of speech recognition technology and speech synthesis technology enables people to get rid of the keyboard and operate through voice commands. The application of speech technology has become a competitive emerging high-tech industry.

Current problems with voice control technology

At present, the intelligent hardware products for voice control are deeply criticized in many scenarios because of the unsatisfactory experience of voice interaction. The reasons are mainly limited by spatial distance, background noise, other vocal interference, echo, reverberation, etc. Complex factors, resulting in a clear pain point such as near recognition distance and low recognition rate.

In addition, there are quite a lot of Chinese language, dialects and accents, and the multi-semantic nature of Chinese, so people in different regions use voice control recognition rates vary greatly. At the same time, in the semantic recognition, there is also a problem that the contextual association brings about difficulty in learning, difficulty in positioning, and difficulty in establishing a model.

Several control techniques for speech recognition

The "speech recognition" technology is equivalent to installing an "ear" on a computer system to make it "listenable". The technology undergoes complex steps such as speech signal processing, speech feature processing, model training, and decoding engine, so that the machine can finally identify the content, speaker, language and other information in the speech. The implementation of the voice control function is highly correlated with the user's usage habits. The current implementation of voice control functions can be divided into two major categories: near-field speech recognition and far-field speech recognition.

Near field/far field speech recognition technology

Near-field speech recognition requires the user to click to start, and the distance between the user and the terminal device is relatively close, such as a mobile phone or other terminal device, and the control function can be directly realized by means of these terminal devices.

Far-field speech recognition, which uses voice data picked up by a long distance from a microphone array as input data, and converts a voice signal into a text by a speech recognition algorithm. Although the principle of the near-field speech recognition technology is the same, since the spatial distance between the sound source and the microphone increases, signal intensity attenuation and various noise interferences occur during the sound wave propagation process, so special speech data is required. Pick and pre-processing techniques. Different pick-up devices and pre-processing techniques often change the characteristics of the acoustic signal used for speech recognition. Therefore, for different far-field speech pickup technologies, the speech recognition engine needs to be customized and optimized.

When the speech signal is attenuated during the propagation process, affecting the intensity and resolution of the acquired signal, the directional microphone with very high sensitivity is used, and the parameters of the microphone are adjusted to the mode suitable for far-field speech data, which can capture the maximum clarity. Far field speech signal. The voice command sound wave is polluted by surrounding noise during transmission, reduces the signal-to-noise ratio of the sound wave signal, and uses directional wave velocity forming technology to suppress the noise outside the direction, thereby reducing the noise interference to the voice signal. In a room, the sound waves picked up by the microphone are not only directly from the sound source, but also the late sound waves reflected by the wall after the sound source is emitted, resulting in residual sound, causing reverberation. The data collected by multiple microphones is used to separate the sound data reached at different times by the multi-channel echo cancellation algorithm, thereby eliminating the influence of reverberation on the sound data.

Wake-up target detection technology

When manipulating with voice at a distance, the sound may come from different people in different directions. So first of all, we must determine which are the sounds of the instructions and which are not. The microphone array wave velocity shaping algorithm is used to vertically divide the 360-degree space into several regions, and each microphone is responsible for detecting a specified region. When an awakening word is detected in a certain spatial area, the microphone pickup function corresponding to the spatial area is enhanced, and the microphone pickup of other areas is suppressed. Thereby, the direction of the sound is picked up in a direction, and the influence of the voices in the surrounding televisions and the conversations of other people on the voice commands is avoided.

Play state interrupt technique

When performing voice control on a device such as a speaker, the device is often in a state of playing a song. Since the microphone is mounted on the speaker, the distance between the microphone and the speaker is much larger than the distance between the microphone and the speaker. In this case, the internal and external methods are used to solve the problem. The internal use of a special echo cancellation algorithm internally reduces the effect of noise on the microphone. In addition, the traditional linear echo cancellation method fails for the nonlinear interference caused by the vibration, so the nonlinear echo cancellation algorithm can be used to improve the internal noise elimination effect. In the external structure design, the carefully designed microphone array damping structure minimizes the vibration between multiple microphones and the circuit board to which it is connected, thereby maximally controlling the vibration of the speaker body caused by high sound intensity. The interference of the pickup.

Mature voice control application solution

As the largest provider of intelligent voice technology in China, HKUST has a long-term research and accumulation in the field of intelligent voice technology, and has international leading achievements in many technologies such as Chinese speech synthesis, speech recognition and spoken language evaluation. For the application of smart home voice control, we must solve the problems of distance, efficiency, personalization, dialect, wake-up, etc., and the University of Science and Technology has a very mature solution in these aspects.

Distance: Far-field recognition technology breaks distance bottleneck

The near field identification technology is relatively mature at present. Now the voice input function used on the mobile phone belongs to the near field recognition technology. The user must speak within a relatively short distance from the mobile phone, but in the smart home environment, between the user and the intelligent terminal The distance has been greatly increased, and a necessary condition for users to control the smart home with voice at will is to accurately identify the device no matter where you are in the living room, and the speech recognition technology must break through the obstacles of distance.

At present, the indoor voice interaction is affected by multiple complex factors such as background noise, other vocal interference, echo, reverberation, etc., resulting in low recognition rate or even inability to use, and can only be used in a relatively quiet, close-range environment. The far field identification technology will be able to solve these problems well.

In March 2015, Keda News released the far-field recognition technology, which is the only voice recognition technology that supports more than 5 meters. It breaks through the bottleneck of voice interaction distance and greatly improves the freedom of voice interaction. The technology utilizes the spatial filtering characteristics of the microphone array—forming the Beamforming in the direction of the target speaker, suppressing the noise outside the beam, and combining the unique de-reverberation algorithm to maximize the absorption of the reflected sound, thereby achieving the removal of the mixture. For the purpose of ringing, it has become a reality for users to control smart home appliances by voice in any corner of the living room.

Efficiency: quick response to achieve second understanding

In recent years, with the explosive development of deep learning theory and its remarkable effects in the field of speech recognition, open source speech recognition tools such as Kaldi have become popular in industry and academia, and the threshold for speech recognition continues to decrease. Many companies Both have the ability to recognize speech and related products, but some products are slow to respond. It is often necessary to wait a long time to display the results. In addition, the accuracy is not high, and there is no recognition.

In response to the problem of speech recognition accuracy and response speed, HKUST continues to innovate in core technologies and product features. Based on the introduction of internationally advanced deep neural network technology into speech recognition, speech enhancement algorithms based on deep neural networks. The input speech is “removed to identify key information without loss”, and the high-precision acoustic model and language model training based on massive training corpus, combined with the ultimate decoding engine engineering technology, achieves a small delay. At the same time, the noise interference and recognition of the original speech will be removed, and the response speed and user experience of the speech input are greatly improved. The continuous speech recognition rate of the large vocabulary can reach 95% or more, and the command word recognition rate reaches 99% or more. The decoding engine You can give the result within 40 milliseconds after the user has finished speaking, and truly realize "seconds".

Personalization: automatic learning to adapt to user habits

When everyone is speaking, the accent, the speed of speech, and the mantra are different. It is obviously not enough for the smart device at home to recognize only some basic words. They must understand you better. For example, understand your accent, dialect, mantra, and professional vocabulary from time to time.

Is it possible to adapt the smart device to everyone's usage habits? The answer is yes. What is used here is another key technology of speech recognition - personalized recognition technology. Personalized identification refers to the ability of the speech recognition system to automatically learn and adapt to the user's habits. The more you use, the more you know about it. In general, personalized recognition includes both pronunciation and language. The pronunciation individualization mainly refers to the system learning the pronunciation habits of the user's speech rate and accent, and the language personalization mainly means that the system can have better vocabulary for the user, such as person name, place name, mantra, professional vocabulary, etc. Identification.

At present, the personalized recognition technology of HKUST has been able to model individualized language models for each person's points of interest and knowledge background, so as to accurately identify personalized vocabulary content. The more you use in the future, the more the speech recognition system will understand you.

Dialect: Unconstrained control

As we all know, China's voice and language are profound and profound. Although the country has spared no effort in promoting standard Mandarin, the proportion of people who really master standard Mandarin in China is still relatively low, while the accent phenomenon in China is complicated and different accents in the same city. very common. Therefore, when these people with light or heavy accents use voice input, if the standard Mandarin data is used for training the model in the usual way, a serious adaptation problem will occur, which will affect the recognition when the voice is input. effect.

"Dialect" is no longer an obstacle to speech recognition. It benefits from professional resources such as audio data, special vocabulary and pronunciation phenomena in various dialects, as well as self-learning features that make full use of deep neural networks. At present, Xunfei input method has been It supports the recognition of 15 dialects such as Sichuan dialect, Henan dialect, Dongbei dialect and Tianjin dialect. These dialect recognition abilities are equally applicable to smart home environments. In the future, whether you use Mandarin or dialect, whether you speak fast or slow, Mandarin is not standard, you can use voice control smart devices at home.

Voice wake up: really liberating hands

Due to limitations in power consumption, it is difficult for smart devices to remain active for 24 hours. Therefore, in order to freely control smart home devices at home, we also need to be able to “wake up” functions instantly, that is, to add “voice wake-up” technology to smart devices.

Voice wake-up refers to "triggering" the speech recognition system through a speech input containing a specific wake-up word to achieve subsequent speech interaction. Through this technology, anyone in any environment, any time, whether it is near-field or far-field, directly speaking the preset wake-up words to the device, can activate the product recognition engine, thus realizing the whole process of touch-free voice interaction. .

In addition to a single wake-up word to achieve wake-up, a more natural and technically more challenging interaction is to bring wake-up words in the continuous stream to achieve the effect of waking up the product and simultaneously achieving manipulation. At present, the voice wake-up solution has been successfully applied in some products. For example, in the voice assistant of the rhinoceros, it is possible to wake up the device and automatically complete the name recognition and call by calling the "spiritual rhythm and calling Zhang San".

Market status of voice control at home and abroad

With the development of the smart home market, foreign IT giants have entered the smart home field by combining smart home products and voice: Google acquired NEST layout smart home, and constantly enhanced the voice entry of Google Now; Apple HomeKit smart home platform And Siri is also constantly strengthening integration; the popular Echo smart speaker on the market uses Amazon's Alexa voice technology; Microsoft also released the voice assistant Cortana, which is an interactive portal for the smart home field. From the attention and input of these foreign technology giants to the voice industry, we can see that the integration of intelligent voice and smart home is the trend of the times. The industry generally believes that voice as the most natural and convenient way of interacting with human information will become the future smart home. An important part of the device.

In China, the voice giant Keda Xunfei also announced its entry into the smart home market in August last year. In March of this year, it joined hands with Jingdong to establish a joint venture company, Beijing Linglong Technology, to launch its first product, the DingDong smart speaker. In addition to the basic functions of the speaker, it can also be used as a voice assistant, and it is also the control center of intelligent hardware. In addition, Baidu, Tencent, etc. are building their own voice team.

As the domestic and foreign giants increase their investment in the field of voice interaction, the voice core technology is gradually maturing. As the inevitable result of the penetration of IT industry and manufacturing industry into the personal home field, smart home is constantly entering the field of smart home, and the market prospect is broad.

Tips

Siri is a voice system on the iOS system. It talks to Siri and can do a lot of things like talking to friends, such as texting, making phone calls, booking restaurants, asking for directions, etc. You can also do some very interesting interactions with Siri, such as letting you flip a coin or a divination constellation. Siri can work in hands-free mode and can use Siri for voice operation navigation to find the best driving route. Siri can also work with HomeKit to control the smart products at home through voice. Siri can also connect with many third-party online service platforms such as Wikipedia, Yelp, Rotten Tomato, and Shazam to help users understand and pay more attention to the world.

The significance of voice interaction for the smart home industry

Technology makes life smarter and voice makes interaction easier. The most direct meaning of voice interaction for smart home is to make "smart home" truly intelligent. No matter the advanced technology brand, and the friendly human-computer interaction interface, there is no simple and direct voice interaction control. When all brands and all products enter the everyday life of people with an intelligent control concept, there is a need for a process of learning and adaptation, while voice does not need, all our daily communication, behavioral habits, language and voice have been It has become a part of people's thinking. Therefore, if the smart home can be integrated with the voice, then the smart home industry may usher in an epoch-making breakthrough.


Supermarket Display Freezer

commercial display freezer

ZHENGZHOU KAIXUE COLD CHAIN CO., LTD. , https://www.supersnowfreezer.com