Speech recognition

Speech recognition is generally built with a system like the one shown in Fig.1. Conventional speech recognition technology uses an acoustic model called GMM-HMM, which is a hybrid of GMM and HMM (Fig. 2,*1). After that, deep learning became popular, and deep learning began to be applied in speech recognition from a relatively early stage.

Since around 2011, there were initially two methods: Tandem type and DNN-HMM hybrid type, and now DNN-HMM, which uses an acoustic model that replaces GMM with DNN, has become mainstream (Fig. 2). In addition, as shown in Fig. 3, while speech recognition accuracy remained almost flat from 2000 to 2010, the accuracy has rapidly improved since then as deep learning has become more popular (indicator is Word Error Rate: WER)*2. It seems that this method is often adopted even at the current level of product development.

Also, since around 2016, a new system called end-to-end speech recognition has been rapidly emerging, and it has become mainstream at the research level.

In addition, emotion recognition is becoming common as a technology based on audio signals, and it has been shown that estimating the emotions of elderly people may contribute to improving the quality of care.*3.

*1. Yu et al., Articulatory and Spectrum Information Fusion Based on Deep Recurrent Neural Networks, 2019

*2.http://www.iro.umontreal.ca/~bengioy/talks/KDD2014-tutorial.pdf

*3. Hirooka et al., Construction of speech and emotion database for care recipients, Proceedings of the Acoustical Society of Japan, 2-Q-9, pp.1059-1060, 2018

Diagram of speech recognition system construction

Fig.1 Appearance of speech recognition system

Fig.2 Evolution of speech recognition methods

Fig.3 Changes in speech recognition accuracy