Keyword Dectection Literature Review

19 Nov 2017

Summaries of a few of the papers I’ve found on existing keyword spotting (KWS) are below:

Chen, Guoguo, Carolina Parada, and Georg Heigold. “Small-footprint keyword spotting using deep neural networks.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
- They use a deep neural network to spot keywords instead of the traditional Hidden Markov Model. They found that it is a smaller implementation that has reduced computation time/space, and preforms at least as well.
- Feature Extraction:
  - They use VAD as to only run the neural network on audio regions containing speech.
  - The audio features are captured in frames every 10 ms, and each encapsulate a 25 ms window.
  - The neutral net is given the 30 previous frames, and the future 10 frames at each instance.
- Training
  - To train the network they had 2.3K instances of each keyword and 133K negative examples.
  - Keywords examined: answer call, ok google, take a picture, etc.
Hout, Julien van, et al. “Recent Improvements in SRI’s Keyword Detection System for Noisy Audio.” Fifteenth Annual Conference of the International Speech Communication Association. 2014.
- They use a series of noise-robust features to improve the accuracy of their KWS in noisy conditions.
- Noise Robust Features:
  - Damped Oscillator Coefficients
  - Normalized Modulation Coefficients
  - Modulation of Medium Duration Speech Amplitudes
  - Gammatone Filter Coefficients
  - Log-spectrally Enhanced Power Normalized Cepstral Coefficients
  - Gabor-MFCC
- Results:
  - They found that the phoneme level search missing fewer keywords, than the word level search, but often has more false positives as a result.
  - They determined that a fusion of the features was most accurate for their tests.
Gales, Mark JF, et al. “Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED.” SLTU. 2014.
- They examine a zero resource acoustic model for languages with little to no transcribed audio.
- They aim to build a language independent acoustic model, using features such as pitch, plp, and context dependent HMMs.
- Their results were inconclusive and they found that although this idea is likely still possible there is more work to be done in the area.

CREU Blog 2017 - 2018 Research of JN Matthews

Keyword Dectection Literature Review