10 Dec 2017
I’ve been continuing to add noun recognition / filtering to my transcript
analysis code. I’ve been reading through
Natural Language Processing with Python to
familiarize myself with the tool kit and as reference.
In addition I meet with the team briefly to discuss our plans for next
semester. It seems like my next project will be to implement a GUI to
facilitate use of the existing code. I intend to start researching libraries
to build such a thing, specially in python as there are plans to port much of
the current research into python. I will also continue to see if there’s a way
to port the existing Matlab code, although my initial research and attempts
have not been fruitful.
03 Dec 2017
Some more literature review of keyword detection:
- Chen, Guoguo, Carolina Parada, and Tara N. Sainath. “Query-by-example keyword spotting using long short-term memory networks.” Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.
- They use a long short-term memory (LSTM) neural netowrk for KWS.
- This allows for training new keywords, beyond just those where we have prior knowledge.
- Feature Extraction:
- VAD used to reduce computation st. KWS only run on voiced regions.
- 13-dimensional PLP features
- The found that this approach reduces the false rejection rate by 86%.
- Chen, et al. “Low-Resource Keyword Search Strategies for Tamil.” 2015 IEEE International Conference on. IEEE, 2015.
- They propose three strategies for low-research KWS.
- Submodular Optimization to Select Audio to Transcribe
- for-each utterance, s, in a set S they measure the degree to with s contains some feature u. This can be used to determine the probability distribution of that feature.
- Keyword Aware Language Modeling
- Word Morph Interpolated Language Model
- 3 language models are constructed
- Word based LM, which is trained on all word entries
- Morph (automatically parsed morphemes) based LM, which is trained by parsing word entries into morphs
- Hybrid Word-Morph LM, where words with more than one occurrence are retained and words with one occurrence are parsed into morphs.
26 Nov 2017
This week was Thanksgiving Break.
19 Nov 2017
Summaries of a few of the papers I’ve found on existing keyword spotting (KWS) are below:
- Chen, Guoguo, Carolina Parada, and Georg Heigold. “Small-footprint keyword spotting using deep neural networks.” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
- They use a deep neural network to spot keywords instead of the traditional Hidden Markov Model. They found that it is a smaller implementation that has reduced computation time/space, and preforms at least as well.
- Feature Extraction:
- They use VAD as to only run the neural network on audio regions containing speech.
- The audio features are captured in frames every 10 ms, and each encapsulate a 25 ms window.
- The neutral net is given the 30 previous frames, and the future 10 frames at each instance.
- Training
- To train the network they had 2.3K instances of each keyword and 133K negative examples.
- Keywords examined: answer call, ok google, take a picture, etc.
- Hout, Julien van, et al. “Recent Improvements in SRI’s Keyword Detection System for Noisy Audio.” Fifteenth Annual Conference of the International Speech Communication Association. 2014.
- They use a series of noise-robust features to improve the accuracy of their KWS in noisy conditions.
- Noise Robust Features:
- Damped Oscillator Coefficients
- Normalized Modulation Coefficients
- Modulation of Medium Duration Speech Amplitudes
- Gammatone Filter Coefficients
- Log-spectrally Enhanced Power Normalized Cepstral Coefficients
- Gabor-MFCC
- Results:
- They found that the phoneme level search missing fewer keywords, than the word level search, but often has more false positives as a result.
- They determined that a fusion of the features was most accurate for their tests.
- Gales, Mark JF, et al. “Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED.” SLTU. 2014.
- They examine a zero resource acoustic model for languages with little to no transcribed audio.
- They aim to build a language independent acoustic model, using features such as pitch, plp, and context dependent HMMs.
- Their results were inconclusive and they found that although this idea
is likely still possible there is more work to be done in the area.
12 Nov 2017
I’ve been continuing to verify participant data with team and help organize volunteer participants. One interesting thing I’ve noticed about this process is that members of the team tend to be more susceptible to noticing different facial features than each other. ie. I tend to pick up distinctions in the eyes and noise, whereas other team member might recognize those differences in head-shape or the ears.
Additionally I have been reviewing the literature for Keyword Detection in audio. Reading academic papers is definitely a process that I hadn’t had much experience with prior to this research, and it’s something were I feel like I am still very much learning how to do so such that I can do so in both a way where I learn about the current understanding of the field/topic but also in a way that is efficient in terms of absorbing content. I tend to start with the abstract, introduction, and conclusion first, and then work my way through the methods and results. One of the difficulties I’ve found is that because I’m still new to these topics and idea a lot of the names for methods are unfamiliar to me and I find myself spending a great deal of time reading background of those terms to build a place to put the findings of the paper. This is not a problem on it’s own, but I find that I struggle with determining which of the terms are truly ones the paper hinges on and which are tangential. One of the strategies I’ve adopted is to for each thing I read mark down everything I want to review and then essential recursively preform a breath first search so that I don’t end up 20 papers down a path that isn’t particular relevant to the subject matter as often.