Our book with Wiley on AI

Thanks, Nisha. Thanks for your kind words. I learned a lot from you, Wil and Michael. I enjoyed being your editor! I benefited greatly from ...

Sunday, August 18, 2019

Google AI Blog: Joint Speech Recognition and Speaker Diarization via Sequence Transduction

http://ai.googleblog.com/2019/08/joint-speech-recognition-and-speaker.html

The key insight in our work was to recognize that the RNN-T architecture is well-suited to integrate acoustic and linguistic cues. The RNN-T model consists of three different networks: (1) a transcription network (or encoder) that maps the acoustic frames to a latent representation, (2) a prediction network that predicts the next target label given the previous target labels, and (3) a joint network that combines the output of the previous two networks and generates a probability distribution over the set of output labels at that time step. Note, there is a feedback loop in the architecture (diagram below) where previously recognized words are fed back as input, and this allows the RNN-T model to incorporate linguistic cues, such as the end of a question. 
An integrated speech recognition and speaker diarization system where the system jointly infers who spoke when and what.
Training the RNN-T model on accelerators like graphical processing units (GPU) or tensor processing units (TPU) is non