http://ai.googleblog.com/2019/08/joint-speech-recognition-and-speaker.html
The key insight in our work was to recognize that the RNN-T architecture is well-suited to integrate acoustic and linguistic cues. The RNN-T model consists of three different networks: (1) a transcription network (or encoder) that maps the acoustic frames to a latent representation, (2) a prediction network that predicts the next target label given the previous target labels, and (3) a joint network that combines the output of the previous two networks and generates a probability distribution over the set of output labels at that time step. Note, there is a feedback loop in the architecture (diagram below) where previously recognized words are fed back as input, and this allows the RNN-T model to incorporate linguistic cues, such as the end of a question.
Training the RNN-T model on accelerators like graphical processing units (GPU) or tensor processing units (TPU) is non
![]() |
| An integrated speech recognition and speaker diarization system where the system jointly infers who spoke when and what. |
