Speech-Driven Conversational Agents using Conditional Flow-VAEs

Sarah Taylor, Jonathan Windle, David Greenwood, Iain Matthews

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Automatic control of conversational agents has applications from animation, through human-computer interaction, to robotics. In interactive communication, an agent must move to express its own discourse, and also react naturally to incoming speech. In this paper we propose a Flow Variational Autoencoder (Flow-VAE) deep learning architecture for transforming conversational speech to body gesture, during both speaking and listening. The model uses a normalising flow to perform variational inference in an autoencoder framework and is a more expressive distribution than the Gaussian approximation of conventional variational autoencoders. Our model is non-deterministic, so can produce variations of plausible gestures for the same speech. Our evaluation demonstrates that our approach produces expressive body motion that is close to the ground truth using a fraction of the trainable parameters compared with previous state of the art.
Original languageEnglish
Title of host publicationCVMP '21: European Conference on Visual Media Production
Number of pages9
Publication statusPublished - 6 Dec 2021


  • Speech animation
  • normalising flows
  • conversational agents
  • variational autoencoders

Cite this