Abstract
This paper describes our entry to the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. This year's challenge focuses on generating gestures in a dyadic setting - predicting a main-agent's motion from the speech of both the main-agent and an interlocutor. We adapt a Transformer-XL architecture for this task by adding a cross-attention module that integrates the interlocutor's speech with that of the main-agent. Our model is conditioned on speech audio (encoded using PASE+), text (encoded using FastText) and a speaker identity label, and is able to generate smooth and speech appropriate gestures for a given identity. We consider the GENEA Challenge user study results and present a discussion of our model strengths and where improvements can be made.
| Original language | English |
|---|---|
| Pages | 802-810 |
| Number of pages | 9 |
| DOIs | |
| Publication status | Published - 9 Oct 2023 |
| Event | INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION - Duration: 9 Oct 2023 → 13 Oct 2023 |
Conference
| Conference | INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION |
|---|---|
| Period | 9/10/23 → 13/10/23 |
Keywords
- 3D pose prediction
- Cross-Attention
- Self-Attention
- Speech-to-gesture
- Transformer-XL
- gesture generation