End-to-End Active Speaker Detection

Juan León Alcázar; Moritz Cordes; Chen Zhao; Bernard Ghanem

doi:10.48550/arXiv.2203.14250

End-to-End Active Speaker Detection

Publikation: Beiträge in Sammelwerken › Aufsätze in Konferenzbänden › Forschung › begutachtet

Authors

Juan León Alcázar
Moritz Cordes
Chen Zhao
Bernard Ghanem

Fakultät Management und Technologie

Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.

Originalsprache	Englisch
Titel	Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
Herausgeber	Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Anzahl der Seiten	18
Verlag	Springer Science and Business Media Deutschland
Erscheinungsdatum	2022
Seiten	126-143
ISBN (Print)	978-3-031-19835-9
ISBN (elektronisch)	978-3-031-19836-6
DOIs	https://doi.org/10.48550/arXiv.2203.14250 https://doi.org/10.1007/978-3-031-19836-6_8
Publikationsstatus	Erschienen - 2022
Veranstaltung	Conference - 17th European Conference on Computer Vision - ECCV 2022 - Expo Tel Aviv / David Intercontinental Hotel, Tel Aviv, Israel Dauer: 23.10.2022 → 27.10.2022 Konferenznummer: 17 https://eccv2022.ecva.net/

Bibliographische Notiz

Funding Information:
Acknowledgements. This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.

Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

Fachgebiete

Informatik
Wirtschaftsinformatik

DOI

https://doi.org/10.48550/arXiv.2203.14250
Akzeptiertes Autorenmanuskript
https://doi.org/10.1007/978-3-031-19836-6_8
Endgültige, publizierte Fassung

End-to-End Active Speaker Detection

Authors

Bibliographische Notiz

Fachgebiete

DOI

Zuletzt angesehen

Forschende

Projekte

Aktivitäten

Publikationen

Presse / Medien