End-to-End Active Speaker Detection
Publikation: Beiträge in Sammelwerken › Aufsätze in Konferenzbänden › Forschung › begutachtet
Standard
Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. Hrsg. / Shai Avidan; Gabriel Brostow; Moustapha Cissé; Giovanni Maria Farinella; Tal Hassner. Springer Science and Business Media Deutschland GmbH, 2022. S. 126-143 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 13697 LNCS).
Publikation: Beiträge in Sammelwerken › Aufsätze in Konferenzbänden › Forschung › begutachtet
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - End-to-End Active Speaker Detection
AU - Alcázar, Juan León
AU - Cordes, Moritz
AU - Zhao, Chen
AU - Ghanem, Bernard
N1 - Conference code: 17
PY - 2022
Y1 - 2022
N2 - Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.
AB - Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.
KW - Informatics
KW - Business informatics
UR - http://www.scopus.com/inward/record.url?scp=85142706504&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/e6496c8f-b57c-3961-9c1b-0128427ddd58/
U2 - 10.48550/arXiv.2203.14250
DO - 10.48550/arXiv.2203.14250
M3 - Article in conference proceedings
AN - SCOPUS:85142706504
SN - 978-3-031-19835-9
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 126
EP - 143
BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings
A2 - Avidan, Shai
A2 - Brostow, Gabriel
A2 - Cissé, Moustapha
A2 - Farinella, Giovanni Maria
A2 - Hassner, Tal
PB - Springer Science and Business Media Deutschland GmbH
T2 - Conference - 17th European Conference on Computer Vision - ECCV 2022
Y2 - 23 October 2022 through 27 October 2022
ER -