End-to-End Active Speaker Detection

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Standard

End-to-End Active Speaker Detection. / Alcázar, Juan León; Cordes, Moritz; Zhao, Chen et al.
Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. ed. / Shai Avidan; Gabriel Brostow; Moustapha Cissé; Giovanni Maria Farinella; Tal Hassner. Springer Science and Business Media Deutschland, 2022. p. 126-143 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13697 LNCS).

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Harvard

Alcázar, JL, Cordes, M, Zhao, C & Ghanem, B 2022, End-to-End Active Speaker Detection. in S Avidan, G Brostow, M Cissé, GM Farinella & T Hassner (eds), Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13697 LNCS, Springer Science and Business Media Deutschland, pp. 126-143, Conference - 17th European Conference on Computer Vision - ECCV 2022, Tel Aviv, Israel, 23.10.22. https://doi.org/10.48550/arXiv.2203.14250, https://doi.org/10.1007/978-3-031-19836-6_8

APA

Alcázar, J. L., Cordes, M., Zhao, C., & Ghanem, B. (2022). End-to-End Active Speaker Detection. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022 - 17th European Conference, Proceedings (pp. 126-143). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13697 LNCS). Springer Science and Business Media Deutschland. https://doi.org/10.48550/arXiv.2203.14250, https://doi.org/10.1007/978-3-031-19836-6_8

Vancouver

Alcázar JL, Cordes M, Zhao C, Ghanem B. End-to-End Active Speaker Detection. In Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors, Computer Vision – ECCV 2022 - 17th European Conference, Proceedings. Springer Science and Business Media Deutschland. 2022. p. 126-143. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.48550/arXiv.2203.14250, 10.1007/978-3-031-19836-6_8

Bibtex

@inbook{f53b33da89fc48f8bf13f3676febd593,
title = "End-to-End Active Speaker Detection",
abstract = "Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.",
keywords = "Informatics, Business informatics",
author = "Alc{\'a}zar, {Juan Le{\'o}n} and Moritz Cordes and Chen Zhao and Bernard Ghanem",
note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.; Conference - 17th European Conference on Computer Vision - ECCV 2022, ECCV 2022 ; Conference date: 23-10-2022 Through 27-10-2022",
year = "2022",
doi = "10.48550/arXiv.2203.14250",
language = "English",
isbn = "978-3-031-19835-9",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Science and Business Media Deutschland",
pages = "126--143",
editor = "Shai Avidan and Gabriel Brostow and Moustapha Ciss{\'e} and Farinella, {Giovanni Maria} and Tal Hassner",
booktitle = "Computer Vision – ECCV 2022 - 17th European Conference, Proceedings",
address = "Germany",
url = "https://eccv2022.ecva.net/",

}

RIS

TY - CHAP

T1 - End-to-End Active Speaker Detection

AU - Alcázar, Juan León

AU - Cordes, Moritz

AU - Zhao, Chen

AU - Ghanem, Bernard

N1 - Conference code: 17

PY - 2022

Y1 - 2022

N2 - Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.

AB - Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.

KW - Informatics

KW - Business informatics

UR - http://www.scopus.com/inward/record.url?scp=85142706504&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/e6496c8f-b57c-3961-9c1b-0128427ddd58/

U2 - 10.48550/arXiv.2203.14250

DO - 10.48550/arXiv.2203.14250

M3 - Article in conference proceedings

AN - SCOPUS:85142706504

SN - 978-3-031-19835-9

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 126

EP - 143

BT - Computer Vision – ECCV 2022 - 17th European Conference, Proceedings

A2 - Avidan, Shai

A2 - Brostow, Gabriel

A2 - Cissé, Moustapha

A2 - Farinella, Giovanni Maria

A2 - Hassner, Tal

PB - Springer Science and Business Media Deutschland

T2 - Conference - 17th European Conference on Computer Vision - ECCV 2022

Y2 - 23 October 2022 through 27 October 2022

ER -

Recently viewed

Researchers

  1. Lukas Stolz

Publications

  1. Collisions in space
  2. A Process Perspective on Organizational Failure
  3. Social group membership does not modulate automatic imitation in a contrastive multi-agent paradigm
  4. Intelligence assessment with computer simulations
  5. Capitalizing on natural language processing (NLP) to automate the evaluation of coach implementation fidelity in guided digital cognitive-behavioral therapy (GdCBT)
  6. Rapid Prototyping of a Mechatronic Engine Valve Controller for IC Engines
  7. Constitutions, Democratic Self-Determination and the Institutional Empowerment of Future Generations: Mitigating an Aporia
  8. Ticio Escobar
  9. Resolving conflicts between people and over time in the transformation toward sustainability
  10. Lyapunov stability analysis to set up a saturating PI controller with anti-windup for a mass flow system
  11. U-model-based dynamic inversion control for quadrotor UAV systems
  12. Systematic risk behavior in cyclical industries
  13. Bridging scenario planning and backcasting
  14. On the Existence of Digital Objects
  15. Release of monomers from four different composite materials after halogen and LED curing
  16. System and action theory
  17. Audio-Hacks
  18. Internet: Impact and Potential for Learning and Instruction
  19. Modeling and Simulation of Electrochemical Cells under Applied Voltage
  20. Controlling a Bank Model Economy by Sliding Mode Control with Help of Kalman Filter
  21. Does ESG performance have an impact on financial performance?
  22. An empirical investigation of experiences and the link between a servicedominant logic mindset, competitive advantage, and performance of nonprofit organizations
  23. How and Why Different Forms of Expertise Moderate Anchor Precision in Price Decisions
  24. An empirical note on commuting distance and sleep during workweek and weekend
  25. Dietary patterns of children on three indigenous societies
  26. Predictive mapping of plant species and communities using GIS and Landsat data in a southern Mongolian mountain range
  27. Der FFB-Server mit Microsoft Windows Server 2003