Embodied Machine Listening in Audio-Visual Navigation @MARL Lab

Embodied Machine Listening in Audio-Visual Navigation @MARL Lab

Jul 21, 2022 04:44 PM
Last updated December 9, 2022
Music/Audio Technology
Nov 23, 2022 Work in progress supervised and mentored by Aurora Cramer, Professor Magdalena Fuentes, and Professor Juan Bello.

Proof of Concepts

  • Pushing the ability of an audio encoder to the edge, we examined the representations learned in convolutional neural network architectures through two stages of transfer learning: one in audio-visual scene correspondence and audio-visual navigation; another in various downstream tasks to holistically evaluate the generalizability of learned audio representations.
  • In audio visual correspondence pre-training, we adopted the contrastive learning method on egocentric videos with stereo audio.
  • In semantic audio-visual navigation fine-tuning, the acoustic, directional, and semantic features of the binaural sound are learned through a reinforcement learning approach with actions, reward, and memory.
    • Semantic audio-visual navigation in SoundSpaces
      Semantic audio-visual navigation in SoundSpaces
Up-upstream: contrastive audio visual correspondence
Upstream: embodied semantic audio-visual navigation
Downstream: sound source localization in HEAR benchmarks
notion image

Codes in Progress

marlUpdated Nov 4, 2022
marlUpdated Sep 27, 2022
auroracramerUpdated Jul 27, 2022


Cover video from https://soundspaces.org/. inproceedings{chen2021semantic, title={Semantic audio-visual navigation}, author={Chen, Changan and Al-Halah, Ziad and Grauman, Kristen}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={15516--15525}, year={2021} } @article{wu2022listen, title={How to Listen? Rethinking Visual Sound Localization}, author={Wu, Ho-Hsiang and Fuentes, Magdalena and Seetharaman, Prem and Bello, Juan Pablo}, journal={arXiv preprint arXiv:2204.05156}, year={2022} } @article{turian2022hear, title={Hear 2021: Holistic evaluation of audio representations}, author={Turian, Joseph and Shier, Jordie and Khan, Humair Raj and Raj, Bhiksha and Schuller, Bj{\"o}rn W and Steinmetz, Christian J and Malloy, Colin and Tzanetakis, George and Velarde, Gissel and McNally, Kirk and others}, journal={arXiv preprint arXiv:2203.03022}, year={2022} }