Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

Park, Subin; Kim, Jung Uk

Generate, Analyze, and Refine:
Training-Free Sound Source Localization via MLLM Meta-Reasoning

Subin Park, Jung Uk Kim^*

Kyung Hee University, Visual AI Lab.
The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2026
^*Corresponding author

Paper arXiv Code

Overview of the proposed GAR-SSL framework.

Abstract

Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at this GitHub repository.

Method

The proposed GAR pipeline consists of three stages.

Generation: The MLLM produces initial bounding boxes for sound-emitting objects and predicts audio classifications.
Analysis: Audio-Visual Consistency is quantified through open-set role tagging and anchor voting, enabling explicit verification of audio-visual correspondence.
Refinement: Adaptive gating selectively adjusts unreliable predictions while preventing unnecessary modifications to already reliable localization results.

Generation-Analysis-Refinement pipeline for training-free sound source localization.

Results

Extensive experiments on single-source and multi-source sound source localization benchmarks demonstrate that GAR-SSL achieves competitive performance without task-specific training.

Qualitative Results

Qualitative localization results on single-source and multi-source sound source localization benchmarks.

Additional qualitative localization results of GAR-SSL.

BibTeX

@article{park2026generate,
  title={Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning},
  author={Park, Subin and Kim, Jung Uk},
  journal={arXiv preprint arXiv:2604.06824},
  year={2026},
  url={https://arxiv.org/abs/2604.06824}
}