MambaVLA: A Scalable and Efficient Vision-Language-Action Model with State Space Architecture

Abstract

Recent advancements in multimodal learning have demonstrated the promise of Vision-Language-Action (VLA) systems for robotic reasoning and real-world interaction. However, most existing VLA models rely heavily on Transformer-based backbones, which suffer from quadratic complexity and limited scalability for long sequences. In this work, we present MambaVLA, a novel VLA framework that leverages the structured state space model architecture of Mamba for efficient sequence modeling, combined with established vision and language encoders across vision, language, and action domains.

MambaVLA integrates the efficient Eagle backbone as its visual encoder with Qwen-7B-Chat-Int4 for text encoding, enabling fine-grained fusion of spatial visual cues and language representations with linear-time complexity. To bridge the gap between multimodal understanding and actionable control, we incorporate a diffusion flow matching module that aligns visual and language embeddings into action representation, enabling smooth and accurate mapping from perception to robotic control commands.

Extensive evaluations on existing vision-language-action benchmarks confirm that MambaVLA achieves competitive or superior performance to Transformer-based models while maintaining significantly lower computational overhead and faster inference. Our results highlight the potential of integrating state space models with flow-based generative learning to create compact, scalable, and deployable Vision-Language-Action systems for real-world embodied intelligence.

Architecture

Success Rate on Various Benchmarks

Benchmark Results

Watch MambaVLA in action on various robotic manipulation tasks. Select from the dropdowns to see the model's performance across different datasets and tasks.

Dataset:

Task Type:

Video: