MambaVLA: A Scalable and Efficient Vision-Language-Action Model with State Space Architecture#
📚 Publication#
Published at: Consumer Communications & Networking Conference (CCNC), 2026
Authors: Sai Navaneet, Manisha Lingala, Sangmoon Lee, Ju H. Park
Date: January 2026
📝 Abstract#
Recent advances in multimodal learning have enabled powerful Vision–Language–Action (VLA) systems for robotic reasoning and control. However, most existing approaches rely on Transformer backbones, which face scalability and efficiency bottlenecks for long sequences. This work introduces MambaVLA, a scalable VLA framework built on the Mamba state space architecture for efficient sequence modeling. The framework integrates the Eagle visual encoder and Qwen-7B-Chat-Int4 language model to achieve fine-grained multimodal fusion with linear-time complexity. A diffusion flow matching module further aligns visual–language embeddings with continuous action trajectories, enabling smooth and precise control. Extensive evaluations on standard VLA benchmarks demonstrate that MambaVLA matches or surpasses Transformer based models while offering substantially lower computational cost and faster inference. These results highlight the potential of state space modeling and flow-based action generation for compact, scalable, and deployable embodied intelligence systems.
Project Website🎯 Key Features#
State Space Architecture: Efficient processing using Mamba-based state space models
Vision-Language Integration: Seamless understanding of natural language commands with visual perception
Scalable Design: Efficient architecture suitable for real-time robot control
Action Generation: Unified framework for generating robot actions from multimodal inputs
💡 Primary Contributions#
My primary contributions are fourfold:
First, I introduce MambaVLA, a novel Vision–Language–Action framework that leverages structured state space models, specifically the Mamba architecture, to overcome scalability and long-sequence modeling limitations inherent in Transformer-based approaches.
Second, I integrate the Eagle2 visual backbone and the Qwen-7B-Chat-Int4 language encoder with a unified multimodal fusion mechanism, enabling efficient alignment between spatial and linguistic features.
Third, I incorporate a diffusion flow-matching module that generates continuous, smooth, and physically consistent action trajectories, effectively bridging the gap between multimodal perception and low-level control.
Fourth, I conduct extensive evaluations across multiple established VLA benchmarks, demonstrating that MambaVLA achieves competitive or superior performance compared to state-of-the-art Transformer-based models while maintaining substantially lower computational overhead and faster inference.
Overall, these contributions highlight the potential of combining state space architectures with flow-based generative modeling to build compact, efficient, and scalable Vision–Language–Action systems for real-world embodied intelligence.
🔬 Methodology#
Architecture Overview#
The MambaVLA architecture combines state space models with vision-language understanding to create an efficient and scalable VLA framework.
📊 Results#
This work demonstrates the effectiveness of state space architectures in vision-language-action tasks, providing a scalable alternative to transformer-based approaches.
In terms of data efficiency, MambaVLA requires substantially fewer demonstrations to achieve similar performance to existing baselines. Specifically, it matches SmolVLA with 40 demonstrations (vs. 50) and reaches OpenVLA performance with just 35 demonstrations.