In embodied intelligence, visual perception possesses substantially higher information entropy than language, providing dense, continuous, and multi-level cues essential for understanding the physical world and guiding robotic action. While language offers symbolic abstraction, vision conveys rich spatial and functional priors that dominate perception-driven decision-making. To fully exploit this high-entropy modality, we propose BioVLA, a bio-inspired vision-language-action model designed to enhance robotic manipulation through neuro-inspired mechanisms. Drawing inspiration from the human visual cortex, BioVLA introduces a bio-inspired visual encoder that simulates region-specific neural responses to different visual cues, enabling multi-functional feature representations and adaptive re-weighting through function-aware re-response. This mechanism refines visual representations by emphasizing function-specific responses while maintaining balanced multi-cue integration. Furthermore, BioVLA incorporates a vision-guided action refinement module, which dynamically modulates the hidden states of the action decoder based on visual feedback, thereby preserving rich visual information throughout the perception–action transformation. Experiments conducted on the RoboTwin 2.0 platform demonstrate that BioVLA achieves significant performance improvements over existing state-of-the-art VLA models across diverse manipulation tasks.
The pipeline begins with a bio-inspired visual encoder that models region-specific responses analogous to the human visual cortex, producing multi-functional feature representations. These visual responses are further refined through a function-aware re-response and adaptive re-weighting mechanism. Finally, a vision-guided action refinement module dynamically modulates the hidden states of the action decoder, ensuring stable perception–action alignment and mitigating information loss in long-horizon reasoning.
Trained and evaluated on the RoboTwin 2.0 platform using the Aloha-AgileX embodiment with 50 demonstrations per task. The best-performing result in each row is highlighted in red, and the second-best in green.
| Task | Mode | BioVLA | RDT | Pi0 | ACT | DP | DP3 | OpenVLA-OFT |
|---|---|---|---|---|---|---|---|---|
| Click Alarmclock | Clean | 79% | 61% | 63% | 32% | 61% | 77% | 75% |
| Randomized | 18% | 12% | 11% | 4% | 5% | 14% | 19% | |
| Dump Bin Bigbin | Clean | 93% | 64% | 83% | 68% | 49% | 85% | 22% |
| Randomized | 25% | 32% | 24% | 1% | 0% | 53% | 15% | |
| Hanging Mug | Clean | 31% | 23% | 11% | 7% | 8% | 17% | 10% |
| Randomized | 10% | 16% | 3% | 0% | 0% | 1% | 4% | |
| Open Laptop | Clean | 86% | 59% | 85% | 56% | 49% | 82% | 40% |
| Randomized | 13% | 32% | 46% | 0% | 0% | 7% | 26% | |
| Place Cans Plasticbox | Clean | 69% | 6% | 34% | 16% | 40% | 48% | 4% |
| Randomized | 7% | 5% | 2% | 0% | 0% | 3% | 2% | |
| Press Stapler | Clean | 88% | 41% | 62% | 17% | 6% | 69% | 14% |
| Randomized | 40% | 24% | 29% | 6% | 0% | 3% | 12% | |
| Put Bottles Dustbin | Clean | 61% | 21% | 54% | 27% | 22% | 60% | 33% |
| Randomized | 25% | 4% | 13% | 1% | 0% | 21% | 13% | |
| Shake Bottle | Clean | 98% | 74% | 97% | 74% | 65% | 98% | 22% |
| Randomized | 47% | 45% | 60% | 10% | 8% | 19% | 21% | |
| Turn Switch | Clean | 49% | 35% | 27% | 5% | 36% | 46% | 39% |
| Randomized | 32% | 15% | 23% | 2% | 1% | 8% | 26% | |
| Average | Clean | 70.7% | 43.2% | 57.0% | 33.1% | 36.0% | 63.8% | 25.8% |
| Randomized | 23.2% | 19.0% | 22.9% | 6.2% | 3.3% | 13.8% | 13.9% |
Additional tasks and results will be continuously updated.
Qualitative visualization of BioVLA performing manipulation tasks.





















































