BioVLA - HUST

BioVLA: A Bio-Inspired Vision-Language-Action Model for Robotic Manipulation

¹Huazhong University of Science and Technology
Last updated: October 11, 2025

In embodied intelligence, visual perception possesses substantially higher information entropy than language, providing dense, continuous, and multi-level cues essential for understanding the physical world and guiding robotic action. While language offers symbolic abstraction, vision conveys rich spatial and functional priors that dominate perception-driven decision-making. To fully exploit this high-entropy modality, we propose BioVLA, a bio-inspired vision-language-action model designed to enhance robotic manipulation through neuro-inspired mechanisms. Drawing inspiration from the human visual cortex, BioVLA introduces a bio-inspired visual encoder that simulates region-specific neural responses to different visual cues, enabling multi-functional feature representations and adaptive re-weighting through function-aware re-response. This mechanism refines visual representations by emphasizing function-specific responses while maintaining balanced multi-cue integration. Furthermore, BioVLA incorporates a vision-guided action refinement module, which dynamically modulates the hidden states of the action decoder based on visual feedback, thereby preserving rich visual information throughout the perception–action transformation. Experiments conducted on the RoboTwin 2.0 platform demonstrate that BioVLA achieves significant performance improvements over existing state-of-the-art VLA models across diverse manipulation tasks.

Pipeline

The pipeline begins with a bio-inspired visual encoder that models region-specific responses analogous to the human visual cortex, producing multi-functional feature representations. These visual responses are further refined through a function-aware re-response and adaptive re-weighting mechanism. Finally, a vision-guided action refinement module dynamically modulates the hidden states of the action decoder, ensuring stable perception–action alignment and mitigating information loss in long-horizon reasoning.

Task	Mode	BioVLA	RDT	Pi0	ACT	DP	DP3	OpenVLA-OFT
Click Alarmclock	Clean	79%	61%	63%	32%	61%	77%	75%
Randomized	18%	12%	11%	4%	5%	14%	19%
Dump Bin Bigbin	Clean	93%	64%	83%	68%	49%	85%	22%
Randomized	25%	32%	24%	1%	0%	53%	15%
Hanging Mug	Clean	31%	23%	11%	7%	8%	17%	10%
Randomized	10%	16%	3%	0%	0%	1%	4%
Open Laptop	Clean	86%	59%	85%	56%	49%	82%	40%
Randomized	13%	32%	46%	0%	0%	7%	26%
Place Cans Plasticbox	Clean	69%	6%	34%	16%	40%	48%	4%
Randomized	7%	5%	2%	0%	0%	3%	2%
Press Stapler	Clean	88%	41%	62%	17%	6%	69%	14%
Randomized	40%	24%	29%	6%	0%	3%	12%
Put Bottles Dustbin	Clean	61%	21%	54%	27%	22%	60%	33%
Randomized	25%	4%	13%	1%	0%	21%	13%
Shake Bottle	Clean	98%	74%	97%	74%	65%	98%	22%
Randomized	47%	45%	60%	10%	8%	19%	21%
Turn Switch	Clean	49%	35%	27%	5%	36%	46%	39%
Randomized	32%	15%	23%	2%	1%	8%	26%
Average	Clean	70.7%	43.2%	57.0%	33.1%	36.0%	63.8%	25.8%
Randomized	23.2%	19.0%	22.9%	6.2%	3.3%	13.8%	13.9%

Task

Mode

BioVLA

RDT

Pi0

ACT

DP3

OpenVLA-OFT

Click Alarmclock

Clean

79%

61%