GIVE: Grounding Human Gestures in Vision-Language-Action Models

Pengfei Liu^*, Gen Li^*, Junqiao Fan, Boyu Ma, Jindou Jia, Yang Xiao, Jianfei Yang^†

MARS Lab, Nanyang Technological University, Singapore

^*Equal contribution. ^†Corresponding author.

Highlights

Current VLA models treat robotic manipulation as a purely language-driven task, struggling when human intentions are conveyed through non-verbal gestures. To address this, we propose GIVE and our core contributions are:

Effective Method without Architectural Modifications: We propose an effective approach that injects human gesture cues into pre-trained VLA models without modifying their architecture.
Complementary Dual-path Gesture Guidance Strategy: Incorporates gesture information through a visual pathway for object grounding and a semantic pathway for interaction state grounding.
Extensive Real-World Validation: Enables VLA policies to accurately associate gestures with manipulation behaviors, improving target object recognition accuracy by 40% and overall task success rate by 80%.

+40%

Target Grounding Accuracy

+80%

Overall Task Success Rate

Real-World Demos

Experiment comparison with baseline π_0.5

Method

GIVE enhances pre-trained VLA models by transforming human gestures into two complementary guidance pathways. Specifically, the visual pathway overlays geometric gesture prompts onto robot observations to explicitly resolve spatial ambiguities for target grounding, while the semantic pathway translates continuous human hand states into high-level textual descriptions for robust intent parsing. By jointly leveraging these visual and semantic cues, GIVE enables foundation policies to accurately generate continuous manipulation actions without requiring any architectural modifications.

Experiments & Analysis

We evaluate GIVE on a sequential grasp-then-handover HRI task. Our analysis highlights three core findings:

Integration Strategy: Our parameter-free Visual Overlay outperforms token-based injection, preserving pretrained representations while explicitly encoding gesture-object spatial relations.
Complementary Guidance: Ablations confirm that combining visual and semantic pathways provides complementary improvements, consistently maintaining an 80% success rate where visual-only models fail.
Robustness & Generalization: GIVE maintains stable performance across unseen spatial layouts, objects, and diverse human participants, effectively decoupling intent reasoning from visual appearance.

Impact of gesture integration mechanisms

Ablation results on gesture keypoint integration strategies

Comparison of our visual overlay versus token-based injection approaches on policy performance.

Ablation study of guidance components

METHOD	IDENTIFY SR	GRASP SR	REACT SR	HANDOVER SR
π₀.₅ (Baseline)	46.7%	6.7%	3.3%	0.0%
+ VGP (keypoints)	56.7%	30.0%	13.3%	3.3%
+ VGP (keypoints + ray)	70.0%	60.0%	23.3%	20.0%
+ VGP & SIP (Ours)	86.7%	80.0%	80.0%	80.0%

VGP (Visual Geometric Prompting) and SIP (Semantic Intent Parsing) show complementary contributions to task success.

Spatial robustness evaluation

LOC	PEACH		MANGO		CARROT
LOC	π₀.₅	OURS	π₀.₅	OURS	π₀.₅	OURS
A	5/5	4/5	3/5	5/5	1/5	4/5
B	3/5	5/5	3/5	5/5	0/5	5/5
C	1/5	3/5	2/5	4/5	1/5	5/5

Success rates for the initial Identify & Grasp stage across varying object locations.

Generalization to unseen human participants

Robustness demonstration against variations in body shape, clothing, and hand size.

BibTeX

@misc{liu2026givegroundinghumangestures,
  title={GIVE: Grounding Human Gestures in Vision-Language-Action Models}, 
        author={Pengfei Liu and Gen Li and Junqiao Fan and Boyu Ma and Jindou Jia and Yang Xiao and Jianfei Yang},
        year={2026},
        eprint={2606.13435},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2606.13435}, 
}