NeuGrasp: Generalizable Neural Surface Reconstruction with Background Priors for Material-Agnostic Object Grasp Detection

Qingyu Fan1,2,3, Yinghao Cai1,2†, Chao Li3, Wenzhe He3, Xudong Zheng3, Tao Lu1, Bin Liang3, Shuo Wang1,2
1State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
2School of Artificial Intelligence, University of Chinese Academy of Sciences
3Qiyuan Lab

† corresponding author

Teaser Image

Fig. 1. Overview of NeuGrasp. We introduce a generalizable method that utilizes background priors within a neural implicit surface framework to achieve real-time scene reconstruction and material-agostic grasping from observations within a narrow field of view.

NeuGrap

NeuGrap-RA

Abstract

Robotic grasping in scenes with transparent and specular objects presents great challenges for methods relying on accurate depth information. In this paper, we introduce NeuGrasp, a neural surface reconstruction method that leverages background priors for material-agnostic grasp detection. NeuGrasp integrates transformers and global prior volumes to aggregate multi-view features with spatial encoding, enabling robust surface reconstruction in narrow and sparse viewing conditions. By focusing on foreground objects through residual feature enhancement and refining spatial perception with an occupancy-prior volume, NeuGrasp excels in handling objects with transparent and specular surfaces. Extensive experiments in both simulated and real-world scenarios show that NeuGrasp outperforms state-of-the-art methods in grasping while maintaining comparable reconstruction quality.

Video

Method

Teaser Image

Fig. 2. Framework of NeuGrasp. NeuGrasp leverages background priors for neural surface reconstruction and material-agnostic grasp detection. A Residual Feature Enhancement module is proposed to enhance the model attention on foreground objects instead of irrelevant background information. We build an occupancy-prior volume from residual features and a shape-prior volume from scene features. These volumes are then combined with multi-view features using Residual and Source View Transformers, which are further refined by a Ray Transformer to capture geometric details. The resulting unified view feature and attention-weighted features are decoded into a signed distance function and converted into a radiance field. Finally, the grasping module maps the reconstructed geometry to 6-DoF grasp poses, enabling end-to-end training.

Acknowledgements

We would like to thank the authors of VGN and GraspNeRF for making their work publicly available.

Contact

If you have any questions, please feel free to contact us: