SG3D

TL;DR We propose Task-oriented Sequential Grounding and Navigation in 3D Scenes; build SG3D—a large-scale dataset with 22K tasks (112K steps) in 4,895 real-world 3D scenes, and introduce SG-LLM—a state-of-the-art approach leveraging a stepwise grounding paradigm for the sequential grounding task.

Grounding natural language in 3D environments is a critical step toward achieving robust 3D vision-language alignment. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented scenarios. In this work, we introduce a novel task: Task-oriented Sequential Grounding and Navigation in 3D Scenes, where models must interpret step-by-step instructions for daily activities by either localizing a sequence of target objects in indoor scenes or navigating toward them within a 3D simulator. To facilitate this task, we present SG3D, a large-scale dataset comprising 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed by combining RGB-D scans from various 3D scene datasets with an automated task generation pipeline, followed by human verification for quality assurance. We benchmark contemporary methods on SG3D, revealing the significant challenges in understanding task-oriented context across multiple steps. Furthermore, we propose SG-LLM, a state-of-the-art approach leveraging a stepwise grounding paradigm to tackle the sequential grounding task. Our findings underscore the need for further research to advance the development of more capable and context-aware embodied agents.

SG3D contains 3D scenes curated from diverse existing datasets of real environments. Harnessing the power of 3D scene graphs and GPT-4, we introduce an automated pipeline to generate tasks. Post-generation, we manually verify the test set data to ensure data quality.

Here we present a few examples from the SG3D dataset via a data explorer. Each task example consists of a sequence of steps, where each step requiring grounding a target object in the scene.

To use the data explorer, first select from the available scenes in the selection bar. The tasks and their corresponding steps will be displayed in the right column. Click on a step to visualize its target object with a red bounding box in the scene. All available objects could be found according to the segmentation visualization. Best viewed on monitors.
Control: Click + Drag = Rotate Ctrl + Drag = Translate Scroll Up/Down = Zoom In/Out

For task-oriented sequential navigation, each task represents a navigation episode. The agent is required to sequentially navigate to the target objects in the scene. The following videos show the navigation episodes in the SG3D-Nav dataset.

We propose SG-LLM for the sequential grounding task. Benefits from the stepwise grounding paradigm and a sequential adapter mechanism, SG-LLM outperforms other baselines by a large margin.

We evaluate the following grounding approaches on the SG3D benchmark: 3D-VG baselines, LLM methods, 3D LLM baseline, large vision-laguage model baseline, and our proposed sequential grounding model SG-LLM.

We benchmark two approaches on the SG3D-Nav benchmark: a modular agent and an end-to-end policy.

The significant performance degradation of grounding and navigation methods when sequential context is removed indicates that the sequential context is crucial for both grounding and navigation tasks.

BibTeX

@article{sg3d,
  title={Task-oriented Sequential Grounding and Navigation in 3D Scenes},
  author={Zhang, Zhuofan and Zhu, Ziyu and Li, Junhao and Li, Pengxiang and Wang, Tianxu and Liu, Tengyu and Ma, Xiaojian and Chen, Yixin and Jia, Baoxiong and Huang, Siyuan and Li, Qing},
  journal={arXiv preprint arXiv:2408.04034}
  year={2024}
}

Task-oriented Sequential Grounding and Navigation in 3D Scenes

Abstract

Dataset

Examples from the Sequential Grounding Benchmark

Examples from the Sequential Navigation Benchmark

Task 1

Task 2

Task 3

Model

Results on Sequential Grounding Benchmark

Results on Sequential Navigation Benchmark

Effect of Removing Sequential Context

BibTeX