Task-oriented Sequential Grounding in 3D Scenes

TL;DR We proposed a new task, Task-oriented Sequential Grounding in 3D scenes, and introduced SG3D, a large-scale dataset with 22,346 tasks and 112,236 steps in 4,895 real-world 3D scenes.



Abstract

Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented grounding necessary for practical applications. In this work, we propose a new task: Task-oriented Sequential Grounding in 3D scenes, wherein an agent must follow detailed step-by-step instructions to complete daily activities by locating a sequence of target objects in indoor scenes. To facilitate this task, we introduce SG3D, a large-scale dataset containing 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed using a combination of RGB-D scans from various 3D scene datasets and an automated task generation pipeline, followed by human verification for quality assurance. We adapted three state-of-the-art 3D visual grounding models to the sequential grounding task and evaluated their performance on SG3D. Our results reveal that while these models perform well on traditional benchmarks, they face significant challenges with task-oriented sequential grounding, underscoring the need for further research in this area.

Data

SG3D contains 3D scenes curated from diverse existing datasets of real environments. Harnessing the power of 3D scene graphs and GPT-4, we introduce an automated pipeline to generate tasks. Post-generation, we manually verify the test set data to ensure data quality.

Model

We adapted three state-of-the-art 3D visual grounding models (3D-VisTA, PQ3D, LEO) to the sequential grounding task and evaluated their performance on SG3D. The results show they face significant challenges with task-oriented sequential grounding.

Data Explorer

To use the data explorer, first select from the available scenes in the selection bar. The tasks and their corresponding steps will be displayed in the right column. Click on a step to visualize its target object with a red bounding box in the scene. All available objects could be found according to the segmentation visualization. Best viewed on monitors.
Control: Click + Drag = Rotate Ctrl + Drag = Translate Scroll Up/Down = Zoom In/Out

Impact on Embodied AI Tasks

We demonstrate the relevance of our annotations by integrating the LEO model with a navigation module in an embodied setting. Specifically, we use the GreedyGeodesicFollower class from Habitat-Sim to guide task-oriented navigation within HM3D scenes based on the grounding results (the centers of the target objects).