DECO: Dense Estimation of 3D Human-Scene COntact in the Wild

ICCV 2023 (Oral)

*equal technical contribution  project lead
1Max Planck Institute for Intelligent Systems, 2University of Amsterdam

Estimating Vertex-level 3D Contact from single RGB image. Given an RGB image, DECO infers dense vertex-level 3D contacts on the full human body. To this end, it reasons about the contacting body parts, human-object proximity, and the surrounding scene context to infer 3D contact for diverse human-object and human-scene interactions. Blue areas show the inferred contact on the body, hands, and feet for each image.

DECO Results

DECO on images from the HOT dataset
DECO on Internet images

Method Overview

DECO reasons about body parts, human-object proximity, and the surrounding scene context. To this end, it uses three branches, i.e., a scene-context, a part-context, and a per-vertex contact-classification branch. A cross attention guides the features to “attend” on (and around) body parts and scene elements that are relevant for contact.

Abstract

Understanding how humans use physical contact to interact with the world is a key step toward human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images.

In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO , a novel 3D contact detector which uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context.

We perform extensive evaluations for our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images.

Intro Video

DAMON Dataset

DAMON (Dense Annotation of 3D HuMAn Object Contact in Natural Images) is a collection of vertex-level 3D contact labels on SMPL paired with color images of people in unconstrained environments with a wide diversity of human-scene and human-object interactions. The sourced images contain valid human-contact images from existing HOI datasets by removing indirect human-object interactions, heavily cropped humans, motion blur, distortion or extreme lighting conditions.

A total of 5522 images are annotated - contact vertices are painted and assigned an appropriate label out of 84 object and 24 body-part labels. The contact region for each object class has been color-coded for better visual representation. Specifically, two types of contact has been focused on: (1) scene-supported contact: humans supported by scene objects, and (2) human-supported contact: objects supported by a human.

Images from the DAMON Dataset (object-contact and supporting-contact respectively).

Please click on the RGB images for a list of objects contacted with. Font colors indicate the color of contact annotation on the displayed mesh.
Supporting contact is indicated in purple in the second mesh.


Object in image:

Backpack

Object in image:

Bowl
Book
Cell Phone
Chair
Dining Table

Object in image:

Laptop
Cell Phone
Bottle
Bed

Object in image:

Backpack
Book
Cup
Cell Phone
Chair
Dining Table

Acknowledgments & Disclosure

We sincerely thank Alpar Cseke for his contributions to DAMON data collection and PHOSA evaluations, Sai K. Dwivedi for facilitating PROX downstream experiments, Xianghui Xie for his generous help with CHORE evaluations, Lea Muller for her help in initiating the contact annotation tool, Chun-Hao P. Huang for RICH discussions and Yixin Chen for details about the HOT paper. We are grateful to Mengqin Xue and Zhenyu Lou for their collaboration in BEHAVE evaluations, Joachim Tesch and Nikos Athanasiou for insightful visualization advice, and Tsvetelina Alexiadis for valuable data collection guidance. Their invaluable contributions enriched this research significantly. We also thank Benjamin Pellkofer for help with the website and IT support. This work was funded by the International Max Planck Research School for Intelligent Systems (IMPRS-IS).

MJB has received research gift funds from Adobe, Intel, Nvidia, Meta/Facebook, and Amazon. MJB has financial interests in Amazon, Datagen Technologies, and Meshcapade GmbH. While MJB is a consultant for Meshcapade, his research in this project was performed solely at, and funded solely by, the Max Planck Society.

Contact

For technical questions, please contact deco@tue.mpg.de
For commercial licensing, please contact ps-licensing@tue.mpg.de

BibTeX

@InProceedings{tripathi2023deco,
    author    = {Tripathi, Shashank and Chatterjee, Agniv and Passy, Jean-Claude and Yi, Hongwei and Tzionas, Dimitrios and Black, Michael J.},
    title     = {{DECO}: Dense Estimation of {3D} Human-Scene Contact In The Wild},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {8001-8013}
}