URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images

RSS 2024

1University of Washington, 2Nvidia, *Equal Advising

We propose URDFormer, a scalable pipeline towards large-scale simulation generation from real-world images. Given an image (from the internet or captured from a phone), URDFormer predicts its corresponding interactive "digital twin" in the format of a URDF. This URDF can be loaded into a simulator to train a robot for different tasks.

URDFormer Applications

Interpolate start reference image.

(a) For the safe deployment of robots in real-world settings, we initially create a 'digital' replica of the real-world scene. Subsequently, we gather demonstrations in simulation and train a robot policy. This policy can then be safely deployed in the real world or used for further real-world fine-tuning. (b) Given that URDFormer relies solely on RGB images, it holds the potential to create extensive simulation environments from internet data, closely mirroring the real-world distribution.

Abstract

Constructing accurate and targeted simulation scenes that are both visually and physically realistic is a problem of significant practical interest in domains ranging from robotics to computer vision. Achieving this goal since it provides a realistic, targeted simulation playground for training generalizable decision-making systems. This problem has become even more relevant as researchers wielding large data-hungry learning methods seek new sources of training data for physical decision-making systems. However, building simulation models is often still done by hand - a graphic designer and a simulation engineer work with predefined assets to construct rich scenes with realistic dynamic and kinematic properties. While this may scale to small numbers of scenes, to achieve the generalization properties that are required for data-driven robotic control, we require a pipeline that is able to synthesize large numbers of realistic scenes, complete with ``natural" kinematic and dynamic structure. To attack this problem, we develop models for inferring structure and generating simulation scenes from natural images, allowing for scalable scene generation from web-scale datasets. To train these image-to-simulation models, we show how controllable text-to-image generative models can be used in generating paired training data that allows for modeling of the inverse problem, mapping from realistic images back to complete scene models. We show how this paradigm allows us to build large datasets of scenes in simulation with semantic and physical realism. We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images and use these for training robotic control policies. We then robustly deploy in the real world for tasks like articulated object manipulation. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies in the resulting environments.



Video


URDFormer: Overview

Interpolate start reference image.

To train a network that takes real-world images and predicts their corresponding URDFs, we propose a forward-inverse pipeline. During the forward phase, we first randomly sample a URDF (this could be from a dataset or procedurally generated) and load the corresponding articulated assets. However, if we train a network with the initial texture, it likely won't generalize well to real-world images due to the distribution shift. We leverage the power of image-conditioned stable diffusion (Robin Rombach et al., 2022) that was pretrained on a web-scale dataset and convert the initial synthetic images into photorealistic images. Now we automatically have paired realistic images with their corresponding URDFs to supervise a network. We can train the network on the generated images which emulate real-world complexity, such as lighting and textures, and enable the network to predict correct URDFs given real-world images during the inverse phase.


Results

Convert Images into Simulation Content

URDFormer enables fast simulation content generation only from RGB images, whether they're sourced from the internet or captured with your phone.

Predicted URDFs of cabinets with different configurations from internet images

URDFormer can be applied to different scenes and objects. Here we visualize examples of URDFormer prediction originated from internet images shown in the bounding box.

Predicted URDFs of cabinets with different configurations from internet images

(a) Predicted URDFs for cabinets in various configurations directly from internet images.

Predicted URDFs of different kitchen appliances from internet images

(b) Predicted URDFs for kitchen appliances in various configurations directly from internet images.

Predicted URDFs of different kitchen appliances from internet images

(c) Predicted URDFs for kitchens with different layouts from internet images. Here we manually collected bbox to enable better URDF prediction. Comparison of performance between manual/detection is on Figure 9 in the paper.

Predicted URDFs of real-world kitchens captured by a phone

(d) We also visualize several examples of predicted URDFs from real-world kitchens captured using a phone.

Application

1. Real-to-Sim-to-Real for safer real-world deployment

URDFormer enables faster real-to-sim-to-real without any manual effort

Given a single image of an object, URDFormer predicts its URDF, automatically collects data, and trains a robot within a simulator. The learned policy can then be transferred to the real world for safer deployment.

We can also train a visual policy on a mobile manipulator to perform multi-step tasks such as put the object in the drawer. Here we train the U-net based network to predict affordance and apply a planner to execute the task

All robot tasks

We train multi-task policy for each object conditioned on language and evaluate them on both tabletop manipulation with a UR5 as well as a mobile manipulation task with a Stretch. All videos are played at the speed of x8





2. Train and Benchmark in Simulation with real-world scenes

We also created a Gym environment named Reality Gym, featuring assets originated from internet images.

(a) We define 4 main tasks: (1) Open [any articulated part] (2) Close [any articulated part] (3) Fetch the object (4) Collect the object. Successful demonstrations are automatically generated using Curobo, togehter with corresponding language descriptions.

(b) We also provide augmentation for switching (1) meshes for handles, frames and drawer etc, using Partnet dataset (Mo et al., 2019) and (2) textures using stable diffusion (Robin Rombach et al., 2022).


Method Details

URDFormer: Forward (Generating Photorealistic Image-URDF pairs)

Interpolate start reference image.

(a) Motivation: Given initial poorly-rendered simulation images, we want to convert them into photorealistic images. However, if we directly use off-the-shelf stable diffusion (Robin Rombach et al., 2022) models, they usually ignore the local configuration and produce images that do not match the original structures.

Interpolate start reference image.

(b) Step-by-step Approach: Instead, we first collect a small set of texture images and use stable diffusion to generate a much larger dataset of texture images. We then apply perspective warping to put a randomly selected texture image on the region of the image and iteratively apply this step for all the parts of the object. This simple approach surprisingly works well at preserving local structures while providing diverse realism to the original image.

Interpolate start reference image.

(c) Part-aware Generation: Using this approach, we can now convert original synthetic images such as cabinets into photorealistic images that render the correct kinematic parts and configurations.



Interpolate start reference image.

(d) Image-URDF Dataset: Photorealistic training set converted from the original synthetic images can be used for training URDFormer.

URDFormer: Inverse (Internet Images to URDFs)

Interpolate start reference image.

(a) Object and Part Detection: Our network takes bounding box of objects or parts and predict their kinematic structures. The bbox labels can be easily obtained in simulation during forward step, but how about when we apply this to the real-world images? Interestingly, the photo-realistic dataset generated during the forward phase (see above (c) and (d)) can also be used to fine-tune an object detection model. We fine-tune GroundingDINO (Liu et al., 2023), an open-vocabulary object detection model, on our generated dataset to improve detection of parts such as "oven door," "knob," and "drawer."
(b) Model Soup Approach for Finetuning: When evaluating the fine-tuned model, although its performance shows visible improvement, with a 13% gain in F1 score for cabinet images, we also observe a decreased ability to detect cases such as "a weird-shaped handle." This is likely because the fine-tuned model starts to "give up" some knowledge learned from the pretrained large real-world dataset to fit into the generated dataset. Inspired by insights from the recent work on Model Soups (Wortsman et al., 2022), we combine the "knowledge" from both the pretrained model and the fine-tuned model by simply averaging their weights. This approach leads to surprising results: The F1 score is further improved by over 10%.

URDFormer: Network

Interpolate start reference image.

Network Architecture: Using the generated dataset, we can now train a transformer-based network called URDFormer. Note that URDFormer takes both RGB images as well as bounding boxes of its parts, such as handles and drawers. During training, the bounding boxes are automatically obtained using the simulator. During inference time, we fine-tune GroundingDINO using the model soup approach (see more details below). For each image crop, URDFormer classifies the mesh type, position (discretized), scale (discretized), and parent. These URDF primitives are fed into a URDF template to create the final URDF that can be loaded into a simulator. For global scene prediction, we train two separate networks: URDFormer (Global) focuses on predicting the parent and spatial information on how to place the object. URDFormer (Part) takes the cropped image containing each object and predicts the detailed structure. The results of the two predictions are combined to create the full scene prediction.

Limitations

Please see the full list of limitations and future work in our paper

(a) URDFormer performance largely relies on the performance of bounding box detection.
(b) URDFormer uses predefined meshes that might not match the real-world scenes.
(c) URDFormer does not predict physics parameters such as mass or friction.

BibTeX

@article{chen2024urdformer,
  title={URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images},
  author={Zoey Chen and Aaron Walsman and Marius Memmel and Kaichun Mo and Alex Fang and Karthikeya Vemuri and Alan Wu and Dieter Fox and Abhishek Gupta},
  journal={arXiv preprint arXiv:2405.11656},
  year={2024}
}