LILAC: Language‑Conditioned Object‑Centric Optical Flow for Open‑Loop Trajectory Generation

Abstract

We address the task of generating the 2D flow of a target object and the corresponding end-effector trajectory for robotic manipulation , guided by a pre-manipulation RGB image and a language instruction.

Flow-based methods, which predict both 2D flow and trajectories from language and initial images, offer significant advantages: they can adapt to various daily tasks using only a small amount of real-robot data and can be trained on readily available web videos depicting object manipulation. However, there is a scarcity of multimodal flow-based methods trained on large datasets. Furthermore, many existing approaches employ closed-loop trajectory generation, which necessitates more demonstration data and suffers from error accumulation. To overcome these limitations, we propose LILAC , a flow-based Language Instruction-guided open-Loop ACtion generator, a novel multimodal flow-based model capable of offline trajectory generation.

Our contributions include introducing

a Semantic Consistency Loss that strengthens language conditioning so the generated optical flow stays aligned with the instruction
, and designing a Prompt-Conditioned Cross-Modal Adapter that aligns a learned visual prompt with image and text features to provide richer cues for appropriate flow generation.

Experimental results demonstrate that our method generates higher quality flow compared to existing approaches and achieves superior task success rates in real-robot object manipulation tasks.

Zero‑Shot Transfer Carousel (8×)

PREDICT "SUCCESS"

1. Instruction: "place a mug in front of the banana"

8×

PREDICT "FAILURE"

2. Instruction: "place a chips can in front of the red round can"

8×

PREDICT "SUCCESS"

3. Instruction: "pick the yellow bottle at the back right"

8×

PREDICT "FAILURE"

4. Instruction: "pick the blue spam can"

8×

PREDICT "SUCCESS"

5. Instruction: "move the purple small cup close to the light blue cup"

8×

PREDICT "FAILURE"

6. Instruction: "move the apple near by the banana"

8×

PREDICT "SUCCESS"

7. Instruction: "move the blue can close to the soup can"

8×

PREDICT "FAILURE"

8. Instruction: "move the rubik's cube near by the purple cup"

8×

Overview

We tackle the task of predicting whether a tabletop object manipulation task was performed successfully, given the instruction sentence and ego-centric images taken before and after the manipulation. We define this task as Success Prediction for Object Manipulation (SPOM).

Fig. 1: Typical samples of SPOM task. The each top sentence is the given instruction for each sample. The top and bottom images depict the scene before and after the manipulation, respectively.

To tackle the task, we propose the novel approach to create the multi-level aligned representations for images, and build the success prediction method based on them—Contrastive \(\boldsymbol \lambda\)-Repformer.

CORE NOVELTIES:
1. \(\boldsymbol \lambda\)-Representation Encoder computes three types of latent representations of an image and integrates them into \(\boldsymbol \lambda\)-Representation—the multi-level aligned visual representation composed of three types of latent representations: features that capture visual characteristics such as colors and shapes (Scene Representation), features aligned with natural language (Aligned Representation), and features structured through natural language (Narrative Representation).
2. Contrastive \(\boldsymbol \lambda\)-Representation Decoder identifies the difference between the \(\lambda\)-Representation of two images. This allows the model to take into consideration the alignment between the differences in the images and the instruction sentence when performing task success prediction.

Fig. 2: Overview of Contrastive \(\lambda\)-Repformer. Given an instruction sentence and images before and after manipulation, our model outputs the predicted probability that the robot successfully performed the manipulation.

\(\boldsymbol \lambda\)-Representation

Fig. 3: \(\boldsymbol \lambda\)-Representation—the multi-level aligned visual representation composed of three types of latent representations: features that capture visual characteristics such as colors and shapes (Scene Representation), features aligned with natural language (Aligned Representation), and features structured through natural language (Narrative Representation).

Scene Representation

The Scene Representation \(\boldsymbol{h}_s\) is obtained by concatenating the outputs of several unimodal image encoders (e.g.: ViT, Swin Transformer, DINOv2).

Aligned Representation

We get Aligned Representation \(\boldsymbol{h}_a\) using the Aligned Representation Module, which is composed of multimodal foundation models such as CLIP, SigLIP and BLIP.

Narrative Representation

Narrative Representation \(\boldsymbol{h}_n\) is obtained using the Narrative Representation Module, containing MLLMs (e.g.: InstructBLIP, Gemini, GPT-4) and text embedders (e.g.: BERT, text-embedding-ada-002). We designed the text prompt to focus on the colors, sizes and shapes of the target objects, how they are placed, their position within the image and relative position to other objects. From the output of MLLMs, we acquire its features using text embedders. Then, they are concatenated to get \(\boldsymbol{h}_n\).

Finally, we obtain the \(\boldsymbol \lambda\)-Representation by concatenating the three representations: \[\boldsymbol{h}_{\lambda}=\left[\boldsymbol{h}_s^\mathsf{T}, \boldsymbol{h}_a^\mathsf{T}, \boldsymbol{h}_n^\mathsf{T}\right]^\mathsf{T}.\]

Contrastive \(\boldsymbol \lambda\)-Representation Decoder

The differences between the images do not by themselves necessarily indicate the success of the task specified by the instructions. To address this issue, we propose the Contrastive \(\lambda\)-Representation Decoder, which use cross attention based architecture to obtain the predicted probability \(P(\hat{y}=1)\), indicating the probability that the manipulator has successfully executed the task.

Results

Qualitative Results

Fig. 4: Successful cases of the proposed method from SP-RT-1: The left and right images show the scene before and after the manipulation, respectively.

Successful cases of the proposed method from zero-shot transfer experiments

Fig. 5: Qualitative results of the proposed method in zero-shot transfer experiments. The left image depicts the scene before the manipulation, while the right image shows it afterward. Examples (i) and (ii) are true positive and true negative cases, respectively.

Quantitative Results

Table 1: Quantitative results. The best results are marked in bold.

Demonstration in the Real-World environment

BibTeX


    To be appeared.

LILAC: Language‑Conditioned Object‑Centric Optical Flow for Open‑Loop Trajectory Generation

Contrastive \(\boldsymbol \lambda\)-Repformer performs task success prediction based on the multi-level aligned representations—\(\boldsymbol \lambda\)-Representation.

Abstract

Zero‑Shot Transfer Carousel (8×)

Overview

\(\boldsymbol \lambda\)-Representation

Scene Representation

Aligned Representation

Narrative Representation

Contrastive \(\boldsymbol \lambda\)-Representation Decoder

Results

Qualitative Results

Quantitative Results

Demonstration in the Real-World environment

BibTeX

Contrastive \(\boldsymbol \lambda\)-Repformer performs task success prediction based on
the multi-level aligned representations—\(\boldsymbol \lambda\)-Representation.