A Modern System Recipe for Situated Embodied Human–Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling

1Massachusetts Institute of Technology
2Carnegie Mellon University
System overview: Real-time situated conversation system with streaming egocentric vision and audio processed by a real-time multimodal LM.
Figure 1: Overview of our real-time situated conversation system. Streaming egocentric vision and audio are processed by a real-time multimodal LM, which generates spoken dialogue and issues low-latency function calls to external tools for attention and active perception (e.g., look_at_person, look_at_object, look_around, look_for, use_vision).

Abstract

Situated embodied conversation requires robots to interleave real-time dialogue with active perception: deciding what to look at, when to look, and what to say under tight latency constraints. We present a simple, minimal system recipe that pairs a real-time multimodal language model with a small set of tool interfaces for attention and active perception. We study six home-style scenarios that require frequent attention shifts and increasing perceptual scope. Across four system variants, we evaluate turn-level tool-decision correctness against human annotations and collect subjective ratings of interaction quality. Results indicate that real-time multimodal large language models and tool use for active perception is a promising direction for practical situated embodied conversation.

Example: Plant Doctor Scenario

A user asks an assistant to visually identify a houseplant brought into view and then provide practical care guidance. The same situated conversation recipe deployed across two robot platforms (Jibo, Reachy Mini) and two real-time backends (OpenAI Realtime, Gemini Live).

Jibo — OpenAI Realtime

Jibo — Gemini Live

Reachy Mini — OpenAI Realtime

Robot Platforms

Jibo robot platform in a home-style environment.

Jibo

Reachy Mini robot platform in a home-style environment.

Reachy Mini

Figure 2: The two robot platforms used in our study. The same system recipe is deployed on both by swapping only the embodiment adapter.

Tool Interface for Attention & Active Perception

We organize the system around two main components: (1) a real-time multimodal language model (e.g., OpenAI Realtime / Gemini Live) that handles streaming audio/video and conversational mechanics, and (2) function calling for tool-mediated robot actions for attentional control and active perception. The LM decides what to say, what to look at, and when by interleaving conversation generation with structured tool calls:

  • look_at_person — Closed-loop attentional primitive that continuously steers the robot's gaze toward the speaker using person tracking and spatial audio.
  • look_at_object — Mirrors look_at_person, replacing the person tracker with object detection + temporal tracking (SAM3).
  • look_around — Exploratory sensing primitive that performs a structured sweep of the environment and records anchored viewpoints for later retrieval.
  • look_for — Targeted retrieval-and-attend primitive: selects the most relevant previously observed viewpoint for a natural-language query using a VLM scorer, then directs the robot's gaze to that viewpoint.
  • use_vision — On-demand perception query: sends the latest camera frame to the real-time LM when visual evidence is needed for the next response.
Visual memory map: look_around captures anchored viewpoints, look_for retrieves the best match for a query.
Figure 3: Lightweight "memory map" for off-camera grounding. look_around captures anchored viewpoints during a sweep; look_for retrieves the most relevant viewpoint for a text query using a VLM scorer.

Interaction Scenarios

We designed six home-style interaction scenarios that progressively increase the demands on visual grounding and perceptual scope, from person-only understanding to room-scale multi-object reference with scene changes.

Task Goal Duration Vision Person Object(s) Map
Posture Coach Identify posture issue; coach and verify improvement. 01:39 Y Y (1) N N
Whiteboard Solve a trig problem using whiteboard diagram. 02:12 Y Y (1) Y (1) N
Lamp Placement Choose a lamp location using a room sweep. 02:05 Y Y (1) Y (1) Y (1)
Plant Doctor Find plant and diagnose plant health. 02:12 Y Y (1) Y (1) Y (1)
Outfit Check Find outfits in room, compare options on-body. 02:21 Y Y (1) Y (2) Y (1)
Pack / Find Objects Locate items via scan; re-scan after changes. 02:03 Y Y (1) Y (3) Y (2)

Table 1: Task descriptions and visual requirements. Person, Object(s), and Map specify which visual entities and environment sweeps are required. Counts in parentheses give the number of distinct targets or map updates.

System Variants

We evaluate four system variants to isolate the contribution of key system elements:

  • OpenAI Realtime — Full system with person, object, and environment sweep tools enabled.
  • OpenAI Realtime w/o Object — Disables object tracking and search (look_for, look_at_object).
  • OpenAI Realtime w/o Person — Disables person tracking (look_at_person).
  • Gemini Live — Alternative real-time multimodal backend with all tools enabled (tests backend swapability).

Human Annotation Protocol

Annotation example showing turn-level human labels for what the robot should do next.
Figure 5: Human annotation protocol. For each turn exchange, annotators label what the robot should do next: look at person, look at object, look for an object, or use vision. These annotations serve as ground truth for evaluating system correctness.

Results: System Correctness

We compare the model's tool-call decisions against human "should-do-next" annotations. Both full systems achieve similar turn-level tool-decision correctness: OpenAI Realtime attains 0.72 macro-averaged accuracy with 0.88 precision and 0.62 recall, while Gemini Live reaches 0.77 accuracy with 0.84 precision and 0.60 recall. Ablated variants show expected degradation.

Condition Overall look_at_person look_at_object look_for use_vision
Acc.PR Acc.PR Acc.PR Acc.PR Acc.PR
OpenAI Realtime 0.720.880.62 0.630.521.00 0.701.000.19 0.931.000.70 0.641.000.60
OpenAI w/o Object 0.280.500.26 0.421.000.42 0.701.000.64
OpenAI w/o Person 0.480.540.35 0.690.620.40 0.710.530.53 0.511.000.48
Gemini Live 0.770.840.60 0.760.681.00 0.760.670.17 0.911.000.60 0.651.000.62

Table 2: System Correctness Measures: LM tool-call predictions compared to human "should-do-next" annotations. Macro-averaged accuracy (Acc.), precision (P), and recall (R) for each system condition, overall and per tool category.

Results: Overall Interaction Quality

Subjective evaluations show that conversational quality remains consistently high across all system variants (avg. 4.65/5), while categories tied to embodiment—attention to person and reference to object—are sensitive to the ablations. Full systems (OpenAI and Gemini) achieve the highest fluency and smoothness scores.

Category Gemini Live OpenAI Realtime w/o Person w/o Objects
Fluent and Smooth 4.42 ± 0.85 4.62 ± 0.43 3.96 ± 0.96 3.79 ± 0.92
Convo. Quality 4.79 ± 0.33 4.79 ± 0.26 4.50 ± 0.60 4.50 ± 0.64
Attention to Person 4.29 ± 1.18 4.75 ± 0.34 4.12 ± 1.03 4.67 ± 0.39
Reference to Object 4.61 ± 0.55 4.83 ± 0.33 4.08 ± 1.09 3.47 ± 0.96
Response Timing & Turn Taking 3.89 ± 1.35 4.83 ± 0.17 4.92 ± 0.15 4.72 ± 0.34
Interruption & Pauses 4.94 ± 0.19 5.00 ± 0.00 4.97 ± 0.10 4.97 ± 0.10

Table 3: Overall Interaction Quality Results (mean ± std on a 5-point Likert scale).

Latency & Cost Comparison

OpenAI Realtime Gemini Live
Cost ($ / interaction) $0.047 ± 0.013 $0.018 ± 0.002
Latency (ms) 690 ± 389 739 ± 593

Table 4: Both backends exhibit similar end-to-end latency, but Gemini Live is substantially cheaper per interaction.

Key Findings

  • Tool-mediated attention drives situational awareness: Full systems achieve similar tool-decision correctness across backends. Subjective judgments remain near-ceiling for conversational coherence, while categories tied to embodiment (attention-to-person, reference-to-object) drop under ablations that remove the corresponding tools.
  • Precision is easy; recall is the bottleneck: Across backends, tools are invoked correctly when invoked (high precision), yet are under-invoked (moderate/low recall). Not looking is itself an error because it breaks common ground and forces users into repair behaviors.
  • Prompt and interface design matter: The look_for and look_at_object tools are linguistically close, causing confusion. Future systems should make tool boundaries mutually exclusive where possible.
  • Backend-agnostic recipe: Swapping the real-time backend preserves overall tool-decision quality with modest differences in variance, precision/recall trade-offs, and cost.
  • Cross-platform portability: The same recipe was deployed on both Jibo and Reachy Mini by reusing the identical dialogue manager and tool schemas and swapping only the embodiment adapter.

All Ablation Videos with Jibo

All 24 scenario × system variant combinations. Columns: OpenAI Realtime, Gemini Live, w/o Object, w/o Person.

Posture Coach

OpenAI Realtime

Gemini Live

w/o Object

w/o Person

Whiteboard Tutoring

OpenAI Realtime

Gemini Live

w/o Object

w/o Person

Lamp Placement

OpenAI Realtime

Gemini Live

w/o Object

w/o Person

Plant Doctor

OpenAI Realtime

Gemini Live

w/o Object

w/o Person

Outfit Check

OpenAI Realtime

Gemini Live

w/o Object

w/o Person

Pack / Find Objects

OpenAI Realtime

Gemini Live

w/o Object

w/o Person

Citation

@article{lee2026modern,
      title={A Modern System Recipe for Situated Embodied Human-Robot Conversation with Real-Time Multimodal LLMs and Tool-Calling},
      author={Lee, Dong Won and Gillet, Sarah and Morency, Louis-Philippe and Breazeal, Cynthia and Park, Hae Won},
      journal={arXiv preprint arXiv:2602.04157},
      year={2026}}