Our work aims to advance the socially intelligent behavior of social embodied artificial intelligence (AI) agents in real-world social interactions. Recently, foundational models are being utilized as automatic evaluators of social interactions. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of Language Models (LMs) and multimodal foundational models to identify and reason about social errors and competencies in real-world human-robot interactions. Our dataset consists of 440 real-world human social robot interaction videos, and is accompanied by over 10K annotations, detailing the robot’s social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions, with annotations on error cases where embodied physical AI agents lack human-like social behaviors such as conversational dynamics and emotion recognition, which are often overlooked and difficult to formulate as machine learning tasks without real-world data. Then, to further assess whether recent AI models can identify and reason through such real-world social interactions, we propose eight new benchmark tasks for evaluating AI model’s ability to reason about social interactions, including detecting errors and competencies, rationalizing, and inferringcorrective actions. Human studies and experiments with modern large language models (LLMs) and vision-language models (VLMs) reveal that current models struggle with these tasks.
@misc{lee2025hsri,
title={Human Robot Social Interaction (HSRI) Dataset & Benchmark},
author={Dong Won Lee and Yubin Kim and Parker Malachowsky and Sooyeon Jeong and Denison Guvenoz and Louis-philippe Morency and Cynthia Breazeal and Hae Won Park},
year={2025},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.CL}
}