Human Robot Social Interaction (HSRI) Datasets & Benchmark

Abstract

Our work aims to advance the socially intelligent behavior of social embodied artificial intelligence (AI) agents in real-world social interactions. Recently, foundational models are being utilized as automatic evaluators of social interactions. To enable further research in this direction, we introduce a large-scale real-world Human Robot Social Interaction (HSRI) Dataset to benchmark the capabilities of Language Models (LMs) and multimodal foundational models to identify and reason about social errors and competencies in real-world human-robot interactions. Our dataset consists of 440 real-world human social robot interaction videos, and is accompanied by over 10K annotations, detailing the robot’s social errors, competencies, rationale, and corrective actions, capturing unique aspects of human-AI interaction only present in real-world interactions, with annotations on error cases where embodied physical AI agents lack human-like social behaviors such as conversational dynamics and emotion recognition, which are often overlooked and difficult to formulate as machine learning tasks without real-world data. Then, to further assess whether recent AI models can identify and reason through such real-world social interactions, we propose eight new benchmark tasks for evaluating AI model’s ability to reason about social interactions, including detecting errors and competencies, rationalizing, and inferringcorrective actions. Human studies and experiments with modern large language models (LLMs) and vision-language models (VLMs) reveal that current models struggle with these tasks.

Dataset Characteristics

Figure 2: Overall characteristics of the SRI Dataset. — **Figure 2**: Overall characteristics of the SRI Dataset. Our dataset contains high overlapping annotations with a high level of agreement among annotators regarding error and competency labels. The dataset includes more annotations from the verbal channel compared to the non-verbal one, with a balanced proportion of error and competency labels. Amongst various social attributes, with the majority of annotations falling under the category of conversational mechanics, followed by intention and engagement.

Benchmark Tasks

Figure 3: Our benchmark offers 8 tasks dedicated to probing various facets of AI model's social reasoning. — **Figure 3**: Our benchmark offers 8 tasks dedicated to probing various facets of AI model’s social reasoning with regards to detecting social errors and competencies, identifying social attributes, understanding the progression of social interactions, and rationalization and correction of social errors.

Benchmark Results

Figure 4: Results per model across all 8 tasks, human performance is marked in dotted lines. — **Figure 4**: Results per model across all 8 tasks, human performance is marked in dotted lines. (L): language-only inputs, (L+V): language and visual inputs. Gemini-1.5-flash does the best in Error/Comp./None detection and error detection tasks, gpt4-o performs the best on attribute identification, internVL2 on multiple attribute presence, gpt-4o and its variants does well on interaction progression(pre, post) reasoning tasks, and o1 performs well on the rationale task and gpt-4o with images and CoT performs best on the repair task. Models from similar sources show similar trends, as indicated by the shape of the radar plot, and upgraded models become ’larger’ (gpt-4o variants).

Video Demonstration

Video: A short video demonstrating our HSRI Dataset and Benchmark.

Citation

@misc{lee2025hsri,
	title={Human Robot Social Interaction (HSRI) Dataset & Benchmark}, 
	author={Dong Won Lee and Yubin Kim and Parker Malachowsky and Sooyeon Jeong and Denison Guvenoz and Louis-philippe Morency and Cynthia Breazeal and Hae Won Park},
	year={2025},
	eprint={},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
}