{"id":4341,"date":"2026-03-27T13:08:44","date_gmt":"2026-03-27T13:08:44","guid":{"rendered":"https:\/\/www.justinecassell.com\/articulab\/?page_id=4341"},"modified":"2026-06-23T08:34:32","modified_gmt":"2026-06-23T08:34:32","slug":"son-of-sara","status":"publish","type":"page","link":"https:\/\/www.justinecassell.com\/articulab\/projects\/son-of-sara\/","title":{"rendered":"Son of Sara"},"content":{"rendered":"\n<div class=\"wp-block-group has-background has-x-large-font-size is-vertical is-content-justification-center is-nowrap is-layout-flex wp-container-core-group-is-layout-73832be3 wp-block-group-is-layout-flex\" style=\"background:linear-gradient(135deg,rgb(201,25,30) 0%,rgb(166,15,121) 50%,rgb(39,52,139) 100%)\">\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h1 class=\"wp-block-heading has-text-align-center has-white-color has-text-color has-link-color has-large-font-size wp-elements-88e38eca2d9e97e10340f07b4710ddfa\" style=\"font-style:normal;font-weight:600;letter-spacing:1px;text-transform:none\">Son of Sara : Developing a new LLM-based<\/h1>\n\n\n\n<p><\/p>\n\n\n\n<h1 class=\"wp-block-heading has-text-align-center has-white-color has-text-color has-link-color has-large-font-size wp-elements-d4e148416578a33b13273f77f47da890\" style=\"font-style:normal;font-weight:600;letter-spacing:1px;text-transform:none\">Embodied Conversational Agent.<\/h1>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n<\/div>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Overview<\/h2>\n\n\n\n<p>As part of our ongoing work on socially capable conversational agents, the team launched what we are calling Son of SARA (see the SARA project on <a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/sara\/\" data-type=\"link\" data-id=\"https:\/\/www.justinecassell.com\/articulab\/projects\/sara\/\">this page<\/a>), an embodied conversational agent designed to support natural and effective interaction with human users, relying on both rapport and task effectiveness to assure good collaboration between human and agent. In this context, and as always, the project aims to equip dialogue agents with both verbal and non-verbal interactional skills that are essential for collaborative communication. Building on our lab\u2019s tradition to build socially aware ECAs, we aim to elevate their naturalness and adaptability, alongside other key attributes, using the robust generalization capabilities of LLMs and cutting-edge deep learning models.<\/p>\n\n\n\n<p>Our current system follows the team&#8217;s traditional approach, utilizing a modular\/cascaded pipeline with specialized modules for each conversational phenomenon. The processing and generative pipeline will allow the system, from the voice of the user, to generate a suitable agent response. Each module of the pipeline can process or generate a different combination of multimodalities (speech, gesture, head and face movements, (text, audio). To ensure that the complexity of the modular architecture and its deep learning models won\u2019t prevent the agent from interacting with the user in real time, we decided to go with an incremental system, following the approach of several dialogue systems and conversational agent from colleagues, such as <a href=\"https:\/\/www.zotero.org\/google-docs\/?YgRYvW\">(Schlangen &amp; Skantze, 2011)<\/a>.&nbsp;<\/p>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">First version : Focusing on speech<\/h2>\n\n\n\n<p>We initially focused on building a conversational agent, without the embodied part of the agent. We successively created an agent that people can interact with, in real time, with a fairly flexible approach towards the scenario of interaction (free discussion, Q&amp;A, collaborative task, etc). That first version of the agent, that we might call the \u201cskeleton\u201d of the pipeline, contains all required models to allow a speech conversation with : It captures and transcribes the user\u2019s voice with the Microphone and Automatic speech recognition (ASR), generates a suitable text then spoken response with the LLM and Text to speech (TTS). The dialogue dynamics are handled with the Voice Activity Detection (VAD) and Dialogue Manager modules. To qualitatively test that first version of the agent, we mimicked the collaborative task used in our Collab&nbsp; team\u2019s longitudinal study which consist in discussing an image and coming up with hypothesis regarding the different image\u2019s elements (you&#8217;ll soon be able to find more information in the corresponding Collab study page).<\/p>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading has-text-align-left\">Non Verbal Behavior<\/h2>\n\n\n\n<p>Continuing the first agent version, we focused on extending it beyond spoken interaction by adding a visual and embodied dimension to the conversation. Building on an existing conversational framework enabling spoken dialogue, the team developed the agent\u2019s virtual body and implemented the software infrastructure required to synchronize non-verbal behaviors with speech. This included the design and implementation of preliminary rule-based models for the generation of facial expressions and gestures, allowing the agent to produce visible communicative signals aligned with its speech turns.&nbsp;<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"768\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_38-1.png\" alt=\"\" class=\"wp-image-4821\" srcset=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_38-1.png 1600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_38-1-600x288.png 600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_38-1-900x432.png 900w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_38-1-768x369.png 768w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_38-1-1536x737.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"769\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_01-23.png\" alt=\"\" class=\"wp-image-4812\" srcset=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_01-23.png 1600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_01-23-600x288.png 600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_01-23-900x433.png 900w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_01-23-768x369.png 768w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulab-Son-of-Sara-Unity-Articulab-Son-of-Sara-Player-Windows-Mac-Linux-Unity-2022.3.62f2-_DX11_-09_02_2026-16_20_01-23-1536x738.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/figure>\n\n\n\n<h5 class=\"wp-block-heading has-text-align-center\">Figure 1 : Son of Sara\u2019s 20 years old virtual body during a conversation<\/h5>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>The gesture generation system aims to produce plausible, semantically relevant co-speech gestures in real-time during agent-user interaction. Initially, the team explored a retrieval-based approach, selecting gestures from a pre-existing gesture library based on verbal and syntactic cues extracted from the agent&#8217;s speech. This system used transformer-based encoders (BGE\/MiniLM) to embed utterances, then ranked candidate gestures via consensus re-ranking across multiple semantic views. While functional and fast, this method lacks flexibility, often producing generic scripted gestures that are vaguely appropriate rather than movements finely tailored to context.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<div data-wp-interactive=\"core\/file\" class=\"wp-block-file aligncenter\"><object data-wp-bind--hidden=\"!state.hasPdfPreview\" hidden class=\"wp-block-file__embed\" data=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulabo-GENEA2025-PosterDraft-YMACHTA-ARTICULAB-1.pdf\" type=\"application\/pdf\" style=\"width:100%;height:600px\" aria-label=\"Embed of Articulabo- GENEA2025-PosterDraft-YMACHTA-ARTICULAB.\"><\/object><a id=\"wp-block-file--media-e1f08759-3487-447b-81f0-eb1f8feaf655\" href=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/Articulabo-GENEA2025-PosterDraft-YMACHTA-ARTICULAB-1.pdf\">Articulabo- GENEA2025-PosterDraft-YMACHTA-ARTICULAB<\/a><\/div>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Current development focuses on transitioning toward a generative model capable of producing diverse gesture types: deictic (pointing), iconic (depicting concrete concepts), and metaphoric (representing abstract ideas), rather than relying predominantly on beat gestures (rhythmic movements aligned with speech prosody) which is what most SOTA models suffer from at the moment. This shift prioritizes both synchronization and semantic appropriateness of non-verbal behaviors.<\/p>\n\n\n\n<p>The generative approach contends with two data sources. Motion capture (e.g., BEAT\/BEAT2 datasets in SMPL-X format <a href=\"https:\/\/www.zotero.org\/google-docs\/?Hg3yuj\">(Liu et al., 2024)<\/a>) provides high-fidelity ground truth but is actor-biased and limited in variety. Reconstructed 3D motion from monocular video via pose estimators (HMR, Hamer, haptic) <a href=\"https:\/\/www.zotero.org\/google-docs\/?8v4hUp\">(Agrawal et al., s.&nbsp;d.<\/a>, <a href=\"https:\/\/www.zotero.org\/google-docs\/?5u8iIR\">Pavlakos et al., 2023<\/a>, <a href=\"https:\/\/www.zotero.org\/google-docs\/?On0ojx\">Shen et al., 2024)<\/a> is cheaper and more scalable but introduces noise : temporal jitter, depth ambiguity, and occlusion artifacts which requires post-processing to recover usable motion.<\/p>\n\n\n\n<p>With data in hand, the goal is to evaluate leading 3D motion generation frameworks: <strong>diffusion models<\/strong> (DiffSHEG <a href=\"https:\/\/www.zotero.org\/google-docs\/?qsHa2c\">(Chen et al., 2024)<\/a>, STARGATE <a href=\"https:\/\/www.zotero.org\/google-docs\/?aKCsWV\">(Abel et al., 2024)<\/a>), <strong>token-based autoregressive models<\/strong> (VQ-VAE discretization followed by GPT-style generation <a href=\"https:\/\/www.zotero.org\/google-docs\/?7vsxO1\">(Zhang et al., 2024)<\/a>), and many more. This exploration lays groundwork for future extensions to co-listening behaviors.<\/p>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Turn Taking<\/h2>\n\n\n\n<p>In parallel, the project advanced the agent\u2019s conversational dynamics by improving its turn-taking capabilities. A predictive model based on voice activity was integrated to enable the agent to anticipate turn transitions during interaction. This contributes to more interactive and rhythmically natural exchanges, bringing agent\u2013human conversations closer to the temporal structure of human\u2013human dialogue.&nbsp;<\/p>\n\n\n\n<p>In human conversation, listeners naturally anticipate upcoming turn transitions while simultaneously planning their responses. This dual cognitive process, monitoring for turn-yielding cues while preparing a follow-up (see Fig 2), constitutes the fluidity of human-human interaction. To replicate this dynamic, we implemented a system combining turn-shift prediction with early answer generation.<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"1157\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-2.png\" alt=\"\" class=\"wp-image-4691\" style=\"width:720px\" srcset=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-2.png 1600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-2-553x400.png 553w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-2-830x600.png 830w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-2-768x555.png 768w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-2-1536x1111.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/figure>\n<\/div>\n\n\n<h5 class=\"wp-block-heading has-text-align-center\">Figure 2 : Projection of interlocutor\u2019s turn to predict end of turn <a href=\"https:\/\/www.zotero.org\/google-docs\/?prvdxK\">(Levinson, 2016)<\/a><\/h5>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Our approach builds primarily on the work of <a href=\"https:\/\/www.zotero.org\/google-docs\/?pVijIc\">(Ekstedt &amp; Skantze, 2022)<\/a> and their Voice Activity Projection (VAP) model for predictive turn-taking. VAP monitors ongoing conversations in real-time and predicts future voice activity patterns for both speakers within a 2-second window (see Fig. 3). By applying threshold-based decision rules to these predictions, we developed a set of behaviours that enable the agent to determine appropriate moments to initiate speech. This combination of predictive modeling and generation rules significantly reduced response latency and enhanced the conversational rhythm.&nbsp;<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"400\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-4.png\" alt=\"\" class=\"wp-image-4717\" style=\"width:1000px\" srcset=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-4.png 1200w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-4-600x200.png 600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-4-900x300.png 900w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-4-768x256.png 768w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><\/figure>\n<\/div>\n\n\n<h5 class=\"wp-block-heading has-text-align-center\">Figure 3 : VAP turn taking prediction model real time execution <a href=\"https:\/\/www.zotero.org\/google-docs\/?cSo1FV\">(Ekstedt &amp; Skantze, 2022)<\/a><\/h5>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>Another addition to our pipeline involves initiating response generation before the user&#8217;s turn concludes <a href=\"https:\/\/www.zotero.org\/google-docs\/?rpkkR4\">(Skantze &amp; Irfan, 2025)<\/a>.&nbsp; When VAP predicts an imminent turn yield, the system begins generating a response while continuing to process incoming speech (see Fig 4). If the user introduces information that substantially alters the meaning of their turn, the initial generation is aborted and restarted using the complete transcription from the automatic speech recognition (ASR) system.&nbsp;<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1402\" height=\"262\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-3.png\" alt=\"\" class=\"wp-image-4716\" style=\"width:1000px\" title=\"Capture d\u2019\u00e9cran du 2026-01-19 17-40-49.png\" srcset=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-3.png 1402w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-3-600x112.png 600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-3-900x168.png 900w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-3-768x144.png 768w\" sizes=\"auto, (max-width: 1402px) 100vw, 1402px\" \/><\/figure>\n<\/div>\n\n\n<h5 class=\"wp-block-heading has-text-align-center\">Figure 4 : Example of a VAP-monitored SDS real time execution <a href=\"https:\/\/www.zotero.org\/google-docs\/?NKFxX2\">(Skantze &amp; Irfan, 2025)<\/a><\/h5>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p>While developing the turn-taking module, our team collaborated with Prof. Koji Inoue from Kyoto University, whose research focuses on enhancing naturalness in conversational AI <a href=\"https:\/\/www.zotero.org\/google-docs\/?FAuXar\">(Inoue et al., 2024a<\/a>, <a href=\"https:\/\/www.zotero.org\/google-docs\/?pfTciJ\">Inoue et al., 2024b)<\/a>. Along with his team, we are developing one of the first predictive turn-taking model for French dyadic interactions.&nbsp;<\/p>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Current Architecture<\/h2>\n\n\n\n<p>Together, these developments mark a significant step toward the fully embodied nature of the Son of SARA conversational agent. They provide a solid foundation for future work on data-driven, real-time gesture generation conditioned on vocal features, as well as on further refinements of turn-taking models, including adaptation to additional languages. The current version of the dialogue system can be pictured as in the following graphic :&nbsp;<\/p>\n\n\n\n<div style=\"height:24px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"900\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image.png\" alt=\"\" class=\"wp-image-4688\" style=\"width:1000px\" srcset=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image.png 1600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-600x338.png 600w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-900x506.png 900w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-768x432.png 768w, https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2026\/03\/image-1536x864.png 1536w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><\/figure>\n<\/div>\n\n\n<h5 class=\"wp-block-heading has-text-align-center\">Figure 5 : Son of Sara\u2019s current dialogue system\u2019s modular architecture<\/h5>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p>Abel, L., Colotte, V., &amp; Ouni, S. (2024). Towards interpretable co-speech gestures synthesis using STARGATE. <em>Companion Proceedings of the 26th International Conference on Multimodal Interaction<\/em>, 138\u2011146. https:\/\/doi.org\/10.1145\/3686215.3688819&nbsp;<\/p>\n\n\n\n<p>Agrawal, V., Akinyemi, A., Alvero, K., Behrooz, M., Bu\ufb00alini, J., Carlucci, F. M., Chen, J., Chen, Z., Cheng, S., Chowdary, P., Chuang, J., D\u2019Avirro, A., Daly, J., Dong, N., Duppenthaler, M., Gao, C., Girard, J., Gleize, M., Gomez, S., \u2026 Zollhoefer, M. (s.&nbsp;d.). <em>Seamless Interaction\u202f: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset<\/em>.&nbsp;<\/p>\n\n\n\n<p>Chen, J., Liu, Y., Wang, J., Zeng, A., Li, Y., &amp; Chen, Q. (2024). <em>DiffSHEG\u202f: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation<\/em> (arXiv:2401.04747). arXiv. https:\/\/doi.org\/10.48550\/arXiv.2401.04747&nbsp;<\/p>\n\n\n\n<p>Ekstedt, E., &amp; Skantze, G. (2022). Voice Activity Projection\u202f: Self-supervised Learning of Turn-taking Events. <em>Interspeech 2022<\/em>, 5190\u20115194. https:\/\/doi.org\/10.21437\/Interspeech.2022-10955&nbsp;<\/p>\n\n\n\n<p>Inoue, K., Jiang, B., Ekstedt, E., Kawahara, T., &amp; Skantze, G. (2024a). Multilingual Turn-taking Prediction Using Voice Activity Projection. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, &amp; N. Xue (\u00c9ds.), <em>Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)<\/em> (p. 11873\u201111883). ELRA and ICCL. https:\/\/aclanthology.org\/2024.lrec-main.1036\/&nbsp;<\/p>\n\n\n\n<p>Inoue, K., Jiang, B., Ekstedt, E., Kawahara, T., &amp; Skantze, G. (2024b). <em>Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection<\/em>. arXiv. https:\/\/doi.org\/10.48550\/ARXIV.2401.04868&nbsp;<\/p>\n\n\n\n<p>Levinson, S. C. (2016). Turn-taking in Human Communication \u2013 Origins and Implications for Language Processing. <em>Trends in Cognitive Sciences<\/em>, <em>20<\/em>(1), 6\u201114. https:\/\/doi.org\/10.1016\/j.tics.2015.10.010&nbsp;<\/p>\n\n\n\n<p>Liu, H., Zhu, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., &amp; Black, M. J. (2024). <em>EMAGE\u202f: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling<\/em> (arXiv:2401.00374). arXiv. https:\/\/doi.org\/10.48550\/arXiv.2401.00374&nbsp;<\/p>\n\n\n\n<p>Pavlakos, G., Shan, D., Radosavovic, I., Kanazawa, A., Fouhey, D., &amp; Malik, J. (2023). <em>Reconstructing Hands in 3D with Transformers<\/em> (arXiv:2312.05251). arXiv. https:\/\/doi.org\/10.48550\/arXiv.2312.05251&nbsp;<\/p>\n\n\n\n<p>Schlangen, D., &amp; Skantze, G. (2011). A general, abstract model of incremental dialogue processing. <em>Dialogue &amp; Discourse<\/em>, <em>2<\/em>(1), 83\u2011111.&nbsp;<\/p>\n\n\n\n<p>Shen, Z., Pi, H., Xia, Y., Cen, Z., Peng, S., Hu, Z., Bao, H., Hu, R., &amp; Zhou, X. (2024). World-Grounded Human Motion Recovery via Gravity-View Coordinates. <em>SIGGRAPH Asia 2024 Conference Papers<\/em>, 1\u201111. https:\/\/doi.org\/10.1145\/3680528.3687565&nbsp;<\/p>\n\n\n\n<p>Skantze, G., &amp; Irfan, B. (2025). Applying General Turn-Taking Models to Conversational Human-Robot Interaction. <em>2025 20th ACM\/IEEE International Conference on Human-Robot Interaction (HRI)<\/em>, 859\u2011868. https:\/\/doi.org\/10.1109\/HRI61500.2025.10973958&nbsp;<\/p>\n\n\n\n<p>Zhang, Z., Ao, T., Zhang, Y., Gao, Q., Lin, C., Chen, B., &amp; Liu, L. (2024). Semantic Gesticulator\u202f: Semantics-Aware Co-Speech Gesture Synthesis. <em>ACM Trans. Graph.<\/em>, <em>43<\/em>(4), 136:1-136:17. https:\/\/doi.org\/10.1145\/3658134&nbsp;<\/p>\n\n\n\n<div style=\"height:100px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">See Other Projects<\/h2>\n\n\n\n<div class=\"wp-block-group is-content-justification-center is-nowrap is-layout-flex wp-container-core-group-is-layout-94bc23d7 wp-block-group-is-layout-flex\">\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-4b2eccd6 wp-block-group-is-layout-flex\">\n<figure class=\"wp-block-image size-thumbnail is-resized\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/sara\/\"><img loading=\"lazy\" decoding=\"async\" width=\"480\" height=\"330\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2018\/04\/project_sara-480x330.jpg\" alt=\"\" class=\"wp-image-2925\" style=\"aspect-ratio:4\/3;object-fit:cover;width:300px\"\/><\/a><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/sara\/\">SARA<\/a><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-4b2eccd6 wp-block-group-is-layout-flex\">\n<figure class=\"wp-block-image size-thumbnail is-resized\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/alex\/\"><img loading=\"lazy\" decoding=\"async\" width=\"480\" height=\"330\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2018\/04\/project_alex-480x330.jpg\" alt=\"\" class=\"wp-image-2927\" style=\"aspect-ratio:4\/3;object-fit:cover;width:300px\"\/><\/a><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/alex\/\" data-type=\"page\" data-id=\"16\">ALEX<\/a><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-4b2eccd6 wp-block-group-is-layout-flex\">\n<figure class=\"wp-block-image size-thumbnail is-resized\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/rapt\/\"><img loading=\"lazy\" decoding=\"async\" width=\"480\" height=\"349\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2018\/04\/project_rapt-480x349.jpg\" alt=\"\" class=\"wp-image-2926\" style=\"aspect-ratio:4\/3;object-fit:cover;width:300px\"\/><\/a><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/rapt\/\" data-type=\"page\" data-id=\"18\">RAPT<\/a><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-group is-vertical is-content-justification-center is-layout-flex wp-container-core-group-is-layout-4b2eccd6 wp-block-group-is-layout-flex\">\n<figure class=\"wp-block-image size-thumbnail is-resized\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/yahoo\/\"><img loading=\"lazy\" decoding=\"async\" width=\"480\" height=\"480\" src=\"https:\/\/www.justinecassell.com\/articulab\/wp-content\/uploads\/2015\/09\/smartphone_inmind_sara-480x480.jpg\" alt=\"\" class=\"wp-image-2619\" style=\"aspect-ratio:4\/3;object-fit:cover;width:300px\"\/><\/a><\/figure>\n\n\n\n<p class=\"has-text-align-center\"><a href=\"https:\/\/www.justinecassell.com\/articulab\/projects\/yahoo\/\" data-type=\"page\" data-id=\"20\">InMind<\/a><\/p>\n<\/div>\n<\/div>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Son of Sara : Developing a new LLM-based Embodied Conversational Agent. Overview As part of our ongoing work on socially capable conversational agents, the team launched what we are calling Son of SARA (see the SARA project on this page), an embodied conversational agent designed to support natural and effective interaction with human users, relying on both rapport and task [&hellip;]<\/p>\n","protected":false},"author":26,"featured_media":4855,"parent":13,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"footnotes":""},"class_list":["post-4341","page","type-page","status-publish","has-post-thumbnail","hentry"],"acf":[],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/pages\/4341","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/users\/26"}],"replies":[{"embeddable":true,"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/comments?post=4341"}],"version-history":[{"count":10,"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/pages\/4341\/revisions"}],"predecessor-version":[{"id":5116,"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/pages\/4341\/revisions\/5116"}],"up":[{"embeddable":true,"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/pages\/13"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/media\/4855"}],"wp:attachment":[{"href":"https:\/\/www.justinecassell.com\/articulab\/wp-json\/wp\/v2\/media?parent=4341"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}