Papers
arxiv:2605.05765

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

Published on May 7
· Submitted by
zhenru
on May 12
Authors:
,
,
,
,
,
,
,
,
,

Abstract

X-OmniClaw is a unified mobile agent architecture that integrates multimodal perception, memory, and action components to enable intelligent interaction within Android environments.

AI-generated summary

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.

Community

Paper author

Hi HF community! If you are interested in our work or have any questions, feel free to reach out or leave a comment. I'd love to hear your thoughts!

Paper author Paper submitter
edited about 1 hour ago

X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

Abstract

Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.
x-omniclaw_structure

Omni Perception

fig1
Multimodal entry and unified ingress. X-OmniClaw consolidates diverse inputs—direct UI triggers, floating widgets, microphone input, scheduled tasks, and external gateways—into one pipeline. For recurring on-device tasks, Android AlarmManager provides a system-level wake-up path so scheduled triggers merge back into the same entry semantics.

Integrated multimodal perception. The phone is modeled as a first-person multimodal system over on-screen UI, real-world camera context, and speech. Camera and screen projection supply visual evidence; ASR transcribes speech in real time; on-device AEC mitigates playback echo. A decoupled streaming pipeline buffers visual history, and a temporal alignment module aligns speech and video via timestamps.

Scene-grounded intent understanding. A VLM interprets the scene with the user query, expanding raw input into intent. Answerable questions return immediately; otherwise the structured intent is handed to the downstream agent loop.

Omni Memory

fig2

Working memory and long-term user memory. Working memory preserves multimodal runtime context across turns, foreground changes, and app switches—screenshots, distilled observations, and execution state—so tasks can resume without losing place. Long-term memory distills device-resident personal data into persistent artifacts and user-profile representations injected into reasoning.

Gallery and semantic records. Gallery photos become compact semantic records (objects, scenes, events) to support grounded QA, retrieval, and automation.

How memory is built, used, and secured. Skills orchestrate maintenance vs. consumption; tools implement concrete steps. Image pipelines prefer multimodal summarization with metadata fallback. Production is separated from consumption; writes pass filtering/redaction; users control gallery memory and profile injection.

Omni Action

fig3
Omni Action in the app ecosystem. Each step follows observation, reasoning, and execution. The observation stack fuses multimodal interface evidence; the loop selects skills, retrieves memory, and returns the next action or a direct reply. Execution spans Android atomic actions and higher-level tools (filesystem, RAG, etc.).

Hybrid UI understanding. XML, on-device grounding, and OCR localize targets: structure when reliable, vision and text when cues are weak or cluttered—especially under ads and dense layouts.

Trajectory-cloned execution. Behavior cloning records UI-layer navigation into named skills; dumpsys-based introspection extracts deeplink/intent shortcuts. Trajectory replay recovers target “addresses” for fast re-entry with fallbacks when UI drifts.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.05765
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.05765 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.05765 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.05765 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.