Running Agents LoPE Demo - Prompt Perturbation for Reasoning Exploration π§ Compare baseline and perturbed reasoning for tasks
Paused Agents Lost-in-Thought Benchmark π§ Run a benchmark to see how reasoning steps affect retrieval accuracy
Sleeping Agents Master Key Capability Demo π Show expected accuracy boost for a math problem via steering
Sleeping Agents Agentic World Model Explorer π Explore world model levels, laws, and rollouts interactively
Runtime error Agents COMPASS-Inspired Semantic Sampling for Sudanese Arabic Dialect Understanding π―
Sleeping Agents CoT Spatial Reasoning Degradation π§ Show how step-by-step prompts affect visual puzzle answers
Sleeping Agents CoT Spatial Reasoning Degradation π Generate spatial puzzles and compare direct vs CoT reasoning
Sleeping Agents Weak Supervision Reasoning Explorer π¬ Explore reasoning performance under weak supervision