misovalko/my-research-papers
Updated • 119
Multi-armed bandit algorithms balance accurate arm mean estimation and reward maximization through regret guarantees that interpolate between exploration and exploitation objectives.
In multi-armed bandits, the most-explored arms are the most informative, while reward maximization typically pulls only the best arm. We study the tradeoff between identifying arm means accurately and accumulating reward, and present an algorithm with regret guarantees that interpolates between the two objectives. We provide both upper and lower bounds and validate empirically.
Get this paper in your agent:
hf papers read 2605.00488 curl -LsSf https://hf.co/cli/install.sh | bash No model linking this paper
No Space linking this paper
No Collection including this paper