Reinforcement Learning: Markov Decision Processes and UCB Action Selection

Slides from TU Vienna about Reinforcement Learning. The Pdf introduces key concepts like Markov Decision Processes and Upper-Confidence-Bound action selection, with mathematical formulas and comparative graphs. This University level Computer science material is based on textbook chapters.

53 Pages

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Reinforcement Learning

Nysret Musliu

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Book

http://incompleteideas.net/book/RLbook2020.pdf

• This lecture: Chapter 1- 3 (pages 1- 71) and

some parts of chapters 4,5,6

• Slides are based on these chapters

Preview

Reinforcement Learning Introduction

TU VIENNA Reinforcement Learning Nysret Musliu

Reinforcement Learning An Introduction second edition Richard S. Sutton and Andrew G. Barto

FACULTY OF INFORMATICS

Reinforcement Learning Book

TU VIENNA Book

http://incompleteideas.net/book/RLbook2020.pdf · This lecture: Chapter 1- 3 (pages 1- 71) and some parts of chapters 4,5,6 · Slides are based on these chapters

Reinforcement Learning An Introduction second edition Richard S. Sutton and Andrew G. Barto

FACULTY OF INFORMATICS

Introduction to Reinforcement Learning

TU VIENNA Introduction

Learning by interacting with the environment . "Reinforcement learning is learning what to do-how to map situations to actions-so as to maximize a numerical reward signal." (Sutton and Barto, RL Book) . Discover which actions to take by trail-and-error - Immediate reward - Delayed rewards

· An agent must be able to - sense the state of its environment - take actions that affect the state

FACULTY OF INFORMATICS

Reinforcement Learning Paradigms

TU VIENNA Reinforcement Learning

Supervised Learning Comparison

Different from supervised learning · Learning form a set of training instances provided by supervisor . In RL an agent must be able to learn from its own experience

Unsupervised Learning Comparison

Different form unsupervised learning · Finding structure in unlabeled data · Reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure

Reinforcement learning can be considered as a third machine learning paradigm

FACULTY OF INFORMATICS

Reinforcement Learning Challenges

TU VIENNA Reinforcement Learning

Trade-off between exploration and exploitation
A reinforcement learning agent must prefer actions that give good rewards (exploitation)
Try also actions that were not selected before to discover good actions (exploration)
Uncertain environment
Taking into account delayed consequences of actions
The effects of actions cannot be fully predicted

FACULTY OF INFORMATICS

Elements of Reinforcement Learning

TU VIENNA Elements of Reinforcement Learning

Policy

Policy: - Agent's way of behaving at a given time - A mapping from perceived states of the environment to actions to be taken

Reward Signal

Reward signal: - Defines the goal of a reinforcement learning problem - The environment sends a reward to agent - The objective is to maximize the total reward over the long run

Value Function

Value function: Specifies what is good in the long run - - The total amount of reward an agent can expect to accumulate over the future

Environment Model

Model of the environment: - Mimics the behavior of the environment

FACULTY OF INFORMATICS

Tic-Tac-Toe Example

TU VIENNA Example: Tic-Tac-Toe

starting position a opponent's move b our move S opponent's move d our move e opponent's move f our move 00

X O O O X X X

FACULTY OF INFORMATICS

Tic-Tac-Toe Example Update

TU VIENNA Example: Tic-Tac-Toe

starting position a opponent's move ~ b our move c c* opponent's move d our move e opponent's move f our move * St: the state before the greedy move St+1: the state after the move The update to the estimated value of St: V (St) < V (St) + a V (St+1) - V (St) Alfa is the state-size parameter

FACULTY OF INFORMATICS

Tabular Solution Methods Overview

TU VIENNA Tabular Solution Methods

State and action spaces are small
Methods can often find optimal value function and the optimal policy
Different from approximate methods which only find approximate solutions, but can be applied effectively to larger problems

Topics in Tabular Methods

Topics: Bandit problems Finite Markov decision processes Dynamic programming Monte Carlo methods Temporal difference learning

FACULTY OF INFORMATICS

k-armed Bandit Problem

TU VIENNA

A k-armed Bandit Problem Wikipedia

FACULTY OF INFORMATICS

k-armed Bandit Problem Details

A k-armed Bandit Problem

Select among k different actions

• Each action receives a numerical reward - A stationary probability distribution that depends on the action you selected . The value of an arbitrary action a is the expected reward given that a is selected: q* (a) = E Rt | At=a] At -> action selected on time step t Rt - > corresponding reward · Maximize the expected total reward over some time period

FACULTY OF INFORMATICS

Action Value Estimation

TU VIENNA

We denote the estimated value of action a at time step t as Qt(a)
We would like Q.(a) to be close to q+(a)
At any time step there is at least one action whose estimated value is greatest
Selection of a greedy action - Exploitation
Selection of non greedy action - Exploration
Balance between exploitation and exploration is important

FACULTY OF INFORMATICS

Action-value Methods

TU VIENNA Action-value Methods

Estimating the values of actions Qt(a) = Ei=1 1 A ¡= a i=1 t-1 , number of times a taken prior to t

· If the denominator is zero, Q (a) takes a default value (e.g. 0) · Sample-average method · Greedy action selection At = arg max Qt (a) a

sum of rewards when a taken prior to t t-1 Ei Ri . 1Ai=a i=1

FACULTY OF INFORMATICS

Epsilon-Greedy Action Selection

TU VIENNA

Greedy action selection does not try inferior actions to see if they might really be better
With small probability & select randomly from among all the actions with equal probability, independently of the action-value estimates
With probability 1-& select one of the actions with the highest estimated value

FACULTY OF INFORMATICS E- greedy action selection!

10-armed Testbed Setup

TU VIENNA The 10-armed Testbed

A set of 2000 randomly generated k-armed bandit problems

• k = 10 · The q+(a) were selected according to a normal (Gaussian) distribution with mean 0 and variance 1 . The actual rewards were selected according to a mean q+(a) unit variance normal distribution · When a learning method applied to that problem selected action A, at time step t, the actual reward, R, was selected from a normal distribution with mean q+(At) and variance 1

FACULTY OF INFORMATICS

10-armed Testbed Distributions

TU VIENNA

3 2 q+ (3) q* (5) 1 q+ (9) q* (4) q+ (1) 0 - 1 q+ (7) q* (10) q* (2) -1 q* (8) q+(6) -2 -3 1 2 3 4 5 6 7 8 9 10 Action

Reward distribution The 10-armed Testbed: Distributions

FACULTY OF INFORMATICS

10-armed Testbed Results

TU VIENNA

1.5 €=0.1 1 == 0 (greedy) Average reward 0.5_ 0 T 1 250 T 500 750 1000 Steps 100% 80% E=0.1 60% €=0.01 Optimal action 40% - == 0 (greedy) 20% 0% 1 250 500 750 1000 Steps ICS €=0.01

Incremental Implementation of Action Values

TU VIENNA Incremental Implementation

Concentrate on a single action: R is the reward received after the ith selection of this action
Qn denote the estimate of its action value after it has been selected n - 1 times, Qn = R1 + R2+ ... + Rn-1 n -1 = 1 Rn + (n-1) 1 n -1 i=1 = Rn + (n - 1)Qn) = Rn + nQn - Qn n = L Qnt - Rn - Qn , L NewEstimate ‹ OldEstimate + StepSize

FACULTY OF INFORMATICS = n n Ri i=1 n-1 Rn + > Ri i=1 n-1 Ri

Qn+1 = 1 n n 1

Simple Bandit Algorithm

TU VIENNA A simple bandit algorithm Initialize, for a = 1 to k: Q(a) +- 0 N(a) + 0 Loop forever: { argmaxa Q(a) with probability 1 - 8 (breaking ties randomly) a random action with probability § R + bandit(A) N(A) <- N(A)+1 Q(A) <- Q(A) + N(A) 1 R-Q(A)]

FACULTY OF INFORMATICS

Can’t find what you’re looking for?

Explore more topics in the Algor library or create your own materials with AI.

Reinforcement Learning: Markov Decision Processes and UCB Action Selection

Generate with AI

Preview

Reinforcement Learning Introduction

Reinforcement Learning Book

Introduction to Reinforcement Learning

Reinforcement Learning Paradigms

Supervised Learning Comparison

Unsupervised Learning Comparison

Reinforcement Learning Challenges

Elements of Reinforcement Learning

Policy

Reward Signal

Value Function

Environment Model

Tic-Tac-Toe Example

Tic-Tac-Toe Example Update

Tabular Solution Methods Overview

Topics in Tabular Methods

k-armed Bandit Problem

k-armed Bandit Problem Details

Action Value Estimation

Action-value Methods

Epsilon-Greedy Action Selection

10-armed Testbed Setup

10-armed Testbed Distributions

10-armed Testbed Results

Incremental Implementation of Action Values

Simple Bandit Algorithm

Related Content