Slides from TU Vienna about Reinforcement Learning. The Pdf introduces key concepts like Markov Decision Processes and Upper-Confidence-Bound action selection, with mathematical formulas and comparative graphs. This University level Computer science material is based on textbook chapters.
See more53 Pages


Unlock the full PDF for free
Sign up to get full access to the document and start transforming it with AI.
TU VIENNA Reinforcement Learning Nysret Musliu
Reinforcement Learning An Introduction second edition Richard S. Sutton and Andrew G. Barto
FACULTY OF INFORMATICS
TU VIENNA Book
http://incompleteideas.net/book/RLbook2020.pdf · This lecture: Chapter 1- 3 (pages 1- 71) and some parts of chapters 4,5,6 · Slides are based on these chapters
Reinforcement Learning An Introduction second edition Richard S. Sutton and Andrew G. Barto
FACULTY OF INFORMATICS
TU VIENNA Introduction
· An agent must be able to - sense the state of its environment - take actions that affect the state
FACULTY OF INFORMATICS
TU VIENNA Reinforcement Learning
Different from supervised learning · Learning form a set of training instances provided by supervisor . In RL an agent must be able to learn from its own experience
Different form unsupervised learning · Finding structure in unlabeled data · Reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure
Reinforcement learning can be considered as a third machine learning paradigm
FACULTY OF INFORMATICS
TU VIENNA Reinforcement Learning
FACULTY OF INFORMATICS
TU VIENNA Elements of Reinforcement Learning
Policy: - Agent's way of behaving at a given time - A mapping from perceived states of the environment to actions to be taken
Reward signal: - Defines the goal of a reinforcement learning problem - The environment sends a reward to agent - The objective is to maximize the total reward over the long run
Value function: Specifies what is good in the long run - - The total amount of reward an agent can expect to accumulate over the future
Model of the environment: - Mimics the behavior of the environment
FACULTY OF INFORMATICS
TU VIENNA Example: Tic-Tac-Toe
starting position a opponent's move b our move S opponent's move d our move e opponent's move f our move 00
X O O O X X X
FACULTY OF INFORMATICS
TU VIENNA Example: Tic-Tac-Toe
starting position a opponent's move ~ b our move c c* opponent's move d our move e opponent's move f our move * St: the state before the greedy move St+1: the state after the move The update to the estimated value of St: V (St) < V (St) + a V (St+1) - V (St) Alfa is the state-size parameter
FACULTY OF INFORMATICS
TU VIENNA Tabular Solution Methods
Topics: Bandit problems Finite Markov decision processes Dynamic programming Monte Carlo methods Temporal difference learning
FACULTY OF INFORMATICS
TU VIENNA
A k-armed Bandit Problem Wikipedia
FACULTY OF INFORMATICS
A k-armed Bandit Problem
• Each action receives a numerical reward - A stationary probability distribution that depends on the action you selected . The value of an arbitrary action a is the expected reward given that a is selected: q* (a) = E Rt | At=a] At -> action selected on time step t Rt - > corresponding reward · Maximize the expected total reward over some time period
FACULTY OF INFORMATICS
TU VIENNA
FACULTY OF INFORMATICS
TU VIENNA Action-value Methods
· If the denominator is zero, Q (a) takes a default value (e.g. 0) · Sample-average method · Greedy action selection At = arg max Qt (a) a
sum of rewards when a taken prior to t t-1 Ei Ri . 1Ai=a i=1
FACULTY OF INFORMATICS
TU VIENNA
FACULTY OF INFORMATICS E- greedy action selection!
TU VIENNA The 10-armed Testbed
• k = 10 · The q+(a) were selected according to a normal (Gaussian) distribution with mean 0 and variance 1 . The actual rewards were selected according to a mean q+(a) unit variance normal distribution · When a learning method applied to that problem selected action A, at time step t, the actual reward, R, was selected from a normal distribution with mean q+(At) and variance 1
FACULTY OF INFORMATICS
TU VIENNA
3 2 q+ (3) q* (5) 1 q+ (9) q* (4) q+ (1) 0 - 1 q+ (7) q* (10) q* (2) -1 q* (8) q+(6) -2 -3 1 2 3 4 5 6 7 8 9 10 Action
Reward distribution The 10-armed Testbed: Distributions
FACULTY OF INFORMATICS
TU VIENNA
1.5 €=0.1 1 == 0 (greedy) Average reward 0.5_ 0 T 1 250 T 500 750 1000 Steps 100% 80% E=0.1 60% €=0.01 Optimal action 40% - == 0 (greedy) 20% 0% 1 250 500 750 1000 Steps ICS €=0.01
TU VIENNA Incremental Implementation
FACULTY OF INFORMATICS = n n Ri i=1 n-1 Rn + > Ri i=1 n-1 Ri
Qn+1 = 1 n n 1
TU VIENNA A simple bandit algorithm Initialize, for a = 1 to k: Q(a) +- 0 N(a) + 0 Loop forever: { argmaxa Q(a) with probability 1 - 8 (breaking ties randomly) a random action with probability § R + bandit(A) N(A) <- N(A)+1 Q(A) <- Q(A) + N(A) 1 R-Q(A)]
FACULTY OF INFORMATICS