| Glen Berseth

Introduction

Reinforcement learning has been shown to solve complex problems. Recently it has been used with great success by google DeepMind playing Atari games. The success is great but understanding the basic of some of these frameworks/algorithms can be daunting. In this post I hope to eliminate some of the mental entropy cause by trying to understand how and why the algorithms work. I will focus on Logistic Regression as it is a simple and if we can understand this model then more complex models are far easier to comprehend.
<div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">Going into this I do expect that the reader has some math background. They should understand what a gradient is, and a little about RL in particular what a policy is and how Q-Learning works. I learned this topic by taking apart algorithms designed for classification, some of this might seep into the discussion.</div>

<h1 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 20pt;">
</h1><h1 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 20pt;">Framework</h1><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">These examples are coded in Python. I am not saying it is the best language in the world but I think in this case it helps the reader (including myself) understand what is going on in the program better.</div><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;">Theano</h2><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">There is also heavy use of a Python library called Theano. This library is fantastic, in my opinion. The library seems to be a symbolic math computing library with support for using the GPU (GPU not used in this work). Using Theano is more like creating a data flow model where variables are vectors/matricies.</div>Example: Create a simple function (f(x) = x + x^{10})
<pre>import theano a = theano.tensor.vector() # declare variable out = a + a ** 10 # build symbolic expression f = theano.function([a], out) # compile function print(f([0, 1, 2]))</pre><pre> </pre><pre>out >> [ 0. 2. 1026.]</pre><pre> </pre><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">This example shows some of those features. First you need to declare the type of variable that will be used, something like vector or matrix. Create a symbolic mathematical expression that you want to compute over the variable type. Next you compile the function, this part almost literaly compiles a C function in the background that data will be fed through to get output.</div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
</div><h1 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 20pt;">The Environment</h1><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">I wanted to use a very simple environment to demonstrate the workings of RL. I decided on a 2D navigation game because you can visualize the policy so nicely.</div><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;">Navigation Game</h2><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">The game consists of an agent, a goal and a square 2D world the agent navigates in. The state of the game is shown on the left and the current policy for the agent is shown on the right.</div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;"></div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">The agent can move in any of 8 directions, as is shown on the left. The target location of the agent is the red dot. </div><h1 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 20pt;">Reinforcement Learning</h1><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">One key difference when using Logistic Regression for RL instead of classification is the data. In classification the data is prelabled with the correct class that we want the model to predict. In RL there is no pre-labeled data. We have to generate data ourselves and this data has a reward signal that should be maximized not a class.</div><h2 dir="ltr" style="line-height: 1.38; margin-bottom: 6pt; margin-top: 18pt;">Bellman Error</h2><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">It is not just the reward signal we want to maximize but the discounted future reward (\gamma \arg\max_{a} Q(s,a)). This is done minimizing the Bellman Error (which I will label (\delta) here)</div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">[\delta(r,s,s’) = (r + \gamma \arg\max_{a’} Q(s’,a’)) - Q(s,a)] </div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
<h2>Logistic Regression</h2> Logistic Regression is one of the methods that can be used as an activation function. Others include Rectified Linear Units (ReLU) and the Perceptron. The conceptual target of using this activation function is to achieve the all mighty goal of linear separability. The data can be transformed through the activation function to find a hyperplane that can separate the data nicely.

The graph is of the sigmoid function.

[sigmoid(x, W, b) = \frac{1}{1+exp^{W \cdot x + b}} ]

Often in RL the parameters of the model are denoted as (\theta) and the goal is to optimize these parameters. For logistic regression using the sigmoid activation function this translates to (\theta = {W, b}). Where (W) are the weights for the model and (b) is a bias for the model. Somewhat similar to the more common (y = mx + b) for defining a line. The (b) effectively shifts the sigmoid curve to the right or left.
<h3>
</h3><h3>Multidimensional Regression</h3>
This model extends to multiple dimensions or actions rather easily, (x, b) become vectors and (W) is now a matrix with dimensions (len(x) x numActions). Then the sigmoid function returns a vector of Q-values, one for each action in state (x). The optimal action using the Q-Learning method can then be computed as</div><div dir="ltr" style="line-height: 1.38; margin-bottom: 0pt; margin-top: 0pt;">
[a^{*} = \arg\max_{a} Q(s,a) = \arg\max_{a} sigmoid(x, \theta) ]

<h4>Model Optimization </h4>To perform optimization of this sigmoidal model a cost function is needed. The idea of a cost function is to compute the difference between the current model and the perfect solution. For RL we use the Bellman Error to determine this
[cost(R, S, S’) = sum(0.5 * \delta(R, S, S’)^{2})]
This cost function gives a fair measure of the model error. Note, in this formulation (R, S, S’) are vectors or matrices and the squared the cost is used so that positive and negative errors to not cancel each other out in the sum. It is also common to use mean in this function instead of sum because sum is effected more by variable scale. Including our model in the cost function gives

[cost(R, S, S’, \theta) = sum(0.5 * \delta(R, S, S’, \theta)^{2})]

To determine in what direction along the cost function contour will result in less error the gradient wrt the model parameters is needed
( \frac{ \partial cost}{ \partial W}, \frac{ \partial cost} {\partial b} ). This can be calculated in Theano rather easily.

<pre> g_W = T.grad(cost=cost, wrt=model.W)
g_b = T.grad(cost=cost, wrt=model.b)</pre>
</div>
With the gradient in hand we can make a update rule to step in the direction of less cost.
[ W’ = W + (-\frac{ \partial cost}{ \partial W} * \alpha) ]
[ b’ = b + (-\frac{ \partial cost}{ \partial b} * \alpha) ]
This can also be done in Theano rather simply as

<pre>updates = [(model.W, model.W + (-g_W * learning_rate),
(model.b, model.b + (-g_b * learning_rate)]</pre>

Those are the steps needed to perform common Stocastic Gradient Descent (SGD). It should be noted that the learning_rate (\alpha) for RL problems should be rather small, the effect of this will be discussed and shown later.
<h2>Model Training</h2>Every time the agent selects an action (a) in some state (s) that leads to recieving some reward (r) and resulting in a new states (s’) experience is gained. This single tuples of experience is kept in a experience history that is used to train the model. The experience history serves a similar purpose as training data in classification problems.
Training the model (or updates) is done over a mini batch, where a mini batch is a randomly selected sample of from the experience history of the agent.

<h3>Action Selection and Exploration</h3>A common method to select an action to execute is (e)-greedy action selection. This algorithm simply selects one of the available actions at random with probability (e). Otherwise, the selected action comes from the policy (\pi(s)), in this case the logistic regression model.
<div class="separator" style="clear: both; text-align: center;"></div><h3>
Learning Rate</h3>One tough challenge that has only recently seen good solutions is deadling with highly eratic policy changes. The policy swings from one end of the spectrum to the other making it almost impossible for the optimization to eventually settle on a good policy.
Earlier I said we would look at the effect the learning rate has on the learning process. The learning rate is one of the parameters that influences the eraticness of the policy. The effect of a learning rate that is far to high is shown in the next two figures. Focus on the policy on the right that describes the optimal action/direction to be selected when the agent is in the location on the map that is the location of the arrow.

<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"></td></tr><tr><td class="tr-caption" style="text-align: center;">Policy before training update</td></tr></tbody></table>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"></td></tr><tr><td class="tr-caption" style="text-align: center;">Policy after single training update</td></tr></tbody></table>
After lowering the learning rate from 0.5 to 0.01 small managable policy updates are made.

<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"></td></tr><tr><td class="tr-caption" style="text-align: center;">Policy before update</td></tr></tbody></table>

<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody><tr><td style="text-align: center;"></td></tr><tr><td class="tr-caption" style="text-align: center;">Policy after a single update</td></tr></tbody></table>
<div class="separator" style="clear: both; text-align: center;"></div><h3>The Code </h3>Can be found here.

References:
<ol><li>https://en.wikipedia.org/wiki/Activation_function</li><li>https://en.wikipedia.org/wiki/Q-learning</li><li>http://deeplearning.net/software/theano/tutorial/examples.html#logistic-function</li><li>http://deeplearning.net/tutorial/logreg.html#logreg</li><li>https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html</li></ol>