We introduced a canonical of Reinforcement Learning (RL) problem in the introductory article: the frozen lake. We still need to introduce the specific vocable of Reinforcement Learning.

#### Reminder

Remember the frozen lake problem. The robots transits through cells, harvesting rewards each times he enters into one cell. Yet, laying on a frozen lake, when the robots wants to go toward one direction, he follows that direction 80% of the time. 10% he drifts to the left, and 10% to the right of the direction he chose. Thus, the robot acts under uncertainty. His goal is to maximize the sum of rewards he meets. The robot start in the bottom left corner and the episode ends when he enters a green or red cell.

#### Elements of the problem

The problem of the frozen lake can be represented as a graph. To simplify, we suppose he can only choose to go toward one direction, the North. When he start from the initial cell, he has 80% to go to the cell above. We know as well that he has 10% chance to go to the cell to the left of his initial position. In this case, he bumps into the wall, go back to the initial cell and gather again the -1 reward. Lastly, he has 10% to go to the cell at the right of the initial cell. We can repeat this way and draw a graph connecting cells with their given probabilities and rewards. We get the figure above.

Here, in the circles the position of the robots are represented with Cartesian coordinates . Arrows represent transition from one cell to another. For each transition, probability and rewards are indicated. The position of the robot is called **state of the system**. System state is the minimum piece of information that allows to describe the system at every moment. Thus, for our frozen lake problem we only need to know in which position lays or robot.

#### The agent in its environment

The **actions** (here only “go North”) are all available options at each decision time step. For instance, the robot could follow any cardinal direction (North, South, East, West). A synonym for robot is the **agent**. The agent is the entity who acts in the **environment**. Here, the environment is the frozen lake. In general, the environment is partially **stochastic**. The means that there is uncertainty about happening phenomenons. To illustrate, the robot lying on the frozen like is not certain about the direction he will follow after decision making.

Our agent, or robot, wants to learn. He learn a **policy**. A policy is a function that indicates, according to system’s current state, what is the best action to make for the robot to maximize his sum of rewards along the episode. For instance, in a gardening context, an irrigation policy could be: “if leave are withered, irrigate with 1 L ; else do nothing”. For our robot, a policy simply indicates which direction to follow to harvest as possible rewards, **in expectation**. Indeed, because the environment is uncertain, we try to learn a policy that would work best in average after many repetitions. In other word, we learn to act for the average situation. If that policy is the best that can be possibly learnt, is it called an **optimal policy**.

#### Formalisation and notations

We should complete our RL vocabulary. When the robot goes from one cell to another, we call it a **transition**. A transition happens when when the agent makes a given action from a given state. Here, when the agent lies on a given cell and chooses one direction to go. Transitions are characterized by a **transition function**. A transition function indicates, given to one action and one state, what are the probabilities of going toward given next states. We can characterize going from one state to another with a reward. A **reward** indicates how much the transition is profitable for the agent at time . The function that indicates the reward between an old and a new state is called the **reward function**.

Remember that the agent will perform many transitions and he wants to maximize the sum of rewards, in expectation. So maximizing rewards at time is not a guarantee to maximize the sum of rewards along one full episode. You could think of cells that too close to the red one in the frozen lake problem.

**Markov Decision Process**

We further assume that **Markov property** is verified. It simply means that our state is well enough define to allow to know from the current state where we may land. For example, the when robot lies on one cell of the frozen lake, we know that one can possibly go toward the direction he chose, or to its left or right. But, the position where in may land depend only on the cell he lies.

We define a **Markov Decision Process** (MDP) as:

- an ensemble of states, noted , for a given state.
- an ensemble of actions, noted , for a given action
- a transition function, noted or
- a reward function, noted
- verified Markov property

We often index and by , indicating the current time step in the espiode, i.e. the number of decisions that have been taken since the beginning of the episode. We call the horizon, the number of decision to be made in one episode. can be fixed or not, finite or not. Another common notation is , the number of available actions, and denote by the action number .

Reinforcement Learning address Markov Decision Processes (MDP). This formalism allows to rigorously define the class of problem to be addressed. A representation of an MDP is a Markov Chain. This what represented our graph with arrows at the beginning. This graph is made explicit below.

In RL, we suppose transition and reward function unknown. For our robot, he ignores everything about the lake! The agent has to learn an optimal policy while discovering its environment. If the environment is supposed to be know, then this a Control Theory problem.

#### Conclusion

We have seen the vocabulary of Reinforcement Learning (RL). RL addresses Markov Decision Processes (MDP). A MDP is made of an ensemble of states, an ensemble of actions, a transition function, a reward function. Finally for the MDP to be complete, we suppose the Markov Property is satisfied. Transition and reward function are supposed to be unknown. Agent’s goal is to learn an optimal policy. An optimal policy maximizes the sum of rewards along one episode in expectation. Thus, the agent learns to act optimally at the same time he discovers its environment.

___

For more information, please check:

*Reinforcement learning: An introduction*. MIT press.