Reinforcement learning
Reinforcement learning is a type of machine learning where an agent learns to take actions in an environment to maximize some cumulative reward. The agent observes the current state of the environment, takes an action, and receives a reward based on the state transition and the action taken. The goal is to learn a policy that maps states to actions in order to maximize the cumulative reward.
Q-learning is a popular algorithm for reinforcement learning that uses a Q-function to estimate the expected cumulative reward for taking a particular action in a given state. The Q-function is learned iteratively by updating its estimate based on the observed state transitions and rewards.
Here's an example of implementing Q-learning in Python using the OpenAI Gym environment:
import gym
import numpy as np
# Define the environment
env = gym.make('FrozenLake-v0')
# Define the Q-table
num_states = env.observation_space.n
num_actions = env.action_space.n
q_table = np.zeros((num_states, num_actions))
# Define the hyperparameters
num_episodes = 10000
max_steps_per_episode = 100
learning_rate = 0.1
discount_factor = 0.99
exploration_rate = 1.0
max_exploration_rate = 1.0
min_exploration_rate = 0.01
exploration_decay_rate = 0.001
# Define the Q-learning algorithm
for episode in range(num_episodes):
state = env.reset()
done = False
step = 0
while not done and step < max_steps_per_episode:
# Choose an action using the epsilon-greedy policy
exploration_rate_threshold = np.random.rand()
if exploration_rate_threshold > exploration_rate:
action = np.argmax(q_table[state, :])
else:
action = env.action_space.sample()
# Take the chosen action and observe the next state and reward
next_state, reward, done, info = env.step(action)
# Update the Q-value for the current state-action pair
old_q_value = q_table[state, action]
next_max_q_value = np.max(q_table[next_state, :])
new_q_value = (1 - learning_rate) * old_q_value + learning_rate * (reward + discount_factor * next_max_q_value)
q_table[state, action] = new_q_value
# Move to the next state
state = next_state
step += 1
# Decay the exploration rate
exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode)
# Evaluate the trained policy
num_episodes = 100
total_reward = 0
for episode in range(num_episodes):
state = env.reset()
done = False
while not done:
action = np.argmax(q_table[state, :])
next_state, reward, done, info = env.step(action)
total_reward += reward
state = next_state
print(f"Average reward per episode: {total_reward/num_episodes}")
In this example, we first create the FrozenLake environment using the OpenAI Gym library. We then define the Q-table, which is a two-dimensional array that stores the Q-values for each state-action pair. We also define the hyperparameters, such as the learning rate, discount factor, and exploration rate.
During training, we run a loop over a fixed number of episodes. For each episode, we reset the environment to the initial state and run a loop over a fixed number of steps. For each step, we choose an action using the epsilon-greedy policy, take the chosen action, and observe the next state and reward. We then update the Q-value for the current state-action pair using the Q-learning update rule. Finally, we move to the next state and continue the loop until the episode is done.
At the end of each episode, we decay the exploration rate according to a fixed schedule. This ensures that the agent starts with a high exploration rate to encourage exploration, but gradually decreases it over time to encourage exploitation of the learned policy.
After training, we evaluate the trained policy by running a loop over a fixed number of episodes and calculating the average reward per episode. In this example, we use the argmax function to choose the action with the highest Q-value for each state, since we assume that the exploration rate has decayed to a sufficiently low level.
Q-learning is a simple yet powerful algorithm for reinforcement learning that can be used in a wide range of applications. However, it has some limitations, such as the need to maintain a Q-table for each state-action pair, which can become infeasible for large state spaces. There are other more advanced algorithms, such as Deep Q-Networks (DQN), that address these limitations by using deep neural networks to approximate the Q-function.
Leave a Comment