Show List

Reinforcement learning

Reinforcement learning is a type of machine learning where an agent learns to take actions in an environment to maximize some cumulative reward. The agent observes the current state of the environment, takes an action, and receives a reward based on the state transition and the action taken. The goal is to learn a policy that maps states to actions in order to maximize the cumulative reward.

Q-learning is a popular algorithm for reinforcement learning that uses a Q-function to estimate the expected cumulative reward for taking a particular action in a given state. The Q-function is learned iteratively by updating its estimate based on the observed state transitions and rewards.

Here's an example of implementing Q-learning in Python using the OpenAI Gym environment:

python
Copy code
import gym import numpy as np # Define the environment env = gym.make('FrozenLake-v0') # Define the Q-table num_states = env.observation_space.n num_actions = env.action_space.n q_table = np.zeros((num_states, num_actions)) # Define the hyperparameters num_episodes = 10000 max_steps_per_episode = 100 learning_rate = 0.1 discount_factor = 0.99 exploration_rate = 1.0 max_exploration_rate = 1.0 min_exploration_rate = 0.01 exploration_decay_rate = 0.001 # Define the Q-learning algorithm for episode in range(num_episodes): state = env.reset() done = False step = 0 while not done and step < max_steps_per_episode: # Choose an action using the epsilon-greedy policy exploration_rate_threshold = np.random.rand() if exploration_rate_threshold > exploration_rate: action = np.argmax(q_table[state, :]) else: action = env.action_space.sample() # Take the chosen action and observe the next state and reward next_state, reward, done, info = env.step(action) # Update the Q-value for the current state-action pair old_q_value = q_table[state, action] next_max_q_value = np.max(q_table[next_state, :]) new_q_value = (1 - learning_rate) * old_q_value + learning_rate * (reward + discount_factor * next_max_q_value) q_table[state, action] = new_q_value # Move to the next state state = next_state step += 1 # Decay the exploration rate exploration_rate = min_exploration_rate + (max_exploration_rate - min_exploration_rate) * np.exp(-exploration_decay_rate * episode) # Evaluate the trained policy num_episodes = 100 total_reward = 0 for episode in range(num_episodes): state = env.reset() done = False while not done: action = np.argmax(q_table[state, :]) next_state, reward, done, info = env.step(action) total_reward += reward state = next_state print(f"Average reward per episode: {total_reward/num_episodes}")

In this example, we first create the FrozenLake environment using the OpenAI Gym library. We then define the Q-table, which is a two-dimensional array that stores the Q-values for each state-action pair. We also define the hyperparameters, such as the learning rate, discount factor, and exploration rate.

During training, we run a loop over a fixed number of episodes. For each episode, we reset the environment to the initial state and run a loop over a fixed number of steps. For each step, we choose an action using the epsilon-greedy policy, take the chosen action, and observe the next state and reward. We then update the Q-value for the current state-action pair using the Q-learning update rule. Finally, we move to the next state and continue the loop until the episode is done.

At the end of each episode, we decay the exploration rate according to a fixed schedule. This ensures that the agent starts with a high exploration rate to encourage exploration, but gradually decreases it over time to encourage exploitation of the learned policy.

After training, we evaluate the trained policy by running a loop over a fixed number of episodes and calculating the average reward per episode. In this example, we use the argmax function to choose the action with the highest Q-value for each state, since we assume that the exploration rate has decayed to a sufficiently low level.

Q-learning is a simple yet powerful algorithm for reinforcement learning that can be used in a wide range of applications. However, it has some limitations, such as the need to maintain a Q-table for each state-action pair, which can become infeasible for large state spaces. There are other more advanced algorithms, such as Deep Q-Networks (DQN), that address these limitations by using deep neural networks to approximate the Q-function.


    Leave a Comment


  • captcha text