Deep Q Networks (DQN) explained with examples and codes in Reinforcement Learning (2024)

Value-based methods in reinforcement learning explained with code

Mehul Gupta

Published in

Data Science in your pocket

7 min read

Apr 8, 2023

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

Moving ahead, my 110th post is dedicated to a very popular method that DeepMind used to train Atari games, Deep Q Network aka DQN. DQN belongs to the family of value-based methods in reinforcement learning i.e. intaking state as input, it outputs value-functions for each action given the state.

Note: this is different from REINFORCE algorithm I implemented in my last blog where intaking a state, the algorithm outputs probability for actions belonging to policy gradient family of algorithm.

So, Let’s get going with DQN

Deep Q Network:

The Q in DQN stands for ‘Q-Learning’, an off-policy temporal difference method that also considers future rewards while updating the value function for a given State-Action pair. An advantage we get with Value-based methods is we don’t need to wait till the end of the episode to get the final reward and calculated discounted reward as was the case with REINFORCE algorithm in my last post. Using the Bellman equation, we update the value function of all actions as we move ahead. You can find more about Q-Learning in the below video.

The agent we would be training is MountainCar-v0 present in OpenAI Gym.

Deep Q Networks (DQN) explained with examples and codes in Reinforcement Learning (3)

In MountainCar-v0, an underpowered car must climb a steep hill by building enough momentum. The car’s engine is not strong enough to drive directly up the hill (acceleration is limited), so it must learn to rock back and forth, building momentum with each swing until it can eventually reach the top of the mountain.

Talking about the rewards, the agent i.e. the car gets a -1 for every action taken until it reaches the flag where it gets a 0. Also, the episode ends if the agent isn’t able to reach the mountain in 200 steps.

Action space:

The agent can take 3 actions, accelerate to left, accelerate to the right, or do nothing

State space:

A state is represented by a list of 2 elements at any point,

Position of the car: This variable represents the position of the car along the x-axis of the environment. It is a continuous value between -1.2 and 0.6.
The velocity of the car: This variable represents the velocity of the car along the x-axis of the environment. It is a continuous value between -0.07 and 0.07.

I think we should get started

Note: A few code snippets are directly taken from my previous post where I have explained them as well. So, I would be explaining the new stuff I am adding to this post

So, let’s get started

Import required libraries

import tensorflow as tf
import numpy as np
import gym
import math
from PIL import Image
import pygame, sys
from pygame.locals import *
from tensorflow import keras
from collections import deque
import random

Why have we imported deque? Double Ended Queue is a useful data structure that allows insertion and deletion from both ends. This will help us to implement Experience Replay.

Another jargon? Will explain this later in the post

Next, let’s define our gym environment

env = gym.make('MountainCar-v0')input_shape = env.observation_space.shape[0]
num_actions = env.action_space.n

As mentioned earlier, sample space is an array with 2 elements and num_actions=3

It’s time we define our DQN

value_network = tf.keras.models.Sequential([
 tf.keras.layers.Dense(32, activation='relu', input_shape=(input_shape,)),
 tf.keras.layers.Dense(32, activation='relu'),
 tf.keras.layers.Dense(num_actions)
])# Set up the optimizer and loss function
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
loss_fn = tf.keras.losses.MeanSquaredError(reduction="auto", name="mean_squared_error")
#value_network = tf.keras.models.load_model('keras')

The value_network is a shallow neural network that intakes state (1d array with 2 elements) and outputs q-value for all possible actions (1d array with 3 values). The loss function used is a mean-squared error.

Declaring a few constants

num_episodes = 1000
epsilon = 1
gamma = 0.9
state = env.reset()
batch = 200
replay = deque(maxlen=2000)
epoch = 0
alpha = 0.1

Training loop

for episode in range(num_episodes):
 state = env.reset() # Run the episode
 while True:
 value_function = value_network.predict(np.array([state]),verbose=0)[0]
 if np.random.rand()>epsilon:
 action = np.argmax(value_function)
 else:
 action = np.random.choice(num_actions)
 next_state, reward, done, _ = env.step(action)
 done = 1 if done else 0
 replay.append((state,action,reward,next_state,done))
 state = next_state
 if done: 
 break
 if len(replay)>batch:
 with tf.GradientTape() as tape:
 batch_ = random.sample(replay,batch)
 q_value1 = value_network(tf.convert_to_tensor([x[0] for x in batch_]))
 q_value2 = value_network(tf.convert_to_tensor([x[3] for x in batch_]))
 reward = tf.convert_to_tensor([x[2] for x in batch_])
 action = tf.convert_to_tensor([x[1] for x in batch_])
 done = tf.convert_to_tensor([x[4] for x in batch_])
 actual_q_value1 = tf.cast(reward,tf.float64) + tf.cast(tf.constant(alpha),tf.float64)*(tf.cast(tf.constant(gamma),tf.float64)*tf.cast((tf.constant(1)-done),tf.float64)*tf.cast(tf.reduce_max(q_value2),tf.float64)) 
 loss = tf.cast(tf.gather(q_value1,action,axis=1,batch_dims=1),tf.float64)
 loss = loss - actual_q_value1
 loss = tf.reduce_mean(tf.math.pow(loss,2))
 grads = tape.gradient(loss, value_network.trainable_variables)
 optimizer.apply_gradients(zip(grads, value_network.trainable_variables))
 print('Epoch {} done with loss {} !!!!!!'.format(epoch,loss))
 value_network.save('keras/')
 if epoch%100==0:
 epsilon*=0.999
 epoch+=1

This requires a lengthy explanation

For every episode, we start with resetting the environment and making initializing the state.
If the replay length is not above the set threshold (batch size)

Get q-value predictions for all possible actions given the state as input.
Depending upon the e-greedy policy, choose an action
Take the chosen action and get the next state and reward alongside the done status (whether the episode ended or not)
Save current_state, action, reward, next_state in the replay deque

If the replay length is above the threshold

Apply GradientTape and
Sample a random batch from the replay deque
Recalculate q-value for current_state using the DQN
Calculate q-value for next_state using DQN
Calculate the expected q-value (ground truth) for the taken action for current_state using the Bellman equation.
Calculate mean-squared loss function for expected q-value and predicted q-value by DQN for action for current_state
Apply gradients for backpropagation
Decay the epsilon in epsilon greedy policy to move towards exploitation

A couple of questions

What is the bellman equation?

V (S) ← V (S) + α[R + γV (S′) − V (S)]
V(S)/V(S,A)= Value-Action-Function for current state
α= Constant
R=Reward for present action
γ= Discount Factor
V(S’)/V(S’/A)=Value-Action-Function for next State when action A taken on state S

Before explaining this, we need to understand what is catastrophic forgetting

Catastrophic forgetting

So, if we go by the default method of training reinforcement learning agents i.e updating the neural network after each action is taken (1 sample at a time), for complex environments (like open-ai environments), leads to catastrophic forgetting i.e. the model may get confused and start taking the same action for similar looking states. So, if in the previous step for state S, action A yielded high rewards, and now we are at state S1 very similar (but not the same) to S, but taking action A yields the worst rewards, the model gets confused.

To counter this, Experience Replay comes into the picture

Implement batch updates rather than single updates.
Update the model with a mix of new and old memories i.e. retraining/replaying old samples alongside new samples while training the agent

That is why we are using a Deque. So if you observe the logic,

The Deque length(2000) is big compared to the batch(200) we are training the model on.
Hence, for every epoch, after the deque reaches length=2000, new memories are pushed in and the older ones are popped out (as Queues work on the principle of FIFO i.e. First In First Out)
Hence, the deque maintains a mix of old and new memories.

#pygame essentials
pygame.init()
DISPLAYSURF = pygame.display.set_mode((625,400),0,32)
clock = pygame.time.Clock()
pygame.display.flip()#openai gym env
env = gym.make('MountainCar-v0')
input_shape = env.observation_space.shape[0]
num_actions = env.action_space.n
state = env.reset()
done = False
count=0
done=False
steps = 0
#loading trained model
value_network = tf.keras.models.load_model('keras')
def print_summary(text,cood,size):
 font = pygame.font.Font(pygame.font.get_default_font(), size)
 text_surface = font.render(text, True, (125,125,125))
 DISPLAYSURF.blit(text_surface,cood)
while count<10000 :
 pygame.event.get()
 steps+=1
 for event in pygame.event.get():
 if event.type==QUIT:
 pygame.quit()
 raise Exception('training ended')
 # Get the action probabilities from the policy network
 action = np.argmax(value_network.predict(np.array([state]))[0])
 obs, reward, done, info = env.step(action) # take a step in the environment
 image = env.render(mode='rgb_array') # render the environment to the screen
 #convert image to pygame surface object
 image = Image.fromarray(image,'RGB')
 mode,size,data = image.mode,image.size,image.tobytes()
 image = pygame.image.fromstring(data, size, mode)
 DISPLAYSURF.blit(image,(0,0))
 print_summary('Step {}'.format(steps),(10,10),15)
 pygame.display.update()
 clock.tick(100)
 count+=1
 if done:
 print_summary('Episode ended !!!'.format(steps),(100,100),30)
 state = env.reset()
 done = False 
 state = next_state
pygame.quit()

For code explanation

How did our model perform? It almost reached the flag. I hope training for a longer duration would have made it happen.

Deep Q Networks (DQN) explained with examples and codes in Reinforcement Learning (4)

My two cents on the whole training

While training, I observed a lot of instability in training loss i.e. the loss going up and down which is, as mentioned in some places, quite common in reinforcement learning.
The results and training instability could have been improved by using 2 DQNs, one a copy of the DQN and the other being the actual DQN for training. You can explore the internet on this ground going further.

On a final note, let’s quickly cover the difference between Value-based (DQN) and Action based (REINFORCE) methods in reinforcement learning.

DQNs don’t need to wait till the end of the episode for training the agent while REINFORCE gets the final reward after episode completion.
Using REINFORCE for non-episodic problems is difficult (as the episode never ends)
DQNs have an overhead of Experience Replay and Target and Training DQNs to avoid training instability and catastrophic forgetting. There is no such issue with REINFORCE.

Can these two ideas merge to form a stronger approach? We will discuss Actor-Critic methods in my next