Projects/RL Racetrack Navigation

DEEP REINFORCEMENT LEARNING · AUTONOMOUS CONTROL

GitHub ↗

End-to-End RL for Autonomous Racetrack Navigation

A progressive implementation of reinforcement learning — from classical dynamic programming to deep actor-critic methods — culminating in an agent that navigates a simulated racetrack with continuous steering and throttle control. Five algorithms across five exercises, implemented as coursework for the Reinforcement Learning course at the University of Edinburgh.

PyTorchGymnasiumhighway-envPython 3.10+DDPGPER
Author
Saravut Lin
Institution
University of Edinburgh, School of Informatics
Year
2025
5
Algorithms
5
Environments
374
Best Mean Return
60+
Models in Sweep

Learning Progression

Each exercise builds on the previous, forming a complete journey from exact dynamic programming to deep actor-critic methods for continuous control.

01Dynamic ProgrammingTabular / Exact

Value Iteration & Policy Iteration on a custom MDP. Exact solutions, full model required.

02Model-Free Tabular RLTabular / Model-Free

Q-Learning & Monte Carlo on FrozenLake-8x8. No model needed — learn from experience.

03Deep Q-NetworksDeep RL / Discrete

DQN with experience replay and target networks. Scales RL to continuous state spaces.

04DDPG for Continuous ControlDeep RL / Continuous

Actor-Critic on Racetrack-v0. First algorithm with continuous steering and throttle.

05Enhanced DDPGBest Result

Prioritised Experience Replay, adaptive noise scheduling, orthogonal weight initialisation, and 60+ model hyperparameter sweep on Racetrack-v0.

Algorithm Deep Dives

Experimental Results

Baseline DDPGEnhanced DDPG0150300450600Mean Return
AlgorithmTypeEnvironmentKey Result
Value IterationDynamic ProgrammingCustom MDPOptimal policy, exact convergence
Q-Learning (gamma=0.99)Tabular RLFrozenLake-8x8Converges in 10K episodes
Monte Carlo (gamma=0.99)Tabular RLFrozenLake-8x8Converges in 300K episodes
DQN (linear, frac=0.01)Deep RLCartPole-v1Solves within 50K steps
DDPGActor-CriticRacetrack-v0Target return: 500.0
Enhanced DDPGActor-Critic + PERRacetrack-v0Mean return: 374.86 +/- 52.78

Key Contribution — Enhanced DDPG

Prioritised Replay with Adaptive Exploration

The Enhanced DDPG extends the baseline with three targeted improvements. Prioritised Experience Replay (PER) samples transitions proportional to their TD-error magnitude — the agent revisits the most informative experiences more often, accelerating learning on difficult transitions.

Adaptive noise scheduling starts exploration at std = 0.8 and decays over training, with periodic bursts to escape local optima. Orthogonal weight initialisation (critic gain 1.0, actor gain 0.1) improves gradient flow in early training. The best configuration from a 60+ model sweep achieved a mean return of 374.86 ± 52.78 on Racetrack-v0.

ReplayUniformPrioritised (TD-error)
NoiseFixed GaussianAdaptive + Burst Schedule
InitDefaultOrthogonal (gain 1.0 / 0.1)
Mean Return~300 (Baseline)374.86 ± 52.78

Technologies

Python 3.10+PyTorchGymnasiumhighway-envNumPyValue IterationPolicy IterationQ-LearningMonte CarloDQNDDPGPrioritised Experience ReplayActor-CriticRacetrack-v0