Now, I’m inclined to believe the AI wasn’t performing well because of the reward shaping. Looking back at the logs, it seems that when the AI tries to explore the track (going around, crashing into walls and stuff), it actually gets LESS reward than just doing nothing. So instead of maximizing exploration and absorbing some of the bad reward, it just learned a crappy policy and avoided risk altogether.
Even though SAC encourages exploration via entropy, it still assumes that exploratory actions aren’t consistently worse than doing nothing. In my case, they were. So mathematically, the optimal policy just collapsed to poop.
Maximum Entropy Reinforcement Learning (what SAC is based on):
As a result, performance actually DEGRADED over training.
I’ve also migrated all the AI’s outputs to the range [-1, 1] (the bounds for Tanh, the final activation function in the model’s net).
- I’ve shrunk the 3 outputs for gas, brake, and steer → 2
So now the outputs are: gas/brake (1 = full gas, -1 = full brake) and steer.
Reducing the action space should also make the Q-function less noisy, which hopefully prevents that kind of collapse again.
Attached, finally, is a video of the agent (barely trained ~10 minutes, 50k steps):