On this page, we show some videos of our experimental results in two different environments, Myopic Breakout and Traffic Control.
Myopic Breakout
The InfluenceNet model (PPO-InfluenceNet) is able to learn the “tunnel” strategy, where it creates an opening on the left (or right) side and plays the ball in there to score a lot of points:
The feedforward network with no internal memory performs considerably worse than the InfluenceNet model:
Traffic Control
The Traffic Control task was modified as follows:
- The size of the observable region was slightly reduced and the delay between the moment an action is taken and the time the lights switch was increased to 6 seconds. During these 6 seconds the green light turns yellow.
- The speed penalty was removed and there is only a negative reward of -0.1 for every car that is stopped at a traffic light.
As shown in the video below, a memoryless agent can only switch the lights when a car enters the local region. With the new changes, this means that the light turns green too late and the cars have to stop:
On the other hand, the InfluenceNet agent is able to anticipate that a car will be entering the local region and thus switch the lights just in time for the cars to continue without stopping: