As new technologies emerge, market forces drive industries to find ways to implement them in an attempt to gain an advantage or just to stay competitive. Over time, new tools are developed that help make the technology easier to use and more widespread. With a focus on improving the supply chain decision-making process, Accenture took the AnyLogic Product Delivery example model (also available in AnyLogic Cloud) and used it to prove the power of new reinforcement learning (RL) opportunities.
To achieve its goal, Accenture partnered with San Francisco based AI company Pathmind. Pathmind is combining the newest RL algorithms with AnyLogic simulation modeling. This pairing is critical for policy training because learning algorithms need time to learn which actions work best in different situations – time that would be difficult to provide outside of a computing environment.
In this case, there cannot be any better training ground than a simulated environment because the associated costs are minimal in comparison to real life testing. Furthermore, a simulated environment can be run many times under different conditions, allowing RL algorithms to train on thousands of simulated years of possibilities.
The RL Model
There are three key elements to define when making a neural net. These elements are: the observation space, the action space, and the reward function.
This is what the RL agent sees. It will only investigate these variables when deciding which action to take. Is important to give information that will be available in the real environment since the final goal is for it to work there.
For our model we choose to give to the agent the following data:
- Stock Info: The current stock of each manufacturing center
- Starting Vehicles: The number of vehicles each manufacturing center has
- Free Vehicles: The number of available vehicles each manufacturing center has
- Order Amounts: The number of items ordered. 0 if no order was placed for a distribution center
The action space is the range of actions our RL agent can make decisions for. In this case, the action space is a vector of size 15x3. As the 15 distribution centers create orders, the RL agent decides which of the 3 manufacturing centers should fulfill each one. If no order is generated, the action is ignored for that distribution center.
The reward function is the way of telling the RL agent if it is performing well or not. The RL will be trained to try to maximize this function. Our reward function was as simple as:
reward = before.avgWaitingTime – after.AvgWaitingTime
This means we only tried to minimize the waiting time. If the waiting time increases the function becomes ever more negative, so the RL agent knows it is performing poorly.
The results obtained were extremely good. The method produced a waiting time more than four times shorter than the Nearest Agent heuristic. The reason RL beat the other heuristics by so much difference is because it could account for the fact that sometimes factories get overloaded by demand. The main difference here is that RL policy learned to dynamically assign orders. When the nearest factory to a distribution center is about to reach capacity the RL agent places orders at factories further away. This helps match production capacity to demand. The other methods are static and cannot adapt to abrupt changes in demand.
Try it out and follow the tutorial
This simulation model is publicly available in AnyLogic cloud. You can try it out for yourself.
Follow the tutorial from Pathmind.
Learn more about using Pathmind for reinforcement learning in AnyLogic on our dedicated Pathmind page.