At Engineering Ingegneria Informatica (EII), we see great potential in integrating simulation and machine learning. This short blog presents an industrial problem and its reinforcement learning solution, made using AnyLogic and developed by EII. The flexibility and customizability of AnyLogic allowed us to use Pathmind, to create a hybrid platform.
Read on, find out about the problem and see how to train agents by letting them interact with an AnyLogic environment. You will also learn the techniques used to formulate the industrial problem in a way fit for machine learning.
Why Simulation and Machine Learning?
While there are many differences between the three main paradigms of machine learning (supervised, unsupervised and reinforcement), they also share one important feature: a thirst for data. Among them, reinforcement learning is the one which not only needs the most data to reach an optimal solution, but also the one that needs to establish a link with a live environment to get this data.
Live environments and large sources of data are not easy to come by due to high costs, high risks, and the physical time limits linked to them. And it’s for this reason that so many of the recent breakthroughs in the world of reinforcement learning have emerged from interaction with games, which provide a safe, free, and easy environment to interact with.
In addition to games, simulation modeling removes the main limitations linked with real-world systems. It comes with a significantly lower price tag, no risk, and the possibility of virtual time execution (e.g. simulating a system’s evolution over an entire year in a matter of seconds). These features make simulation the most viable environment for training reinforcement learning agents in business contexts.
Problem Description
Imagine a factory that manufactures heavy industrial components. The production line consists of several automatized workstations and, due to the weight of the products, heavy-duty conveyor systems connect these workstations — this is the case for ferromagnetic core production.
For the factory in question, these ferromagnetic cores are highly customized, to meet the specific requirements of each transformer and, as a result, production cycles and the movements of the cores on the production line are not easily automated.
At any given time, several cores can be under process with a line manager needing to consider each individual production cycle and plan movements in advance. Despite best efforts, bottlenecks and line blockages happen and cost the factory money. The biggest problem is the movement of different objects around the production line and how to manage them.
Solution
The solution involves reinforcement learning agents that determine the movements of cores on the production line. Together, the agents aim to find the shortest paths between workstations to complete core production.
A robust simulation model of the real-world system is required for training the agents and it is also necessary to break the problem down into smaller tasks. These smaller tasks are assigned to single learning agents which then learn how to work with any initial layout. In other words, the main task of reaching the workstation targets is broken into smaller tasks that are managed by an ensemble of agents.
The reinforcement learning uses a DDQN algorithm because they have shown reasonable sample efficiency and can be effectively used when the action space is discrete — the number of possible actions at any given time for this problem is 64.
Each movement receives a small negative reward (rn) until the final position is reached and a large positive reward (rp) is given. However, such a reward scheme is quite sparse, given the large number of state-action pairs — approximately 36 million. To counter this, adding another reward mechanism creates a denser reward function — each time an agent decides to move an object in the immediate vicinity of the core, or the core itself, it gets a small reward (ri) with respect to rp. The reward scheme can be summarized as:
- rp: the reward if the taken action ends at the final position (e.g., rp = 1000)
- ri: the reward if the taken action moves the core or an object close to the core (e.g., ri = 1)
- rn: the reward in any other case (for instance, rn = -1).
The introduction of ri is backed by understanding that moving the core, or the objects close to it, will increase the probability of reaching the final position. It is also worth noting that RL attempts to maximize the cumulative reward during an episode; therefore, ri should be small enough so that the maximum accumulation of these small rewards is much smaller than rp (i.e., max actions in an episode * ri < rp).
If riis not small enough, the agent might learn a policy in which it just repeats actions that provide it with the little rewards. For instance, considering the rewards given in the list above, if the length of each episode is 1000 actions, the agent does not need to learn how to reach to the final position; as long as it learns how to move the core back and forth, it can do that for the entire episode and achieve a cumulative reward of 1000. This issue is observed and discussed about further by OpenAI in this Faulty Reward Function in the Wild example.
In the beginning, an agent does not know the link between an object’s position and the available actions, so it makes random decisions, which are sometimes physically invalid (red arrows in the video below) and do not result in a state change.
As can be seen in the above video, most of the agents random decisions cannot move the objects in the layout. Regardless, the agent stores all these interactions in its memory and, by exploring new actions, discovers better moves. A rich experience interacting with the environment enables the agent to eventually infer the best decisions for any given situation. After training, the agent can perform its task in an efficient and effective manner, as shown in the video below.
In the following short video, the agent needs to take a core, from any given position, to the platform in the center, T16. Every time it reaches the target, the layout is randomized, and the simulation restarted. The agent then successfully performs the necessary movements again.
Finally, by putting together an ensemble of trained agents, like the ones shown in the video above, they can reach a series of targets in a layout.
Robust modeling and connectivity
By accurately capturing the production line in an AnyLogic simulation model, EII were able to successfully apply deep reinforcement learning. The result provided a policy that can effectively manage production line movements. Key to the success of the project was AnyLogic’s capacity for capturing a system in an appropriate manner and the possibilities it provides for connecting with machine learning technologies – find out more about AnyLogic AI.
This example was created by Engineering Ingegneria Informatica using AnyLogic and Pathmind for reinforcement learning.