Reinforcement Learning: A Catalyst for Next-Gen Mathematical Optimization

Mathematical optimization drives complex decision-making across a diverse range of problems – e.g., energy management, inventory planning, network design, pricing & revenue management, production planning & scheduling, supplier selection, and transportation planning. This field has significantly evolved over decades, from its formative years around World War II to the modern age. While conventional optimization approaches have been very successful in solving real-world challenges, we are increasingly witnessing problem statements that they are unable to address efficiently. This necessitates the application of newer techniques and technologies to the world of optimization.

The idea of using machine learning to augment the capabilities and scope of mathematical optimization is not new. However, recent advancements in this field, especially in Deep Learning, have made its adoption much more viable today than ever before. In particular, Reinforcement Learning offers a great option, especially for combinatorial optimization problems. Researchers are working on building neural-based solvers that model the process of finding the optimal solution as a learning task. The objective is to train better heuristics using a large number of problem instances from a certain distribution, and subsequently leverage this learning to address new problem instances for the same or similar distributions.

The Limitations of Conventional Optimization Methods

In most real-world applications, three approaches are generally used for solving optimization problems:

Exact methods, e.g., branch-and-bound, dynamic programming, simplex, etc.
Approximation methods, such as set cover or vertex cover approximation
Heuristics & Metaheuristics, such as Local Search (e.g., greedy search, hill climbing, and simulated annealing) and Population-based methods (e.g., ant colony or particle swarm optimization, and evolutionary algorithms.)

Modern optimization problems are becoming increasingly complex, and these conventional approaches sometimes prove to be inefficient. For instance, exact methods have limitations when it comes to dealing with large instances; approximation algorithms are not easy to construct, and may suffer weak empirical performances; and heuristics do not provide theoretical guarantees, and often necessitate significant customization based on each specific problem.

In general, conventional methods struggle to address the needs of large-scale problems, especially those involving high-dimensional solution spaces. Despite key innovations (e.g., decomposition or partitioning methods), the issue persists. Moreover, real-world problems often have multiple local minima (or maxima), and it is not uncommon to see conventional methods stuck in local optima instead of reaching the global optimum. Finally, real-world problems today often involve dynamic environments, highly uncertain problem spaces, or non-linear constraints – modeling these characteristics using traditional methods is difficult.

Why Use Reinforcement Learning in Mathematical Optimization?

While machine learning offers a great option for addressing some of the limitations of conventional optimization approaches, Reinforcement Learning has immense potential for complex problems, especially in combinatorial optimization. Here are three major reasons:

Adapting to the needs of dynamic environments: Reinforcement learning works well for problems that evolve over time (e.g., dynamic routing.) Additionally, it supports learning based on interactions with the problem space, thereby adapting as needed.
Balancing exploration and exploitation: The search space for real-life optimization problems is often vast, particularly for combinatorial tasks. Finding the global optima for such complex problems requires an effective balance of exploration (exploring new solutions) and exploitation (selecting the best-known solutions). Since reinforcement learning techniques are developed around the concepts of exploration and exploitation, it is well-suited to ensure this balancing.
Solving sequential decision-making problems: Many combinatorial optimization problems (e.g., Travelling Salesman or Vehicle Routing) can be formulated as a series of decisions. Agents in reinforcement learning can make decisions sequentially, with each decision leading to a new state (just like incremental choices in conventional optimization.)

Research has shown that reinforcement learning may be a good fit for certain types of optimization problems, such as:

assembly line balancing, i.e., efficiently assigning tasks to different workstations in an assembly line
dynamic delivery routing and vehicle routing for efficient routing of deliveries and vehicles
dynamic network routing and load balancing, in which network resources need to be allocated optimally while adapting to changing network traffic
job shop scheduling, in which the goal is to optimize task sequencing on machines while minimizing makespan, or maximizing throughput
large-scale operations research problems in areas like cutting stock, knapsack or resource allocation.
smart grid optimization to dynamically manage energy distribution in terms of minimizing costs, and balancing demand and supply.

How to Apply Reinforcement Learning to Optimization Problems?

The optimization problem needs to be modeled as a sequential decision-making process, generally as a Markov Decision Process (MDP).

Let’s take the example of the Traveling Salesman Problem (TSP), a special case of the Quadratic Assignment Problem (QAP), and an NP-hard problem. This problem can be formulated in the following manner:

State Space (S) represents the sequence of cities already visited.
Action Space (A) corresponds to choosing the next city to visit (from the set of unvisited cities.)
Transition Function (T) reflects the transition to the next state (i.e., adding the chosen city to the sequence of visited cities.)
Reward Function (R) represents the negative distance traveled between cities, and the goal is to minimize the total distance.
Policy (π) represents the sequence of city selections that lead to the shortest travel.

Broadly speaking, the agent’s goal is to find a policy function that maps states into actions. The environment is defined by a particular instance of the problem (e.g., the Max-Cut problem), while the states are encoded with a neural network model. The agent then uses reinforcement learning algorithms to make decisions that move the environment to the next state (e.g., removing a vertex from a solution set). See the diagram below.

Image Source: Reinforcement learning for combinatorial optimization: A survey (Mazyavkina et al, 2020)

Prominent Techniques

Policy Gradient methods (e.g., REINFORCE and Actor-Critic): These directly optimize the policy by estimating the gradient of the expected reward with respect to the policy parameters. Advantages include improved generalization, and the ability to train models without the need for hand-crafted features, or domain-specific heuristics. In particular, Attention-based architectures and Pointer Networks are useful for solving large-scale Traveling Salesman and Vehicle Routing problems.
Q-learning methods (e.g., Deep Q-Networks): In these methods, agents learn (or approximate) the Q-function, which is the value of an action in a given state. Additionally, the technique of ‘experience replay’ is often leveraged to stabilize the learning process. These methods are often general-purpose (i.e., they can be applied across a variety of optimization problems) and model-free (i.e., there is no need to model the environment). Knapsack and routing problems are suitable use cases for these methods.
Monte Carlo Tree Search (MCTS): This technique combines tree search with reinforcement learning, and is specifically relevant for solution spaces structured like trees (e.g., sequential decision-making.) Job shop optimization, production planning, and scheduling problems are suitable use cases.

Known Limitations and the Direction of Research

Reinforcement learning methods also have limitations, which impact their ability to solve all types of optimization problems. For instance, Policy Gradient methods can be sample-inefficient in high-dimensional problem spaces and unstable during training if the reward signals are sparse or noisy. Similarly, Q-learning methods are sometimes inefficient for large state-action spaces, and MCTS is often computationally expensive, thus reducing its applicability to large-scale combinatorial problems.

Scalability to very large problem instances is perhaps the most important issue today. While the techniques mentioned earlier can handle most medium-to-large problems, addressing anything larger in scale remains a fundamental challenge. Generalization (i.e., the ability to transfer learned policies from one problem instance to another) is another major challenge, thereby limiting practical applicability for certain use cases. Finally, balancing exploration and exploitation in certain cases may also be challenging.

Most of today’s research is focused on addressing the above gaps. A major area of research is Graph-based Reinforcement Learning for combinatorial optimization is another important area. For instance, here is a recent paper highlighting the potential of graph structure optimization (e.g., in molecular optimization) and graph process optimization (e.g., in network games.) Another important research area is Meta-Learning, which focuses on enabling reinforcement learning agents to generalize across a distribution of problem instances. This significantly reduces the need to retrain from scratch in order to solve new problem instances, thus improving overall generalization capabilities. A third area of research is Learning to Optimize, where reinforcement learning is leveraged to guide conventional optimization algorithms to perform better (e.g., guiding Branch-and-Bound decisions.)

Closing Comments

Applying reinforcement learning to mathematical programming creates a powerful paradigm for solving complex, large-scale optimization problems. This can help develop highly scalable, more robust, next-generation optimization solutions. While this hybridization is still in its early stages, both the pace and the direction of progress are encouraging. Most research in this area today has a significant ‘applied’ component – i.e., focused on solving real-world challenges, such as approximating complex functions, learning better heuristics, and improving real-time decision-making in dynamic environments. These efforts are likely to help solve problems considered intractable till now.

References:

Reinforcement learning for combinatorial optimization: A survey (Mazyavkina et al, 2020)
Graph Reinforcement Learning for Combinatorial Optimization: A Survey and Unifying Perspective (Darvariu et al, 2024)

PS: 10 – 15% of this paper was written with the help of generative AI.