Duality
1 Motivation
Duality lets us associate to any constrained optimization problem a concave maximization problem, whose solutions lower bound the optimal value of the original problem. What is interesting is that there are cases, when one can solve the primal problem by first solving the dual one. Now, consider a general constrained optimization problem:
\text{ Primal: }f(x) \to \min\limits_{x \in S} \qquad \text{ Dual: } g(y) \to \max\limits_{y \in \Omega}
We’ll build g(y), that preserves the uniform bound:
g(y) \leq f(x) \qquad \forall x \in S, \forall y \in \Omega
As a consequence:
\max\limits_{y \in \Omega} g(y) \leq \min\limits_{x \in S} f(x)
We’ll consider one of many possible ways to construct g(y) in case, when we have a general mathematical programming problem with functional constraints:
\begin{split} & f_0(x) \to \min\limits_{x \in \mathbb{R}^n}\\ \text{s.t. } & f_i(x) \leq 0, \; i = 1,\ldots,m\\ & h_i(x) = 0, \; i = 1,\ldots, p \end{split}
And the Lagrangian, associated with this problem:
L(x, \lambda, \nu) = f_0(x) + \sum\limits_{i=1}^m \lambda_i f_i(x) + \sum\limits_{i=1}^p\nu_i h_i(x) = f_0(x) + \lambda^\top f(x) + \nu^\top h(x)
We assume \mathcal{D} = \bigcap\limits_{i=0}^m\textbf{dom } f_i \cap \bigcap\limits_{i=1}^p\textbf{dom } h_i is nonempty. We define the Lagrange dual function (or just dual function) g: \mathbb{R}^m \times \mathbb{R}^p \to \mathbb{R} as the minimum value of the Lagrangian over x: for \lambda \in \mathbb{R}^m, \nu \in \mathbb{R}^p
g(\lambda, \nu) = \inf_{x \in \mathcal{D}} L(x, \lambda, \nu) = \inf_{x \in \mathcal{D}} \left( f_0(x) +\sum\limits_{i=1}^m \lambda_i f_i(x) + \sum\limits_{i=1}^p\nu_i h_i(x) \right)
When the Lagrangian is unbounded below in x, the dual function takes on the value −\infty. Since the dual function is the pointwise infimum of a family of affine functions of (\lambda, \nu), it is concave, even when the original problem is not convex.
Let us show, that the dual function yields lower bounds on the optimal value p^* of the original problem for any \lambda \succeq 0, \nu. Suppose some \hat{x} is a feasible point for the original problem, i.e., f_i(\hat{x}) \leq 0 and h_i(\hat{x}) = 0, \; λ \succeq 0. Then we have:
L(\hat{x}, \lambda, \nu) = f_0(\hat{x}) + \underbrace{\lambda^\top f(\hat{x})}_{\leq 0} + \underbrace{\nu^\top h(\hat{x})}_{= 0} \leq f_0(\hat{x})
Hence
g(\lambda, \nu) = \inf_{x \in \mathcal{D}} L(x, \lambda, \nu) \leq L(\hat{x}, \lambda, \nu) \leq f_0(\hat{x})
g(\lambda, \nu) \leq p^*
A natural question is: what is the best lower bound that can be obtained from the Lagrange dual function? This leads to the following optimization problem:
\begin{split} & g(\lambda, \nu) \to \max\limits_{\lambda \in \mathbb{R}^m, \; \nu \in \mathbb{R}^p }\\ \text{s.t. } & \lambda \succeq 0 \end{split}
The term “dual feasible”, to describe a pair (\lambda, \nu) with \lambda \succeq 0 and g(\lambda, \nu) > −\infty, now makes sense. It means, as the name implies, that (\lambda, \nu) is feasible for the dual problem. We refer to (\lambda^*, \nu^*) as dual optimal or optimal Lagrange multipliers if they are optimal for the above problem.
1.1 Summary
2 Strong duality
It is common to name this relation between optimals of primal and dual problems as weak duality. For problem, we have:
p^* \geq d^*
While the difference between them is often called duality gap:
p^* - d^* \geq 0
Note, that we always have weak duality, if we’ve formulated primal and dual problem. It means, that if we have managed to solve the dual problem (which is always concave, no matter whether the initial problem was or not), then we have some lower bound. Surprisingly, there are some notable cases, when these solutions are equal.
Strong duality happens if duality gap is zero:
p^∗ = d^*
Notice: both p^* and d^* may be + \infty.
- Several sufficient conditions known!
- “Easy” necessary and sufficient conditions: unknown.
3 Useful features
Construction of lower bound on solution of the direct problem.
It could be very complicated to solve the initial problem. But if we have the dual problem, we can take an arbitrary y \in \Omega and substitute it in g(y) - we’ll immediately obtain some lower bound.
Checking for the problem’s solvability and attainability of the solution.
From the inequality \max\limits_{y \in \Omega} g(y) \leq \min\limits_{x \in S} f_0(x) follows: if \min\limits_{x \in S} f_0(x) = -\infty, then \Omega = \varnothing and vice versa.
Sometimes it is easier to solve a dual problem than a primal one.
In this case, if the strong duality holds: g(y^∗) = f_0(x^∗) we lose nothing.
Obtaining a lower bound on the function’s residual.
f_0(x) - f_0^∗ \leq f_0(x) - g(y) for an arbitrary y \in \Omega (suboptimality certificate). Moreover, p^* \in [g(y), f_0(x)], d^* \in [g(y), f_0(x)]
Dual function is always concave
As a pointwise minimum of affine functions.
4 Slater’s condition
4.1 Reminder of KKT statements:
Suppose we have a general optimization problem (from the chapter)
\begin{split} & f_0(x) \to \min\limits_{x \in \mathbb{R}^n}\\ \text{s.t. } & f_i(x) \leq 0, \; i = 1,\ldots,m\\ & h_i(x) = 0, \; i = 1,\ldots, p \end{split} \tag{1}
and convex optimization problem (see corresponding chapter), where all equality constraints are affine: h_i(x) = a_i^Tx - b_i, i \in 1, \ldots p
\begin{split} & f_0(x) \to \min\limits_{x \in \mathbb{R}^n}\\ \text{s.t. } & f_i(x) \leq 0, \; i = 1,\ldots,m\\ & Ax = b, \end{split} \tag{2}
The Lagrangian is
L(x, \lambda, \nu) = f_0(x) + \sum\limits_{i=1}^m \lambda_i f_i(x) + \sum\limits_{i=1}^p\nu_i h_i(x)
The KKT system is:
\begin{split} & \nabla_x L(x^*, \lambda^*, \nu^*) = 0 \\ & \nabla_\nu L(x^*, \lambda^*, \nu^*) = 0 \\ & \lambda^*_i \geq 0, i = 1,\ldots,m \\ & \lambda^*_i f_i(x^*) = 0, i = 1,\ldots,m \\ & f_i(x^*) \leq 0, i = 1,\ldots,m \\ \end{split} \tag{3}
5 Applications
5.1 Connection between Fenchel duality and Lagrange duality
Let f: E \to \mathbb{R} and g: G \to \mathbb{R} — function, defined on the sets E and G in Euclidian Spaces V and W respectively. Let f^*:E_* \to \mathbb{R}, g^*:G_* \to \mathbb{R} be the conjugate functions to the f and g respectively. Let A: V \to W — linear mapping. We call Fenchel - Rockafellar problem the following minimization task:
f(x) + g(Ax) \to \min\limits_{x \in E \cap A^{-1}(G)}
where A^{-1}(G) := \{x \in V : Ax \in G\} — preimage of G. We’ll build the dual problem using variable separation. Let’s introduce new variable y = Ax. The problem could be rewritten:
\begin{split} & f(x) + g(y) \to \min\limits_{x \in E, \; y \in G }\\ \text{s.t. } & Ax = y \end{split}
Lagrangian
L(x,y, \lambda) = f(x) + g(y) + \lambda^\top (Ax - y)
Dual function
\begin{split} g_d(\lambda) &= \min\limits_{x \in E, \; y \in G} L(x,y, \lambda) \\ &= \min\limits_{x \in E}\left[ f(x) + (A^*\lambda)^\top x \right] + \min\limits_{y \in G} \left[g(y) - \lambda^\top y\right] = \\ &= -\max\limits_{x \in E}\left[(-A^*\lambda)^\top x - f(x) \right] - \max\limits_{y \in G} \left[\lambda^\top y - g(y)\right] \end{split}
Now, we need to remember the definition of the conjugate function:
\sup_{y \in G}\left[\lambda^\top y - g(y)\right] = \begin{cases} g^*(\lambda), &\text{ if } \lambda \in G_*\\ +\infty, &\text{ otherwise} \end{cases}
\sup_{x \in E}\left[(-A^*\lambda)^\top x - f(x) \right] = \begin{cases} f^*(-A^*\lambda), &\text{ if } \lambda \in (-A^*)^{-1}(E_*)\\ +\infty, &\text{ otherwise} \end{cases}
So, we have:
\begin{split} g_d(\lambda) &= \min\limits_{x \in E, y \in G} L(x,y, \lambda) = \\ &= \begin{cases} -g^*(\lambda) - f^*(-A^*\lambda) &\text{ if } \lambda \in G_* \cap (-A^*)^{-1}(E_*)\\ -\infty, &\text{ otherwise} \end{cases} \end{split}
which allows us to formulate one of the most important theorems, that connects dual problems and conjugate functions:
Convex case is especially important since if we have Fenchel - Rockafellar problem with parameters (f, g, A), than the dual problem has the form (f^*, g^*, -A^*).
5.2 Sensitivity analysis
Let us switch from the original optimization problem
\begin{split} & f_0(x) \to \min\limits_{x \in \mathbb{R}^n}\\ \text{s.t. } & f_i(x) \leq 0, \; i = 1,\ldots,m\\ & h_i(x) = 0, \; i = 1,\ldots, p \end{split} \tag{P}
To the perturbed version of it:
\begin{split} & f_0(x) \to \min\limits_{x \in \mathbb{R}^n}\\ \text{s.t. } & f_i(x) \leq u_i, \; i = 1,\ldots,m\\ & h_i(x) = v_i, \; i = 1,\ldots, p \end{split} \tag{Per}
Note, that we still have the only variable x \in \mathbb{R}^n, while treating u \in \mathbb{R}^m, v \in \mathbb{R}^p as parameters. It is obvious, that \text{Per}(u,v) \to \text{P} if u = 0, v = 0. We will denote the optimal value of \text{Per} as p^*(u, v), while the optimal value of the original problem \text{P} is just p^*. One can immediately say, that p^*(u, v) = p^*.
Speaking of the value of some i-th constraint we can say, that
- u_i = 0 leaves the original problem
- u_i > 0 means that we have relaxed the inequality
- u_i < 0 means that we have tightened the constraint
One can even show, that when \text{P} is convex optimization problem, p^*(u,v) is a convex function.
Suppose, that strong duality holds for the orriginal problem and suppose, that x is any feasible point for the perturbed problem:
\begin{split} p^*(0,0) &= p^* = d^* = g(\lambda^*, \nu^*) \leq \\ & \leq L(x, \lambda^*, \nu^*) = \\ & = f_0(x) + \sum\limits_{i=1}^m \lambda_i^* f_i(x) + \sum\limits_{i=1}^p\nu_i^* h_i(x) \leq \\ & \leq f_0(x) + \sum\limits_{i=1}^m \lambda_i^* u_i + \sum\limits_{i=1}^p\nu_i^* v_i \end{split}
Which means
\begin{split} f_0(x) \geq p^*(0,0) - {\lambda^*}^T u - {\nu^*}^T v \end{split}
And taking the optimal x for the perturbed problem, we have:
p^*(u,v) \geq p^*(0,0) - {\lambda^*}^T u - {\nu^*}^T v \tag{4}
In scenarios where strong duality holds, we can draw several insights about the sensitivity of optimal solutions in relation to the Lagrange multipliers. These insights are derived from the inequality expressed in equation above:
Impact of Tightening a Constraint (Large \lambda_i^\star):
When the ith constraint’s Lagrange multiplier, \lambda_i^\star, holds a substantial value, and if this constraint is tightened (choosing u_i < 0), there is a guarantee that the optimal value, denoted by p^\star(u, v), will significantly increase.Effect of Adjusting Constraints with Large Positive or Negative \nu_i^\star:
- If \nu_i^\star is large and positive and v_i < 0 is chosen, or
- If \nu_i^\star is large and negative and v_i > 0 is selected,
then in either scenario, the optimal value p^\star(u, v) is expected to increase greatly.
- If \nu_i^\star is large and positive and v_i < 0 is chosen, or
Consequences of Loosening a Constraint (Small \lambda_i^\star):
If the Lagrange multiplier \lambda_i^\star for the ith constraint is relatively small, and the constraint is loosened (choosing u_i > 0), it is anticipated that the optimal value p^\star(u, v) will not significantly decrease.Outcomes of Tiny Adjustments in Constraints with Small \nu_i^\star:
- When \nu_i^\star is small and positive, and v_i > 0 is chosen, or
- When \nu_i^\star is small and negative, and v_i < 0 is opted for,
in both cases, the optimal value p^\star(u, v) will not significantly decrease.
- When \nu_i^\star is small and positive, and v_i > 0 is chosen, or
These interpretations provide a framework for understanding how changes in constraints, reflected through their corresponding Lagrange multipliers, impact the optimal solution in problems where strong duality holds.
5.3 Local sensitivity
Suppose now that p^*(u, v) is differentiable at u = 0, v = 0.
\lambda_i^* = -\dfrac{\partial p^*(0,0)}{\partial u_i} \quad \nu_i^* = -\dfrac{\partial p^*(0,0)}{\partial v_i} \tag{5}
To show this result we consider the directional derivative of p^*(u,v) along the direction of some i-th basis vector e_i:
\lim_{t \to 0} \dfrac{p^*(t e_i,0) - p^*(0,0)}{t} = \dfrac{\partial p^*(0,0)}{\partial u_i}
From the inequality Equation 4 and taking the limit t \to0 with t>0 we have
\dfrac{p^*(t e_i,0) - p^*}{t} \geq -\lambda_i^* \to \dfrac{\partial p^*(0,0)}{\partial u_i} \geq -\lambda_i^*
For the negative t < 0 we have:
\dfrac{p^*(t e_i,0) - p^*}{t} \leq -\lambda_i^* \to \dfrac{\partial p^*(0,0)}{\partial u_i} \leq -\lambda_i^*
The same idea can be used to establish the fact about v_i.
The local sensitivity result Equation 5 provides a way to understand the impact of constraints on the optimal solution x^* of an optimization problem. If a constraint f_i(x^*) is negative at x^*, it’s not affecting the optimal solution, meaning small changes to this constraint won’t alter the optimal value. In this case, the corresponding optimal Lagrange multiplier will be zero, as per the principle of complementary slackness.
However, if f_i(x^*) = 0, meaning the constraint is precisely met at the optimum, then the situation is different. The value of the i-th optimal Lagrange multiplier, \lambda^*_i, gives us insight into how ‘sensitive’ or ‘active’ this constraint is. A small \lambda^*_i indicates that slight adjustments to the constraint won’t significantly affect the optimal value. Conversely, a large \lambda^*_i implies that even minor changes to the constraint can have a significant impact on the optimal solution.
5.4 Shadow prices or tax interpretation
Consider an enterprise where x represents its operational strategy and f_0(x) is the operating cost. Therefore, -f_0(x) denotes the profit in dollars. Each constraint f_i(x) \leq 0 signifies a resource or regulatory limit. The goal is to maximize profit while adhering to these limits, which is equivalent to solving:
\begin{split} & f_0(x) \to \min\limits_{x \in \mathbb{R}^n}\\ \text{s.t. } & f_i(x) \leq 0, \; i = 1,\ldots,m \end{split}
The optimal profit here is -p^*.
Now, imagine a scenario where exceeding limits is allowed, but at a cost. This cost is linear to the extent of violation, quantified by f_i. The charge for breaching the i^{th} constraint is \lambda_i f_i(x). If f_i(x) < 0, meaning the constraint is not fully utilized, \lambda_i f_i(x) represents income for the firm. Here, \lambda_i is the cost (in dollars) per unit of violation for f_i(x).
For instance, if f_1(x) \leq 0 limits warehouse space, the firm can rent out extra space at \lambda_1 dollars per square meter or rent out unused space for the same rate.
The firm’s total cost, considering operational and constraint costs, is L(x, \lambda) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x). The firm aims to minimize L(x, \lambda), resulting in an optimal cost g(\lambda). The dual function g(\lambda) represents the best possible cost for the firm based on the prices of constraints \lambda, and the optimal dual value d^* is this cost under the most unfavorable price conditions.
Weak duality implies that the cost in this flexible scenario (where the firm can trade constraint violations) is always less than or equal to the cost in the strict original scenario. This is because any optimal operation x^* from the original scenario will cost less in the flexible scenario, as the firm can earn from underused constraints.
If strong duality holds and the dual optimum is reached, the optimal \lambda^* represents prices where the firm gains no extra advantage from trading constraint violations. These optimal \lambda^* values are often termed ‘shadow prices’ for the original problem, indicating the hypothetical cost of constraint flexibility.
5.5 Mixed strategies for matrix games
In zero-sum matrix games, players 1 and 2 choose actions from sets \{1,...,n\} and \{1,...,m\}, respectively. The outcome is a payment from player 1 to player 2, determined by a payoff matrix P \in \mathbb{R}^{n \times m}. Each player aims to use mixed strategies, choosing actions according to a probability distribution: player 1 uses probabilities u_k for each action i, and player 2 uses v_l.
The expected payoff from player 1 to player 2 is given by \sum_{k=1}^{n} \sum_{l=1}^{m} u_k v_l P_{kl} = u^T P v. Player 1 seeks to minimize this expected payoff, while player 2 aims to maximize it.
5.5.1 Player 1’s Perspective
Assuming player 2 knows player 1’s strategy u, player 2 will choose v to maximize u^T P v. The worst-case expected payoff is thus:
\max_{v \geq 0, 1^T v = 1} u^T P v = \max_{i=1,...,m} (P^T u)_i
Player 1’s optimal strategy minimizes this worst-case payoff, leading to the optimization problem:
\begin{split} & \min \max_{i=1,...,m} (P^T u)_i\\ & \text{s.t. } u \geq 0 \\ & 1^T u = 1 \end{split} \tag{6}
This forms a convex optimization problem with the optimal value denoted as p^*_1.
5.5.2 Player 2’s Perspective
Conversely, if player 1 knows player 2’s strategy v, the goal is to minimize u^T P v. This leads to:
\min_{u \geq 0, 1^T u = 1} u^T P v = \min_{i=1,...,n} (P v)_i
Player 2 then maximizes this to get the largest guaranteed payoff, solving the optimization problem:
\begin{split} & \max \min_{i=1,...,n} (P v)_i \\ & \text{s.t. } v \geq 0 \\ & 1^T v = 1 \end{split} \tag{7}
The optimal value here is p^*_2.
5.5.3 Duality and Equivalence
It’s generally advantageous to know the opponent’s strategy, but surprisingly, in mixed strategy matrix games, this advantage disappears. The key lies in duality: the problems above are Lagrange duals. By formulating player 1’s problem as a linear program and introducing Lagrange multipliers, we find that the dual problem matches player 2’s problem. Due to strong duality in feasible linear programs, p^*_1 = p^*_2, showing no advantage in knowing the opponent’s strategy.
5.5.4 Formulating and Solving the Lagrange Dual
We approach problem Equation 6 by setting it up as a linear programming (LP) problem. The goal is to minimize a variable t, subject to certain constraints:
- u \geq 0,
- The sum of elements in u equals 1 (1^T u = 1),
- P^T u is less than or equal to t times a vector of ones (P^T u \leq t \mathbf{1}).
Here, t is an additional variable in the real numbers (t \in \mathbb{R}).
5.5.5 Constructing the Lagrangian
We introduce multipliers for the constraints: \lambda for P^T u \leq t \mathbf{1}, \mu for u \geq 0, and \nu for 1^T u = 1. The Lagrangian is then formed as:
L = t + \lambda^T (P^T u - t \mathbf{1}) - \mu^T u + \nu (1 - 1^T u) = \nu + (1 - 1^T \lambda)t + (P\lambda - \nu \mathbf{1} - \mu)^T u
5.5.6 Defining the Dual Function
The dual function g(\lambda, \mu, \nu) is defined as:
g(\lambda, \mu, \nu) = \begin{cases} \nu & \text{if } 1^T\lambda=1 \text{ and } P\lambda - \nu \mathbf{1} = \mu \\ -\infty & \text{otherwise} \end{cases}
5.5.7 Solving the Dual Problem
The dual problem seeks to maximize \nu under the following conditions:
- \lambda \geq 0,
- The sum of elements in \lambda equals 1 (1^T \lambda = 1),
- \mu \geq 0,
- P\lambda - \nu \mathbf{1} = \mu.
Upon eliminating \mu, we obtain the Lagrange dual of Equation 6:
\begin{split} & \max \nu \\ & \text{s.t. } \lambda \geq 0 \\ & \lambda \geq 0 \\ & P\lambda \geq \nu \mathbf{1} \end{split}
5.5.8 Conclusion
This formulation shows that the Lagrange dual problem is equivalent to problem Equation 7. Given the feasibility of these linear programs, strong duality holds, meaning the optimal values of Equation 6 and Equation 7 are equal.