Complex optimization¶

Optimization problems

An integral part of Tensorlab comprises optimization of real functions in complex variables [7][8]. Tensorlab offers algorithms for complex optimization that solve unconstrained nonlinear optimization problems of the form

(1)\[\underset{\vec{z}\,\in\,\C^n}{\operatorname{minimize}}~~ f(\vec{z},\conj{\vec{z}})\mbox{,}\]

where \(f : \C^n \rightarrow \R\) is a smooth function (cf.minf_lbfgs, minf_lbfgsdl and minf_ncg) and nonlinear least squares problems of the form

(2)\[\underset{\vec{z}\,\in\,\C^n}{\operatorname{minimize}}~~\frac{1}{2}\|\tensor{F}(\vec{z},\conj{\vec{z}})\|_F^2\mbox{,}\]

where \(\|\cdot\|_F\) is the Frobenius norm and \(\tensor{F} : \C^n \rightarrow \C^{I_1 \times \cdots \times I_N}\) is a smooth function that maps \(n\) complex variables to \(\prod I_n\) complex residuals (cf.nls_gndl, nls_gncgs and nls_lm). For nonlinear least squares problems, simple bound constraints of the form

(3)\[\begin{split}\underset{\vec{z}\,\in\,\C^n}{\operatorname{minimize}}&~~\frac{1}{2}\|\tensor{F}(\vec{z},\conj{\vec{z}})\|_F^2\\ \mbox{subject to} &~~\realpart{\vec{l}} \leq \realpart{\vec{z}} \leq \realpart{\vec{u}}\\ &~~\impart{\vec{l}} \leq \impart{\vec{z}} \leq \impart{\vec{u}}\end{split}\]

are also supported (cf.nlsb_gndl). Furthermore, when a real solution \(\vec{z} \in \R^n\) is sought, complex optimization reduces to real optimization and the algorithms are computationally equivalent to their real counterparts.

Prototypical example

Throughout this section, we will use the Lyapunov equation

\[A\cdot X + X\cdot A^{\TH} + Q = 0\mbox{,}\]

which has important applications in control theory and model order reduction, as a prototypical example. In this matrix equation, the matrices \(A,Q \in \C^{n \times n}\) are given and the objective is to compute the matrix \(X\in \C^{n \times n}\). Since the equation is linear in \(X\), there exist direct methods to compute \(X\). However, these are relatively expensive, requiring \(O(n^6)\) floating point operations (flop) to compute the solution. Instead, we will focus on a nonlinear extension of this equation to low-rank solutions \(X\), which enables us to solve large-scale Lyapunov equations.

From here on, \(X\) is represented as the matrix product \(U\cdot V\), where \(U \in \C^{n\times k}\), \(V \in \C^{k \times n}\) (\(k < n\)). In the framework of (1) and (2), we define the objective function and residual function as

\[\begin{split}f_{\textnormal{lyap}}(U,V) &:= \frac{1}{2}\|\tensor{F}_{\textnormal{lyap}}(U,V)\|_F^2\end{split}\]

and

\[\begin{split}\tensor{F}_{\textnormal{lyap}}(U,V) &:= A\cdot (U \cdot V) + (U \cdot V)\cdot A^{\TH} + Q\mbox{,}\end{split}\]

respectively.

Note

Please note that this example serves mainly as an illustration and that computing a good low-rank solution to a Lyapunov equation proves to be quite difficult in practice due to an increasingly large amount of local minima as \(k\) increases.

Complex derivatives¶

Pen & paper differentiation¶

Scalar functions

To solve optimization problems of the form (1), many algorithms require first-order derivatives of the real-valued objective function \(f\). For a function of real variables \(f_R:\R^n \rightarrow \R\), these derivatives can be captured in the gradient \(\frac{\partial f_R}{\partial \vec{x}}\). For example, let

\[\begin{split}f_R(\vec{x}) &:= \sin(\vec{x}^{\T} \vec{x} + 2\vec{x}^{\T}\vec{a} )\end{split}\]

for \(\vec{a},\vec{x} \in \R^n\), then its gradient is given by

\[\begin{split}\frac{d f_R}{d \vec{x}} &= \cos(\vec{x}^{\T} \vec{x} + 2\vec{x}^{\T}\vec{a})\cdot(2\vec{x} + 2\vec{a}).\end{split}\]

Things get more interesting for real-valued functions of complex variables. Let

\[\begin{split}f(\vec{z}) &:= \sin(\conj{\vec{z}}^{\T} \vec{z} + (\conj{\vec{z}}+\vec{z})^{\T}\vec{a}),\end{split}\]

where \(\vec{z} \in \C^n\) and an overline denotes the complex conjugate of its argument. It is clear that \(f(\vec{x}) \equiv f_R(\vec{x})\) for \(\vec{x} \in \R^n\) and hence \(f\) is a generalization of \(f_R\) to the complex plane. Because \(f\) is now a function of both \(\vec{z}\) and \(\conj{\vec{z}}\), the limit \(\lim_{\vec{h} \rightarrow \vec{0}}\frac{f(\vec{z}+\vec{h}) - f(\vec{z})}{\vec{h}}\) no longer exists in general and so it would seem a complex gradient does not exist either. In fact, this only tells us the function \(f\) is not analytic in \(\vec{z}\), i.e., its Taylor series in \(\vec{z}\) alone does not exist. However, it can be shown that \(f\) is analytic in \(\vec{z}\) and \(\conj{\vec{z}}\) as a whole, meaning \(f\) has a Taylor series in the variables \(\vec{z}_C := \begin{bmatrix}\vec{z}^{\T} & \conj{\vec{z}}^{\T}\end{bmatrix}^{\T}\) with a complex gradient

\[\begin{split}\frac{d f}{d \vec{z}_C} = \begin{bmatrix}\dfrac{\partial f}{\partial \vec{z}} \\[10pt] \dfrac{\partial f}{\partial \conj{\vec{z}}}\end{bmatrix},\end{split}\]

where \(\frac{\partial f}{\partial \vec{z}}\) and \(\frac{\partial f}{\partial \conj{\vec{z}}}\) are the cogradient and conjugate cogradient, respectively. The (conjugate) cogradient is a Wirtinger derivative and is to be interpreted as a partial derivative of \(f\) with respect to the variables \(\vec{z}\) (\(\conj{\vec{z}}\)), while treating the variables \(\conj{\vec{z}}\) (\(\vec{z}\)) as constant. For the example above, we have

\[\begin{split}\frac{\partial f}{\partial \vec{z}} &= \cos(\conj{\vec{z}}^{\T} \vec{z} + (\conj{\vec{z}}+\vec{z})^{\T}\vec{a})\cdot(\conj{\vec{z}} + \vec{a})\\ \frac{\partial f}{\partial \conj{\vec{z}}} &= \cos(\conj{\vec{z}}^{\T} \vec{z} + (\conj{\vec{z}}+\vec{z})^{\T}\vec{a})\cdot(\vec{z} + \vec{a}).\end{split}\]

First, we notice that \(\conj{\frac{\partial f}{\partial \vec{z}}} = {\frac{\partial f}{\partial \conj{\vec{z}}}}\), which holds for any real-valued function \(f(\vec{z},\conj{\vec{z}})\). A consequence is that any algorithm that optimizes \(f\) will only need one of the two cogradients, since the other is just its complex conjugate. Second, we notice that the cogradients evaluated in real variables \(\vec{z} \in \R^n\) are equal to the real gradient \(\frac{d f_R}{d \vec{x}}\) up to a factor 2. Taking these two observations into account, the unconstrained nonlinear optimization algorithms in Tensorlab require only the scaled conjugate cogradient

\[\vec{g}(\vec{z}) := 2\frac{\partial f}{\partial \conj{\vec{z}}} \equiv 2\conj{\frac{\partial f}{\partial \vec{z}}}\]

and can optimize \(f\) over both \(\vec{z} \in \C^n\) and \(\vec{z} \in \R^n\).

Vector-valued functions

To solve optimization problems of the form (2), first-order derivatives of the vector-valued, or more generally tensor-valued, residual function \(\tensor{F}\) are often required. For a tensor-valued function \(\tensor{F}:\R^n \rightarrow \R^{I_1 \times \cdots \times I_N}\), these derivatives can be captured in the Jacobian \(\frac{\partial \vectorize(\tensor{F})}{\partial \vec{x}^{\T}}\). For example, let

\[\begin{split}\tensor{F}_R(\vec{x}) := \begin{bmatrix}\sin(\vec{x}^{\T}\vec{x}) & \vec{x}^{\T}\vec{b} \\ \vec{x}^{\T}\vec{a} & \cos(\vec{x}^{\T}\vec{x}) \end{bmatrix}\end{split}\]

for \(\vec{a},\vec{b},\vec{x} \in \R^n\), then its Jacobian is given by

\[\begin{split}\frac{d \vectorize(\tensor{F}_R)}{d \vec{x}^{\T}} = \begin{bmatrix}\cos(\vec{x}^{\T}\vec{x})\cdot (2\vec{x}^{\T})\\ \vec{a}^{\T}\\ \vec{b}^{\T} \\ -\sin(\vec{x}^{\T}\vec{x})\cdot(2\vec{x}^{\T})\end{bmatrix}\mbox{.}\end{split}\]

But what happens when we allow \(\tensor{F}:\C^n \rightarrow \C^{I_1 \times \cdots \times I_N}\)? For example,

\[\begin{split}\tensor{F}(\vec{z}) := \begin{bmatrix}\sin(\vec{z}^{\T}\vec{z}) & \vec{z}^{\T}\vec{b} \\ \conj{\vec{z}}^{\T}\vec{a} & \cos(\conj{\vec{z}}^{\T}\vec{z}) \end{bmatrix},\end{split}\]

where \(\vec{z} \in \C^n\) could be a generalization of \(\tensor{F}_R\) to the complex plane. Following a similar reasoning as for scalar functions \(f\), we can define a Jacobian and conjugate Jacobian as \(\frac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}}\) and \(\frac{\partial \vectorize(\tensor{F})}{\partial \conj{\vec{z}}^{\T}}\), respectively. For the example above, we have

\[\begin{split}\frac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}} = \begin{bmatrix}\cos(\vec{z}^{\T}\vec{z})\cdot (2\vec{z}^{\T})\\ \vec{0}^{\T}\\ \vec{b}^{\T} \\ -\sin(\conj{\vec{z}}^{\T}\vec{z})\cdot \conj{\vec{z}}^{\T}\end{bmatrix}\qquad\mbox{and}\qquad\frac{\partial \vectorize(\tensor{F})}{\partial \conj{\vec{z}}^{\T}} = \begin{bmatrix}\vec{0}^{\T} \\ \vec{a}^{\T}\\ \vec{0}^{\T} \\ -\sin(\conj{\vec{z}}^{\T}\vec{z})\cdot \vec{z}^{\T} \end{bmatrix}.\end{split}\]

Because \(\tensor{F}\) maps to the complex numbers, it is no longer true that the conjugate Jacobian is the complex conjugate of the Jacobian. In general, algorithms that solve (2) require both the Jacobian and conjugate Jacobian. In some cases only one of the two Jacobians is required, e.g., when \(\tensor{F}\) is analytic in \(\vec{z}\), which implies \(\frac{\partial \vectorize(\tensor{F})}{\partial \conj{\vec{z}}^{\T}} \equiv 0\). Tensorlab offers nonlinear least squares solvers for both the general nonanalytic case and the latter analytic case.

Numerical differentiation¶

Real scalar functions (with the \(i\)-trick)

The real gradient can be numerically approximated with deriv using the so-called \(i\)-trick [9]. For example, define the scalar functions

\[f_1(x) := \frac{10^{-20}}{1-10^3 x}\qquad f_2(\vec{x}) := \sin(\vec{x}^{\T}\vec{a})^3 \qquad f_3(X,Y) := \arctan(\trace(X^{\T}\cdot Y)),\]

where \(x \in \R\), \(\vec{a},\vec{x} \in \R^n\) and \(X,Y \in \R^{n \times n}\). Their first-order derivatives are

\[\begin{split}\frac{d f_1}{d x} = \frac{10^{-17}}{(1-10^3 x)^2}\qquad \frac{d f_2}{d \vec{x}} = 3\sin(\vec{x}^{\T}\vec{a})^2\cos(\vec{x}^{\T}\vec{a})\cdot \vec{a} \qquad \left\{\begin{array}{ll}\dfrac{\partial f_3}{\partial X} &= \frac{1}{1+\trace(X^{\T}\cdot Y)^2}\cdot Y\\[10pt]\dfrac{\partial f_3}{\partial Y} &= \frac{1}{1+\trace(X^{\T}\cdot Y)^2}\cdot X\end{array}\right. \mbox{.}\end{split}\]

An advantage of using the \(i\)-trick is that it can compute first-order derivatives accurately up to the order of machine precision. The disadvantages are that this requires an equivalent of about 4 (real) function evaluations per variable (compared to 2 for finite differences) and that certain requirements must be met. First, only the real gradient can be computed, meaning the gradient can only be computed where the variables are real. Second, the function must be real-valued when evaluated in real variables. Third, the function must be analytic on the complex plane. In other words, the function may not be a function of the complex conjugate of its argument. For example, the \(i\)-trick can be used to compute the gradient of the function @(x)x.'*x, but not of the function @(x)x'*x because the latter depends on \(\conj{\vec{x}}\). As a last example, note that @(x)real(x) is not analytic in \(x \in \C\) because it can be written as @(x)(x+conj(x))/2.

Choosing \(\vec{a}\) as ones(size(x)) in the example functions above, the following example uses deriv to compute the real gradient of these functions using the \(i\)-trick:

% Three test functions.
f1 = @(x)1e-20/(1-1e3*x);
f2 = @(x)sin(x.'*ones(size(x)))^3;
f3 = @(XY)atan(trace(XY{1}.'*XY{2}));

% Their first-order derivatives.
g1 = @(x)1e-17/(1-1e3*x)^2;
g2 = @(x)3*sin(x.'*ones(size(x)))^2*cos(x.'*ones(size(x)))*ones(size(x));
g3 = @(XY){1/(1+trace(XY{1}.'*XY{2})^2)*XY{2}, ...
           1/(1+trace(XY{1}.'*XY{2})^2)*XY{1}};

% Approximate the real gradient with the i-trick and compute its relative error.
x = randn;
relerr1 = abs(g1(x)-deriv(f1,x))/abs(g1(x))
x = randn(10,1);
relerr2 = norm(g2(x)-deriv(f2,x))/norm(g2(x))
XY = {randn(10),randn(10)};
relerr3 = cellfun(@(a,b)frob(a-b)/frob(a),g3(XY),deriv(f3,XY))

In Tensorlab, derivatives of scalar functions are returned in the same format as the function’s argument. Notice that f3 is function of a cell array XY, containing the matrix \(X\) in XY{1} and the matrix \(Y\) in XY{2}. In similar vein, the output of deriv(f3,XY) is a cell array containing the matrices \(\frac{\partial f_3}{\partial X}\) and \(\frac{\partial f_3}{\partial Y}\). Often, this allows the user to conveniently write functions as a function of a cell array of variables (containing vectors, matrices or tensors) instead of coercing all variables into one long vector which must then be disassembled in the respective variables.

Scalar functions (with finite differences)

If the conditions for the \(i\)-trick are not satisfied, or if a scaled conjugate cogradient is required, an alternative is using finite differences to approximate first-order derivatives. In both cases, the finite difference approximation can be computed using deriv(f,x,[],'gradient'). As a first example, we compute the relative error of the finite difference approximation of the real gradient of \(f_1\), \(f_2\) and \(f_3\):

% Approximate the real gradient with finite differences and compute its relative error.
x = randn;
relerr1 = abs(g1(x)-deriv(f1,x,[],'gradient'))/abs(g1(x))
x = randn(10,1);
relerr2 = norm(g2(x)-deriv(f2,x,[],'gradient'))/norm(g2(x))
XY = {randn(10),randn(10)};
relerr3 = cellfun(@(a,b)frob(a-b)/frob(a),g3(XY),deriv(f3,XY,[],'gradient'))

If \(f(\vec{z})\) is a real-valued scalar function of complex variables, deriv can compute its scaled conjugate cogradient \(\vec{g}(\vec{z})\). For example, let

\[f(\vec{z}) := \vec{a}^{\T}(\vec{z}+\conj{\vec{z}})+\log(\vec{z}^{\TH}\vec{z})\qquad \vec{g}(\vec{z}) := 2 \frac{\partial f}{\partial \conj{\vec{z}}} \equiv 2 \conj{\frac{\partial f}{\partial {\vec{z}}}} = 2\cdot\vec{a}+\frac{2}{\vec{z}^{\TH}\vec{z}} \cdot \vec{z},\]

where \(\vec{a}\in\R^n\) and \(\vec{z} \in \C^n\). Since \(f\) is real-valued and \(\vec{z}\) is complex, calling deriv(f,z) is equivalent to deriv(f,z,[],'gradient') and uses finite differences to approximate the scaled conjugate cogradient. In the following example \(\vec{a}\) is chosen as ones(size(z)) and the relative error of the finite difference approximation of the scaled conjugate cogradient \(\vec{g}(\vec{z})\) is computed:

% Approximate the scaled conjugate cogradient with finite differences
% and compute its relative error.
f = @(z)ones(size(z)).'*(z+conj(z))+log(z'*z);
g = @(z)2*ones(size(z))+2/(z'*z)*z;
z = randn(10,1)+randn(10,1)*1i;
relerr = norm(g(z)-deriv(f,z))/norm(g(z))

Note

In case of doubt, use deriv(f,z,[],'gradient') to compute the scaled conjugate cogradient. Running deriv(f,z) will attempt to use the \(i\)-trick when z is real, which can be substantially more accurate, but should only be applied when f is analytic.

Real vector-valued functions (with the \(i\)-trick)

Analogously to scalar functions, the real Jacobian of tensor-valued functions can also be numerically approximated with deriv using the \(i\)-trick. Take for example the following matrix-valued function

\[\begin{split}\tensor{F}_1(\vec{x}) := \begin{bmatrix}\log(\vec{x}^{\T}\vec{x}) & \vec{0}\\ 2\vec{a}^{\T}\vec{x} & \sin(\vec{x}^{\T}\vec{x})\end{bmatrix},\end{split}\]

where \(\vec{a},\vec{x} \in \R^n\), and its real Jacobian

\[\begin{split}J_1(\vec{x}) := \frac{d \vectorize(\tensor{F}_1)}{d \vec{x}^{\T}} &= 2\begin{bmatrix}\frac{1}{\vec{x}^{\T}\vec{x}}\cdot\vec{x}^{\T} \\ \vec{a}^{\T} \\ \vec{0} \\ \cos(\vec{x}^{\T}\vec{x})\cdot\vec{x}^{\T}\end{bmatrix}.\end{split}\]

Set \(\vec{a}\) equal to ones(size(x)). Since each entry in \(\tensor{F}_1(\vec{x})\) is real-valued and not a function of \(\conj{\vec{x}}\), we can approximate the real Jacobian \(J_1(\vec{x})\) with the \(i\)-trick:

% Approximate the Jacobian of a tensor-valued function with the i-trick
% and compute its relative error.
F1 = @(x)[log(x.'*x) 0; 2*ones(size(x)).'*x sin(x.'*x)];
J1 = @(x)2*[1/(x.'*x)*x.'; ones(size(x)).'; zeros(size(x)).'; cos(x.'*x)*x.'];
x = randn(10,1);
relerr1 = frob(J1(x)-deriv(F1,x))/frob(J1(x))

Analytic vector-valued functions (with finite differences)

The Jacobian of an analytic tensor-valued function \(\tensor{F}(\vec{z})\) can be approximated with finite differences by calling deriv(F,z,[],'Jacobian'). Functions that are analytic when their argument is real, may no longer be analytic when their argument is complex. For example, \(\tensor{F}(\vec{z}) := \begin{bmatrix}\vec{z}^{\TH}\vec{z} & \realpart{\vec{z}}^{\T}\vec{z}\end{bmatrix}\) is not analytic in \(\vec{z} \in \C^n\) because it depends on \(\conj{\vec{z}}\), but is analytic when \(\vec{z} \in \R^n\). An example of a function that is analytic for both real and complex \(\vec{z}\) is the function \(\tensor{F}_1(\vec{z})\). The following two examples compute the relative error of the finite differences approximation of the Jacobian \(J_1(\vec{x})\) in a real vector \(\vec{x}\):

% Approximate the Jacobian of an analytic tensor-valued function
% with finite differences and compute its relative error.
x = randn(10,1);
relerr1 = frob(J1(x)-deriv(F1,x,[],'Jacobian'))/frob(J1(x))

and the relative error of the finite differences approximation of \(J_1(\vec{z})\) in a complex vector \(\vec{z}\):

% Approximate the Jacobian of an analytic tensor-valued function
% with finite differences and compute its relative error.
z = randn(10,1)+randn(10,1)*1i;
relerr1 = frob(J1(z)-deriv(F1,z,[],'Jacobian'))/frob(J1(z))

Nonanalytic vector-valued functions (with finite differences)

In general, a tensor-valued function may be function of its argument and its complex conjugate. The matrix-valued function

\[\begin{split}\tensor{F}_2(X,Y) := \begin{bmatrix}\log(\trace(X^{\TH}\cdot Y)) & \vec{0}\\ \vec{a}^{\T}(X+\conj{X})\vec{a} & \vec{a}^{\T}(Y-\conj{Y})\vec{a}\end{bmatrix},\end{split}\]

where \(\vec{a}\in \R^n\) and \(X,Y \in \C^{n \times n}\) is an example of such a nonanalytic function because it depends on \(X\), \(Y\) and \(\conj{X}\), \(\conj{Y}\). Its Jacobian and conjugate Jacobian are given by

\[\begin{split}J_2(X,Y) := \frac{\partial \vectorize(\tensor{F}_2)}{\partial \begin{bmatrix}\vectorize(X)^{\T} & \vectorize(Y)^{\T}\end{bmatrix}} &= \begin{bmatrix}0 & \frac{1}{\trace(X^{\TH}\cdot Y)}\cdot \vectorize(\conj{X})^{\T} \\ (\vec{a} \otimes \vec{a})^{\T} & 0\\ 0 & 0\\ 0 & (\vec{a} \otimes \vec{a})^{\T}\end{bmatrix}\\ J_2^{c}(X,Y) := \frac{\partial \vectorize(\tensor{F}_2)}{\partial \begin{bmatrix}\vectorize(\conj{X})^{\T} & \vectorize(\conj{Y})^{\T}\end{bmatrix}} &= \begin{bmatrix}\frac{1}{\trace(X^{\TH}\cdot Y)}\cdot \vectorize(Y)^{\T} & 0 \\ (\vec{a} \otimes \vec{a})^{\T} & 0\\ 0 & 0\\ 0 & -(\vec{a} \otimes \vec{a})^{\T}\end{bmatrix},\end{split}\]

respectively. For a nonanalytic tensor-valued function \(\tensor{F}(\vec{z})\), deriv(F,z,[],'Jacobian-C') computes a finite differences approximation of the complex Jacobian

\[\begin{split}\begin{bmatrix}\dfrac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}} & \dfrac{\partial \vectorize(\tensor{F})}{\partial \conj{\vec{z}}^{\T}}\end{bmatrix},\end{split}\]

comprising both the Jacobian and conjugate Jacobian. The complex Jacobian of \(\tensor{F}_2(X,Y)\) is the matrix \(\begin{bmatrix}J_2(X,Y) & J_2^c(X,Y)\end{bmatrix}\). In the following example \(\vec{a}\) is equal to ones(length(z{1})) and the relative error of the complex Jacobian’s finite differences approximation is computed:

% Approximate the complex Jacobian of a nonanalytic tensor-valued function
% with finite differences and compute its relative error.
F2 = @(z)[log(trace(z{1}'*z{2})) 0; ...
          sum(sum(z{1}+conj(z{1}))) sum(sum(z{2}-conj(z{2})))];
J2 = @(z)[zeros(1,numel(z{1})) 1/trace(z{1}'*z{2})*reshape(conj(z{1}),1,[]) ...
          1/trace(z{1}'*z{2})*reshape(z{2},1,[]) zeros(1,numel(z{2})); ...
          ones(1,numel(z{1})) zeros(1,numel(z{1})) ...
          ones(1,numel(z{2})) zeros(1,numel(z{2})); ...
          zeros(1,2*numel(z{1})) zeros(1,2*numel(z{2})); ...
          zeros(1,numel(z{1})) ones(1,numel(z{1})) ...
          zeros(1,numel(z{1})) -ones(1,numel(z{1}))];
z = {randn(10)+randn(10)*1i,randn(10)+randn(10)*1i};
relerr2 = frob(J2(z)-deriv(F2,z,[],'Jacobian-C'))/frob(J2(z))

Nonlinear least squares¶

Algorithms

Tensorlab offers three algorithms for unconstrained nonlinear least squares: nls_gndl, nls_gncgs and nls_lm. The first is Gauss–Newton with a dogleg trust region strategy, the second is Gauss–Newton with CG-Steihaug for solving the trust region subproblem and the last is Levenberg–Marquardt. A bound-constrained method, nlsb_gndl, is also included and is a projected Gauss–Newton method with dogleg trust region. All algorithms are applicable to both analytic and nonanalytic residual functions and offer various ways of exploiting the structure available in its (complex) Jacobian.

With numerical differentiation

The complex optimization algorithms that solve (2) require the Jacobian of the residual function \(\tensor{F}(\vec{z},\conj{\vec{z}})\), which will be \(\tensor{F}_{\textnormal{lyap}}(U,V)\) for the remainder of this section (cf. Section Complex optimization). The second argument dF of the nonlinear least squares optimization algorithms nls_gndl, nls_gncgs and nls_lm specifies how the Jacobian should be computed. To approximate the Jacobian with finite differences, set dF equal to 'Jacobian' or 'Jacobian-C'.

The first case, 'Jacobian', corresponds to approximating the Jacobian \(\frac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}}\), assuming \(\tensor{F}\) is analytic in \(\vec{z}\). The second case, 'Jacobian-C', corresponds to approximating the complex Jacobian consisting of the Jacobian \(\frac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}}\) and conjugate Jacobian \(\frac{\partial \vectorize(\tensor{F})}{\partial \conj{\vec{z}}^{\T}}\), where \(\vec{z} \in \C^n\). Since \(\tensor{F}_{\textnormal{lyap}}\) does not depend on \(\conj{U}\) or \(\conj{V}\), we may implement a nonlinear least squares solver for the low-rank solution of the Lyapunov equation as

function z = lyap_nls_deriv(A,Q,z0)

F = @(z)(A*z{1})*z{2}+z{1}*(z{2}*A')+Q;
z = nls_gndl(F,'Jacobian',z0);

end

The residual function F(z) is matrix-valued and its argument is a cell array z, consisting of the two matrices \(U\) and \(V\). The output of the optimization algorithm, in this case nls_gndl, is a cell array of the same format as the argument of the residual function F(z).

With the Jacobian

Using the property \(\vectorize(A\cdot X\cdot B) \equiv (B^{\T} \otimes A)\cdot \vectorize(X)\), it is easy to verify that the Jacobian of \(\tensor{F}_{\textnormal{lyap}}(U,V)\) is given by

(4)\[\begin{split}\frac{\partial \vectorize(\tensor{F}_{\textnormal{lyap}})}{\partial \begin{bmatrix}\vectorize(U)^{\T} & \vectorize(V)^{\T}\end{bmatrix}} = \begin{bmatrix}(V^{\T} \otimes A)+(\conj{A}\cdot V^{\T}\otimes\eye) & (\eye \otimes A\cdot U)+(\conj{A}\otimes U)\end{bmatrix}.\end{split}\]

To supply the Jacobian to the optimization algorithm, set the field dF.dz as the function handle of the function that computes the Jacobian at a given point \(\vec{z}\). For problems for which the residual function \(\tensor{F}\) depends on both \(\vec{z}\) and \(\conj{\vec{z}}\), the complex Jacobian can be supplied with the field dF.dzc. See Section Complex derivatives or the help page of the selected algorithm for more information. Applied to the Lyapunov equation, we have

function z = lyap_nls_J(A,Q,z0)

dF.dz = @J;
z = nls_gndl(@F,dF,z0);

    function F = F(z)
        U = z{1}; V = z{2};
        F = (A*U)*V+U*(V*A')+Q;
    end

    function J = J(z)
        U = z{1}; V = z{2}; I = eye(size(A));
        J = [kron(V.',A)+kron(conj(A)*V.',I) kron(I,A*U)+kron(conj(A),U)];
    end

end

With the Jacobian’s Gramian

When the residual function \(\tensor{F} : \C^n \rightarrow \C^{I_1 \times \cdots \times I_N}\) is analytic [1] in \(\vec{z}\) (i.e., it is not a function of \(\conj{\vec{z}}\)) and the number of residuals \(\prod I_n\) is large compared to the number of variables \(n\), it may be beneficial to compute the Jacobian’s Gramian \(J^{\TH} J\) instead of the Jacobian \(J := \frac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}}\) itself. This way, each iteration of the nonlinear least squares algorithm no longer requires computing the (pseudo-)inverse \(J^{\dagger}\), but rather the less expensive (pseudo-)inverse \((J^{\TH} J)^{\dagger}\). In the case of the Lyapunov equation, this can lead to a significantly more efficient method if the rank \(k\) of the solution is small with respect to its order \(n\). Along with the Jacobian’s Gramian, the objective function \(f := \frac{1}{2}\|\tensor{F}\|_F^2\) and its gradient \(\frac{\partial f}{\partial \vec{z}} \equiv J^{\TH}\cdot \vectorize(\tensor{F})\) are also necessary. Skipping the derivation of the gradient and Jacobian’s Gramian, the implementation could look like

function z = lyap_nls_JHJ(A,Q,z0)

AHA = A'*A;
dF.JHJ = @JHJ;
dF.JHF = @grad;
z = nls_gndl(@f,dF,z0);

    function f = f(z)
        U = z{1}; V = z{2};
        F = (A*U)*V+U*(V*A')+Q;
        f = 0.5*(F(:)'*F(:));
    end

    function g = grad(z)
        U = z{1}; V = z{2};
        gU = AHA*(U*(V*V'))+A'*(U*(V*A'*V'))+A'*(Q*V')+ ...
             A*(U*(V*A*V'))+U*(V*AHA*V')+Q*(A*V');
        gV = (U'*AHA*U)*V+((U'*A'*U)*V)*A'+(U'*A')*Q+ ...
             ((U'*A*U)*V)*A+((U'*U)*V)*AHA+(U'*Q)*A;
        g = {gU,gV};
    end

    function JHJ = JHJ(z)
        U = z{1}; V = z{2}; I = eye(size(A));
        tmpa = kron(conj(V*A'*V'),A); tmpb = kron(conj(A),U'*A'*U);
        JHJ11 = kron(conj(V*V'),AHA)+kron(conj(V*AHA*V'),I)+tmpa+tmpa';
        JHJ22 = kron(I,U'*AHA*U)+kron(conj(AHA),U'*U)+tmpb+tmpb';
        JHJ12 = kron(conj(V),AHA*U)+kron(conj(V*A),A'*U)+ ...
                kron(conj(V*A'),A*U)+kron(conj(V*AHA),U);
        JHJ = [JHJ11 JHJ12; JHJ12' JHJ22];
    end

end

By default, the algorithm nls_gndl uses the Moore–Penrose pseudo-inverse of either \(J\) or \(J^{\TH} J\) to compute a new descent direction. However, if it is known that these matrices always have full rank, a more efficient least squares inverse can be computed. To do so, use

% Compute a more efficient least squares step instead of using the pseudoinverse.
options.JHasFullRank = true;
z = nls_gndl(@f,dF,z0,options);

The other nonlinear least squares algorithms nls_gncgs and nls_lm use a different approach for calculating the descent direction and do not have such an option.

With an inexact step

The most computationally intensive part of most nonlinear least squares problems is computing the next descent direction, which involves inverting either the Jacobian \(J := \frac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}}\) or its Gramian \(J^{\TH} J\) in the case of an analytic residual function. With iterative solvers such as preconditioned conjugate gradient (PCG) and LSQR, the descent direction can be approximated using only matrix-vector products. The resulting descent directions are said to be inexact. Many problems exhibit some structure in the Jacobian which can be exploited in its matrix-vector product, allowing for an efficient computation of an inexact step. Concretely, the user has the choice of supplying the matrix vector products \(J\cdot \vec{x}\) and \(J^{\TH} \cdot \vec{y}\), or the single matrix-vector product \((J^{\TH} J) \cdot \vec{x}\). An implementation of an inexact nonlinear least squares solver using the former method can be

function z = lyap_nls_Jx(A,Q,z0)

dF.dzx = @Jx;
z = nls_gndl(@F,dF,z0);

    function F = F(z)
        U = z{1}; V = z{2};
        F = (A*U)*V+U*(V*A')+Q;
    end

    function b = Jx(z,x,transp)
        U = z{1}; V = z{2};
        switch transp
            case 'notransp' % b = J*x
                Xu = reshape(x(1:numel(U)),size(U));
                Xv = reshape(x(numel(U)+1:end),size(V));
                b = (A*Xu)*V+Xu*(V*A')+(A*U)*Xv+U*(Xv*A');
                b = b(:);
            case 'transp'   % b = J'*x
                X = reshape(x,size(A));
                Bu = A'*(X*V')+X*(A*V');
                Bv = (U'*A')*X+(U'*X)*A;
                b = [Bu(:); Bv(:)];
        end
    end

end

where the Kronecker structure of the Jacobian (4) is exploited by reducing the computations to matrix-matrix products. Under suitable conditions on \(A\) and \(Q\), this implementation can achieve a complexity of \(O(nk^2)\), where \(n\) is the order of the solution \(X = U\cdot V\) and \(k\) is its rank.

In the case of a nonanalytic residual function \(\tensor{F}(\vec{z},\conj{\vec{z}})\), computing an inexact step requires matrix-vector products \(J\cdot \vec{x}\), \(J^{\TH}\cdot \vec{y}\), \(J_c \cdot \vec{x}\) and \(J_c^{\TH}\cdot \vec{y}\), where \(J := \frac{\partial \vectorize(\tensor{F})}{\partial \vec{z}^{\T}}\) and \(J_c := \frac{\partial \vectorize(\tensor{F})}{\partial \conj{\vec{z}}^{\T}}\) are the residual function’s Jacobian and conjugate Jacobian, respectively. For more information on how to implement these matrix-vector products, read the help information of the desired (2) solver.

Setting the options

The parameters of the selected optimization algorithm can be set by supplying the method with an options structure, e.g.,

% Set algorithm tolerances.
options.TolFun = 1e-12;
options.TolX = 1e-6;
options.MaxIter = 100;
dF.dz = @J;
z = nls_gndl(@F,dF,z0,options);

Note

It is important to note that since the objective function is the square of a residual norm, the objective function tolerance options.TolFun can be set as small as \(10^{-32}\) for a given machine epsilon of \(10^{-16}\). See the help information on the selected algorithm for more details.

Viewing the algorithm output

The second output of the optimization algorithms returns additional information pertaining to the algorithm. For example, the algorithms keep track of the objective function value in output.fval and also output the circumstances under which the algorithm terminated in output.info. As an example, the norm of the residual function of each iterate can be plotted with

% Plot each iterate's objective function value.
dF.dz = @J;
[z,output] = nls_gndl(@F,dF,z0);
semilogy(0:output.iterations,sqrt(2*output.fval));

Since the objective function is \(\frac{1}{2}\|\tensor{F}\|_F^2\), we plot sqrt(2*output.fval) to see the norm \(\|\tensor{F}\|_F\). See the help information on the selected algorithm for more details.

Unconstrained nonlinear optimization¶

Algorithms

Tensorlab offers three algorithms for unconstrained complex optimization: minf_lbfgs, minf_lbfgsdl and minf_ncg. The first two are limited-memory BFGS with Moré–Thuente line search and dogleg trust region, respectively, and the last is nonlinear conjugate gradient with Moré–Thuente line search. Instead of the supplied Moré–Thuente line search, the user may optionally supply a custom line search method. See the help information for details.

With numerical differentiation

The complex optimization algorithms that solve (1) require the (scaled conjugate co-)gradient of the objective function \(f(\vec{z},\conj{\vec{z}})\), which will be \(f_{\textnormal{lyap}}(\vec{z},\conj{\vec{z}})\) for the remainder of this section (cf. Section Complex optimization). The second argument of the unconstrained nonlinear minimization algorithms minf_lbfgs, minf_lbfgsdl and minf_ncg specifies how the gradient should be computed. To approximate the (scaled conjugate co-)gradient with finite differences, set the second argument equal to the empty matrix []. An implementation for the Lyapunov equation could look like

function z = lyap_minf_deriv(A,Q,z0)

f = @(z)frob((A*z{1})*z{2}+z{1}*(z{2}*A')+Q);
z = minf_lbfgs(f,[],z0);

end

As with the nonlinear least squares algorithms, the argument of the objective function is a cell array z, consisting of the two matrices \(U\) and \(V\). Likewise, the output of the optimization algorithm, in this case minf_lbfgs, is a cell array of the same format as the argument of the objective function f(z).

With the gradient

If an expression for the (scaled conjugate co-)gradient is available, it can be supplied to the optimization algorithm in the second argument. For the Lyapunov equation, the implementation could look like

function z = lyap_minf_grad(A,Q,z0)

AHA = A'*A;
z = minf_lbfgs(@f,@grad,z0);

    function f = f(z)
        U = z{1}; V = z{2};
        F = (A*U)*V+U*(V*A')+Q;
        f = 0.5*(F(:)'*F(:));
    end

    function g = grad(z)
        U = z{1}; V = z{2};
        gU = AHA*(U*(V*V'))+A'*(U*(V*A'*V'))+A'*(Q*V')+ ...
             A*(U*(V*A*V'))+U*(V*AHA*V')+Q*(A*V');
        gV = (U'*AHA*U)*V+((U'*A'*U)*V)*A'+(U'*A')*Q+ ...
             ((U'*A*U)*V)*A+((U'*U)*V)*AHA+(U'*Q)*A;
        g = {gU,gV};
    end

end

The function grad(z) computes the scaled conjugate cogradient \(2\frac{\partial f_{\textnormal{lyap}}}{\partial \conj{\vec{z}}}\), which coincides with the real gradient for \(\vec{z} \in \R^n\). See Section Complex derivatives for more information on complex derivatives.

Note

Note that the gradient grad(z) is returned in the same format as the solution z, i.e., as a cell array containing matrices of the same size as \(U\) and \(V\). However, the gradient may also be returned as a vector if this is more convenient for the user. In that case, the scaled conjugate cogradient should be of the format \(2\frac{\partial f_{\textnormal{lyap}}}{\partial \conj{\vec{z}}}\) where \(\vec{z} := \scriptsize\begin{bmatrix}\vectorize(U)^{\T} & \vectorize(V)^{\T}\end{bmatrix}^{\T}\).

Setting the options

The parameters of the selected optimization algorithm can be set by supplying the method with an options structure, e.g.,

% Set algorithm tolerances.
options.TolFun = 1e-6;
options.TolX = 1e-6;
options.MaxIter = 100;
z = minf_lbfgs(@f,@grad,z0,options);

See the help information on the selected algorithm for more details.

Viewing the algorithm output

The second output of the optimization algorithms returns additional information pertaining to the algorithm. For example, the algorithms keep track of the objective function value in output.fval and also output the circumstances under which the algorithm terminated in output.info. As an example, the objective function value of each iterate can be plotted with

% Plot each iterate's objective function value.
[z,output] = minf_lbfgs(@f,@grad,z0);
semilogy(0:output.iterations,output.fval);

See the help information on the selected algorithm for more details.

[1]	In the more general case of a nonanalytic residual function, the structure in its complex Jacobian can be exploited by computing an inexact step. See the following paragraph for more details.