Spectral graph theory: deriving effective resistance

There are some themes that you hear about here and there for years. You know it’s cool, you know it’s probably profound, but you’ve never really taken it seriously. Then, one day, you decide to take a deeper look at it, and you find yourself sucked in a black hole of mathematical revelations. To me, this is spectral graph theory.

When I first heard about the whole notion of “effective resistance” I thought, cool, complicated algebra tricks and something cute in return. But actually, it turns out that viewing undirected graphs as a resistor network, and thinking about its effective resistance between pairs of nodes, arises in a number of applications, such as sparsification, or centrality measures. So, it’s worth buckling down, looking long and hard at the equations, and finally making them make sense.

The main paper I am using as a reference is

[1] Daniel A. Spielman, Nikhil Srivastava. Graph sparsification by effective resistances, 2009.

However, truthfully, a solid elementary circuits textbook will be likely more detailed.

What is effective resistance?

There are probably multiple answers to this question.

Effective resistance is a way of summarizing a graph’s “inverse flow rate”. For example, consider a network of water pipes connecting various houses. Then, the flow of water coming from a source to my house is not only dependent on the size of a single pipe, but the size of all the pipes in all the paths from the source to my house. So, it doesn’t make sense to only look at the size of the pipe connecting the source directly to my house, but a “summary of pipe sizes”, over all the other pipes. This is one intuitive way to think of it.
Effective resistance is a metric for spectrally-optimal matrix sparsification. This is the main idea behind [1], where the effective resistance is used as the sampling weights to produce a sparser matrix approximation of a graph Laplacian. Specifically, an approximation $L’$ computed this way satisfies
$$
(1-\epsilon) x^TL’x \leq x^TLx \leq (1+\epsilon)x^TL’x
$$
for optimal $\epsilon$.
Finally, most formally, the effective resistance $R_{i,j}$ is the actual resistance, in ohms, that represents the action of a resistor network with two exposing heads, at nodes $i$ and $j$. The rest of the nodes represent internal junctions, and in between each pair of junction $i,j$ where $A_{i,j} > 0$, lies a resistor of resistance $1/A_{i,j}$.
Effective resistance has close connections with other graph concepts, like mean hitting times, and Kemeny constants. However, whether we chose to interpret this close connection as some cool electrical phenomenon, or merely that they are all closely related to $L^\dagger$ is… as the saying goes, six in one, half a dozen in the other.

I’ll guess there are even more answers to this question, but I’ll leave it up to the comments section. Now let’s do some math.

Don’t drink and derive, but do derive some effective resistances

Problem setup

Suppose I have an undirected graph, with $n$ nodes, and an edge set $\mathcal E$ containing pairs of nodes $(i,j)$, where $i,j \in \{1,…,n\}$. The weight of the edges are stored in a symmetric adjacency matrix

$$ A_{i,j} \geq 0, \quad i\neq j, \qquad A_{ii} = 0.$$

Moreover, $A_{i,j}\neq 0$ only if $(i,j) \in \mathcal E$. Also, we have the degree matrix $D = \mathrm{diag}(A\mathbf 1)$ and Laplacian matrix $L = D-A$ is symmetric and positive semidefinite.

Electrical networks setup

Now suppose that this graph actually models an electrical network, where nodes indicate junctions (e.g. where wires cross) and the weight $A_{i,j} = 1/r_{i,j}$ where $r_{i,j}$ is the resistance (in Ohms) of a physical resistor placed connecting junctions $i$ and $j$. Assume that at most one resistor appears for each pair $(i,j)$.

Now, pick any two nodes, say $k$ and $l$. We will ground the junction at $k$ and put a 1-volt power source on junction $l$. This will cause current to flow in the graph, from a source node $l$ and a terminus $k$. The effective resistance $R_{k,l}$ is defined as the resistance that summarizes this entire network into one resistor, with ends $k$ and $l$.

Specifically, from Kirchoff’s current law, for every node not $k$ or $l$, the sum of the currents (a signed quantity) going into each node should be 0; flow is conserved. For nodes $k$ and $l$, the cumulative current going into $k$ is the same as that coming out of $l$. Call this last quantity $\bar I_{k,l}$, in amps. Then, using Ohm’s law, $R_{k,l} = \frac{1}{\bar I_{k,l}}$.

The main claim

A (now well known, and frequently used) fact:

$$
R_{i,j} = L^\dagger_{ii} + L^\dagger_{jj} \,-\, 2L^\dagger_{i,j}\tag{$\star$}
$$

where $L^\dagger$ is the Moore-Penrose pseudoinverse of this graph.

Example. Consider the following circuit.

Now, what is the effective resistance $R_{1,2}$? Using elementary circuit knowledge, we would see it involves two resistors ($r_{1,3}$, $r_{2,3}$) in series and then $r_{1,2}$ in parallel, so
\[
R_{1,2} = \frac{1}{\frac{1}{r_{13} + r_{23}} + \frac{1}{r_{12}}} = \frac{r_{12}(r_{13}+r_{23})}{r_{12} + r_{13} + r_{23}}
\]
Meanwhile, in matrix form,
\[
A = \begin{bmatrix} 0 & \frac{1}{r_{12}} & \frac{1}{r_{13}}\\ \frac{1}{r_{12}} & 0 & \frac{1}{r_{23}}\\ \frac{1}{r_{13}} & \frac{1}{r_{23}} & 0 \end{bmatrix}, \qquad L = \begin{bmatrix} \frac{1}{r_{12}} + \frac{1}{r_{13}} & -\frac{1}{r_{12}} & -\frac{1}{r_{13}}\\ -\frac{1}{r_{12}} &\frac{1}{r_{12}} + \frac{1}{r_{23}} & -\frac{1}{r_{23}}\\ -\frac{1}{r_{13}} & -\frac{1}{r_{23}} & \frac{1}{r_{13}} +\frac{1}{r_{23}} \end{bmatrix}.
\]
Using MATLAB’s handy dandy symbolic toolbox,
\[
L^\dagger =
\frac{1}{9(r_{12} + r_{13} + r_{23})}
\begin{bmatrix}
4r_{12}r_{13} + r_{12}r_{23} + r_{13}r_{23} & -2r_{12}(r_{13}+r_{23}) +r_{13}r_{23} & -2r_{13}(r_{12}+r_{23}) + r_{12}r_{23}\\
-2r_{12}(r_{13}+r_{23}) +r_{13}r_{23} & r_{12}r_{13} + 4r_{12}r_{23} + r_{13}r_{23} & -2r_{23}(r_{12}+r_{13}) + r_{12}r_{13} \\
-2r_{13}(r_{12}+r_{23}) + r_{12}r_{23} & -2r_{23}(r_{12}+r_{13}) + r_{12}r_{13} & r_{12}r_{13} + r_{12}r_{23} + 4r_{13}r_{23}
\end{bmatrix}
\]
and one can manually check that, for $k\neq l \neq j$,
\[
(e_k-e_l)^T L^\dagger (e_k-e_l)=
\frac{ r_{k,l} ( r_{k,j} + r_{l,j} )}{r_{k,l} + r_{k,j} + r_{l,j}}.
\]

Interpretation as a Euclidean distance matrix. Note that since $L$ is symmetric positive semidefinite, so is $L^\dagger$. Therefore, we can think of $L^\dagger$ as the Gram matrix of points $p_1,…,p_n$ where $L^\dagger_{i,j} = p_i^Tp_j$. Moreover, since by construnction, the mean row or column sum of $L$, then $L\mathbf 1 = 0$ and therefore $L^\dagger \mathbf 1 = 0$. That is to say, the points are centered; e.g. $\sum_{i=1}^n p_i = 0$. Then, the Euclidean distance matrix corresponding to these points $p_1,…,p_n$ is exactly $R$, e.g. $R_{i,j} = \|p_i-p_j\|_2^2$.

The main proof

Now, let’s go back and prove ($\star$).

First, we should interpret $A_{i,j} = \frac{1}{r_{i,j}}$ as the edge conductance, and write Ohm’s law as
\[
\text{current from $i$ to $j$ } = I_{i,j} = \frac{v_i-v_j}{r_{i,j}} = v^T(e_i-e_j)A_{i,j}.
\]
Next, we apply 1V to node $k$ and ground node $l$. Then, by KCL, the cumulative current flowing out of node $j$ is
\[
\bar I_j = \sum_{i}I_{i,j} = \begin{cases}
0 & j \not\in \{k,l\}\\
\frac{1}{R_{k,l}} & j= k\\
-\frac{1}{R_{k,l}} &j =l\\
\end{cases}
\]
so we may write succinctly
\[
I^T\mathbf 1 \,-\, \frac{1}{R_{k,l}} (e_k-e_l) = 0.
\]
Filling in the terms for $I$, this means
\begin{eqnarray*}
0 &=& \sum_j v^T(e_i-e_j)A_{i,j}- \frac{1}{R_{k,l}} (e_k-e_l)_i \\
\Rightarrow 0 &=& (D-A)v\, -\,\frac{1}{R_{k,l}} (e_k-e_l)\\
\iff R_{k,l}L v &=& e_k-e_j
\end{eqnarray*}
Applying $(e_k-e_l)L^\dagger$ to both sizes gives
\[
R_{k,l}(e_k-e_l)^TL^\dagger L v = (e_k-e_l)^TL^\dagger(e_k-e_l)
\]
Next, I claim that $L^\dagger L = I – \frac{1}{n}\mathbf 1 \mathbf 1^T$. To see this, note that $\mathbf 1$ is orthogonal to all the nontrivial eigenvectors in $L$. So, given the eigenvalue decomposition
\[
L = \sum_{i=2}^n \lambda_i u_iu_i^T,
\]
then
$$
L+\frac{1}{n}\mathbf 1\mathbf 1^T = \sum_{i=2}^n \lambda_i u_iu_i^T+\frac{1}{n}\mathbf 1\mathbf 1^T,\qquad (L+\frac{1}{n}\mathbf 1\mathbf 1^T)^{-1} = \sum_{i=2}^n \frac{1}{\lambda_i} u_iu_i^T + \frac{1}{n} \mathbf 1 \mathbf 1^T
$$
so
\[
(L+\frac{1}{n}\mathbf 1\mathbf 1^T)^{-1} = L^\dagger + \frac{1}{n} \mathbf 1 \mathbf 1^T
\]
and
\[
I = (L+\frac{1}{n}\mathbf 1\mathbf 1^T)(L+\frac{1}{n}\mathbf 1\mathbf 1^T)^{-1} = (L+\frac{1}{n}\mathbf 1\mathbf 1^T) (L^\dagger + \frac{1}{n}\mathbf 1\mathbf 1^T) = LL^\dagger + \frac{1}{n} \mathbf 1\mathbf 1^T.
\]
Therefore, since recall $v_k = 1$ and $v_l = 0$,
\[
(e_k-e_l)^TL^\dagger(e_k-e_l) = R_{k,l}(e_k-e_l)^T (I \,-\, \frac{1}{n}\mathbf 1 \mathbf 1^T) v = R_{k,l}(v_k-v_l) = R_{k,l}.
\]
Actually, there isn’t much needed in terms of minimizing energies or anything fancy, just good old fashioned Kirchoff’s laws!

Finally, here is a summarizing image created by DALLE via ChatGPT4. The spelling could be improved, but otherwise, hopefully it gets the message across!

Tangoing with nonsymmetric contraction matrices: Understanding Gelfand’s formula and its implications on linear convergence rates

I have finally reached the point in my life when I have the misfortune of meeting not one, but two nonsymmetric “contraction” matrices, and have had to try to understand why it is that, even after all the hard work of proving that a matrix’s largest eigenvalue is $\lambda_{\max}\leq \rho <1$, it does not mean that we have a contraction rate of $\rho^k$. This journey includes not just a lot of bruises from banging one’s head against a wall, but also a (reunion) encounter with degenerate eigenvalues, algebraic vs geometric multiplicity, and the ever-present linear algebra embodiment of “speak softly and carry a big stick”, the Jordan canonical form.

(Apologies for rambling language. I haven’t had much sleep.)

Ok, so here is the problem. Let’s say I have a method, and through thick and thin I have finally proven that this method can be bounded in terms of something like

$$
x^{(k)}-x^* = A_k (x^{(k-1)}-x^*) = (\prod_{i\leq k} A_i)(x^{(0)}-x^*).\tag{Method}
$$

Then, someone way smarter than me has also shown that the largest eigenvalue of $A_k$, in modulus, is also $\rho <1$, for all $k$. Great! we say. We can conclude that $\|x^{(k)}-x^*\|_*\leq O(\rho^k)$! And to add insult to injury, all our simulations actually show this rate! So we’re done right?

Wait. No. $A_k$ is not symmetric. So, while it is true that

$$
\|x^{(k)}-x^*\|_2 = \| (\prod_{i\leq k} A_i)(x^{(0)}-x^*)\|_2\leq \|x^{(0)}-x^*\|_2,\tag{LinRate}
$$

we didn’t actually show that $\max_{i\leq k} \|A_i\|_2^k\leq \rho < 1$ (or that any $\|A_i\|_2\leq \rho$). So, now what do we do?

First approach: Gelfand’s formula.

Through some googling, we usually can find this formula:

$$
|\lambda_{\max}| = \lim_{k\to\infty}\|A^k\|_2^{1/k}
$$

or, more specifically,

$$
|\lambda_{\max}| \geq \lim_{k\to\infty}\|A^k\|_2^{1/k} – \epsilon_k, \qquad \epsilon_k\to 0.
$$

So, we can always just stop there and say that the rate (LinRate) is asymptotically true. This is somewhat unsatisfying, first because asymptotic rates are a bit wussy; but secondly, because we just kind of used some old guy’s rule without really understanding why it may be true.

Second approach: Complex diagonalization

It turns out that the issue we were facing above was not that $A_k$ is not symmetric, but that it is not necessarily diagonalizable. But, in cases where $A_k$ are diagonalizable, we actually don’t have that much to worry about.

Take, for example, the contraction matrix associated with the Polyak Heavy Ball method:

$$
A_k = \begin{bmatrix} (1+\beta) I -\eta H_k & -\beta I \\ I & 0 \end{bmatrix}
$$

and Polyak’s method can indeed be written like (Method) where $H_k = \nabla^2 f(x^{(k)})$. When the problem is quadratic, then $H_k$ is constant, but otherwise it may change with respect to $k$; but, under the right smoothness and strong convexity conditions, its eigenvalues will be bounded. Overall, with some smarts, we can show that $\lambda_{\max}(A_k) \leq \sqrt{\beta}$, and we really would like to say that this means $\|x^{(k)}-x^*\|_2 = O(\sqrt{\beta}^k)$. (This part of the analysis is well-known, was first shown by Polyak in [1], and actually I had discussed it in an earlier blog post.)

Now, suppose that $H_k$ is not constant, but satisfies the spectral bounds in our assumptions. Now, we are following the analysis in [2], which is cute, simple, and pretty powerful. Basically, we first observe that $H_k$ is symmetric and diagonalizable, and thus we may write $A_k$ :

$$
H_k = U_k\Sigma_kU_k^T, \qquad A_k = \begin{bmatrix} U_k & 0 \\ 0 & U_k \end{bmatrix}\underbrace{ \begin{bmatrix}(1+\beta)I-\eta \Sigma_k & -\beta I \\ I & 0 \end{bmatrix} }_{T_k}\begin{bmatrix} U^T_k & 0 \\ 0 & U^T_k \end{bmatrix}
$$

Furthermore, note that a permutation exists such that $\Pi T_k\Pi^T$ is block diagonal, with each block as the 2×2 matrix

$$
(\Pi T_k\Pi^T )_{2i-1:2i}= \begin{bmatrix}(1+\beta)-\eta (\Sigma_k)_i & -\beta 1 \\ 1 & 0 \end{bmatrix}
$$

and importantly, this matrix is diagonalizable! So, using the right combination of permutation matrices, we will be able to diagonalize $T_k = Q_kZ_kQ_k^{-1}$ where $Z_k$ is diagonal and $\sqrt{\beta} \geq \max_k |(Z_k)_{jj}|$. Note also that both $Q$ and $Z$ ought to have complex values (but that in itself is not problematic.)

Then, the rest is not too tricky. In the paper [2], they define $P_k = \begin{bmatrix} U_k & 0 \\ 0 & U_k \end{bmatrix}Q_k$ so that $A_k = P_kZ_kP_k^T$, and introduce a dummy variable $u_k = P_k^{-1}(x^{(k)}-x^*)$ so that

$$
u_k = P_k^{-1}A_k P_k u_{k-1}= Z_k u_{k-1} = \prod_{i\leq k}Z_iu_0
$$
and since $Z_i$ is diagonal with largest modulus value $\leq \sqrt{\beta}$, it is not hard to say $\|u_k\|_2\leq \sqrt{\beta}^k u_0$. In the paper they actually use an additional step to say $\|x_k-x^*\|_2\leq \kappa(P)\|u_k\|_2$ which adds a constant factor, but actually $P_k$ looks like it ought to be well conditioned in practice.

But, note that the hard part of this proof was to show that $T_k$ is diagonalizable (and therefore $A_k$ is diagonalizable) despite not being symmetric. Now, we have to ask ourselves, what happens if $A_k$ is not diagonalizable?

Third approach: try to remember some useful linear algebra facts.

Sometimes in the midst of some serious REM sleep, I’ll have reoccuring dreams of old math lectures. In one of them is a friendly graduate school professor chanting, in a comforting mantra: “not all matrices are diagonalizable… ” I then try to focus the memory a bit, to skip ahead a bit to the next part of the lecture, which says “…but the Jordan canonical form (JCF) always exists!”

To remember what this JCF thing is, let’s take as an example

$$
A = \begin{bmatrix} 1 & \alpha \\ 0 & 1 \end{bmatrix}.
$$

If we then look at its characteristic polynomial, then $det(A-\lambda I) = 0$ only when $\lambda = 1$, so it has two identical roots. But, we can also directly compute that
$$
A^k = \begin{bmatrix} 1 & k\alpha \\ 0 & 1 \end{bmatrix}.
$$

so clearly, its spectral radius is not bounded in the limit $k\to \infty$! Therefore, in this case, it would be absolutely incorrect to say (LinRate), for $\rho = 1$. However, if we just scale down the diagonal of $A$ ever so slightly, then
$$
A = \begin{bmatrix} \beta & k\alpha \\ 0 & \beta \end{bmatrix}, \qquad
A^k = \beta^k \begin{bmatrix} 1 & \frac{\alpha}{\beta} \\ 0 & 1 \end{bmatrix}^k
= \beta^k \begin{bmatrix} 1 & \frac{k\alpha}{\beta} \\ 0 & 1 \end{bmatrix}
$$

and since $\beta^k$ shrinks faster than $k\alpha/\beta$, this quantity will go to 0, asymptotically as $\rho(\beta^k)$, but not necessarily upper bounded by $\beta^k$.

Connection wiht Jordan canonical form

The JCF decomposition of a matrix $A$ is

$$
A = UJU^{-1}
$$

where $\|U\|_2 = 1$ and $J = \mathrm{diag}(J_1,J_2,…,J_D)$ contains diagonal blocks

$$
J_i = \begin{bmatrix} \lambda_i & 1 & 0 & … & 0 & 0 \\ 0 & \lambda_i & 1 & … &0 & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots
\\ 0 & 0 & 0 & … & \lambda_i&1
\\ 0 & 0 & 0 & … & 0 & \lambda_i\end{bmatrix}
$$

The blocksize of $J_i$ is the geometric multiplicity of the corresponding $i$th eigenvalue. This could be any number; 1 means it is the eigenvalue you already know and love, but > 1 means that there are “hidden eigenvector spaces”, where in fact the matrix spans more spaces but we can’t really reach them using an eigenbasis. That is the case of the scary matrix we had found. Note that, taking any one of these diagonal blocks, we can get
\begin{eqnarray*}
J_i^k
= \lambda_i^k\begin{bmatrix} 1& \binom{k}{1}\lambda_i^{-1} & \binom{k}{2}\lambda_i^{-2} & … & \binom{k}{b_i}\lambda_i^{-b_i} & 0 \\ 0 & 1& \binom{k}{2}\lambda_i^{-2} & … & \binom{k}{b_i-2}\lambda_i^{-b_i-2} & \binom{k}{b_i-1}\lambda_i^{-b_i-1}\\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots
\\ 0 & 0 & 0 & … & 1& \binom{k}{1}\lambda_i^{-1}
\\ 0 & 0 & 0 & … & 0 &1\end{bmatrix}
\end{eqnarray*}
More precisely, I can write $\binom{k}{b} \leq \frac{k^b}{b!}$, which guarantees that, for $\lambda_i\leq 1$,
$$
\|J_i^k\|_\infty \leq \lambda_i^k \cdot \frac{ k^{b_i}}{b_i!}
$$
and therefore
$$
\|J_i^k\|_2 = \max_{u\neq 0} \frac{\|J_i^k u\|_2}{\|u\|_2} \leq \lambda_i^k \cdot \frac{ k^{b_i}}{b_i!} \cdot \underbrace{\max_{u\neq 0} \frac{|u\|_1 \|\mathbf{1}\|_2}{\|u\|_2}}_{b_i\sqrt{b_i}} = \lambda_i^k \cdot \frac{\sqrt{b_i} k^{b_i }}{(b_i-1)!}
$$

which overall gives an overall nonasymptotic convergence rate of
$$
\|x^{(k)}-x^*\|_2\leq \rho^k \max_i \frac{\sqrt{b_i} k^{b_i }}{(b_i-1)!}
$$
where $b_i$ are the Jordan blocks of $A$! (Nope, no constants!) Indeed this is $\|J_i^k\|\lesssim \lambda_i^k$ (but not nonasymptotically).

Conclusions and future thoughts.

So at this point, we’ve concluded that indeed, it is not a good idea to bound the action of “contraction” matrices (in quotes because they may not necessarily contract) based on the behavior of the largest eigenvalue; however, we can also see qualitatively how eventually, the spectral radius does give a linear rate. Note that this analysis can be done by using the ever-forgotten always-powerful JCF, which very explicitly points out when eigenvalues are problematic. (Not from symmetry, but from degeneracy.)

Overall (edited), we can actually come up with better nonasymptotic rates for nondiagonalizable contraction matrices, using the JCF. It feels a bit unsettling because it does not contain a “fully linear” overall trend, and depends critically on $b_i$, the order of the Jordan blocks, which may be hard to ascertain in practice. But, it is a nonasymptotic rate!

Well, hope y’all are enjoying your day!

Citations

[1]: Polyak, Boris T. “Introduction to optimization. optimization software.” Inc., Publications Division, New York 1 (1987): 32.

[2] Wang, Jun-Kun, Chi-Heng Lin, and Jacob D. Abernethy. “A modular analysis of provable acceleration via polyak’s momentum: Training a wide RELU network and a deep linear network.” In International Conference on Machine Learning, pp. 10816-10827. PMLR, 2021.

Many thanks to Sharan, Reza, and Baojian, whose discussions helped me understand this issue.

High dimensional mean value theorem.

Here’s a new “overthinking it” issue for you. For a while now, I’ve been relying on the high-dimensional version of a Taylor series approximation result to help me flip variables and gradients. That is, for 1-D functions with continuous Hessians everywhere, the mean value theorem says that there always exists a $z$ where $x\leq z \leq y$, such that

$$
g”(z) (x-y)=g'(x)-g'(y).
$$

Without blinking, we use this to extend to $f:\mathbb R^n\to \mathbb R$, and say that there always exists some $z$ where

$$
\nabla^2 f(z) (x-y) = \nabla f(x) – \nabla f(y).\tag{MVT-hd}
$$

Never mind that the notion of “between $x$ and $y$” doesn’t really make sense. We just love the result, and use it to flip primal and dual solutions, based on the qualities of $\nabla^2 f(z)$, willy nilly.

But, one thing I am realizing as I go through my convex analysis adventure, is that everything that is true in high dimensions can be proven using only tools in low dimensions. And by low dimensions, I mean 1-dimensional. So, I should be able to, using only 1-D function knowledge, prove (MVT-hd).

First, (MVT-hd) is true if

$$
u_k^T\nabla^2 f(z) (x-y) =u_k^T( \nabla f(x) – \nabla f(y)).\tag{MVT-hd}
$$

for some set of $u_1,u_2,…,u_n$ spanning $\mathbb R^n$. To make life easy, let’s pick the standard basis. Then,

$$
e_k^T\nabla^2 f(z)(x-y) = \sum_{i=1}^n \frac{\partial f(u)^2}{\partial u_k\partial u_k}\Bigg|_{u=z} \cdot (x_i-y_i)
$$

and

$$
e_k^T(\nabla f(x)-\nabla f(y)) = \frac{\partial f(u)}{\partial u_k}\Bigg|_{u=x}- \frac{\partial f(u)}{\partial u_k}\Bigg|_{u=y}
$$

Let us now define some dummy 1-D functions as
$$
g_k(\alpha) = \frac{\partial f(u)}{\partial u_k}\Bigg|_{u=x+\alpha(y-x)}
$$

Then,

$$
e_k^T(\nabla f(x)-\nabla f(y))=g_k(1)-g_k(0)
$$

and

$$
g’_k(\alpha) = \sum_{j=1}^n \frac{\partial^2 f(u)}{\partial u_k\partial u_j}\Bigg|_{u=x+\alpha(y-x)}.\cdot (y_j-x_j)
$$

Now this is very promising! because now we can invoke MVT on $g_k(\alpha)$ and say that there must exist some $0\leq \bar \alpha \leq 1$ where

$$
g_k(1)-g_k(0)=g’_k(\bar\alpha)\iff e_k^T(\nabla f(x)-\nabla f(y)) = \sum_{j=1}^n \frac{\partial^2 f(u)}{\partial u_k\partial u_j}\Bigg|_{u=x+\bar \alpha(y-x)}\cdot (y_j-x_j).
$$

However, there’s one catch: there’s no reason to assume that this $\bar\alpha$ is the same point for each $k$. So, while (MVT-hd) has one $z$ for its relation, here, we’ve got $n$ different $\bar \alpha_k$s, corresponding to each $g_k$.

Of course, you could always create a new function $g(\alpha) = \sum_k u_k g_k(\alpha)$, and following the same analysis, you could show that, for any $u$, there exists some $z$ where

$$
u^T\nabla^2 f(z) (x-y) =u^T( \nabla f(x) – \nabla f(y)).\tag{MVT-hd-proj}
$$

It turns out that this alone is enough for me to prove most of what I need. However, right now I am hesitant to say that one unique $z$ exists for all $u$.

I hope to be proven wrong, but in the meantime, I shall begin my search for a counterexample!

We need to talk about Adam

At some point in every optimist’s life, we are going to be asked the age-old question: “How does your method compare with Adam?” or, even more annoyingly, “Does this analysis help understand Adam?” I mean, Adam is a nice method that works very well for a very unique set of optimization problems, and not, like, the answer to all the problems in the universe (42). That being said, even if Adam (the method) isn’t exactly Adam (the first man), we should probably know a little bit about what it’s about.

In case you were living under a rock for the past 10 years, Adam refers to ADAptive Moment estimation, this magical method that came out in 2015 by Kingma and Ba, and is widely hailed as the one training method that actually seems to work well. It was the latest in adaptive step size methods, starting with AdaGrad (Duchi, Hazan, Singer 2011), but including AdaDelta (Zeiler 2012), RMSProp (Tieleman & Hinton, 2012), etc.

Adam (Kingma and Ba, 2015)
(typo above: should be $\alpha_t = \alpha/\sqrt{t}$, not constant $\alpha$)
Not Adam	Or Adam

Anyway, so I finally decided it was time to bite the bullet, and learn about what this Adam thing is all about, follow the journey to a proper convergence proof, and give some comments as to how it may work in practice. This post will largely be a literature survey about two parts of this story: the original paper, and the followup proof

[1] Kingma and Ba. Adam: A Method for Stochastic Optimization

[2] Reddi, Kale, Kumar. On the Convergence of Adam and Beyond.

Related prior works

As previously mentioned, Adam comes as a member of the “adaptive step size committee”, a string of methods that followed AdaGrad, the often hailed seminal work of Duchi et al who first proposed adaptively adjusting the step size to help deep network training deal with the vanishing gradient problem. Followup work, like RMSProp, and Adam itself, can be seen as “descendants” of this work. While that paper in itself holds an important place in this journey, in the interest of time I will not mention it much more here.

What the papers do vs what people actually do

Before going on, I need to point out, indeed, highlight, three assumptions (1-3) that are critical in the proofs but clearly not used in practice. Additionally, we will mention a bug (4), which was later fixed in [2], and just for completeness a minor thing to remember (5).

The proof assumes $f_t$, the online loss function, is convex in its parameters.
The proof uses a decaying step size $\alpha_t = \alpha/\sqrt{t}$. In practice, we use a constant $\alpha$.
The proof uses a decaying mixing parameter $\beta_{1,t} = \beta_{1,0}\lambda^{t-1}$ for some $0 < \lambda < 1$.
(Bug, fixed in later work) We must require $\sqrt{v_t}/\alpha_t$ is a decreasing sequence.
(Minor) the proof assumes $\epsilon = 0$. (In practice, $\epsilon \approx 0$.)

Now, I should do a disclaimer and say that everyone and their sister, including me, often resorts to unreasonable assumptions the night before the deadline. It’s unsatisfying when theory doesn’t quite match practice, but the truth is doing convergence proofs is hard, especially for very general assumptions. However, what makes me feel that this is a bit less ok than the usual case is that Adam is often hailed as the method to use ([1] is cited over 100,000 times). Therefore, the fact that we as a community don’t seem as interested in closing this loop is a bit baffling to me. For this reason, at the end of this post, I will try to highlight a few papers that tried to “fix the proof” by justifying these assumptions. If I have missed out a work in this category that you know of, please do let me know and I will update it.

(By the way, I want to clarify, this is not a rant against the authors of [1]. I am not trying to take away the significance of their contribution. It’s more of a question of whether, as a community, we are satisfied with this story being “done”.)

Proof of Adam ([1], with help from [2])

For $g_t = \nabla_\theta(f_t(\theta_{t-1}))$, consider the method

\begin{eqnarray*}
m_t &=& \beta_{1,t}m_{t-1} + (1-\beta_{1,t})g_t, \quad \hat m_t = \frac{m_t}{1-\beta_1^t}\\
v_t &=& \beta_{2}v_{t-1} + (1-\beta_{2})g^2_t, \quad \hat v_t = \frac{v_t}{1-\beta_2^t}\\
\theta_t &=& \theta_{t-1}\, -\, \frac{\alpha_t }{\sqrt{\hat v_t}}\hat m_t
\end{eqnarray*}

where we initialize $m_0 = v_0 = 1$ and $\beta_{1,0} = \beta_1$. Here, $m_t,v_t,\theta_t,g_t$ are all vectors in $\mathbb R^d$ and with an extreme abuse of notation, all multiplications and divisions between these vectors are taken element-wise. The scalars are $\alpha_t,\beta_{1,t},\beta_2$, and of course $t$.

Theorem: Assume that

the functions $f_t(\theta) = \mathcal L_\theta(x_t,y_t)$ are convex in $\theta\in \mathbb R^d$
the gradients of $f_t$ are bounded everywhere $\|\nabla f_t(\theta)\|_2 \leq G$, $\|f_t(\theta)\|_\infty \leq G_\infty$
the distance between any two $\theta_t$ generated by Adam is bounded $$\|\theta_n-\theta_m\|_2\leq D, \quad \|\theta_m-\theta_n\|_\infty \leq D_\infty, \text{ for any } m,n \in \{1,…,T\}$$

Pick parameters such that $\beta_1,\beta_2 \in [0,1)$, $\beta_1^2/\sqrt{\beta_2} < 1$, $\alpha_t = \alpha/\sqrt{t}$, $\lambda\in (0,1)$, and define $\beta_{1,t}=\beta_1\lambda^{t-1}$. Then for all $T\geq 1$, the regret $$R(T) = \sum_{t=1}^Tf_t(\theta_t)-\min_\theta \sum_{t=1}^T f_t(\theta)$$ is upper bounded
\begin{multline*}
R(T)\leq
\frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}
+ \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \frac{\alpha\sqrt{\log(T)+1} }{2(1-\beta_1\lambda)} \|g_{1:T}\|_1 \\
+ \frac{1}{1-\beta_1} \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} + \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}
\end{multline*}
In terms of big-O, the first, third, and fourth term are constants, and overall we get an $R(T) = O(\sqrt{\log(T)}\|g_{1:T}\|_1$ term. Here I should note that in the two papers [1] and [2], actually this is a $\|g_{1:T}\|_2$ term, but in recreating the proof I was not able to recover that. That being said, these norms are exchangable, so we’ll just leave it as is. The $\|g_{1:T}\|_2$ also appears in AdaGrad, and the argument is that if your gradients are sparse enough, this guy is also $O(\sqrt{T})$. Seems like a rough assumption, but let’s go with the flow for now.

Additionally, it should be noted that there does not appear to be a significant speedup over SGD. In [1], the argument is that Adam is $O(\log(d)/\sqrt{t})$, (which got boosted in terms of $t$ by [2]), but in no way can I recreate the proof so that the dependency on $d$ is less than linear. (I’d be happy to correct it if people feel differently.)

Either way, the fact that Adam does not claim to have a faster convergence rate than SGD in terms of $t$, is somewhat significant.

Now let’s do the proofs

Using the update rule $\theta_{t+1}=\theta_t-\frac{\alpha_t\hat m_t}{\sqrt{\hat v_t}}$ where $\epsilon = 0$
then we may write

\begin{eqnarray*}
(\theta_{t+1,i}-\theta_i^*)^2&=&(\theta_{t,i}-\theta_i^*)^2 – 2\alpha_t\frac{\hat m_{t,i}}{\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*) + \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}\\
&=&(\theta_{t,i}-\theta_i^*)^2 – \frac{2(1-\beta_{1,t})\alpha_tg_{t,i}}{(1-\beta^t_1)\sqrt{\hat v_{t,i}}}(\theta_{t,i}-\theta_i^*)
– \frac{ 2\alpha_t\beta_{1,t} m_{t-1,i} }{\sqrt{\hat v_{t,i}}(1-\beta^t_1)}(\theta_{t,i}-\theta_i^*)
+ \alpha^2_t\frac{\hat m^2_{t,i}}{\hat v_{t,i}}.
\end{eqnarray*}

Summing, and pulling out the term $g_t^T(\theta_t-\theta^*)$

\begin{eqnarray*}
g_t^T(\theta_t-\theta^*) &=& \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{1,t}} \,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}}\frac{1-\beta_1^t}{1-\beta_{1,t}} \,-\, \frac{\beta_1 m_{t-1,i}}{(1-\beta_{1,t})}(\theta_{t,i}-\theta_i^*)\right)
\end{eqnarray*}

The last term can be bound using
$$
|m_{t-1}^T(\theta_t-\theta^*)|\leq \frac{\|\alpha_t^{1/2}\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\alpha_t^{-1/2}\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2}
$$
where $\hat V_t = \mathbf{diag}(\hat v_{t})$, giving
\begin{eqnarray*}
g_t^T(\theta_t-\theta^*) &\leq & \sum_{i=1}^d\left(\left((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta_i^*)^2\right)\frac{\sqrt{\hat v_{t,i}}}{2\alpha_t }\frac{1-\beta_1^t}{1-\beta_{t,1}}\,+\, \frac{\alpha_t \hat m_{t,i}^2}{2\sqrt{\hat v_{t,i}}} \frac{1-\beta_{1}^t}{1-\beta_{t,1}}\right) \\
&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)
\end{eqnarray*}

From convexity , for $\theta^* = \min_\theta \sum_{t=1}^T f_t(\theta)$ we have
$$ f_t(\theta_t)-f_t(\theta^*)\leq g_t^T(\theta_t-\theta^*), $$
and
\begin{eqnarray*}
R(T) &\leq & \sum_{t=1}^T\Bigg[\frac{1}{2\alpha_t} \frac{1-\beta_1^t}{1-\beta_{t,1}} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2 -\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2 ) \,+\, \frac{\alpha_t}{2} \frac{1-\beta_1^t}{1-\beta_{t,1}}\|\hat V_t^{-1/4}\hat m_t\|_2^2 \\&&+ \frac{\beta_{t,1}}{1-\beta_{t,1}} \left(\frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} + \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}\right)\Bigg]\\
&=& R_1(T)+R_2(T)+R_3(T) + R_4(T)
\end{eqnarray*}

where

\begin{eqnarray*}
R_1(T) &=& \sum_{t=1}^T\frac{1}{2\alpha_t} (\|\hat V^{1/4}_t(\theta_{t}-\theta^*)\|_2^2-\|\hat V^{1/4}_t(\theta_{t+1}-\theta^*)\|_2^2 )\\
R_2(T) &=&\sum_{t=1}^T \frac{\alpha_t}{2}\frac{1-\beta_1^t}{1-\beta_{t,1}} \|\hat V_t^{-1/4}\hat m_t\|_2^2\\
R_3(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2} \\
R_4(T) &=&\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\|\hat V^{1/4}(\theta_t-\theta^*)\|_2^2}{2\alpha_t}
\end{eqnarray*}

Simple bounds

First, note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’} \leq \frac{G_\infty^2}{1-\beta_2}$$
where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series. So, $\|v_t\|_2 \leq d\|v_t\|_\infty \leq \frac{d G_\infty^2}{1-\beta_2}$.
Another important inequality. Consider two sequences $a_k\geq 0$ and $b_k\geq 0$. By Holder $p = 1$,
$$
\sum_k a_k = \sum_k \frac{a_k}{\sqrt{b_k}}\cdot \sqrt{b_k} \leq \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\max_l \sqrt{b_l}\right)\overset{\|\cdot\|_\infty \leq \|\cdot\|_2}{\leq} \left(\sum_j \frac{a_j}{\sqrt{b_j}}\right)\left(\sqrt{\sum_l b_l}\right)
$$
Therefore,
$$
\frac{\sum_k a_k}{\sqrt{\sum_l b_l}} \leq \sum_j \frac{a_j}{\sqrt{b_j}} \qquad\qquad(\oplus)
$$
Lemma 2 in [1], fixed in Lemma 2 in [2]. We now need to put a bound on $\sum_t \frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}}$. In [1], they try to do a proof in which this term becomes $O(\sqrt{t})$; in [2], they tightened it to $O(\sqrt{\log(T)})$. In addition to being a better rate, [2] seemed to also have fewer bugs in the proof, so I think it makes sense to just present their version.
\begin{eqnarray*}
\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} = \sum_{i=1}^d\frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)}\frac{1}{\sqrt{t}}\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}
\end{eqnarray*}
Using Cauchy Schwartz,
$$
\sum_{k=1}^t(1-\beta_{1,0})\prod_{j=k}^t\beta_{1,j}g_{k,i}\leq \sqrt{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j} g^2_{l,i}}
$$
so we may write
\begin{eqnarray*}
\sum_{i=1}^d\frac{(\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j}g_{k,i})^2}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}} &=& \sum_{i=1}^d\frac{\sum_{k=1}^t\prod_{j=k}^t\beta_{1,j} \sum_{l=1}^t \prod_{j=l}^t\beta_{1,j} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
&=& \sum_{i=1}^d\frac{\sum_{k=1}^t\beta_1^{t-k} \sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
&\overset{\sum_{k=1}^t\beta_1^{t-k}\leq \frac{1}{1-\beta_1}}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\frac{\sum_{l=1}^t \beta_1^{t-l} g^2_{l,i}}{\sqrt{\sum_{j=1}^t \beta_2^{t-j}g_{j,i}^2}}\\
&\overset{(\oplus)}{\leq}& \frac{1}{1-\beta_1}\sum_{i=1}^d\sum_{k=1}^t\frac{ \beta_1^{t-k} g^2_{k,i}}{\sqrt{ \beta_2^{t-k}g_{k,i}^2}}\\
&=& \frac{1}{1-\beta_1}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
\end{eqnarray*}
Now, let’s sum over $t$.
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq& \frac{\sqrt{1-\beta_2^t}(1-\beta_{1,0})^2}{(1-\beta_1^t)^2(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
\end{eqnarray*}
The next step is to use the assumption $\beta_1^2\leq \sqrt{\beta_2}$ to get
$$
\frac{\sqrt{1-\beta_2^t}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_1^2\leq \sqrt{\beta_2}}{\leq} \frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}(1-\beta_{1,0})^2 \overset{\beta_{1,0}\leq 1}{\leq}\frac{\sqrt{1-\beta_1^{4t}}}{(1-\beta_1^t)^2}\leq 1
$$
since the 1-D curve $g(x) = \frac{1-x^4}{(1-x)^4}$ is decreasing in the interval $0\leq x \leq 1$ and $g(0) = 1$. So, more simply,
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq& \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{t=1}^T\frac{1}{\sqrt{t}}\sum_{k=1}^t\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1
\end{eqnarray*}We now switch the $t$ and the $k$ summations, and use $\tau = t-k$. Then
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq& \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{t=k}^T\frac{1}{\sqrt{t}}\frac{ \beta_1^{t-k} }{\sqrt{ \beta_2^{t-k}}}\|g_k\|_1\\
&\leq& \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{k=1}^T\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau+k}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\|g_k\|_1\\
&\overset{\tau+k\geq\tau}{\leq}& \frac{1}{(1-\beta_2)(1-\beta_1)}\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\sum_{k=1}^T\|g_k\|_1
\end{eqnarray*}
Now we can just work with this term
$$
\sum_{\tau=0}^{T-k}\frac{1}{\sqrt{\tau}}\frac{ \beta_1^{\tau} }{\sqrt{ \beta_2^{\tau}}}\overset{C-S}{\leq}\sqrt{\sum_{\tau=0}^{T}\frac{1}{\tau}}\sqrt{\sum_{\tau’=0}^T\frac{ \beta_1^{2\tau’} }{ \beta_2^{\tau’}}}\leq \sqrt{\log(T)+1}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }}
$$
which gives an overall rate of
\begin{eqnarray*}
\sum_{t=1}^T\frac{\|\hat V_t^{-1/4} \hat m_{t}\|_2^2}{\sqrt{t}} &\leq& \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1 \\&=& A \sqrt{\log(T)+1}\|g_{k,1:T}\|_1
\end{eqnarray*}
where we absorb the constant
$$A= \frac{1}{(1-\beta_2)(1-\beta_1)}\sqrt{\frac{\beta_2}{\beta_2-\beta_1^{2} }} .$$

Now we are ready to tackle the terms one by one. First,

Term $R_1(T)$

Looking at each element,

\begin{eqnarray*}
\sum_{t=1}^T\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} ((\theta_{t,i}-\theta_i^*)^2-(\theta_{t+1,i}-\theta^*_i)^2 )&=&
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{t,i}-\theta_i^*)^2)\\
&&
+
\sum_{t=2}^{T-1}\left(\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} -\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}}\right) (\theta_{t,i}-\theta_i^*)^2\\
&\leq&
\frac{\hat v^{1/2}_{1,i}}{2\alpha_1} (\theta_{1,i}-\theta_i^*)^2
-\frac{\hat v^{1/2}_{T-1,i}}{2\alpha_{T-1}} ((\theta_{T,i}-\theta_i^*)^2)
\end{eqnarray*}

only if
$$
\frac{\hat v^{1/2}_{t,i}}{2\alpha_t} \leq\frac{\hat v^{1/2}_{t-1,i}}{2\alpha_{t-1}} \qquad\qquad (\star)
$$

for all $t$! ($(\star)$ was actually the biggest point of [2], which corrects the proof in [1] by forcing this equation to be true in AMSGrad.)

This simplifies

$$
R_1(T) \leq \frac{1}{2\alpha_1} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2-\frac{1}{2\alpha_{T-1}}\|\hat V^{1/4}_{T-1}(\theta_{T}-\theta^*)\|_2^2 \leq \frac{1}{2\alpha} \|\hat V^{1/4}_1(\theta_{1}-\theta^*)\|_2^2 \leq \frac{D^2}{2\alpha}\frac{ \sqrt{G}}{\sqrt{1-\beta_2}}
$$

Term $R_2(T)$

We want to use one of our “small bounds” results here. Explicitly,

$$
R_2(T) = \frac{\alpha}{2} \sum_{t=1}^T \frac{1-\beta_1^t}{1-\beta_{1,0}\lambda^t}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha}{2} \sum_{t=1}^T \frac{1}{1-\beta_1\lambda}\frac{\|\hat V_t^{-1/4}\hat m_t\|_2^2}{\sqrt{t}} \leq \frac{\alpha A}{2(1-\beta_1\lambda)} \sqrt{\log(T)+1}\|g_{k,1:T}\|_1
$$

Term $R_3(T)$

I’m actually a little confused as to why neither paper seemed to really address what was going on with $R_3(T)$. In particular, if we expand,

$$
R_3(T) =\sum_{t=1}^T \frac{\beta_{t,1}}{1-\beta_{t,1}} \frac{\alpha_t\|\hat V^{-1/4}m_{t-1}\|_2^2}{2}
$$

\begin{eqnarray*}
\|\hat V_t^{-1/4}m_{t-1}\|_2^2\, -\, \|\hat V_t^{-1/4}m_{t} \|_2^2 &=& \|\hat V_t^{-1/4}(m_{t-1}-m_t) \|_2^2 \,-\, 2(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-m_{t})) \\
&=& (1-\beta_{1,t})^2 \|\hat V_t^{-1/4}(m_{t-1}-g_t) \|_2^2 \, -\, 2(1-\beta_{1,t})(\hat V_t^{-1/4}m_{t})^T(\hat V_t^{-1/4}(m_{t-1}-g_{t}))
\end{eqnarray*}
and $(1-\beta_{1,t})^2\to 1$ as $t\to\infty$. So it’s not like we can bound $R_3$ using the same tricks from $R_2$. On the other hand, assuming that $|\hat V^{-1/4}m_{t-1}\|_2^2 $ is a bounded quantity, and $ \frac{\beta_{t,1}}{1-\beta_{t,1}} $ is a summable sequence, it’s more likely that they just saw $R_3(T) \leq \text{constant}$ and just didn’t bother with it. Well, for completeness, let’s bother with it. First,
\begin{eqnarray*}
\|\hat V_t^{-1/4}m_{t-1}\|_2^2 &\leq & \|V_t^{-1/4}m_{t-1}\|_2^2 \\
&\leq& \sum_{i=1}^d \frac{(\sum_{\tau=1}^t\beta_{1}^\tau g_{i,\tau})^2}{\sqrt{\sum_{\tau=1}^t \beta_2^\tau g^2_{i,\tau}}} \\
&\overset{\oplus}{\leq}& \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }} \|g_t\|_1\\
&\overset{\|g_t\|_1\leq dG_\infty}{\leq}& dG_\infty \sum_{\tau=1}^t \frac{(\beta_{1}^\tau )^2}{\sqrt{ \beta_2^\tau }}\\
&\leq& \frac{dG_\infty\sqrt{ \beta_2}}{\sqrt{ \beta_2}-\beta_1^2} =:B
\end{eqnarray*}

so we’re safe!

Term $R_4(T)$

Using $\alpha_t=\alpha/\sqrt{t}$.

\begin{eqnarray*}
R_4(T) &=&\sum_{t=1}^T \|\hat V_t^{1/4}(\theta_t-\theta^*)\|_2^2\frac{\beta_{t,1}}{2\alpha_t(1-\beta_{t,1})}\\
&=&\frac{D^2}{2\alpha}\sum_{t=1}^T \|v_t^{1/2}\|_2\frac{\beta_{t,1}\sqrt{t}}{(1-\beta_{t,1})(1-\beta_2^t)}
\end{eqnarray*}

Note that, element-wise, $$|v_{t,i}| = \sum_{\tau = 1}^t \beta_2^{t-\tau} g_{\tau,i}^2\leq G_\infty^2 \sum_{\tau’ = 0}^{t-1} \beta_2^{\tau’} \leq \frac{G_\infty^2}{1-\beta_2}$$
where we use a change of variables $\tau’ = t – \tau$ and upper bound with the infinite geometric series. This gets us to

$$
R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}}\sum_{t=1}^T\lambda^t\sqrt{t}
$$

and using some fairly liberal tricks, for $\lambda \in (0,1)$:
$$
\sum_{t=1}^T\lambda^t\sqrt{t} \leq \sum_{t=1}^T\lambda^t t \leq \int_0^\infty \lambda^t t dt = \frac{1}{(1-\lambda)^2}
$$

we get
$$
R_4(T) = \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}}\sum_{t=1}^T\frac{\lambda^t\beta_{0,1}\sqrt{t}}{(1-\lambda^t\beta_{0,1})(1-\beta_2^t)}\leq \frac{D^2G_\infty}{2\alpha\sqrt{1-\beta_2}\beta_{0,1}(1-\lambda)^2}
$$

Sidebar: After 3 days of pouring through [1] and aided by [2], I really feel like [2] is an underrated work, that really makes the theoretical analysis of [1] clean and persevering. If you think about it, at this point, Adam isn’t just one person’s genius idea, it’s a method that a ton of people are using every day, in a slightly different form than the one presented. So it’s not unreasonable to expect that the convergence analysis should be a group effort, not just the work of a single team. In that spirit…

AMSGrad [2]

In addition to applying band-aids all over the proof of [1], [2] also ended up proposing a new (slightly modified) method, which fixes the issue raised in ($\star$). The fix is pretty simple: using the same updates for $m_t$ and $v_t$, we apply elementwise
$$
\hat v_t = \max(\hat v_{t-1},v_t)
$$

and update

$$
\theta_{t+1} = \theta_t – \frac{\alpha_tm_t}{\sqrt{\hat v_t}}.
$$

This ensures that $(\star)$ is always true, and thus the proof holds well.

Followup works

It is clear also that I am not the first one on the bandwagon, of skepticism, powered by the 1: huge assumptions used in this proof and 2: no clear acceleration in terms of $t$ shown. It turns out that this bandwagon is actually pretty chock-a-full of optimizers singing more or less the same tune. There are more followup works being written to acknowledge the weaknesses of Adam in a better way, one of them being

[3] Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search) by Sharan Vaswani, Issam Laradji, Frederik Kunstner, Si Yi Meng, Mark Schmidt, Simon Lacoste-Julien

which argues that adaptive methods can use constant step sizes in the interpolation regime (where $\theta = \theta^*$ implies $\nabla f_t(\theta) = 0$ for all $t$), which seems strong but actually quite frequent in overparametrized neural networks.

Another notable work is

[4] A Simple Convergence Proof of Adam and Adagrad, by
Alexandre Défossez, Léon Bottou, Francis Bach, Nicolas Usunier

which extend the proof of Adam’s convergence to nonconvex functions, and show a rate of $O(\log(T)/\sqrt{T})$, matching Adagrad, and improving upon SGD by a factor of $1-\beta_1$. Curious as I am about the guts of this proof, I’m afraid that’s going to have to wait for another day.

Finally, a fun reference. Francesco Orobana also has a nice rant about this too in his blog:

Neural networks (maybe) evolved to make Adam the best optimizer.

The points here are a bit different from the points I was trying to make, in literally pulling out the weaknesses of Adam; rather, he is the opposite side of the coin which is that the numerical success of Adam is still quite surprising. (I agree.)

Over all, I believe the subject of Adam remains interesting, important, and controversial. Probably better proofs can still be made.

Acknowledgements

Thanks Sharan Vaswani for helpful discussion in dissecting this.

Projected gradient descent, revisited

It turns out that some of my previous statements on PGD may not have been, well, correct. At first I thought it was just a few typos, but as I studied my notes further, I realized there may be some fundamental flaws. So this blog post revisits some ideas and hopefully irons out some past mistakes.

By the way, there aren’t any mistakes on past blog posts, but there is in my online SpeedOpto notes.

Projected gradient descent (PGD)

First, a review on how PGD works, and what is true about it. Consider the constrained optimization problem

$$ \min_{x\in \mathcal D} f(x)$$

where $f$ is $L$-smooth, and $\mathcal D$ is a closed convex compact set. (For now I will not assume $f$ is convex, since it doesn’t really matter as much. However, $\mathcal D$ convex is important.) Then PGD takes steps as

$$ x^{(k+1)} = \mathbf{proj}_{\mathcal D}(x^{(k)}-\alpha\nabla f(x^{(k)}))\tag{PGD}$$

where $\mathbf{proj}_{\mathcal D}$ is the Euclidean projection solution to the minimization problem

$$\mathbf{proj}_{\mathcal D}(x) = \arg\min_{y\in {\mathcal D}}\|y-x\|_2.$$

Previously, I made the statement that PGD cannot hurt GD, in the sense that

PGD approaches a stationary point at least as fast as GD, and
PGD has the same objective function decay rate as GD.

The first statement is still true. In particular, in the first case, we have that for any point in $\mathcal D$ (and in particular for $x^*\in \mathcal D$),

$$\|\mathbf{proj}_\mathcal D(x)-x^*\|_2\leq\|x-x^*\|_2$$

with equality only if $x\in \mathcal D$.

However, I am now realizing that we cannot so easily prove the second one. In fact, the usual statements for PGD and Mirror Descent are much weaker, namely, that for constant step size. $f(x)\to f(x^*)$ at a rate of $O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2)$ and taking $\alpha = 1/\sqrt{k}$ we may get an aggregate $O(1/\sqrt{k})$ rate. (but it is not a just-in-time rate.)

Second however: if instead, we view the projection operation as a proximal operator of an indicator, we should get back our rate of $O(1/k)$. This I will explore in a follow up post.

What’s wrong with the descent lemma of PGD?

Let us simplify each step of PGD as

\begin{eqnarray*} y &=& x – \alpha \nabla f(x) \\ x^+ &=& \mathbf{proj}_{\mathcal D}(y).\end{eqnarray*}

Now, it’s tempting to try to write equations like this:

\begin{eqnarray*}
f(x^+)-f(x) &\leq& \nabla f(x)^T(x^+-x) + \frac{L}{2}\|x^+-x\|_2^2\\
&\overset{+y-y}{=}& \nabla f(x)^T(x^+-y )+\nabla f(x)^T(y-x) + \frac{L}{2}\|x^+-y\|_2^2 + L(x^+-y)^T(y-x) + \frac{L}{2}\|x-y\|_2^2\\
&\overset{y-x = -\alpha \nabla f(x)}{=}&-\alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2+ (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2
\end{eqnarray*}

However, at this point we actually get stuck, because in general, it is very hard to bound this residual term

$$ r := (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2.$$

Previously, I tried to argue that this thing is negative ($r<0$); however, after drawing some pictures, it is clear that $\nabla f(x)^T(x^+-y ) $ will more often be positive (these two terms are aligned), and for $\alpha < 1/L$, $r>0$. In fact, the best bound I could come up with is actually, for $\alpha \leq 1/L$,

$$r \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2\tag{descent}$$

which shows descent, but you can’t get a rate from it. (See appendix.) This makes sense, since $f(x^+)$ is not in general less than $f(y)$, so you wouldn’t expect the same technique for GD to give as aggressive a rate (in terms of $f$) as PGD. In fact, usually you would expect the opposite; $y$ steps in the maximally descending direction, but $x^+$ is a fix that brings back feasibility and can only hamper progress in objective value decay.

As a counterexample, first consider something like $f(x) = (x-1)^2$ over the constraint $x \geq 0$, and suppose that we already are at $x^{(k)} = 0$. Clearly, $\|\nabla f(x^{(k)})\|_2 = 2 > 0$, but hereonafter, all future iterates will have the same objective value. So we really can only show descent, not a descent amount that’s related to the gradient of $f$.

the proof (corrected)

After knocking my head around a bit, I have come to the conclusion that the best way to go around this is to actually use the (normal, accepted) way of the mirror descent proofs. First, we need two important inequalities in our arsenal: the 3 point equality

$$2(x-y)^T(x-z) = \|x-y\|_2^2 + \|z-x\|_x^2 – \|z-y\|_2^2\tag{3 pt =}$$

and the “projections bring me closer to the set” inequality (see past posts)

$$\|y-x^*\|_2 \geq \|x^+-x^*\|_2^2\tag{closer}$$

\begin{eqnarray*}
f(x)-f(x^*) &\leq& \nabla f(x)^T(x-x^*) \\
&=& \frac{1}{\alpha}(x-y)^T(x-x^*)\\
&\overset{\text{(3 pt =)}}{=}&\frac{1}{2\alpha}(\underbrace{\|x-y\|_2^2}_{\alpha^2\|\nabla f(x)\|_2^2} + \|x^*-x\|_2^2 – \|x^*-y\|_2^2)\\
&\overset{\text{(closer)}}{\leq}&\frac{1}{2\alpha}(\alpha^2\|\nabla f(x)\|_2^2 + \|x^*-x\|_2^2 – \|x^*-x^+\|_2^2)
\end{eqnarray*}

and after telescoping, we get

$$
\frac{1}{k}\sum_{i=1}^k(f(x^{(i)}) – f(x^*))\leq \frac{\alpha}{2}\underbrace{\frac{1}{k}\sum_{i=1}^k\|\nabla f(x^{(i)})\|_2^2 }_{\overline{\nabla f(x^{(k)}}} + \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha}.
$$

This averaged gradient term (AG) converges to $\frac{\alpha}{2}\|\nabla f(x^*)\|_2^2$ which is not 0 in general. Recall also that we showed PGD descends, so that $f(x^{(i+1)}) \leq f(x^{(i)})$. Thus we can conclude

$$
f(x^{(i)}) – f(x^*)\leq \frac{\|x^*-x^{(0)}\|_2^2 – \|x^*-x^{(k)}\|_2^2}{2k\alpha} + \frac{\alpha}{2} \|\overline{\nabla f(x^{(k)}})\|_2^2\to O(1/k) + O(\alpha \|\nabla f(x^*)\|_2^2).
$$

Appendix

proof that

$$r= (1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \leq \alpha (1-\frac{L\alpha}{2})\|\nabla f(x)\|_2^2$$

Actually, this is just a consequence of the definition of projection, which says that, since $x\in \mathcal D$, then

$$
\|x^+-y\|_2 \leq \|x-y\|_2 =\alpha \|\nabla f(x)\|_2.
$$

Therefore, if $\alpha \leq \frac{1}{L}$, then using Cauchy Schwartz,

\begin{eqnarray*}
r&=&(1-\alpha L)\nabla f(x)^T(x^+-y ) + \frac{L}{2}\|x^+-y\|_2^2 \\
&\overset{\text{C-S}}{\leq} &(1-\alpha L)\|\nabla f(x)\|_2\|x^+-y\|_2 + \frac{L}{2}\|x^+-y\|_2^2\\
&\overset{\text{closer}}{\leq}&(1-\alpha L)\|\nabla f(x)\|_2\|x-y\|_2 + \frac{L}{2}\|x-y\|_2^2\\
&\overset{\|x-y\|_2=\alpha\|\nabla f(x)\|_2}{\leq}&\alpha (1-\frac{\alpha L}{2})\|\nabla f(x)\|_2^2
\end{eqnarray*}

Video time! Discretization methods on quadratic and logistic regression

Hi all, this is another “tiny project” that I’ve been wanting to do for a while. Lately, I’ve been somewhat obsessed with method flows and discretization methods. While we can sit here and write integrals until our faces turn blue, I think it’s time to just simulate some stuff, and just see what happens. In this post I explore 4 problems:

Problems

Quadratic optimization, somewhat ill conditioned (duh.)
Normalized quadratic optimization, to look at nonsmoothness and bad curvature.
Logistic regression, well separated
Logistic regression, poorly separated

The methods include

GD = Euler’s discretization (explicit)
MDPT = Midpoint method (explicit)
RK4 = 4th order Runge Kutta method (explicit)
BE = Backward Euler (implicit)
BDF2 = 2nd order Backward Differentiation Formula (implicit)
TZ = Trapezoidal rule (implicit).

Here, an approximate implicit method = 1 gradient step, and a full implicit method = 100 gradient steps.

Quadratic

$$f(x) = \frac{1}{2}x^TQx$$

Normalized quadratic

$$f(x) = \sqrt{x^TQx}$$

Logistic regression

In the logistic regression models,

$$f(\theta) = \frac{1}{m}\log(1+\exp(-y_ix_i^T\theta)$$

where we generate data using the 2-blob model

$$x_i = y_i c + z_i $$

and $\|c\|_2$ is tuned to create separation. Included is not just the trajectory over the landscape, but also the classification barrier, so you can see whether the suboptimality actually has real effect on bad classification or not.

Well separated logistic regression

Poorly separated logistic regression

Discussion

I had a few questions going into this tiny experiment, some of which were answered, some of which are still ephemeral.

Do implicit methods work? Answer:: Hard to say. They definitely do hug the flow better, and in the case of quadratic or normalized quadratic, they seem to help a lot with conditioning and being able to pick large step sizes. However, for logistic regression, they don’t really contribute well to margin orientation, which seems to be better affected by explicit noisy methods.
Are their approximate versions sufficient? Note that an approximate implicit method has the same per-iteration complexity as an explicit method. Answer: yes, the approximate methods are actually almost as good as the full ones, with regards to conditioning robustness and flow hugging!
Does it make sense to consider quadratic problems to represent machine learning tasks? Answer: This I kind of knew, but the answer is not really. Well, actually, it depends. For poorly separated blobs, a quadratic approximation is not bad, and seems to well-characterize what’s going on. But, on the other hand, for poorly separated blobs, the margin orientation is somewhat arbitrary and overfits on training data idiosyncrasies. On the other hand, when the blob is well separated, the step size choices, the flow characterization, all of that doesn’t seem to align well with margin orientation!

More studies need to be made in order to capture embedding learning as well, but anyway, it’s called a tiny project for a reason!

Finally, I leave you an image of our two favorite cheerleaders.

Some proofs of continuous time things

In trying to understand how continuous time interpretations can help optimization research, I find myself doing a lot of “recreating the wheel”, but in a nontrivial way; I am getting these proofs that I’m sure are standard knowledge, but it’s not like we learned them back when we were kids. So I’m going to try and go through some of them here.

Here are some super simple proofs to get us started.

Gradient descent

This one is actually super easy and classical, with one of the earliest citations [1]. Here, we’re looking at unconstrained methods

$$
\min_{x} \quad f(x)\tag{(U)}
$$

and the method is very simply

\begin{eqnarray*}
\mathbf x_{k+1} &=& \mathbf x_k – \alpha \nabla f(\mathbf x_k)
\end{eqnarray*}

with flow

\begin{eqnarray*}
\dot x(t) &=& -\nabla f(x(t)).
\end{eqnarray*}

Super classical. Then the convergence proof of the flow is simply

$$
\frac{d}{dt} f(x(t))= \nabla f(x)^T\dot x = -\|\nabla f(x)\|_2^2.
$$

Right away, we get method descending, and (after showing that $\|\nabla f(x)\|_2$ also decays) a rate

$$
\|\nabla f(x(t))\|_2^2 \leq \frac{1}{t} \int_0^t\|\nabla f(x(\tau))\|_2^2d\tau \leq \frac{f(x(0))-f(x^*)}{t}
$$

I’ve already discussed this one a bit in a previous post, so I’ll just stop here. Note that I did not say anything about convexity, compactness, etc.

Projected gradient descent

Now let’s make the problem slightly more complicated by doing the projected version. Here, we want to solve

$$
\min_{x\in\mathcal D} \quad f(x)\tag{C}
$$

where $\mathcal D\subset \mathbb R^n$ is a convex, closed set.

We use the repeated iteration

\begin{eqnarray*}
\mathbf x_{k+1} &=& \mathrm{proj}_{ \alpha g}(\mathbf x_k – \alpha \nabla f(\mathbf x_k))\tag{PGD}
\end{eqnarray*}

where

$$
\mathrm{proj}_{g}(z):= \arg\min_{x\in \mathcal D} \; \|x-z\|_2.
$$

One way to reframe the method, then, is iteratively solving the following problem at each step

\begin{eqnarray*}
\mathbf x_{k+1} &=& \arg\min_x \frac{1}{2}\|x – \mathbf x_k + \alpha \nabla f(\mathbf x_k)\|_2^2 + I_{\mathcal D}(x)
\end{eqnarray*}

where $ I_{\mathcal D}(x)$ is the indicator for the constraint $x\in \mathcal D$.

From that, we can write the optimality condition

$$
\mathbf x_{k} – \mathbf x_{k+1} \in \alpha \nabla f(\mathbf x_k) + \mathcal N_{\mathcal D}(\mathbf x_{k+1}) \tag{PGD-opt}
$$

and $\mathcal N_{\mathcal D}(x)$ is the normal cone of $\mathcal D$ at $x$, described as

$$
\mathcal N_{\mathcal D}(x) := \partial I_{\mathcal D}(x) = \{z : z^T(x-x’) \geq 0, \; \forall x’\in \mathcal D\}. \tag{NC-def}
$$

And, again using approximations of $\dot x(t) \approx \frac{\mathbf x_{k+1} – \mathbf x_{k}}{\alpha}$ we get

$$
-\dot x – \nabla f(x) \in \mathcal N_{\mathcal D}(x). \tag{PGD-flow}
$$

(I’m a bit skeptical to do this, because then it seems this method is the same as the subgradient method, which I kind of assumed would have a worse rate. But, then again… maybe not in the flow? After all, what is the harm of nonsmoothness in the continuous time regime? Still, I’d feel more comfortable if some $\frac{d}{dt} \nabla f(x)$ term appeared somewhere. For now we’ll go with it.)

Now here comes a crucial step. In the usual proof of the projected gradient method, we needed to show that projections cannot increase your distance to $x^*$. I wrote this as a “funny little proof” in a previous post. But in the continuous version, what we really want to say is that projections can only slow you down in your desired direction–that is, the $\dot x$ from PGD can’t be faster than the $\dot x$ from good ole GD. Concretely, I want to say that the $\dot x$ defined in (PGD-flow) satisfies

$$
\|\dot x\|_2 \leq \|\nabla f(x)\|_2. \tag{Proj-Contr-Cond.}
$$

This is going to work as follows. We define $v(t) = -\dot x(t) – \nabla f(x(t))$. Then by (PGD-Flow), $v\in \mathcal N_\mathcal D(x)$, or $v^T(x-x’) \geq 0$, for all $x’\in \mathcal D$. I’m going to pick $x’ = x – c \dot x$, for some imperceptibly small $c > 0$. Note that since the trajectory is always feasible, such a $c>0$ should exist, if we assume we initialize the method somewhere in $\mathcal D$. Then, I can say that $v^T\dot x \geq 0$.

Then, using Pythagorean theorem,

$$
\|\nabla f(x)\|_2^2 = \|v+\dot x\|_2^2 = \|v\|_2^2 +\underbrace{ 2v^T\dot x}_{\geq 0} + \|\dot x\|_2^2 \geq \|\dot x\|_2^2 \tag{Proj-Contr-Cont}
$$

And we’re done! See below for the super-high-tech figure.

Projected gradient descent. Note the almost Pythagorean relationship between $\dot x$, $v$, and $\nabla f(x)$.

Ok, Now we can move onto the main event: convergence proof!

$$
\frac{d}{dt} f(x(t)) = \nabla f(x)^T\dot x \leq (\nabla f(x) + v)^T{\dot x} = -\|\dot x\|_2^2
$$

and therefore

$$
\frac{f(0)-f(x^*) }{t}\geq \frac{1}{t}\int_0^t \|\dot x(\tau)\|_2^2 d\tau
$$

e.g. we have an $O(1/t)$ bound on the average speed of the method. Note that it makes perfect sense why we can’t seem to be able to bound the average gradient of the method, since in a constrained problem, $\nabla f(x(t))\not\to 0$.

The proof is more -or-less trivially extended to the proximal case, e.g. we solve

$$
\min_{x} \quad f(x) + g(x)\tag{(U2)}
$$

using the repeated iteration

$$
\mathbf x_{k+1} = \mathrm{prox}_{ \alpha g}(\mathbf x_k – \alpha \nabla f(\mathbf x_k)), \qquad
\mathrm{prox}_{g}(z):= \arg\min_x \; g(x) + \frac{1}{2}\|x-z\|_2^2.
$$

Edit: It occurred to me that a very strange thing happens when $x(t)$ is on a corner of $\mathcal D$, which brings out a problem with nonsmoothness. In this case, $\dot x(t)$ is not really well defined: we have

$$
\dot x_+(t) = \lim_{\Delta>0,\Delta \to 0} \frac{x(t+\Delta)-x(t)}{\Delta}, \qquad \dot x_-(t) = \lim_{\Delta>0,\Delta \to 0} \frac{x(t)-x(t-\Delta)}{\Delta}
$$

A literal “corner” case, showing that it matters if the velocity is coming from the future or the past.

and they’re both different. Note that our relationship (Proj-Contr-Cont) only applies to $\dot x_+$, not $\dot x_-$ in this case. See image above.

[1] Polyak, Boris T. “Introduction to optimization. 1987.” Optimization Software, Inc, New York.

A fun convex set fact

This is actually not a new thing at all, but a thing I played with a couple years ago, and I ended up using in a numerical optimization lecture to try to impress some undergrads. They were not impressed. Since then I’ve tuned it a bit and just kind of take it for granted now, but now that I think about it, it is kind of nontrivial, and has some interesting significance. It’s come up again in the short course, so I figured I’d write it up here as well.

Statement: Suppose $\mathcal C$ is a convex set in $\mathbb R^n$, and $y = \mathbf{proj}_{\mathcal C}(z)$, for any point $z\in \mathbb R^n$. (Euclidean projection.) Then for any point $x\in \mathcal C$, $$\|y – x\|_2 \leq \|z-x\|_2.$$

Significance: This is part of a larger story that says “projecting on convex sets cannot hurt convex optimization”. Basically, you should think of $x = x^*$ as an optimal part of some constrained optimization problem. This basically says that my contraction is a give-in, that every time I do descent-step-plus-projection, the projection step will always drive me closer to $x^*\in \mathcal C$. What I found interesting and nontrivial, though, is that this proof does not depend on any objective function $f$, so that while we really want to use this statement for $x = x^*$, it actually applies for any $x\in \mathcal C$.

Proof by picture:

Note that the assumption that $\mathcal C$ is a convex set is essential; counterexamples easily exist when the set is not convex.

Proof: First, we can assume that $z\not\in \mathcal C$, since if $z\in \mathcal C$, then $z = y$ and the statement is trivially true.

Second, we should assume that $x\neq y$. Since we know that the projection on a convex set is always unique, we can encode this further and say that there exists some $\epsilon > 0$ where $\|y-z\|_2 + \epsilon \leq \|x-z\|_2$.

Now, the fact that $y$ is the projected point means that it’s the solution to the convex optimization problem
$$ y = \arg\min_{y’\in \mathcal C} \; \|y’-z\|_2^2,$$ and thus we can apply the normal cone condition $-\nabla f(y)\in \mathcal N_{\mathcal C}(y)$:

$$
(z-y)^T(x-y)\leq 0 \qquad\qquad \text{(normal cone condition, since $x^*\in \mathcal C$)}.
$$

Then,

\begin{eqnarray*}
\|z-x\|_2^2 &=& \|(z-y) + (y-x)\|_2^2\\
&=& \underbrace{\|z-y\|_2^2}_{\geq 0} + \|y-x\|_2^2 + 2\overbrace{\underbrace{(z-y)^T}_{-\nabla f(z)^T}(y-x)}^{\geq 0 \text{ by N.C.}}\\
&\geq &
\|y-x\|_2^2.
\end{eqnarray*}

You may now be asking yourself, ok, but where did convexity fit in? Isn;t this only supposed to work when $\mathcal C$ is a convex set? (You can easily draw counterexamples that are nonconvex; see the star figure above.) Well, the normal cone of a nonconvex set $\mathcal S$ is the same as the normal cone of its convex hull $\mathbf{conv}(\mathcal S)$. When doing projections on $\mathcal S$ nonconvex, there is no way of ensuring that the negative gradient will lie in the normal cone, which is a key property needed to make this proof work. But, if you are lucky and it does, then it probably means that you are taking steps that work equally well in the convex relaxation, and all will be fine! (Don’t count on it though!)

After posting findings

The extension of this statement to general Bregman divergences can be found in Nick’s course notes: https://www.cs.ubc.ca/~nickhar/F18-531/Notes20.pdf

Edit (July 2022): Thanks Mathurin Massias who showed me how to drastically shorten the proof. (The previous version was like 5x longer.)

Small proof: $L$-smoothness definitions and equivalences

tl;dr

In a former version of this post, I conjectured that while $L$-smoothness and convexity were not equivalent in definition, they were tied in some of the “equivalent” (not equivalent) versions of their statements.

To clarify, my former statement:

“Convexity and smoothness are totally independent concepts and should not be confused with each other “

I now still believe is still true, and (about 80%) carries through to equivalent forms as well. Thus no groundbreaking optimization proofs were harming in the creation of this blog post.

But, there are still minor details to iron out.

“Equivalences”

The following are all considered “equivalent” definitions of $L$-smoothness. (tl;dr: they are not equivalent).

[D] $\|\nabla f(x)-\nabla f(y)\|_2\leq L\|x-y\|_2$
[Q] $f(x)-f(y)-\nabla f(y)^T(x-y) \leq \frac{L}{2}\|x-y\|_2^2$
[Q2] $ g(x) = \frac{L}{2}x^Tx-f(x)$ is convex
[M] $(\nabla f(x)-\nabla f(y))^T(x-y) \leq L\|x-y\|_2^2$
[H] Largest eigenvalue of $\nabla^2 f(x)$ is less than $L$

Now, some details:

[D] is in fact the definition of $L$-smoothness. It does not and cannot depend on convexity, since if $f$ is $L$-smooth, then so is $-f$.
[Q], [Q2], [M] are all equivalent. See also citation [1] below.
Under the assumption that the Hessian exists everywhere, [Q], [Q2], [M] $\iff$ [H].

New conjecture (probably exists somewhere, I’m too lazy to find it and rather prove it myself):

[D] $\iff$ [Q], [Q2], [M] applies for $f$ and $-f$
[D] + Hessian exists everywhere $\iff$ largest magnitude eigenvalue of $\nabla^2 f(x)$ is less than $L$
[Q], [Q2], [M] alone are trivially true for concave functions.
Relationship with convexity: [D] $\Rightarrow$ [Q], [Q2], [M]. If additionally the function is convex, then [Q], [Q2], [M] $\Rightarrow$ [D]. But it is a sufficient, not necessary condition (as evidenced by $f(x) = \sin(x)$ is $L$-smooth.)

Proof of statement 3: This is pretty simple, and we can just go after [Q2]: if $f$ is concave then $-f$ is convex, and thus $g$ is convex.

Proof of statement 2: This is more of a logic statement than anything else. It is still true that $L$-smoothness forces Hessian magnitude to be less than $L$, and once simple way to see this is to use the Taylor approximation trick: e.g. given that $\nabla^2 f(x)$ is continuous everywhere, for every $x$ and $y$, there exists some $w$ where $$\nabla f(x)-\nabla f(y) = \nabla^2 f(w)(x-y).$$ Therefore, $$\|\nabla f(x)-\nabla f(y)\|_2^2 = \|\nabla f(w)\|_2^2\|x-y\|_2^2,$$ which imposes that $\|\nabla f(w)\|_2 \leq L$ if [D] is to hold.

Proof of statement 4: First, without extra bells or whistles, Nesterov already showed us that [D] $\Rightarrow$ [Q], [Q2],[M]. To see this, we just prove for [Q]. First, the Bregman divergence can be written as an integral over inner products:
$$f(x)-f(y) = \int_0^1 \frac{\partial f (u(s))}{\partial s} ds = \int_0^1 \sum_{k=1}^n \frac{\partial f(u)}{\partial u_k} \frac{\partial u_k(s)}{\partial s} ds= \int_0^1 \nabla f(u(s))^T(x-y) ds$$
where $u(t) = (1-t) y + t x\in \mathbb R^n$. Therefore,
$$f(x)-f(y) – \nabla f(y)^T(x-y)= \int_0^1 (\nabla f(u(s)) – \nabla f(y))^T(x-y) ds$$
and using Cauchy-Schwartz and imposing $L$-smoothness,

$$f(x)-f(y) – \nabla f(y)^T(x-y) \overset{\mathrm{C-S}}{\leq} \int_0^1 \|\nabla f(u(s)) – \nabla f(y)\|_2\|x-y\|_2 ds \overset{\mathrm{D}}{\leq} L \int_0^1 \|u-y\|_2\|x-y\|_2 ds.$$

Since $\|u-y\|_2 = s\|x-y\|_2$, the integral reduces to
$$L \int_0^1 \|u-y\|_2\|x-y\|_2 ds = L \|x-y\|_2^2\int_0^1 s ds = \frac{L}{2}\|x-y\|_2^2.$$

Now suppose [Q2] is true, and also $f$ is convex. And, for fun, let’s just assume that the Hessian exists everywhere. Then, since the eigenvalues of $\nabla^2 g(x)$ are $\geq 0$ and $f$ is convex, then $0 \preceq \nabla^2 f(x) \preceq LI$. Note that [D] only requires $-LI \preceq \nabla^2 f(x) \preceq LI$; hence the sufficient but not necessary bit.

Finally, Proof of statement 1: Suppose that [Q2] holds for $f$ and $-f$. Then the eigenvalues of $f$ must be bounded between $LI$ and $-LI$. This exactly gives the Hessian condition necessary for [D].

A could-be-fun-addendum: proving all of this without using the Hessian, aka for cases where the Hessian may not exist.

And, since it’s been a while, a cozy photo of a little friend

Citations

[1]: Blog of Xingyu Zhou, 2 pages in particular: https://xingyuzhou.org/blog/notes/Lipschitz-gradient and https://xingyuzhou.org/blog/notes/strong-convexity. By far the best set of reference notes I have seen on this topic.

Small proof: The gradient norms in gradient descent descends.

I’ve been racking my brain lately if there is an appropriate place to insert “small proofs”, e.g. stuff that isn’t trivial, I can’t find it anywhere, but isn’t cool enough to inspire a whole new paper. For now, I think I’ll put it here.

Background

We investigate $\underset{x}{\min}\, f(x)$ using gradient descent

$$ x^{(k+1)} = x^{(k)} – \alpha \nabla f(x^{(k)}) \qquad\qquad\text{(GD)}$$

As I’m working to pull together notes for a short course, I realized all my proofs are “average case proofs”, e.g. averaged function suboptimality or averaged gradient norm squared converges like $O(1/k)$. But because GD is a “descent method”, the averaged function suboptimality is a last-iterate-proof in disguise, since

$$f(x^{(k+1)})\leq f(x^{(k)}) \Rightarrow f(x^{(k)}) \leq \frac{1}{k} \sum_{i=1}^k f(x^{(i)}) = O(1/k).$$

I wanted to do the same thing for the gradient norm case, but I could not find this proof around, even though intuitively it should be true. So I concocted one. Please let me know if this already exists, otherwise please cite this thing (url buried somewhere is probably sufficient) if you would like to steal it.

Statement: If $f$ is convex and $0<\alpha < 2/L$ then $f(x^{(k+1)})\leq f(x^{(k)}) $ for all $k$.

If this is true, then we can generalize average gradient results to last iterate results.

$$\|\nabla f(x^{(k)})\|_2^2 \leq \frac{1}{k} \sum_{i=1}^k \|f(x^{(i)})\|_2^2 = O(1/k).$$

Proof:

First,

$$ \| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = ( \nabla f(x^{(k+1)})-\nabla f(x^{(k)}))^T( \nabla f(x^{(k+1)})+\nabla f(x^{(k)})).$$

Using Taylor’s approximation theorem, there should exist some $z$ “between” $x^{(k)}$ and $x^{(k+1)}$ where $H = \nabla^2 f(z)$ satisfies

$$H(x^{(k+1)}-x^{(k)} ) = \nabla f(x^{(k+1)})-\nabla f(x^{(k)}).$$

Using this substitution,

\begin{eqnarray*}
\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 &=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) +2\nabla f(x^{(k)}))\\
&=& (H(x^{(k+1)}-x^{(k)} ) )^T(H(x^{(k+1)}-x^{(k)} ) -\frac{2}{\alpha}(x^{(k+1)}-x^{(k)})).
\end{eqnarray*}

Now we make life easier and just write $s = x^{(k+1)}-x^{(k)}$ to get

$$
\| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = s^TH^THs -\frac{2}{\alpha} s^THs.
$$

Now we invoke convexity. Since $f$ is convex, then a Cholesky decomposition exists $H = LL^T$, and we encode $u = L^Ts$, to get

$$ \| \nabla f(x^{(k+1)})\|_2^2 – \|\nabla f(x^{(k)})\|_2^2 = u^THu -\frac{2}{\alpha} \|u\|_2^2.$$

which is $\leq 0$ if $0<\alpha < 2/L$.

Notes:

It’s unclear to me if convexity is really needed, or if smoothness alone should be sufficient. It seems like there should be a way around requiring convexity, since the biggest property used was $Lu^Tu \geq u^THu$ which is true even if $f$ is not convex, but other little parts of the proof seems nontrivial. Maybe to be continued…

Edit: quick counterexample for nonconvex case: Suppose $f(x) = -x^2$ and start with $x^{(0)}\neq 0$. Then at every step, the gradient norm necessarily increases. We could say it still has an attainable min if we substitute this for, say, a sine function, but this violation still happens. Sad.

Edit: related works, generated through discussion:

Class Notes by Mark cover the $O(1/k)$ convergence rate of average gradient norm$^2$ for nonconvex functions: https://www.cs.ubc.ca/~schmidtm/Courses/5XX-S20/S2.pdf
Tighter bound $O(1/k^2)$ by Adrien and Francis using their computer-aided methods https://arxiv.org/pdf/1902.00947.pdf, Theorem 3