The SVD as a Classification Theorem
Prerequisites: Linear algebra.
Introduction
The Singular Value Decomposition (SVD) is an extremely important concept in linear algebra with practical applications ranging from image processing, inverting matrices, machine learning, and statistics. However, despite its importance and (I would argue) how fundamental it is, the SVD is often presented in an exceptionally confusing manner. It took me a long time to appreciate the SVD for what it is because of this. As prime example, Wikipedia defines the SVD as:
[T]he singular value decomposition of an $m \times n$ complex matrix $\mathbf{M}$ is a factorization of the form $\mathbf{U \Sigma V ^{∗}}$, where $\mathbf{U}$ is an $m\times m$ complex unitary matrix, $\mathbf {\Sigma }$ is an $m\times n$ rectangular diagonal matrix with nonnegative real numbers on the diagonal, and $\mathbf{V}$ is an $n\times n$ complex unitary matrix.
What a mouthful! Wikipedia defines the SVD in a computationally useful way, but I think that it misses the point entirely. In this post, I try to explain how I think of the SVD: as a classification of maps between Hilbert spaces.
Conventions
In this post, I will use "vector space" to mean "finite dimensional vector space over $\mathbb{R}$ or $\mathbb{C}$." $\mathbb{R}$ or $\mathbb{C}$ is called the base field of the vector space, and when the base field is not important, I will use $\mathbb{K}$ to represent the base field.
Some terminology requires explanation. When I use map, I mean "linear function". An isomorphism between two vector spaces is an invertible map. An isomorphism from a vector space to itself is called an automorphism, and is equivalent to a "change of basis". In each section, I talk about classifying maps "up to isomorphism". What it means for two maps to be isomorphic differs from section to section, and will be explained in each section.
Classifying Maps Between Vector Spaces
Linear algebra is ostensibly the study of vector spaces, but really, it's all about maps between vector spaces. For one thing, vector spaces are easy to classify: every $n$dimensional vector space is isomorphic to $\mathbb{K}^n$. As such, you can use a single integer to completely classify a vector space. On the other hand, maps between vector spaces are spicy. You can add maps together and compose them to get new maps. And, unlike vector spaces, you can't use a single number to classify all vector space maps.
... Right?
Before we can answer that question, I need to discuss what I actually mean when I talk about "classifying" a map. Two maps $T, Q: V \rightarrow W$ may behave the same even if they are not equal. For example, consider the following two maps from $V = \mathbb{R}^3$ to $W = \mathbb{R}^2$:
These two maps are not equal, but they are pretty similar. In fact, if you swap the two basis vectors of $W$, then $P$ becomes $Q$ and $Q$ becomes $P$. That is, $P$ and $Q$ are the same up to a change of basis of $W$. Informally, if you look at $W$ from a "different perspective", then $P$ "looks like" $Q$. For the purposes of this section, that's good enough for us. Similarly, the following two maps are related by a change of basis of $V$ (exercise: what is the change of basis?), and thus behave "pretty much the same".
Definition: For this section only, two maps $P, Q: V \rightarrow W$ are considered isomorphic if they are the same up to an automorphism (i.e. change of basis) of $V$ and $W$.
In other words, $P$ and $Q$ are isomorphic if there exist automorphisms $A: V \rightarrow V$ and $B: W \rightarrow W$ such that $Q = A P B^{1}$.
In future sections, we will look at maps with more structure and adopt stricter definitions of "isomorphism".
So, what properties of $T$ are invariant under isomorphism? One important property of $T$ is its image, but its image is a subspace of $W$, and usually isn't invariant under an automorphism of $W$. However, the dimension of a subspace doesn't change under automorphism (exercise: why not?). The dimension of the image is called the rank of $T$, and is an important invariant of $T$.
Another important subspace associated with $T$ is its nullspace in $V$. Once again, the nullspace of $T$ is not invariant under isomorphism, but its dimension is. This quantity is called the nullity of $T$. The wellknown ranknullity theorem states that the rank of $T$ plus its nullity is equal to $\dim{V}$, so either one can be computed from the other.
Surprisingly, the rank (or nullity) completely determines a map up to isomorphism.
Theorem: A map $T: V \rightarrow W$ is completely determined up to isomorphism by its rank.
Proof:
Let $r$ denote the rank of $T$. Find a basis $v_1, v_2, \dots v_n$ for $V$ where $v_{r+1}$ through $v_n$ span the nullspace of $T$. Let $w_i = T v_i$ for $i \leq r$. Since no nontrival combination of the $v_i$'s for $i\leq r$ are in the nullspace of $T$, these $w_i$'s are linearly independent. Thus, they can be completed to a basis $w_1, w_2, \dots w_m$ of $W$. $T$ sends the first $r$ basis vectors of $V$ to the first $r$ basis vectors of $W$, and nullifies the other basis vectors of $V$. In other words, $T$ has the following form with respect to our chosen basis.
Any other map from $V$ to $W$ with the same rank as $T$ can be brought into the same form. Thus, $T$ is completely specified up to isomorphism by its rank.
$\square$
In the above proof, we showed that $T$ is unique up to isomorphism by exhibiting a canonical form for $T$. That is, by writing $T$ in a simple form with respect to a particularly nice basis. This strategy is very important, and will show up again when we come to the SVD.
Extensions: This section holds true for any base field, and even for modules over division rings. Closely related ideas such as the structure theorem and Smith normal form extend these concepts to modules over PIDs as well.
Classifying Maps Within a Vector Space
In the last section, we talked about maps between two distinct vector spaces and when those maps should be considered isomorphic. But we can also talk about maps from a vector space to itself. While not directly related to our study of the SVD, I think it is illuminating (and interesting!) to consider this case, too. For example, consider two maps $P, Q: V \rightarrow V$. In the previous section, we considered $P$ and $Q$ to be equivalent if they were related by an automorphism in their domains and codomains. However, since the $P$ and $Q$ have the same domain and codomain, it often makes more sense to ask whether $P$ and $Q$ are the same under a fixed automorphism of $V$.
Definition: For this section only, two maps $P, Q: V \rightarrow V$ are isomorphic if they are related by an automorphism of $V$.
In other words, $P$ and $Q$ are isomorphic if there exists an automorphism $A: V \rightarrow V$ such that $Q = A P A^{1}$.
Maps from a vector space to itself have more structure than maps between unrelated vector spaces. This is because you can compare vectors in the domain (i.e. input vectors) with vectors in the codomain (i.e. output vectors). For example, consider the following two maps $V \rightarrow V$ where $V = \mathbb{R}^2$.
$P$ and $Q$ both have the same rank of 2, but $P$ scales every vector by 1, while $Q$ scales every vector by 2. Thus, $P$ and $Q$ are not isomorphic, even though they would have been considered isomorphic in the last section.
Astute readers will recognize that the preceding observation can actually be phrased in terms of eigenvalues. An eigenvalue of $T: V \rightarrow V$ is a number $\lambda \in \mathbb{K}$ such that $T v = \lambda v$ for some nonzero $v \in V$ called an eigenvector of $T$. The eigenvalues of $T$ are clearly invariant under isomorphism. In fact, if $T$ is diagonalizable, then $T$ is uniquely determined up to isomorphism by its eigenvalues (exercise). However, not every map is diagonalizable, and in general, maps within a vector space are classified by their Jordan normal forms. What exactly the Jordan normal form entails is not particularly important to us, but interested readers should check the Wikipedia article.
Extensions: the Jordan normal form exists for maps over any base field, not just the real or complex numbers. In general, the eigenvalues actually lie in an algebraic closure of the base field, which can complicate things. The rational canonical form is an alternative to the Jordan normal form that does not need an algebraic closure of the base field.
Classifying Maps Between Hilbert Spaces
A Hilbert space is, loosely speaking, a vector space with notions of length and angle. A good example of a Hilbert space is the standard Euclidean space $\mathbb{R}^n$. The length of a vector here is given by its Euclidean norm:
Vector quantities from physics like velocity and displacement always lie in Hilbert spaces, and the length of these vectors (speed and distance, respectively) play an important role. Complex Hilbert spaces are a bit different and may seem odd at first glance, but they are a useful generalization of real Hilbert spaces and show up all the time in quantum mechanics.
Hilbert spaces are vector spaces with extra structure, and this extra structure leads to a different classification of maps between Hilbert spaces. This classification theorem is most commonly known as the Singular Value Decomposition, or SVD.
Rigorously, a (finitedimensional) Hilbert space is a vector space along with an inner product. An inner product on $V$ is a map $\langle\cdot, \cdot\rangle: V \times V \rightarrow \mathbb{K}$ that satisfies the following properties:
 Linearity/Antilinearity:
 $\langle a, c + d\rangle = \langle a, c \rangle + \langle a, d\rangle$
 $\langle a + b, c\rangle = \langle a, c \rangle + \langle b, c\rangle$
 $\langle a, \lambda c\rangle = \lambda \langle a, c \rangle$
 $\langle \lambda a, c\rangle = \overline{\lambda} \langle a, c \rangle$

Conjugate symmetry: $\langle a, b \rangle = \overline{\langle b, a\rangle}$

Positivedefiniteness: $\langle a, a\rangle$ is real, $\langle a, a\rangle \geq 0$, and $\langle a, a\rangle > 0$ if $a \neq 0$
It can be easily checked that the dot product on $\mathbb{R}^n$ is an inner product. The length of a vector $v \in \mathbb{R}^n$ is $\sqrt{v \cdot v}$. Similarly, we define the length of a vector $v$ in a Hilbert space to be $\sqrt{\langle v, v\rangle }$, and we write it as $\v\$. In linear algebra, the length of a vector is often called its norm. Two vectors in $\mathbb{R}^n$ are orthogonal if their dot product is zero. Similarly, we call any two vectors in a Hilbert space orthogonal if their inner product is zero.
A Hilbert space basis $e_1 \dots e_n$ is called orthonormal if $\e_i\ = 1$ and $\langle e_i, e_j \rangle = 0$ when $i \neq j$. For example, the standard basis on $\mathbb{R}^n$ is an orthonormal basis. Orthonormal bases are the basis of choice when working in a Hilbert space.
So, how do our notions of equivalence change when we consider Hilbert spaces instead? To start with, a map between Hilbert spaces should only be considered an equivalence of Hilbert spaces if it preserves the inner product structure. Call a map of Hilbert spaces $T: V \rightarrow W$ a Hilbert space isomorphism if it is a vector space isomorphism and $\langle T v_1, T v_2 \rangle = \langle v_1, v_2 \rangle$ for all $v_1, v_2 \in V$. Such maps are often called orthogonal (if the base field is real) or unitary (if the base field is complex), but for simplicity's sake, I'll just call all of them "Hilbert space isomorphisms", or if the domain and codomain are the same, "Hilbert space automorphisms". Up to Hilbert space isomorphism, there is only one Hilbert space of each dimension, namely $\mathbb{K}^n$ with the standard inner product.
This more restrictive notion of an isomorphism of Hilbert spaces naturally leads to this section's definition of an isomorphism of maps between Hilbert spaces.
Definition: For this section only, two maps $P, Q: V \rightarrow W$ are isomorphic if they are related by Hilbert space automorphisms of $V$ and $W$.
In other words, $P$ and $Q$ are isomorphic if and only if there exist Hilbert space automorphisms $A: V \rightarrow V$ and $B: W \rightarrow W$ such that $P = A Q B^{1}$.
What new properties do Hilbert space maps have that maps between vanilla vector spaces don't? For one thing, it is possible to talk about how "large" a map is. Consider the two maps below from $V=\mathbb{R}^2$ to $W=\mathbb{R}^3$.
$P$ never increases the length of a vector, so that $\Pv\ \leq \v\$. On the other hand, if $e_1$ is the first basis vector of $V$, then $\Q e_1\ = 5$, so $Q$ can increase the length of a vector. In order to capture this property, we define the norm of a map as follows:
With the above examples, $\P\ = 1$ and $\Q\ = 5$. The norm is an extremely important property of a map between Hilbert spaces. (exercise: the norm of a map is always finite). Note that the norm doesn't make sense for maps between vanilla vector spaces because vanilla vector spaces don't carry a notion of "length".
There are other invariants too, but let's transition to the star players: the singular values of a map.
Theorem (The SVD): Let $T: V \rightarrow W$ where $V$ and $W$ are Hilbert spaces. Let $n = \dim V$. There exists a basis $e_1, e_2, \dots e_n$ for $V$ and nonnegative real numbers $\sigma_1 \geq \sigma_2 \geq \dots \sigma_n$ such that $\ T e_i\ = \sigma_i$ and $\langle T e_i, T e_j \rangle = 0$ when $i \neq j$. The values $\sigma_i$ are called singular values, and the vectors $e_i$ are called right singular vectors. The singular values of $T$ are unique and determine $T$ up to isomorphism.
In other words, the SVD of $T$ gives a special basis for $V$ which plays nicely with the inner product structures on $V$ and $W$, and classifies $T$ by its action on this special basis.
The SVD Theorem can also be interpreted as finding a canonical form for $T$. Each pair of $T e_i$ is orthogonal, and hence (if they are nonzero) are linearly independent. In fact, when $\sigma_1 \dots \sigma_k$ are the nonzero singular values of $T$, $\frac{T e_1}{\T e_1\} \dots \frac{T e_k}{\T e_k\}$ are orthogonal and unit norm. Thus, they can be extended to an orthonormal basis of $W$. Under this basis, $T$ has the exact form:
where the first $k$ diagonal entries are the first $k$ singular values of $T$, and everything else is zero. In a sense, all $T$ does is "scale" its right singular vectors by their singular values. It's actually kind of crazy that maps between Hilbert spaces, although a very rich subject, can all be brought into such a simple canonical form! And that, I think, is the beauty of the SVD.
The proof of the SVD theorem is a bit technical, and is not necessary for understanding the SVD in practice. However, I am a big fan of backing up my claims with a proof, so here it is.
Proof:
Let $T: V \rightarrow W$. If $\dim V =0$, then the SVD exists trivially, so we assume that $\dim V > 0$. The first step to finding the SVD is determining the first right singular vector, which is also the vector whose norm scales the most under $T$.
Let $X \subset V$ be the set of all vectors with norm less than or equal to 1. It turns out that there exists a vector $v \in X$ which maximizes the norm. This is an intuitive fact, but not one with a simple proof. It is true by the multivariable extreme value theorem that any continuous function $f: X \rightarrow \mathbb{R}$ is maximized by some $v \in X$. However, the multivariable extreme value theorem can be tricky to prove, and involves far more analysis than I want to include in this blog post. I hope you can forgive me if present this one fact on faith.
Let $v \in X$ have the maximum norm among all vectors in $X$. In other words, $\Tv\ = \T\$. We will now prove that for any vector $v' \in V$ orthogonal to $v$, $T v'$ and $T v$ are also orthogonal. Let $u = \frac{v + t v'}{\v + tv'\}$ where $t$ is a yet to be determined small real number. $T u$ has the following norm.
Since $v$ and $v'$ are orthogonal, $\v + t v'\^2 = \v\^2 + \mathcal{O}(t^2)$. For those who are unfamiliar with bigO notation, this means that $\v + tv'\$ and $\v\$ are very close, and their difference is at most a multiple of $t^2$.
Using bigO notation, we can also simplify our expression for $\T u\^2$.
So, unless $\langle T v, T v'\rangle + \langle T v', T v \rangle = 0$, $Tu$ could have larger norm than $T v$ for some (small) values of $t$. But, since $u$ has unit norm, $u \in X$, so $\T u \ \leq \T v\$. This would be a contradiction. Thus,
Over the real numbers, this is enough to prove that $Tv$ and $Tv'$ are orthogonal since an inner product over the reals is symmetrical. Over the complex numbers, the inner product is conjugatesymmetric, so we find that:
In other words, $\langle T v, T v' \rangle$ is purely imaginary. Replacing $v'$ by $i v'$ in this formula (which is fine because $v$ is also orthogonal to $i v'$), and then expanding, we obtain the following.
This shows that $\langle T v, T v' \rangle$ is purely real. The only purely real and purely imaginary number is 0, so $\langle T v, T v' \rangle = 0$.
Let $e_1 = x$, and let $\sigma_1 = \T\$. Also rename $V$ to $V_1$, and rename $T$ to $T_1$. Define $V_2$ to be the orthogonal complement to $e_1$ in $V_1$, that is, the subspace of all vectors which are orthogonal to $e_1$. Since $V_2$ is a subspace of a Hilbert space, $V_2$ is itself a Hilbert space. Let $T_2$ be the restriction of $T_1$ to $V_2$.
As long as $\dim V_2 \neq 0$, This process can be repeated with $T_2: V_2 \rightarrow W$ to obtain $e_2$ and $\sigma_2$, as well as $V_3$ and $T_3$. Each step, the dimension of $V_i$ decreases by 1 (exercise). Thus, if $n = \dim V$, this process can be iterated $n$ times. Doing so yields $n$ vectors $e_1 \dots e_n$ and $n$ nonnegative real numbers $\sigma_1 \dots \sigma_n$.
Each $e_i$ has unit norm. When $i < j$, $e_j$ is in the orthogonal complement to $e_i$ and $\langle Te_i, Te_j \rangle = 0$. Since $T_{i+1}$ is a restriction of $T_i$, $\T_{i+1}\ \leq \T_i\$. As $\sigma_i = \T_i\$, this shows that $\sigma_{i+1} \leq \sigma_i$. Thus, $\sigma_1 \dots \sigma_n$ are singular values of $T$, and $e_1 \dots e_n$ are right singular vectors of $T$.
Any two maps with the same singular values can be brought to this common form using their right singular vectors as a basis for $V$. Thus, any two maps with the same singular vectors are isomorphic. It remains only to prove that the singular values of a map are unique. To do this, we need the following technical lemma.
Lemma: Let $e_1 \dots e_n$ be one SVD basis for $T$ with singular values $\sigma_1 \dots \sigma_n$. Let $e_1' \dots e_n'$ be another SVD basis for $T$ with singular values $\sigma_1' \dots \sigma_n'$. If $\sigma_i \neq \sigma_j'$, then $\langle e_i, e_j' \rangle = 0$.
You can expand $e_j'$ in the $e_1 \dots e_n$ basis as:
Thus,
and
Similarly, by swapping the roles of the $e_1 \dots e_n$ basis and the $e_1' \dots e_n'$ basis,
If $\sigma_i^2 \neq \sigma_j^2$, then $\langle e_i, e_j' \rangle = 0$. This proves the lemma.
Now assume for contradiction that $T$ had two singular value decompositions with different singular values. In one SVD, $T$ has right singular vectors $e_1 \dots e_n$ with singular values $\sigma_1 \dots \sigma_n$. In another, $T$ has right singular vectors $e_1' \dots e_n'$ with singular values $\sigma_1' \dots \sigma_n'$. Suppose that $\sigma_i = \sigma'_i$ for $i = 1 \dots k1$, but $\sigma_k > \sigma'_k$. Let $D$ be the set of right singular vectors of $T$ in the first SVD whose singular values equal $\sigma_k$, and let $D'$ be the set of right singular vectors of $T$ in the second SVD whose singular values equal $\sigma_k$,. We must have $D > D'$, so there must be some $e_j \in D$ not contained in the span of $D'$. As such, there must be some $e_k'$ with $\sigma_k' \neq \sigma_j$ whose inner product with $e_j$ is nontrivial. This contradicts the above lemma.
Thus, the singular values of $T$ are unique.
$\square$
Extensions: a lot of this section also holds for Hilbert spaces over the quaternions, although no one appears to use them. There is also a variant of the SVD for compact maps between infinite Hilbert spaces. It seems that general maps between infinite Hilbert spaces are much harder to classify.
Classifying Maps Within a Hilbert Space
Just as with vanilla vector spaces, maps within a Hilbert space have more structure than maps between Hilbert spaces. You know the drill by now: new structure means a new notion of "isomorphism".
Definition: For this section only, two maps $P, Q: V \rightarrow V$ are isomorphic if they are related by a Hilbert space automorphism of $V$.
In other words, $P$ and $Q$ are isomorphic if there is some Hilbert space automorphism $A: V \rightarrow V$ such that $Q = A P A^{1}$.
Unfortunately, I don't know a good classification theorem for this category of maps. In any case, such a classification should be at least as complicated as the Jordan normal form, and the Jordan normal form would already take us on too long a diversion. However, there are specific cases when a map within a Hilbert space map can be put into a canonical form, and this canonical forms can be used to classify the map. The most important case is that of selfadjoint maps.
The adjoint of a map $T: V \rightarrow W$ is the unique map $T^{\dagger}: W \rightarrow V$ satisfying the following condition for all $v \in V$ and $w \in W$.
A map $T: V \rightarrow V$ is called selfadjoint if $T = T^{\dagger}$. Alternatively, $\langle v_1, Tv_2\rangle = \langle Tv_1, v_2\rangle$ for all $v_1, v_2 \in V$. In the real case, these maps are often called symmetric because any matrix representation of a symmetric map will be symmetric along the main diagonal. In the complex case, they are sometimes called Hermitian. Selfadjoint maps are very common, and fortunately, they have a nice classification theorem based on eigenvalues.
Theorem (Spectral Theorem): Suppose that $T: V \rightarrow V$ is selfadjoint. There exists an orthogonal basis $e_1, e_2, \dots e_n$ for $V$ and decreasing real numbers $\lambda_1 \geq \lambda_2 \geq \dots \lambda_n$ such that $T e_i = \lambda_i e_i$. These eigenvalues are unique, and they determine $T$ up to isomorphism.
There are a lot of really cool connections between the spectral theorem and the SVD. For example, the eigenvectors of a selfadjoint matrix are all singular vectors, and its singular values are the absolute values of its eigenvalues. It's common to prove the singular values decomposition using the spectral theorem, but the spectral theorem can also be proved directly from the SVD (harder exercise). I avoided using eigenvalues in my discussion of the SVD because I wanted to emphasize that the SVD is about maps between different vector spaces, while eigenvalues only apply to maps within a vector space. Besides, I see the SVD as more "fundamental" than the spectral theorem, and deserves its own proof.
Extensions: Much of this section also applies to maps within a quaternion Hilbert space, too. The spectral theorem has a simple extension to normal maps within a Hilbert space. The spectral theorem becomes incredibly important in the theory of infinitedimensional Hilbert spaces, and can even be extended to some unbounded maps on Hilbert spaces.