By convention, vectors are column vectors.
Set V a vector space (almost always real or complex).
Def: \| \cdot \| : V \to \mathbb R is a norm if it satisfies the following properties:
Now, since this is a computational class, we only care about specific norms, almost all of which we can quickly write down.
Set V = \mathbb R^n or \mathbb C^n equivalently.
Def: The Minkowski or p-norm is given by \| x \|_p = \left(\sum_{i=1} x_i^p\right)^{1/p} and we call the 2-norm the Euclidean norm and the 1-norm the Manhattan norm.
Def: The \infty-norm is the limit of p-norms as p \to \infty, and is given by \| x \|_\infty = \max_{i = 1, \dots, n} |x_i| = \lim_{p \to \infty} \| x \|_p.
Def: For a weight vector \underline w = \begin{bmatrix}w_1, \dots, w_n\end{bmatrix}^T \in \mathbb R^n, with each w_i > 0, we have that the weighted p-norm is \| v \|_{\underline w, p} = \left(\sum_{i=1} w_i x_i^p\right)^{1/p}.
Def: In general, for any positive definite matrix A (that is, x^TAx > 0 for all x \neq 0), we may consider the Mahalanobis norm \| v \|_{A} = \left(x^T A x\right)^{1/2}.
As convention, the “default” norm when a subscript is omitted is the Euclidean norm.
Set V = \mathbb R^{m \times n} or \mathbb C^{m \times n} equivalently.
Def: The Hölder p-norms are given by \| X \|_{H, p} = \left(\sum_{i=1}^m\sum_{j=1}^m |x_{ij}|^p \right)^{1/p}, and the Hölder 2-norm is called the Frobenius norm, which is also defined on infinite dimensional vector spaces as \| X \|_F = \left(\mathop{\mathrm{Tr}}(XX^*)\right)^{1/2} where ^* is the conjugate transpose.
Def: As before, we can take p \to \infty to get the Hölder \infty-norm given by \| X\|_{H, \infty} = \max_{\substack{i = 1, \dots n \\ j = 1, \dots, n}} |x_{ij}|.
Def: We can also define norms on matrices by viewing them as linear maps A: \mathbb R^n \to \mathbb R^m; in particular, if we have some norm \| \cdot \|_a on \mathbb R^n and some norm \| \cdot \|_b on \mathbb R^m, we may define the operator norm (or induced norm) \| A \|_{a, b} = \max_{x \neq 0} \frac{\| A x \|_b}{\| x \|_a}. In particular, if the norms on the domain and codomain are just p-norms, we write \| A \|_{p} = \max_{x \neq 0}\frac{\| A x\|_p}{\| x \|_p} and call it the p-norm of A. In particular, we call the 2-norm the spectral norm. Further, the 1-norm and \infty-norm are just \| A \|_1 = \max_{j = 1, \dots, n} \left(\sum_{i=1}^m |a_{ij}| \right), which is the max column sum, and \| A \|_\infty = \max_{i = 1, \dots, m} \left(\sum_{j=1}^n |a_{ij}| \right), which is the max row sum; both facts are easy to check.
In general, for p \notin \{1, 2, \infty\}, computing \| A \|_p is NP-hard, and if we consider \| A \|_{p,q} then \|A\|_{\infty, 1} is hard and \|A\|_{1, \infty} is easy.
We may also want to consider some other desirable properties on our norms.
Some properties always hold.
Prop: Every norm is Lipschitz.
Proof: Let our norm be \| \cdot \| : V \to \mathbb{R}. The triangle inequality immediately implies | \| u \| - \| v \| | \leq \| u - v \|.
Theorem (Equivalence of Norms): Link Set V a finite-dimensional vector space. Then every pair of norms \| \cdot \|_a, \| \cdot \|_b are equivalent to each other, e.g. there are constants c_1, c_2 such that for any v \in V, c_1\| v \|_b \leq \| v \|_a \leq c_2 \| v \|_b.
Proof: Induct on the dimension of V and see that every norm is equivalent to the infinity norm.
Def: We say that a sequence \{ x_k \}_{k=1}^\infty of vectors converges to x if \lim_{k \to \infty} \| x_k - x \| = 0.
Then, the above clearly shows that convergence in one norm implies convergence in every norm.
Def: Set V a K-vector space (where K = \mathbb{R}, \mathbb{C}). An inner product is a binary operation \langle \cdot, \cdot \rangle: V \times V \to \mathbb{R} which satisfies that
Prop: For an inner product \left\langle \cdot, \cdot \right\rangle, \| v \| = \sqrt{ \left\langle v, v \right\rangle } is a norm. Furthermore, an arbitrary norm \| \| is induced by an inner product if and only if it satisfies the parallelogram law \|u + v\|^2 + \|u - v\|^2 = 2 \|u\| ^ 2 + 2 \|v\| ^2.
Theorem (Cauchy-Schwarz): Link Let \|\cdot\| be induced by \left\langle \cdot, \cdot \right\rangle. Then \sqrt{\left\langle u, v \right\rangle} \leq \|u\| \|v\|.
Def: The standard Euclidean inner product on \mathbb{C}^n is \left\langle x, y\right\rangle = x^*y.
Def: The Frobenius norm on \mathbb{C}^{m \times n} is \left\langle X, Y\right\rangle = \sum_{i=1}^m \sum_{j=1}^n x_{ij} y_{ij} = \mathop{\mathrm{Tr}}(X^*Y).
Theorem (Hölder Inequality): Link For x, y \in \mathbb{C}^n and p^{-1} + q^{-1} = 1, we have |x^*y| \leq \|x\|_p \|y\|_q.
Theorem (Bessel Inequality): Link For x \in \mathbb{C}^n and an orthonormal basis e_1, \dots, e_n, we have \sum_{k=1}^n \left| \left\langle x, e_k \right\rangle^2\right| \leq \|x\|_2.
Def: The outer product is a binary operator \mathbb{C}^m \times \mathbb{C}^n \to \mathbb{C}^{m \times n} taking (x, y) \mapsto xy^*.
Prop: A \in \mathbb{R}^{m \times n} is an outer product iff and only if it has rank 1.
Def: The matrix product is a binary operator \mathbb{C}^{m \times n} \times \mathbb{C}^{n \times p} \to \mathbb{C}^{M \times p}. Set A = \left[\begin{matrix}\alpha_1 & \alpha_2 & \cdots & \alpha_m\end{matrix}\right]^T and B = \left[\begin{matrix}\beta_1 & \beta_2 & \cdots & \beta_p\end{matrix}\right]. Then, AB = \left[\begin{matrix} \alpha_1^T \beta_1 & \cdots & \alpha_1^T \beta_n \\ \alpha_2^T \beta_1 & \cdots & \alpha_2^T \beta_n \\ \vdots & \ddots & \vdots \\ \alpha_m^T \beta_1 & \cdots & \alpha_m^T \beta_n \end{matrix}\right] = \left[\begin{matrix}A\beta_1 & A\beta_2 & \cdots & A\beta_n\end{matrix}\right] = \left[\begin{matrix}\alpha_1^TB \\ \alpha_2^TB \\ \vdots \\ \alpha_m^TB\end{matrix}\right]. Alternatively, it is uniquely characterized as the matrix representing the composition of A and B as operators.
Prop: Let D = \mathop{\mathrm{diag}}(d_1, \dots, d_n), the diagonal matrix with entries d_1, \dots, d_n; then AD = \left[\begin{matrix}d_1\alpha_1, \dots, d_n \alpha_n\end{matrix}\right]. Simiar for the other direction of multiplication.
Def: For a complex matrix A, an eigenvalue \lambda \in \mathbb{C} and eigenvector v \neq 0 satisfy Av = \lambda v.
Def: Furthermore, an eigenspace is the span of all eigenvectors correspnding to a single eigenvalue, the spectrum of A, \mathop{\mathrm{Spec}}(A) is the set of all eigenvalues of A, and the spectral radius is \rho(A) = \max_{\lambda \in \mathop{\mathrm{Spec}}(A)} |\lambda| = |\lambda_{\text{max}}|. Sometimes we will call this the top eigenvector/eigenvalue.
As a convention, usually we implicitly order eigenvalues, e.g. |\lambda_1| \geq |\lambda_2| \geq \cdots \geq \lambda_n.
Def: More generally, if v is a eigenvector of A^T, we say that v is a left eigenvector of A (and thus usual eigenvectors are right eigenvectors).
There are a few warnings about these things:
Def: A square matrix A \in \mathbb{C}^{n \times n} is diagonalizable if it is similarly to a diagonal matrix.
Prop: A is diagonalizable if and only if it has n linearly independent eigenvectors.
Proof: The eigenvectors form a basis by intertibility; change basis from the eigenvectors, scale by eigenvalues, and change basis back. In particular, let A \in \mathbb{C}^{n \times n} have eigenvalues \lambda_1, \dots, \lambda_n, with corresponding eigenvectors v_1, \dots, v_n; set \Lambda = \mathop{\mathrm{diag}}(\lambda_1, \dots, \lambda_n) and V = \left[\begin{matrix}v_1, \dots, v_n\end{matrix}\right]. Then A = V \Lambda V^{-1}.
Def: The above A = X \Lambda X^{-1} is called the eigenvalue decomposition, or EVD.
Prop: The diagonal entries in a triangular matrix are just the eigenvalues.
Def: A matrix A \in \mathbb{C}^{n \times n} is normal if it commutes with its adjoint A^* (the conjugate transpose), and unitary if AA^* = A^*A = I.
Prop: A is unitary if and only if it’s columns (or rows) are orthonormal.
Theorem: A \in \mathbb{C}^{n \times n} is normal if and only if it is unitarily diagonalizable, e.g. if A = V\Lambda V^* with V unitary.
Def: A \in \mathbb{C}^{n \times n} is Hermitian if A^* = A.
Theorem (Spectral Theorem): Link A \in \mathbb{C}^{n \times n} is Hermitian if and only if A is unitarily diagonalizable with all real eigenvalues.
Corollary: A \in \mathbb{R}^{n \times n} is symmetric if and only if it is orthogonally diagonalizable with all eigenvalues real.
Def: Link Any matrix can be written in Jordan canonical form, e.g. A = XJX^{-1} where J is nonzero only on the diagonal and the superdiagonal, which has values only in \{0, 1\}, such as J = \begin{bmatrix} \lambda_1 & 1 & & & & & \\ & \lambda_1 & 1 & & & & \\ & & \lambda_1 & 0 & & & \\ & & & \lambda_2 & 0 & & \\ & & & & \lambda_3 & 1 & \\ & & & & & \lambda_3 & \\ \end{bmatrix} for example.
The way it’s written above is not by coincidence: you can permute everything so that J is composed of Jordan blocks, which have the same entry all the way down the diagonal and have a superdiagonal of only ones, e.g. J = \begin{bmatrix} J_1 & & & \\ & J_2 & & \\ & & \ddots & \\ & & & J_k \\ \end{bmatrix} where each J_i = \lambda I + N, where N has all zero entries except on the superdiagonal, on which it is always one.
Unfortunately, the JCF is pretty useless in application.
Theorem (Golub-Wilkinson): The Jordan canonical form cannot be computed in finite precision.
Let’s return to the spectral radius, \rho(A) = \max_{\lambda \in \mathop{\mathrm{Spec}}(A)} |\lambda|.
Any nilpotent matrix shows that \rho is not a norm, but the following is true.
Prop: If \| \cdot \|: \mathbb{C}^{m \times m} \to \mathbb{R} is any consistent matrix norm, then \rho(A) \leq \| A \|.
Proof: Look at the norm of the image of a top eigenvector.
Theorem: Given any A \in \mathbb{C}^{n \times n}, any positive \epsilon, there is a consistent norm (in fact, an operator norm) \| \cdot \|: \mathbb{C}^{n \times m} \to \mathbb{R} such that \| A \| \leq \rho(A) + \epsilon.
Prop: Given any matrix norm \| \cdot \|: \mathbb{C}^{n \times m} \to R, we have \rho(A) = \lim_{k \to \infty} \| A^k \|^{1/k}.
Lemma: \lim_{k \to \infty} A^k = 0 if and only if \rho(A) < 1.
Proof: (\implies) Set \lambda to be a top eigenvalue of A, and x a corresponding eigenvector. Then, A^k x = \lambda^k x. Send k \to \infty and conclude.
(\impliedby) By the above theorems, there is some operator norm \| \cdot \| such that \|A\| \leq \rho(A) + \epsilon < 1. Send k \to \infty and use the fact that operator norms imply \| A^k \| \leq \|A\|^k and win.
Theorem (Gershgorin Circle Theorem): Wikipedia Let A \in \mathbb{C}^{n \times n}, with entries a_{ij} and for 1 \leq i \leq n set r_i = \sum_{j\neq i} |a_{ij}|. Then, every eigenvalue of A lies within the union of the Gershgorin discs G_i = \left\{ z \mid z \in \mathbb{C}, |z - a_{ii}| \leq \sum_{j \neq i} |a_{ij}| \right\} and the number of eigenvalues in each connected component is equal to the number of Gershgorin disks that constitute that component.
Proof: If A \in \mathbb{C}^{n \times n} is strictly diagonally dominant, e.g. |a_{ii}| > \sum_{j \neq i}|a_{ij}| for all 1 \leq i \leq n, then A is invertible. To see this, take any x \in \ker(A) so that \sum_{j=1}^n a_{ij}x_i = 0. But look at the index k that witnesses \| x \|_\infty: -a_{kk}x_k = \sum_{j \neq k} a_{kj}x_j \implies |a_{kk}||x_k| \leq \sum_{j \neq k} |a_{kj}||x_k| so x = 0 and \ker(A) is trivial.
We proceed to prove that any z \notin \bigcup_{i=1}^n G_i cannot be an eigenvalue by showing that A - zI is invertible. But this is clear, since A - zI in that case is strictly diagonally dominant.
Def: Wikipedia The Schur decomposition of a matrix A is A = QUQ^* such that Q is unitary and U, the Schur form of A is upper triangular.
Note that you can read off the eigenvalues of A from the diagonal of the Schur form, which is numerically stable to compute.
As an aside, how one numerically finds the roots of a polynomial p(x) = \sum_{k=0}^m c_kx^k is by forming the companion matrix A = \begin{bmatrix} 0 & \cdots & 0 & -c_0 \\ 1 & \ddots &0 & -c_1 \\ \vdots & \ddots & \vdots & \vdots \\ 0 & \cdots & 1 & -c_{d-1} \end{bmatrix}.
Def: Link The SVD decomposition of a real (resp. complex) matrix A \in \mathbb{C}^{m \times n}) is A = U \Sigma V^* where U, V are orthogonal (resp. unitary) and \Sigma is diagonal (on the shorter diagonal) with nonnegative entries; that is, if m > n, then \Sigma = \left[\begin{matrix}\mathop{\mathrm{diag}}(\sigma_1, \dots, \sigma_n) \\ \mathbf 0\end{matrix}\right] \in \mathbb{R}^{m \times n}_{\geq 0} and if m < n, \Sigma = \left[\begin{matrix}\mathop{\mathrm{diag}}(\sigma_1, \dots, \sigma_n) & \mathbf 0\end{matrix}\right] \in \mathbb{R}^{m \times n}_{\geq 0}
Def: We may arrange \sigma_{1} \geq \sigma_2 \geq \dots \geq \sigma_{\min(m, n)} \geq 0; we then call top few singular values the principal singular values. The columns of U are left singular vectors, and the columns of V are right singular vectors.
Prop: The largest r for which r > 0 is the rank of A.
Def: We sometimes use the reduced SVD, which is the same as the normal SVD but we drop the extra rows of \Sigma to force it to be a square diagonal matrix in \mathbb{C}^{\min\{m, n\} \times \min\{m, n\}}. Then, A = \sum_{k=1}^r \sigma_k u_r v_r^* where u_r, v_r are the columns of U, V in the reduced SVD and r = \operatorname{rank}(A), or the number of nonzero singular values.
Def: In a similar vein, the condensed SVD is attained by removing all of the nonzero rows/columns of \Sigma, so \Sigma \in \mathbb{C}^{r \times r}.
Theorem: Given any A \in \mathbb{C}^{m \times n}, there are U \in \mathbb{C}^{m \times m}, V \in \mathbb{C}^{n \times n}, and \Sigma = \mathop{\mathrm{diag}}(\sigma_1, \dots, \sigma_r, 0, \dots, 0) \in \mathbb{R}_{\geq 0}^{m \times n} such that U, V are unitary and A = U \Sigma V^*.
Proof: Form the matrix W = \left[\begin{matrix}0 & A \\ A^* & 0\end{matrix}\right] which is Hermitian. The spectral theorem gives an EVD, and if you look carefully you get the compact SVD: W = \left[\begin{matrix} U_r & U_r \\ V_r & -V_r \end{matrix}\right] \left[\begin{matrix} \sigma_1 & & & & & &\\ & \sigma_2 & & & & &\\ & & \ddots & & & &\\ & & & \sigma_r & & &\\ & & & & -\sigma_1 & & \\ & & & & & \ddots & \\ & & & & & & -\sigma_r\\ \end{matrix}\right] \left[\begin{matrix} U_r & U_r \\ V_r & -V_r \end{matrix}\right]^*.
Singular values have a nice property: we have for any singular value \sigma and left singular vector u and right singular vector v, Av = \sigma u and A^*v = \sigma v; thus u, v are eigenvalues of AA^* and A^*A respectively.
We can read off many important quantities/spaces from the SVD. Let A = U \Sigma V^* be a SVD, with only r nonzero singular values.
You can also solve fundamental problems. For example, if you consider the linear system Ax = b, it might not have exactly one solution, so we can translate this into the least squares system \min_{x \in \mathbb{R}^n} \| Ax - b \|_2^2; this still might not have a unique solution when A is singular, so we consider the minimum norm least squares problem: \min \{ \| x \|_2 \mid x \in \mathop{\mathrm{argmin}}\| A_x - b \|_2 \}.
SVD solves all of these simultaneously; set A = U \Sigma V^T; this means that we want \Sigma V^T x = U^T b; equivalently, we just need to solve \Sigma y = c, where b = Uc, x = Vy. But this is trivial since \Sigma is diagonal.
We could define the pseudoinverse of A to be A^+ to be the matrix A^+ that sends any b to the minimum norm least squares solution of Ax = b.
Prop: A^+ is just V \Sigma^{-1} U^*, where by \Sigma^{-1} is just gotten by flipping all the nonzero diagonal elements.
Proof: Clear.
Def: In general, the pseudoinverse or Moore-Penrose inverse of A \in K^{m \times n} is a matrix X \in K^{n \times m} that satisfies
Prop: The above gives a unique matrix and is in fact the same as A^+ from earlier.
Prop: The following statements are true.
Suppose that we are looking for \min \|Ax - b\|^2_2 \text{ such that } \|x\|^2_2 \leq \alpha^2 which is a ridge regression. Remember that the pseudoinverse A^+ gives that A^+b is the minimal norm solution; so if \|A^+b\| \leq \alpha, this works. On the other hand, if we have \|A^+b\| > \alpha, we first use the fact that the above optimization reduces to one on the boundary \|x\|_2 = \alpha. Then take the SVD decomposition A = U\Sigma V^T, and look at the Lagrange multiplier condition A^TAx - A^Tb - \lambda x = 0 \implies x = (A^TA - \lambda I)^{-1}A^Tb and thus, setting c = U^b, \begin{align*} \alpha^2 & = x^Tx \\ & = b^TA(A^TA- \lambda I)^{-2}A^Tb \\ & = c^T \Sigma(\Sigma^T\Sigma + \lambda I)^{-2}\Sigma^Tc \\ & = \sum_{i=1}^r \frac{c_i^2\sigma_i^2}{(\sigma_i^2 + \lambda)^2} = f(\lambda). \end{align*} Then, solve f(\lambda) = \alpha^2 to get a solution (or really, f(\lambda)^{-1} = \alpha^{-2} since most univariate root finders are bad when things go to infinity). In fact, there are many solutions, but the one we are looking for is actually the biggest solution.
Suppose that A \in \mathbb{R}^{n \times n}; then, we want to solve \min_{X^T = X} \|A - X\|_F or \min_{X^TX = 1} \|A - X\|_F or if A \in \mathbb{R}^{m \times n}, \min_{\mathop{\mathrm{rank}}(X) \leq r} \|A - X\|_F. The last problem is very important since rounding error sends every matrix to an invertible matrix, so the above approximates a solution to this problem.
Set A = \frac{A + A^T}{2} + \frac{A - A^T}{2} = X + Y; then, X = \frac{A + A^T}{2} is the solution to the first problem. Further, we can compute \mathop{\mathrm{tr}}(X^TY) = 0 and so \|A\|^2 = \|X + Y\|^2 = \mathop{\mathrm{tr}}(X^TX) + 2\mathop{\mathrm{tr}}(X^TY) + \mathop{\mathrm{tr}}(Y^TY) = \|X\|^2 + \|Y\|^2. In general, we see \mathbb{R}^{n \times n} = S^2(\mathbb{R}^n) \oplus \Lambda^2(\mathbb{R}^n), e.g. the space of all matrices is the direct sum of symmetric and skew symmetric matrices.
In the second problem, take the SVD A = U\Sigma V^T and Z = U^T X V so that \begin{align*} \min_{X^TX = I}\|U \Sigma V^T - X\|_F^2 & = \min_{X^TX = I}\|\Sigma - U^TXV\|^2_F \\ & = \min_{Z^TZ = I}\| \Sigma - Z \|^2_F \\ & = \max_{Z^TZ = I} \| \Sigma \|_F^2 - 2 \mathop{\mathrm{tr}}(\Sigma^T Z) + \|Z\|^2_F & = \max_{Z^TZ = I} \| \Sigma \|_F^2 - 2 \mathop{\mathrm{tr}}(\Sigma^T Z) + n \end{align*} so we just need to maximize \mathop{\mathrm{tr}}(\Sigma^T Z) = \sum_{i=1}^n \sigma_i z_ii \leq \sum_{i=1}^n \sigma_i; but this is attained when Z = I, so we need X = UV^2.
The solution to the last problem is called the Eckhart-Young theorem.
Theorem (Eckhart-Young): Let A \in \mathbb{R}^{m \times n} and let A = U \Sigma V^T be the SVD. Then the solution to \min_{\mathop{\mathrm{rank}}(X) \leq r} \|A - X\|_2 is given by dropping all but the first r entries in \Sigma.
Proof: Suppose not. Then let the solution be called B. Then \|A - X\|_2 = \|U \mathop{\mathrm{diag}}(0, \dots, 0, \sigma_{r+1}, \dots, \sigma_{\min(m, n)}) V^T\|_2 = \sigma_{r + 1}. However, the nullity of B is at least n - r; but if we take W to be the span of the first r+1 right singular vectors, then any w \in W is w = V_{r+1}\alpha where V_{r+1} is the first r+1 columns of X and \alpha is some vector in \mathbb{R}^{r+1}. Then, \|Aw\|^2 = \| U\Sigma V^T V_{r+1}\alpha \|_2^2 = \sum_{i=1}^{r+1}\sigma_i^2 \|\alpha_i\|^2 \geq \sigma_{r+1}^2 \|w\|^2. But if we pick some w \in \ker(B), then Aw = (A - B)w, so \|Aw\|_2 \leq \|A - B\|\|w\|_2 < \sigma_{r+1}\|w\|_2 and since the ranks sum above r, we have a contradiction.
The Eckhart-Young-Mirsky theorem extends this to all unitarily invariant norms.
Consider a linear system Ax = b with A having full column rank. Then, if this system is not consistent, then we can “solve it” by taking Ax= b + r for some minimal r, which is just least squares; alternatively, we can take (A + E)x = b, which is the data least squares; if we take both, we get the total least squares (A + E)x = b + r.
In this last problem, we are minimizing \|E\|_F^2 + \|r\|_2^2; take C = \left[\begin{matrix}A & b\end{matrix}\right] and F = \left[\begin{matrix}E & b\end{matrix}\right]. Then we can restate the constraint to (C + F)\left[\begin{matrix}x \\ -1\end{matrix}\right] = 0. Then either \mathop{\mathrm{rank}}(C) = n, in which case b is in the span of A and we may take E = 0, r = 0, or \mathop{\mathrm{rank}}(C) = n + 1. In the latter case, the kernel of C + F is nontrivial, and so \mathop{\mathrm{rank}}(C + F) \leq n.
Taking the SVD of both C and C + F, we must have that F = U \mathop{\mathrm{diag}}(0, \dots, 0, \sigma_{n+1}, 0, \dots, 0) V^T. Solve for z in (C + F)z = 0, and take a solution so that the last entry is -1.
Def: We have a few different types of errors; let x be the real solution and \hat x the computed solution.
Error is inherent in floating point arithmetic. If there are n-bits, there can only be 2^n real numbers that are representable; the way this works is by taking each number to be of the form \begin{matrix} \pm & e_1 & e_2 & \dots & e_l & a_1 & a_2 & \dots & a_k\end{matrix} = \pm a_1.a_2a_3\dots a_k \cdot 2^{e_1e_2\dots e_l} where the e_i are called the exponent and the a_i are called the mantissa.
The exponent is stored as two’s complement, and is computed as e = -e_1\cdot 2^{l - 1} + \sum_{k=1}^{l} e_{k+1}\cdot 2^{l - k}.
If a_1 = 1, it is called normal, and if a_1 = 0, it is called subnormal (and since they are in some ways pathological, we ignore them for now).
Double precision corresponds to n = 64, l = 11, k = 52. This includes numbers from 2^{-1024} to 2^{1024}.
Let F be the set of floating point numbers. We have some rounding scheme \mathop{\mathrm{fl}}: \mathbb{R}\to F and some machine \epsilon, \epsilon_{m} such that \epsilon_m = \inf \{ x \in \mathbb{R}\mid x > 0, \mathop{\mathrm{fl}}(1 + x) \neq 1 \}. In double precision, \epsilon_m = 2^{-52}.
Theorem (Kahan): For any x \in [-2^{M}, -2^{m}] \cup [2^m, 2^M], there is x' \in F such that |x - x'| \leq \epsilon_m |x|. Furthermore, for any x, y \in F,
Unfortunately, F is terrible otherwise: it is not commutative, not associative, and not distributive. We have all types of errors here: for example, round off error is stuff like \mathop{\mathrm{fl}}(1.1112 \times 10^5) = 1.111 \times 10^5.
There is also overflow and underflow: when your computations exceed the defined limits of the floating point standard; also cancellation error: double precision claims that 844487^5 + 1288439^5 - 1318202^5 = 0 which is about 200 billion off the real answer, but this fits within the non-significant bits of the floating point representation.
To compute \|x\|_2, you compute \|x\|_\infty, compute y = \frac{x}{\|x\|_\infty}, and then finally \|x\|_2 = \|x\|_\infty\|y\|. The reason to do this is to avoid underflow/overflow.
As another example, consider the sample variance; s^2 = \frac{1}{n-1} \left( \sum_{i=1}^n x_i^2 - n^{-1} \left( \sum_{i=1}^n x_i \right)^2 \right) is a terrible formula, since it suffers from cancellation error; but s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 is fine. Similarly, x^2 - y^2 is bad, but (x + y)(x - y) is good. As another example, you would rather compute x_1 = \frac{-b + \operatorname{sign}(b) \sqrt{b^2 - 4ac}}{2a} and x_2 = \frac{c}{ax_1}.
Conditioning will be a property of the problem.
Suppose you have some problem, such as finding a solution to a linear system Ax = b for some invertible A. One thing that we can do is perturb A by a little bit, e.g. (A + \Delta A)(x + \Delta x) = b + \Delta b; then we want to bound \|\Delta x\| = \|\hat x - x\|. A rough answer in this case is possible; in particular suppose \Delta A = 0, so that \|\Delta x\| = \|A^{-1}\Delta b\|. Further, \|b\| \leq \|A\|\|x\|, so \frac{1}{\|x\|} \leq \frac{\|A\|}{\|b\|} and \frac{\|\Delta x\|}{\|x\|} \leq \|A\|\|A^{-1}\|\frac{\|\Delta b\|}{\|b\|}. The quantity \kappa(A) = \|A\|\|A^{-1}\| is so important that it is called the condition number of A. In the case of the two norm, it is called the spectral condition number.
Now if the error is instead in A, we have that \frac{\|\Delta x\|}{\|x\|} = \frac{\kappa(A)\frac{\|\Delta A\|}{\|A\|}}{1 - \kappa(A)\frac{\|\Delta A\|}{\|A\|}}. Most generally, we have \frac{\|\Delta x\|}{\|x\|} \leq \frac{\kappa(A)\left(\frac{\|\Delta A\|}{\|A\|} + \frac{\|\Delta b\|}{\|b\|}\right)}{1 - \kappa(A)\frac{\|\Delta A\|}{\|A\|}}. Thus, if \frac{\|\Delta A\|}{\|A\|}, \frac{\|\Delta b\|}{\|b\|} \leq \epsilon then \frac{\|\Delta x\|}{\|x\|} \leq \frac{2\epsilon}{1 - \rho}\kappa(A) where \rho = \kappa(A)\frac{\|\Delta A\|}{\|A\|}.
Def: Given any norm \|\cdot \|: \mathbb{C}^{m \times n} \to \mathbb{R}, the condition number of a matrix A \in \mathbb{C}^{m \times n} is defined by \kappa_{\|\cdot\|}(A) = \|A\|\|A^+\|.
In the case of the spectral norm, this is just the ratio of the largest singular value to the smallest singular value.
We have that in general, 1 \leq \kappa_2(A) < \infty, and \kappa_2(A) = 1 (perfectly conditioned) if and only if it is unitary.
Since you can get invertible matrices that we terribly conditioned, we never care about the determinant or invertibility: we only ever compute \mathop{\mathrm{rcond}}(A) to check for near-singularity, never the determinant.
Def: A problem is well posed if a solution exists and is unique, and ill-posed otherwise. An instance of a problem is a selection of parameters that describes the problem further.
Def: The condition number of a problem is the normalized reciprocal of the distance to the nearest ill-posed instance.
For example, in the case of solving Ax = b, this will be \frac{\|A\|}{d(A, M)}, where M = \{X \in \mathbb{C}^{n \times n} \mid \det(X) = 0 \} is the ill-posed manifold.
(Quiz problem: show that d(A, M) = \|A^{-1}\|^{-1}).
Luckily, if A = U\Sigma V^T, we can see that \|\Delta A\| = \|\Delta \Sigma\| so the SVD is perfectly conditioned and so we will be able to find it to machine precision.
In some sense, every problem gives rise to a map f: X \to Y from input data to an output solution; when the problem is well posed this is a function. When X, Y have norms, then \kappa_f(x) = \lim_{\delta \to 0} \sup_{\operatorname{RelErr}(x) < \delta}\frac{\operatorname{RelErr}(f(x))}{\operatorname{RelErr}(x)} is the (relative) condition number. Recall that the relative error is \operatorname{RelErr}(x) = \frac{\|\Delta x\|}{\|x\|}. Then, we immediately see that \operatorname{RelErr}(f(x)) \leq \kappa_f(x) \operatorname{RelErr}(x) + o(\operatorname{RelErr}(x)) or, with names, \text{forward error} \lessapprox \text{condition number } \cdot \text{ backward error}
Stability, on the other hand will be a property of an algorithm. As above, each problem is some f:X \to Y sending inputs to solutions; on the other hand, we only have some algorithm \hat f:X \to Y.
Suppose that we have some backward error x + \Delta x and some forward error y + \Delta y, such that \hat f(x + \Delta x) = f(x) = y + \Delta y.
For example, if A \in \operatorname{GL}(n), then the solution to Ax = b is f(A, b) = A^{-1}b; for any algorithm \hat f(A, b) = \hat x, if we know A exactly, then the forward error is \|f(A, b) - \hat f(A, b)\| = \|A^{-1} b - \hat x\| and the backward error is \|A\hat{x} - b\|.
If the problem is SVD, then we have f(A) = (U, \Sigma, V); the forward errors are \|U - \hat U\|, \|V - \hat V\|, \|\Sigma - \hat \Sigma\| but the backward error is just \|A - \hat U \hat \Sigma \hat V^T\|.
Def: We say an algorithm \hat f is backwards stable, if for any x \in X, the computed \hat y = \hat f(x) satisfies that \hat y = f(x + \Delta x), \ \ \|\Delta x\| \leq \delta \|x\| for \delta small.
Def: We say that \hat f is numerically stable if for any x \in X, the computed \hat y = \hat f(x) satisfies \hat y + \Delta y = f(x + \Delta x), \ \ \|\Delta x\| \leq \delta \|x\|, \|\Delta y\| \leq \delta \|y\| for \delta, \epsilon small.
Def: Let A \in \mathbb{R}^{m \times n} with n \leq m; then we can find a decomposition A = QR, where Q \in O(m) is orthogonal and R \in \mathbb{R}^{m \times n} is upper triangular. In particular, we have R = \left[\begin{matrix}R_1 \\ 0\end{matrix}\right] where R_1 is upper triangular in \mathbb{R}^{n \times n}. If A is of full column rank, then R_1 is invertible. This is called the full QR decomposition.
Def: In the same setup as above, partition A = \left[\begin{matrix}Q_1 & Q_2\end{matrix}\right]\left[\begin{matrix}R_1 \\ 0\end{matrix}\right] = Q_1R_1. This is the condensed QR decomposition.
Def: In the same setup as above, if we take \mathop{\mathrm{rank}}(A) = r, then there is a permutation matrix \Pi such that A\Pi = Q \left[\begin{matrix}R_1 & S \\ 0 & 0\end{matrix}\right] and so A = Q \left[\begin{matrix}R_2^T & 0 \\ 0 & 0\end{matrix}\right]Z^T \Pi^T where R_1 \in \mathbb{C}^{r \times r} and we get R_2 by doing the full QR decomposition on \left[\begin{matrix}R_1^T \\ S^T\end{matrix}\right] = Z \left[\begin{matrix}R_2 \\ 0\end{matrix}\right]. This is the rank-retaining QR decomposition.
Def: For any A \in \mathbb{R}^{m \times n}, we may find A = Q\left[\begin{matrix}L & 0 \\ 0 & 0\end{matrix}\right]U^T. This is the complete orthogonal decomposition.
To solve Ax = b, we can take A = QR so Rx = Q^Tb and solve using back substitution. Alternatively, let A\Pi = LU; then we solve Ly = b with back substitution, Uz = y with forward substitution, and set x = \Pi z.
Linearly constrained least squares problems are problems of the form \min\| Ax - b \|, \ \ \text{s.t. } C^Tx = d. where A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^n, C \in \mathbb{R}^{n \times p}, and d \in \mathbb{R}^p.
We could form the Lagrangian to get \begin{cases} A^Ax - A^Tb + C\lambda = 0\\ C^Tx - d = 0 \end{cases} which is a KKT constraint. Sometimes we care about \lambda, and this is fine (though A^TA can often be ill-conditioned).
Instead, we may use the QR decomposition: note that A^TAx = A^Tb - C\lambda and thus x = \hat x - (A^TA)^{-1}C\lambda, where \hat x = \mathop{\mathrm{argmin}}\|Ax - b\|_2. Then, since C^Tx = d, we must have that C^T(A^TA)^{-1}C\lambda = C^T\hat x - d. To avoid A^TA, set A = Q\left[\begin{matrix}R \\ 0\end{matrix}\right], set W = R^{-T}C via backsubstitution, take the QR W = Q_1R_1, set \eta = C^T \hat x - d and finally solve R_1^TR_1 \lambda = \eta via backsubstitution.
If we do not want to compute \lambda, if we take p \leq n, let C = Q_2 \left[\begin{matrix}R_2 \\ 0\end{matrix}\right], so that \left[\begin{matrix}R_2^T & 0\end{matrix}\right]Q_2^Tx = d. Thus, if we take Q_2^Tx = \left[\begin{matrix}u \\ v\end{matrix}\right], we backsolve for u such that R_2^Tu = d, we know that \|b - Ax\|^2 = \|b - AQ_2Q_2^Tx\|^2 = \left\|b - \left[\begin{matrix}A_1 & A_2\end{matrix}\right] \left[\begin{matrix}u \\ v\end{matrix}\right]\right\|^2 = \|b - A_1u - A_2v\|_2^2.
Either use Householder reflections or Givens rotations to compute QR. Compute the complete orthogonal decomposition via two QR decompositions. Compute the LU decomposition via Gauss/elimination matrices.
Suppose that we have some sequence of equations Ax_i = b_i. In this case, we form B = \begin{matrix}b_1 & b_2 & \dots & b_n\end{matrix} and solve AX = B. Everything from before carries through to X = A^+B, where A^+ is the pseudoinverse.
If you have a low-rank error, e.g. you solved Ax = b but in fact needed \hat A = A + uv^T instead, then you can solve the new system efficiently by using the Sherman-Morrison formula, or in general the Woodbury formula for higher rank corrections.