differential geometry blog

Manifold Exploration

What is a straight line on a curved surface? An interactive journey from smooth manifolds to latent space geometry.

↓ scroll to explore
01

Smooth Manifolds

What does it mean for a space to be curved? We begin with the historical shift from Euclidean to Riemannian geometry, then formalize the notion of a smooth manifold using the sphere as our guiding example.

From Euclid to Riemann

For over two thousand years, geometry meant Euclidean geometry. Euclid's five postulates, formulated for the flat plane, provided the foundation. The fifth (the parallel postulate) states: given a line in the plane and a point not on it, there exists exactly one line through that point which never meets the first.

In the 19th century, mathematicians began questioning this assumption. Gauss, Bolyai, and Lobachevsky showed that consistent geometries exist where the parallel postulate fails. Bernhard Riemann took this further in his 1854 Habilitationsschrift, proposing a framework where geometry is not a fixed background but a property of the space itself.

On a sphere, for instance, "lines" (geodesics) are great circles. Any two great circles intersect, so there are no parallel lines at all. The geometry of the sphere is intrinsically different from that of the plane.

The Sphere as a Surface

Notation

We denote by S^2 the unit 2-sphere in \mathbb{R}^3:

S^2 = \left\{ (x, y, z) \in \mathbb{R}^3 \;\middle|\; x^2 + y^2 + z^2 = 1 \right\}

S^2 is a surface: two-dimensional, smooth, and curved. The most natural way to describe it is via spherical coordinates (\theta, \varphi):

(\theta, \varphi) \mapsto (\sin\theta\cos\varphi,\; \cos\theta,\; \sin\theta\sin\varphi), \quad \theta \in [0, \pi],\; \varphi \in [0, 2\pi)

Two parameters produce a point in \mathbb{R}^3. Near most points, this is a smooth bijection: the surface locally looks like a piece of \mathbb{R}^2. These coordinates will be our working coordinates for concrete computations throughout the blog (metric, Christoffel symbols, geodesics).

However, this parametrization fails at the poles (\theta = 0 and \theta = \pi): all values of \varphi map to the same point, and the Jacobian drops rank. No single coordinate system can cover all of S^2 without such degeneracies. We need a more general framework.

Surfaces as Parametrizations

Before introducing the abstract machinery, let us make precise what we just did. We described S^2 by giving a smooth map from a region of \mathbb{R}^2 into \mathbb{R}^3. This is the concrete notion of a parametric surface.

Definition

A parametric surface is a smooth map F : U \subset \mathbb{R}^2 \to \mathbb{R}^3 whose Jacobian has rank 2 everywhere. The image F(U) is the surface sitting in \mathbb{R}^3.

For the sphere, the parametrization is exactly the map we wrote above:

F(\theta, \varphi) = (\sin\theta\cos\varphi,\; \cos\theta,\; \sin\theta\sin\varphi)

The key observation is that a chart is the local inverse of a parametrization. Where F goes from coordinates to the surface, the chart \psi goes the other way:

\psi = F^{-1}_{\text{local}} : F(U) \subset \mathbb{R}^3 \longrightarrow U \subset \mathbb{R}^2, \quad (x,y,z) \mapsto (\theta, \varphi)

In practice, we often think in terms of the parametrization F (going "up" from coordinates to the surface) rather than the chart \psi (going "down" from the surface to coordinates). They carry the same information, just in opposite directions.

Topological Manifolds

Definition

A topological manifold M of dimension n is a topological space that is:

  1. Hausdorff: distinct points have disjoint neighborhoods,
  2. Second-countable: the topology has a countable basis,
  3. Locally Euclidean: every point p \in M has a neighborhood homeomorphic to an open subset of \mathbb{R}^n.

S^2 satisfies all three conditions: it is a 2-dimensional topological manifold.

Charts and Atlases

Since no single coordinate system covers S^2 without singularities, we need several overlapping ones. The following construction provides a singularity-free atlas (we will not use it for calculations, only to prove that S^2 is a smooth manifold).

Definition

A chart on M is a pair (U_\alpha, \psi_\alpha) where U_\alpha \subset M is open and \psi_\alpha : U_\alpha \to \psi_\alpha(U_\alpha) \subset \mathbb{R}^n is a homeomorphism. An atlas is a collection of charts \{(U_\alpha, \psi_\alpha)\} that covers M.

Example. The stereographic projection from the north pole N = (0, 0, 1) maps S^2 \setminus \{N\} to \mathbb{R}^2:

\sigma_N(x, y, z) = \left( \frac{x}{1 - z},\; \frac{y}{1 - z} \right)

From the south pole S = (0, 0, -1):

\sigma_S(x, y, z) = \left( \frac{x}{1 + z},\; \frac{y}{1 + z} \right)

Each chart misses one point, but together (S^2 \setminus \{N\},\, \sigma_N) and (S^2 \setminus \{S\},\, \sigma_S) cover all of S^2.

Smooth Structure

On the overlap S^2 \setminus \{N, S\}, the transition map between our two charts is:

\sigma_S \circ \sigma_N^{-1} : \mathbb{R}^2 \setminus \{0\} \to \mathbb{R}^2 \setminus \{0\}, \quad (u, v) \mapsto \frac{(u, v)}{u^2 + v^2}

This inversion map is C^\infty on its domain. An atlas whose transition maps are all smooth is a smooth atlas.

Definition

A smooth manifold is a topological manifold M equipped with a maximal smooth atlas (a smooth structure). In practice, any smooth atlas determines a unique smooth structure.

With S^2 established as a smooth manifold, we now have the foundation to build richer structure on top of it. In the next chapter, we introduce tangent spaces, which capture the notion of "directions" at a point.

02

Tangent Spaces

At each point of a manifold, the tangent space captures all possible directions of movement.

The Tangent Plane on S2

At a point p \in S^2, imagine all curves passing through p. Their velocity vectors at p span a plane tangent to the sphere. This plane is the tangent space T_pS^2.

In our spherical coordinates (\theta, \varphi), the two basis vectors are obtained by differentiating the parametrization:

\frac{\partial}{\partial \theta} = (\cos\theta\cos\varphi,\; -\sin\theta,\; \cos\theta\sin\varphi), \quad \frac{\partial}{\partial \varphi} = (-\sin\theta\sin\varphi,\; 0,\; \sin\theta\cos\varphi)

These two vectors are tangent to S^2 at p and form a basis of T_pS^2.

Tangent Vectors as Derivations

On \mathbb{R}^n, a vector v is a "arrow". On a curved manifold, there is no ambient space to draw arrows in. Instead, we define a tangent vector by what it does: it computes how a function changes in a given direction.

Concrete example on S^2. Let f: S^2 \to \mathbb{R} be the temperature at each point. The chart \psi assigns coordinates (\theta, \varphi) to each point, so f \circ \psi^{-1}(\theta, \varphi) is the temperature expressed as a function of \theta and \varphi. A tangent vector v = 3\,\partial/\partial\theta + 2\,\partial/\partial\varphi gives:

v(f) = 3\,\frac{\partial f}{\partial\theta} + 2\,\frac{\partial f}{\partial\varphi}

This is the rate of change of temperature in the direction v. The output v(f) is a real number, not a vector. The vector is the operator v itself.

Intuition

Why define a vector as an operator? On \mathbb{R}^n, a tangent vector is an "arrow" and v(f) = \langle v, \nabla f \rangle is just the dot product with the gradient. On an abstract manifold, there is no ambient space to draw arrows in, and no global coordinate system to write components. But smooth functions f: M \to \mathbb{R} always exist, and the directional derivative v(f) is a coordinate-free real number. So we characterize a vector by its action on all functions: this is its dual representation (analogous to the Riesz representation theorem in functional analysis).

This characterization is faithful: a linear map v: C^\infty(M) \to \mathbb{R} is a tangent vector if and only if it satisfies the Leibniz rule: v(fg) = v(f)\,g(p) + f(p)\,v(g). This condition forces v to be local and finite-dimensional: the space of all such operators at p is isomorphic to \mathbb{R}^n. Two different vectors always disagree on at least one function, and every Leibniz-compatible operator corresponds to exactly one geometric direction. The correspondence is a bijection, not merely an encoding.

Example. On S^2, take the coordinate functions f_1 = \theta and f_2 = \varphi. Then v(f_1) = 3 and v(f_2) = 2 recover the components of v = 3\,\partial/\partial\theta + 2\,\partial/\partial\varphi. In a different chart, the components change, but the operator v and all its outputs v(f) remain the same.

General formula. On a manifold M of dimension n with chart (U, \psi) and coordinates (x^1, \ldots, x^n):

Definition

The tangent space T_pM at p \in M is the vector space of derivations at p: linear maps v: C^\infty(M) \to \mathbb{R} satisfying the Leibniz rule. A tangent vector v \in T_pM acts on any smooth function f: M \to \mathbb{R} by:

v(f) = \sum_{i=1}^n v^i \frac{\partial (f \circ \psi^{-1})}{\partial x^i}\bigg|_{\psi(p)}

Here f \circ \psi^{-1}: \mathbb{R}^n \to \mathbb{R} is f rewritten in local coordinates (so that we can take ordinary partial derivatives), and (v^1, \ldots, v^n) are the components of v in the coordinate basis \{\partial/\partial x^i\}. This action of a vector on functions will be the building block for the covariant derivative in Chapter 3, where we will need to differentiate vector fields along directions on M.

The Tangent Bundle

Definition

The tangent bundle is the disjoint union of all tangent spaces: TM = \bigsqcup_{p \in M} T_pM. It is itself a smooth manifold of dimension 2n.

Each point of TM is a pair (p, v) with p \in M and v \in T_pM. For S^2, the tangent bundle TS^2 is a 4-dimensional manifold. In the next chapter, we will see how to differentiate vector fields on M, which requires additional structure called a connection.

03

Connections & Christoffel Symbols

How do we differentiate a vector field on a curved surface?

Notation (Einstein summation convention)

From this chapter onward, we use the Einstein summation convention: when an index appears once as a superscript and once as a subscript in the same expression, a sum over that index is implied. For example, on S^2 with basis vectors e_1 = e_\theta, e_2 = e_\varphi:

v^i\, e_i = v^1\, e_1 + v^2\, e_2 = v^\theta\, e_\theta + v^\varphi\, e_\varphi

The repeated index i is a dummy index (summed over); non-repeated indices are free (they label the components of the result). This compact notation avoids writing explicit \sum signs in the tensor formulas that follow.

Differentiating e_θ and e_φ

In Chapter 2 we built the basis vectors e_\theta = \partial/\partial\theta and e_\varphi = \partial/\partial\varphi at each point of S^2. These vectors change from point to point. How do they change?

Consider e_\varphi (pointing "east" along latitude circles). As we move along the sphere in the e_\varphi direction, the ordinary derivative \partial e_\varphi / \partial\varphi in \mathbb{R}^3 points inward, toward the center of the sphere. It leaves the tangent plane.

On \mathbb{R}^n this does not happen because the tangent space is the same everywhere. On a curved surface, naive differentiation produces a vector that is no longer tangent. The solution: project the result back onto T_pS^2.

The Covariant Derivative

This projection defines the covariant derivative \nabla. On S^2, we can compute all four covariant derivatives of basis vectors along basis vectors:

\nabla_{e_\theta} e_\theta, \quad \nabla_{e_\theta} e_\varphi, \quad \nabla_{e_\varphi} e_\theta, \quad \nabla_{e_\varphi} e_\varphi

Each result is itself a tangent vector, so it can be expressed in the basis \{e_\theta, e_\varphi\}. The coefficients of this expansion are the Christoffel symbols.

The Round Metric on S²

To compute the covariant derivative from a formula (rather than by projecting in \mathbb{R}^3), we need the metric: the inner product on each tangent space. For S^2 with spherical coordinates (\theta, \varphi), the metric components g_{ij} = \langle e_i, e_j \rangle are:

Definition

A Riemannian metric g on a manifold M is a smooth assignment of an inner product g_p: T_pM \times T_pM \to \mathbb{R} to each tangent space. It allows us to measure lengths, angles, and volumes on M.

g_{\theta\theta} = 1, \quad g_{\theta\varphi} = g_{\varphi\theta} = 0, \quad g_{\varphi\varphi} = \sin^2\theta

This says: e_\theta is a unit vector, e_\varphi has length \sin\theta (it shrinks near the poles), and they are orthogonal. The line element is ds^2 = d\theta^2 + \sin^2\theta\, d\varphi^2. These metric components will determine the Christoffel symbols below and the geodesic equation in Chapter 4.

Notation

The inverse metric g^{ij} is defined by g^{ik}g_{kj} = \delta^i_j. It allows us to "raise indices": convert a covector into a vector. On S^2: g^{\theta\theta} = 1, g^{\varphi\varphi} = 1/\sin^2\theta.

Christoffel Symbols

Definition

The Christoffel symbols \Gamma^k_{ij} are the components of the covariant derivative of basis vectors: \nabla_{e_i} e_j = \Gamma^k_{ij} \, e_k. Here i is the direction of differentiation, j is the basis vector being differentiated, and k indexes the components of the result.

Example on S^2. With two basis vectors, there are 2 \times 2 \times 2 = 8 possible symbols. Most are zero. The non-zero ones are:

\Gamma^\theta_{\varphi\varphi} = -\sin\theta\cos\theta, \qquad \Gamma^\varphi_{\theta\varphi} = \Gamma^\varphi_{\varphi\theta} = \frac{\cos\theta}{\sin\theta}

For example, \nabla_{e_\varphi} e_\varphi = -\sin\theta\cos\theta \, e_\theta: as we move east, the "east" vector tilts toward the pole. At the equator (\theta = \pi/2) this effect vanishes. Near the poles it grows, reflecting the convergence of meridians.

General Vector Fields

Any vector field on S^2 can be written X = X^\theta e_\theta + X^\varphi e_\varphi and Y = Y^\theta e_\theta + Y^\varphi e_\varphi where the components X^\theta, X^\varphi, Y^\theta, Y^\varphi are smooth functions on S^2. By linearity of \nabla, the covariant derivative of Y along X is:

\nabla_X Y = X^i \left( \frac{\partial Y^k}{\partial x^i} + \Gamma^k_{ij}\, Y^j \right) e_k

The first term \partial Y^k / \partial x^i is how the components of Y change. The second term \Gamma^k_{ij} Y^j corrects for the fact that e_\theta and e_\varphi themselves rotate from point to point.

Proof: Covariant Derivative Formula

Setup

We want to compute \nabla_X Y where Y = Y^j\, e_j is a vector field expressed in the coordinate basis. The key tools are:

  • The Leibniz rule: \nabla_X(f\, V) = X(f)\, V + f\, \nabla_X V for any function f and vector field V.
  • The linearity of \nabla in X: \nabla_{fX} V = f\, \nabla_X V.
  • The definition of Christoffel symbols: \nabla_{e_i} e_j = \Gamma^k_{ij}\, e_k.

Step 1: Apply the Leibniz rule

Since Y = Y^j\, e_j (sum over j), treating each term Y^j\, e_j as a function times a vector field:

\nabla_X Y = \nabla_X (Y^j\, e_j) = X(Y^j)\, e_j + Y^j\, \nabla_X e_j

The first term is the derivative of the components of Y (the basis is held fixed). The second term accounts for the rotation of the basis itself.

Step 2: Expand X as a derivation

The vector field X = X^i\, e_i acts on the function Y^j via the definition from Chapter 2:

X(Y^j) = X^i\, \frac{\partial Y^j}{\partial x^i}

Step 3: Use linearity in X for the second term

Since \nabla is linear in the subscript: \nabla_X e_j = \nabla_{X^i e_i} e_j = X^i\, \nabla_{e_i} e_j. Now apply the definition of Christoffel symbols:

Y^j\, \nabla_X e_j = Y^j\, X^i\, \nabla_{e_i} e_j = X^i\, Y^j\, \Gamma^k_{ij}\, e_k

Step 4: Combine

Putting Steps 1-3 together:

\nabla_X Y = X^i\, \frac{\partial Y^j}{\partial x^i}\, e_j + X^i\, Y^j\, \Gamma^k_{ij}\, e_k

In the first term, j is a dummy index. Rename it to k:

\nabla_X Y = X^i\, \frac{\partial Y^k}{\partial x^i}\, e_k + X^i\, Y^j\, \Gamma^k_{ij}\, e_k = X^i \left( \frac{\partial Y^k}{\partial x^i} + \Gamma^k_{ij}\, Y^j \right) e_k \qquad \square

The expression in parentheses, \frac{\partial Y^k}{\partial x^i} + \Gamma^k_{ij}\, Y^j, is often written (\nabla_i Y)^k or Y^k_{\ ;i} and called the covariant components of the derivative of Y.

More abstractly, \nabla is an affine connection: an operator on any smooth manifold M that differentiates vector fields while keeping the result in T_pM. In the next chapter, we use it to define geodesics and parallel transport.

04

Geodesics & Parallel Transport

What is the shortest path between two points on a curved surface? This question, older than calculus itself, leads to the calculus of variations, the Euler-Lagrange equation, and the notion of geodesic.

The Shortest Path Problem

On a flat plane, the shortest path between two points is a straight line. This fact is so basic that Euclid took it as an axiom. But what happens on a curved surface?

Navigators have known for centuries that the shortest route between two cities on Earth follows a great circle, not a straight line on the map. This practical observation hides a deep mathematical question: among all curves connecting two points on a surface, which one has the smallest length?

In 1696, Johann Bernoulli posed the brachistochrone problem: find the curve along which a bead slides fastest under gravity. This was not about shortest distance but shortest time, yet the mathematical structure is the same. Solutions came from Leibniz, Newton, l'Hôpital, and Jakob Bernoulli. Each had to optimize over an infinite-dimensional space of curves.

Leonhard Euler systematized these ideas in 1744, developing the calculus of variations: a framework for finding curves (or functions) that minimize (or make stationary) a given integral quantity. Joseph-Louis Lagrange refined the method in 1755, giving it the elegant form we use today.

From Length to Energy

Given a curve \gamma: [a, b] \to M on a Riemannian manifold (a smooth manifold M equipped with a metric g, as introduced in Chapter 3), its length is:

L(\gamma) = \int_a^b \sqrt{g_{\gamma(t)}(\dot\gamma(t), \dot\gamma(t))} \, dt

The square root makes this functional difficult to work with (it is invariant under reparametrization, which introduces degeneracies). Instead, we minimize the energy functional:

Definition

The energy of a curve \gamma: [a, b] \to M is:

E(\gamma) = \frac{1}{2} \int_a^b g_{\gamma(t)}(\dot\gamma(t), \dot\gamma(t)) \, dt

A curve that minimizes E among all curves with the same endpoints also minimizes L (by the Cauchy-Schwarz inequality), and it is automatically parametrized proportionally to arc length. This is why we work with energy rather than length.

Calculus of Variations

The key idea: to find the curve that makes E stationary, we perturb it. Consider a smooth one-parameter family of curves \gamma_\varepsilon(t) = \gamma(t) + \varepsilon \, \eta(t) where \eta is a smooth vector field along \gamma that vanishes at the endpoints: \eta(a) = \eta(b) = 0.

The curve \gamma is a critical point of E if and only if:

\frac{d}{d\varepsilon}\bigg|_{\varepsilon=0} E(\gamma_\varepsilon) = 0 \quad \text{for all such } \eta

This is the infinite-dimensional analogue of setting the gradient to zero. The computation that turns this condition into a differential equation is the heart of the calculus of variations.

Proof: Euler-Lagrange Equation

Setup

We work in local coordinates x^1, \dots, x^n. The energy of a curve \gamma(t) = (x^1(t), \dots, x^n(t)) is:

E[\gamma] = \frac{1}{2} \int_a^b g_{ij}(x(t))\, \dot x^i(t)\, \dot x^j(t) \, dt

This has the form \int_a^b \mathcal{L}(x, \dot x) \, dt with Lagrangian \mathcal{L}(x, \dot x) = \tfrac{1}{2}\, g_{ij}(x)\, \dot x^i \dot x^j.

Variation

Let x^k_\varepsilon(t) = x^k(t) + \varepsilon\, \eta^k(t) with \eta^k(a) = \eta^k(b) = 0. Then:

\frac{d}{d\varepsilon}\bigg|_{\varepsilon=0} E[\gamma_\varepsilon] = \frac{1}{2} \int_a^b \left( \frac{\partial g_{ij}}{\partial x^k}\, \eta^k\, \dot x^i \dot x^j + 2\, g_{ij}\, \dot\eta^i\, \dot x^j \right) dt

Integration by parts

The second term contains \dot\eta^i. We integrate by parts:

\int_a^b g_{ij}\, \dot\eta^i\, \dot x^j \, dt = \Big[\, g_{ij}\, \eta^i\, \dot x^j \Big]_a^b - \int_a^b \eta^i \frac{d}{dt}\!\left(g_{ij}\, \dot x^j\right) dt

The boundary term vanishes because \eta(a) = \eta(b) = 0. Expanding the time derivative:

\frac{d}{dt}\!\left(g_{ij}\, \dot x^j\right) = g_{ij}\, \ddot x^j + \frac{\partial g_{ij}}{\partial x^k}\, \dot x^k\, \dot x^j

Collecting terms

Substituting back and relabeling indices so that \eta^k factors out:

\frac{d}{d\varepsilon}\bigg|_{\varepsilon=0} E = \int_a^b \eta^k \left[ -g_{kj}\, \ddot x^j + \frac{1}{2}\left( \frac{\partial g_{ij}}{\partial x^k} - \frac{\partial g_{kj}}{\partial x^i} - \frac{\partial g_{ki}}{\partial x^j} \right) \dot x^i \dot x^j \right] dt

The Euler-Lagrange equation

Since this must vanish for all \eta^k (the fundamental lemma of calculus of variations), the bracketed expression is zero. Multiplying through by g^{km}:

\ddot x^m + \underbrace{\frac{1}{2}\, g^{mk}\!\left(\frac{\partial g_{kj}}{\partial x^i} + \frac{\partial g_{ki}}{\partial x^j} - \frac{\partial g_{ij}}{\partial x^k}\right)}_{\Gamma^m_{ij}}\, \dot x^i\, \dot x^j = 0

We recognize the Christoffel symbols from Chapter 3. The Euler-Lagrange equation for the energy functional is the geodesic equation.

The Geodesic Equation

Definition

A curve \gamma(t) on a Riemannian manifold is a geodesic if it satisfies:

\frac{d^2 x^k}{dt^2} + \Gamma^k_{ij} \frac{dx^i}{dt} \frac{dx^j}{dt} = 0

This equation has two readings. From the calculus of variations: geodesics are critical points of the energy functional. From the connection of Chapter 3: geodesics are curves that parallel-transport their own tangent vector, i.e. \nabla_{\dot\gamma} \dot\gamma = 0. These two characterizations are equivalent.

On S^2, using the Christoffel symbols computed in Chapter 3, the geodesic equation becomes a coupled system for \theta(t) and \varphi(t):

\ddot\theta - \sin\theta\cos\theta\;\dot\varphi^2 = 0, \qquad \ddot\varphi + 2\frac{\cos\theta}{\sin\theta}\;\dot\theta\,\dot\varphi = 0

The solutions are great circles. Choose a point and an initial direction to trace one:

Parallel Transport

The geodesic equation says \nabla_{\dot\gamma} \dot\gamma = 0: the tangent vector is "constant" along the curve. We can generalize this to any vector carried along any curve.

Definition

A vector field V(t) along a curve \gamma(t) is parallel-transported if \nabla_{\dot\gamma} V = 0. In coordinates:

\frac{dV^k}{dt} + \Gamma^k_{ij}\, \dot\gamma^i\, V^j = 0

Given an initial vector V(0) \in T_{\gamma(0)}M, this first-order linear ODE has a unique solution. Parallel transport defines a linear isomorphism between tangent spaces at different points along the curve.

Holonomy

On a flat surface, parallel-transporting a vector around a closed loop returns it unchanged. On a curved surface, it comes back rotated. The rotation angle is called the holonomy of the loop.

For a geodesic triangle on S^2 with interior angles A, B, C, the holonomy equals the angular excess:

\Omega = A + B + C - \pi = \int\!\!\int_\Delta K \, dA

where K = 1 is the Gaussian curvature of the unit sphere and \Delta is the region enclosed. This is the Gauss-Bonnet theorem in action, and it connects parallel transport directly to the curvature of Chapter 5.

We have seen that parallel transport reveals something fundamental about a surface: it rotates vectors, and the rotation is proportional to the enclosed area. This "rotation per unit area" is the curvature. In the next chapter, we make this precise and extend the notion beyond surfaces to manifolds of any dimension.

05

Curvature

In Chapter 4, we saw that parallel transport around a loop rotates a vector. Curvature is what makes this happen: it measures, at each point, how much the manifold deviates from being flat.

From Holonomy to Curvature

Recall from Chapter 4: parallel-transporting a vector around a geodesic triangle on S^2 produces a rotation \Omega equal to the area of the triangle (since K = 1 on the unit sphere). Now consider what happens when we shrink the triangle toward a single point p.

As the triangle contracts, both the holonomy \Omega and the enclosed area A tend to zero. But their ratio converges to a definite limit:

K(p) = \lim_{A \to 0} \frac{\Omega}{A}

This limit is the Gaussian curvature at p. It is a local quantity: the amount of "rotation per unit area" that the surface generates at each point. On the unit sphere, K = 1 everywhere. On a flat plane, K = 0.

Gaussian Curvature

For a surface in \mathbb{R}^3, Gaussian curvature can be understood through principal curvatures.

Definition

At each point of a surface in \mathbb{R}^3, the principal curvatures \kappa_1 and \kappa_2 are the maximum and minimum curvatures of normal cross-sections. The Gaussian curvature is their product.

At each point, the surface bends most in one direction (\kappa_1) and least in another (\kappa_2):

K = \kappa_1 \, \kappa_2

This gives a geometric classification:

Theorema Egregium (Gauss, 1827)

The Gaussian curvature K depends only on the metric g_{ij} and its derivatives, not on how the surface is embedded in \mathbb{R}^3. In other words, K is an intrinsic invariant: a creature living on the surface could measure it without any knowledge of the ambient space.

This is remarkable. A cylinder has K = 0 everywhere: you can unroll it flat without stretching. A sphere has K > 0: no map of the Earth can be distance-preserving (this is why all flat maps distort). These are intrinsic facts, detectable from within the surface.

On S^2, using the round metric from Chapter 3 (g_{\theta\theta} = 1, g_{\varphi\varphi} = \sin^2\theta), the Gaussian curvature formula gives K = 1 everywhere, confirming our holonomy computation.

On a torus with major radius a and minor radius b, parametrized by (u, v):

K(u, v) = \frac{\cos v}{b(a + b\cos v)}

The outer rim (v = 0) has K > 0 (sphere-like), the inner rim (v = \pi) has K < 0 (saddle-like), and the top and bottom circles (v = \pm\pi/2) have K = 0. The total curvature integrates to zero, consistent with the Gauss-Bonnet theorem for a torus (Euler characteristic \chi = 0).

The Riemann Curvature Tensor

On a 2-dimensional surface, a single number K(p) captures all curvature information at p. In higher dimensions, curvature is richer: the manifold can curve differently in different 2-planes through p. We need a more powerful object.

The Riemann curvature tensor measures the failure of covariant derivatives to commute. On \mathbb{R}^n, the order of differentiation does not matter: \nabla_X \nabla_Y Z = \nabla_Y \nabla_X Z. On a curved manifold, it does.

Notation

The Lie bracket [X, Y] of two vector fields is the vector field defined by [X,Y](f) = X(Y(f)) - Y(X(f)) for any smooth function f. It measures whether the flows of X and Y commute. For coordinate basis vectors, [e_i, e_j] = 0.

Definition

The Riemann curvature tensor \text{Riem} is defined by:

\text{Riem}(X, Y)Z = \nabla_X \nabla_Y Z - \nabla_Y \nabla_X Z - \nabla_{[X,Y]} Z

The term \nabla_{[X,Y]}Z corrects for the fact that X and Y might not commute as vector fields. For coordinate basis vectors this term vanishes, simplifying computations.

In coordinates, the Riemann tensor has components (often written R^l_{\ kij} for brevity):

\text{Riem}^l_{\ kij} = \frac{\partial \Gamma^l_{kj}}{\partial x^i} - \frac{\partial \Gamma^l_{ki}}{\partial x^j} + \Gamma^l_{im}\,\Gamma^m_{kj} - \Gamma^l_{jm}\,\Gamma^m_{ki}

This is a direct consequence of the Christoffel symbols from Chapter 3: curvature is built entirely from the connection and its derivatives.

Sectional Curvature

Definition

The sectional curvature of a 2-plane \sigma = \text{span}(X, Y) in T_pM is:

The Riemann tensor is a 4-index object, which can be hard to interpret geometrically. The sectional curvature extracts a single number for each 2-plane:

K(\sigma) = \frac{\langle \text{Riem}(X, Y)Y, X \rangle}{\langle X, X\rangle\langle Y, Y\rangle - \langle X, Y\rangle^2}

Intuitively, K(\sigma) is the Gaussian curvature of the 2-dimensional surface obtained by "slicing" the manifold along \sigma (via geodesics starting in the directions of \sigma). On a 2-manifold, there is only one 2-plane at each point, and K(\sigma) reduces to the Gaussian curvature.

Ricci and Scalar Curvature

The Riemann tensor can be progressively simplified by contraction (summing over indices):

Definition

The Ricci tensor is the trace of the Riemann tensor: \text{Ric}_{ij} = \text{Riem}^k_{\ ikj}. It averages sectional curvatures over all 2-planes containing a given direction.

Definition

The scalar curvature is the trace of the Ricci tensor: S = g^{ij}\,\text{Ric}_{ij}. It is a single number at each point, summarizing the total curvature.

The hierarchy is:

\underbrace{\text{Riem}^l_{\ kij}}_{\text{Riemann (full)}} \;\xrightarrow{\text{trace}}\; \underbrace{\text{Ric}_{ij}}_{\text{Ricci (directional avg.)}} \;\xrightarrow{\text{trace}}\; \underbrace{S}_{\text{scalar (total avg.)}}

Each contraction loses information but gains interpretability. For a 2-manifold, all three are equivalent (determined by K). In general relativity, the Einstein field equations relate the Ricci tensor to the energy-momentum content of spacetime. In machine learning, the scalar curvature of a latent space measures how much the learned representation distorts local volumes (Chapter 9).

All of these curvature quantities are built from the metric g and its derivatives. In Chapter 3, we introduced the metric on S^2 to compute Christoffel symbols. In the next chapter, we study the Riemannian metric in full generality: how it defines lengths, areas, and the unique Levi-Civita connection.

06

Metric Geometry

In Chapter 3, we introduced the Riemannian metric g on S^2 to compute Christoffel symbols. Here we explore the full power of the metric: measuring distances, computing areas, and the remarkable fact that the metric alone determines the connection.

Geodesic Distance

In Chapter 4, we defined the length of a curve \gamma: [a,b] \to M using the metric. The geodesic distance between two points is the infimum of lengths over all curves connecting them:

Definition

The geodesic distance between p, q \in M is d(p,q) = \inf_\gamma L(\gamma), where the infimum is taken over all piecewise smooth curves from p to q.

This distance function turns (M, d) into a metric space in the topological sense. On S^2, the geodesic distance between two points is the angle between them (for the unit sphere): d(p,q) = \arccos(\langle p, q \rangle). The geodesics (great circles) are precisely the curves that achieve this minimum.

Riemannian Volume

The metric also lets us measure areas and volumes. In coordinates (x^1, \ldots, x^n), the volume element is:

d\text{Vol} = \sqrt{\det(g_{ij})} \; dx^1 \wedge \cdots \wedge dx^n

The factor \sqrt{\det(g_{ij})} accounts for the distortion introduced by the coordinate system. On S^2:

\det(g_{ij}) = \det \begin{pmatrix} 1 & 0 \\ 0 & \sin^2\theta \end{pmatrix} = \sin^2\theta \quad \Rightarrow \quad d\text{Vol} = \sin\theta \; d\theta \, d\varphi

This is the familiar area element on the sphere. It shrinks near the poles (\theta \to 0, \pi) where the coordinate grid compresses. Integrating over the full sphere gives \int_0^\pi \int_0^{2\pi} \sin\theta \, d\varphi \, d\theta = 4\pi.

Isometries

Definition

An isometry between Riemannian manifolds (M, g) and (N, h) is a diffeomorphism \phi: M \to N that preserves the metric: \phi^* h = g. In other words, h_{\phi(p)}(d\phi(v), d\phi(w)) = g_p(v, w) for all tangent vectors v, w \in T_pM.

Isometries preserve everything that the metric defines: distances, angles, areas, curvature, geodesics. They are the "rigid motions" of Riemannian geometry.

A cylinder and a flat plane are locally isometric (you can unroll a cylinder without stretching). Both have K = 0. A sphere and a plane are not locally isometric (no distance-preserving map exists), which is why every flat world map necessarily distorts distances. This is a consequence of the Theorema Egregium (Chapter 5): isometries preserve Gaussian curvature, and K = 1 \neq 0.

The Levi-Civita Connection

In Chapter 3, we defined the covariant derivative \nabla on S^2 by projecting ordinary derivatives onto the tangent plane. On a general Riemannian manifold, there are many possible connections. A remarkable theorem says that the metric singles out a unique one.

Theorem (Fundamental Theorem of Riemannian Geometry)

On any Riemannian manifold (M, g), there exists a unique connection \nabla (the Levi-Civita connection) satisfying:

  1. Metric compatibility: \nabla g = 0, i.e. X(g(Y,Z)) = g(\nabla_X Y, Z) + g(Y, \nabla_X Z).
  2. Torsion-free: \nabla_X Y - \nabla_Y X = [X, Y].

Metric compatibility means parallel transport preserves inner products: lengths and angles are unchanged. Torsion-free means the connection is symmetric (\Gamma^k_{ij} = \Gamma^k_{ji}). Together, these two conditions determine the Christoffel symbols entirely from the metric:

\Gamma^k_{ij} = \frac{1}{2}\, g^{kl} \left( \frac{\partial g_{jl}}{\partial x^i} + \frac{\partial g_{il}}{\partial x^j} - \frac{\partial g_{ij}}{\partial x^l} \right)
Proof: Levi-Civita Formula

Strategy

We derive the Christoffel symbol formula by writing out metric compatibility three times with permuted indices, then combining to isolate \Gamma.

Step 1: Metric compatibility in coordinates

The condition \nabla g = 0 applied to coordinate basis vectors gives:

\frac{\partial g_{jl}}{\partial x^i} = \Gamma^m_{ij}\, g_{ml} + \Gamma^m_{il}\, g_{jm}

This says: the change in the metric along x^i is accounted for entirely by the Christoffel symbols.

Step 2: Cyclic permutation

Write the same equation with indices permuted cyclically:

(A):\; \frac{\partial g_{jl}}{\partial x^i} = \Gamma^m_{ij}\, g_{ml} + \Gamma^m_{il}\, g_{jm}
(B):\; \frac{\partial g_{il}}{\partial x^j} = \Gamma^m_{ji}\, g_{ml} + \Gamma^m_{jl}\, g_{im}
(C):\; \frac{\partial g_{ij}}{\partial x^l} = \Gamma^m_{li}\, g_{mj} + \Gamma^m_{lj}\, g_{im}

Step 3: Combine using torsion-free

Compute (A) + (B) - (C). Using \Gamma^m_{ij} = \Gamma^m_{ji} (torsion-free), many terms cancel and we get:

\frac{\partial g_{jl}}{\partial x^i} + \frac{\partial g_{il}}{\partial x^j} - \frac{\partial g_{ij}}{\partial x^l} = 2\, \Gamma^m_{ij}\, g_{ml}

Step 4: Solve for Γ

Multiply both sides by \frac{1}{2}\, g^{kl} (the inverse metric from Chapter 3) and sum over l:

\Gamma^k_{ij} = \frac{1}{2}\, g^{kl} \left( \frac{\partial g_{jl}}{\partial x^i} + \frac{\partial g_{il}}{\partial x^j} - \frac{\partial g_{ij}}{\partial x^l} \right) \qquad \square

Since the right-hand side involves only the metric and its first derivatives, the connection is completely determined by g. This is why we could compute the Christoffel symbols in Chapter 3 from the round metric alone.

This formula closes the circle: in Chapter 3, we computed the Christoffel symbols of S^2 by projecting derivatives in \mathbb{R}^3. Here we see that the same symbols emerge purely from the metric g_{\theta\theta} = 1, g_{\varphi\varphi} = \sin^2\theta, without any reference to an ambient space. The metric is all you need.

With the full toolkit of metric geometry in hand (distances, areas, curvature, the Levi-Civita connection), we can now move toward applications. In the next chapter, we study how to project points onto manifolds and how the exponential map links tangent spaces to the manifold itself.

07

Projection onto Manifolds

Given a point in ambient space, finding the closest point on a manifold is fundamental for optimization, statistics, and ML. The exponential map reverses the question: starting from the tangent space, it shoots geodesics onto the manifold, giving the definitive answer to our central question.

The Normal Bundle

In Chapters 1 through 6, we lived entirely "on" the manifold. Now we step back and ask: given a manifold M embedded in ambient space \mathbb{R}^n, what happens to nearby points that are not on M? This requires understanding directions perpendicular to the surface.

Definition — Normal Space

For M embedded in \mathbb{R}^n, the normal space at p is:

N_pM = \{ \mathbf{n} \in \mathbb{R}^n : g(\mathbf{n}, v) = 0 \;\;\forall\, v \in T_pM \}

On S^2 \subset \mathbb{R}^3, the tangent plane at p is the plane perpendicular to p (viewed as a radius vector). Therefore N_pS^2 = \text{span}(p): the normal direction at any point on the sphere is simply the radial direction.

Definition — Normal Bundle

The normal bundle collects all normal spaces into a single smooth manifold:

NM = \{ (p, \mathbf{n}) : p \in M,\; \mathbf{n} \in N_pM \}

If M has dimension k inside \mathbb{R}^n, then NM is itself a smooth manifold of dimension n (the k dimensions of M plus the n - k normal directions at each point).

Projection and Tubular Neighborhoods

Given a point q near M but not on it, the nearest-point projection sends q to the closest point on M.

Definition — Nearest-Point Projection

The projection map \pi is defined by:

\pi(q) = \arg\min_{p \in M} \| q - p \|

This is well-defined only if q is "close enough" to M. Points too far away may have multiple closest points. Consider the center of a sphere: every surface point is equidistant, so \pi(0) is undefined. The region where projection is well-behaved is called a tubular neighborhood.

Definition — Tubular Neighborhood

A tubular neighborhood of M in \mathbb{R}^n is an open set U \supset M such that \pi: U \to M is a smooth submersion. Concretely:

U = \{ p + t\,\mathbf{n}(p) : p \in M,\; \mathbf{n}(p) \in N_pM,\; \|\mathbf{n}(p)\| = 1,\; |t| < \varepsilon \}
Theorem — Tubular Neighborhood Theorem

Every compact embedded submanifold M \subset \mathbb{R}^n admits a tubular neighborhood.

On S^2, the tubular neighborhood is the open shell \{ q \in \mathbb{R}^3 : 1 - \varepsilon < \|q\| < 1 + \varepsilon \} for any \varepsilon \in (0, 1). The projection takes a particularly simple form:

\pi(q) = \frac{q}{\|q\|}, \qquad q \neq 0

The distance from q to its projection is \|q - \pi(q)\| = |\,\|q\| - 1\,|: simply how far q is from the unit sphere.

Proof: Projection Formula on S²

We minimize \|q - p\|^2 subject to \|p\| = 1.

Step 1. By Lagrange multipliers, at a minimum we need q - p = \lambda\, p for some \lambda \in \mathbb{R}.

Step 2. This gives q = (1 + \lambda)\, p, so p = q / (1 + \lambda). Since \|p\| = 1, we get |1 + \lambda| = \|q\|.

Step 3. Taking the sign that minimizes distance yields:

p = \frac{q}{\|q\|}, \qquad \|q - \pi(q)\| = |\,\|q\| - 1\,| \qquad \square

The Exponential Map

Projection maps from ambient space to the manifold. The exponential map goes the other way: from the tangent space to the manifold. It is arguably the most important construction in Riemannian geometry.

Definition — Exponential Map

The exponential map at p sends a tangent vector to the point reached by following the geodesic it defines:

\exp_p : T_pM \to M, \qquad \exp_p(v) = \gamma_v(1)

where \gamma_v is the unique geodesic with \gamma_v(0) = p and \dot\gamma_v(0) = v. In words: "walk from p in direction v for unit time along the geodesic with initial speed \|v\|_g." The distance traveled is \|v\|_g.

This is the precise answer to the blog's central question. The exponential map converts "straight lines in the tangent plane" into "straight lines on the curved surface" (geodesics).

On S^2, geodesics are great circles. For p \in S^2 and v \in T_pS^2 with \|v\| = r, the closed-form expression is:

\exp_p(v) = \cos(r)\, p + \sin(r)\, \frac{v}{r}

This follows directly from the parametrization of great circles: t \mapsto \cos(t)\, p + \sin(t)\, \hat{v} evaluated at t = \|v\|.

Key properties of exp:

  1. \exp_p(0) = p
  2. d(\exp_p)_0 = \mathrm{id}_{T_pM} (the differential at zero is the identity)
  3. t \mapsto \exp_p(tv) is the geodesic with initial velocity v
  4. \exp_p is a local diffeomorphism near 0 (by the inverse function theorem, since d(\exp_p)_0 = \mathrm{id})

The Logarithmic Map

Since \exp_p is a local diffeomorphism, it has a local inverse.

Definition — Logarithmic Map

The logarithmic map is the local inverse of the exponential map:

\log_p : U \subset M \to T_pM, \qquad \log_p = (\exp_p)^{-1}

It returns the initial velocity of the geodesic from p to q. On S^2, with d = \arccos\langle p, q \rangle:

\log_p(q) = d \cdot \frac{q - \langle p, q \rangle\, p}{\|q - \langle p, q \rangle\, p\|}

The numerator q - \langle p, q \rangle\, p is the component of q tangent to S^2 at p, and d is the geodesic distance (from Chapter 6).

How far can we push \exp_p before it stops being injective? This is captured by the injectivity radius.

Definition — Injectivity Radius

The injectivity radius at p is:

\mathrm{inj}(p) = \sup \{ r > 0 : \exp_p \text{ is a diffeomorphism on } B_r(0) \subset T_pM \}

On S^2, \mathrm{inj}(p) = \pi for every p. The exponential map is a diffeomorphism on the open disk of radius \pi in T_pM. It fails at the antipodal point -p (distance \pi), where all geodesics from p reconverge. This point is called the cut point.

Normal Coordinates

Since \exp_p is a diffeomorphism near p, we can use it as a chart. This gives a coordinate system centered at p where geodesics through p are straight lines and the metric is Euclidean to first order.

Definition — Normal Coordinates

Choose an orthonormal basis \{e_1, \ldots, e_n\} of T_pM. The normal coordinate map is:

(x^1, \ldots, x^n) \mapsto \exp_p(x^i e_i)

The key property (a consequence of the Gauss lemma) is that in normal coordinates, the metric looks Euclidean at the center and the Christoffel symbols vanish:

g_{ij}(p) = \delta_{ij}, \qquad \Gamma^k_{ij}(p) = 0

The curvature appears only in the second-order correction:

g_{ij}(x) = \delta_{ij} - \tfrac{1}{3}\, \mathrm{Riem}_{ikjl}(p)\, x^k\, x^l + O(\|x\|^3)

This is conceptually powerful: at any single point, you can always pretend you are in flat space. Curvature is what makes this approximation break down as you move away from p. The visualization below shows this in action: radial geodesics from p form the "coordinate lines," and on S^2 (positive curvature) they converge at the antipodal point, revealing how the metric deviates from flatness.

Proof: expp is a Local Diffeomorphism

We show that d(\exp_p)_0 = \mathrm{id}_{T_pM}, then apply the inverse function theorem.

Step 1. For any v \in T_pM, consider the curve c(t) = tv in T_pM. Then \exp_p(c(t)) = \exp_p(tv) = \gamma_v(t), the geodesic with initial velocity v.

Step 2. Differentiating at t = 0:

d(\exp_p)_0(v) = \frac{d}{dt}\bigg|_{t=0} \exp_p(tv) = \frac{d}{dt}\bigg|_{t=0} \gamma_v(t) = \dot\gamma_v(0) = v

Step 3. Since d(\exp_p)_0 is the identity map (hence invertible), the inverse function theorem guarantees that \exp_p is a diffeomorphism from a neighborhood of 0 \in T_pM to a neighborhood of p \in M. \square

The exponential and logarithmic maps give us a principled way to move between the tangent space (a linear space where we can do standard linear algebra) and the manifold itself. This "linearize, compute, map back" paradigm is exactly what manifold learning algorithms exploit. In the next chapter, we see how algorithms like ISOMAP, LLE, and UMAP use these geometric ideas to discover low-dimensional manifold structure hidden in high-dimensional data.

08

Manifold Learning

In Chapters 1 through 7, the manifold was given: we knew S^2, we knew its metric, and we computed geodesics explicitly. In practice, we rarely have such luxury. Data arrives as a cloud of points in high-dimensional space, and the manifold is hidden. Manifold learning algorithms recover that hidden structure using the same geometric concepts we have built: geodesic distances, tangent spaces, and the Laplace-Beltrami operator.

The Manifold Hypothesis

Notation

We write \mathcal{X} = \mathbb{R}^D for the ambient data space of dimension D, and \mathcal{Z} = \mathbb{R}^d for the low-dimensional embedding space with d \ll D. A dataset is a finite sample \{x_1, \dots, x_N\} \subset \mathcal{X}.

Definition — Manifold Hypothesis

A dataset \{x_1, \dots, x_N\} \subset \mathbb{R}^D satisfies the manifold hypothesis if there exists a smooth manifold M of dimension d \ll D and a smooth embedding \iota: M \hookrightarrow \mathbb{R}^D such that the data concentrates near \iota(M).

Consider a sheet of paper rolled into a spiral in \mathbb{R}^3: the Swiss roll. Points on this surface live in 3D, but the sheet itself is 2-dimensional. Two points close in Euclidean distance may be far apart along the surface, because the straight line between them cuts through the roll. The true distance is the geodesic distance d(p,q) from Chapter 6, measured along the manifold.

Definition — Intrinsic Dimensionality

The intrinsic dimensionality of a dataset is the dimension d of the underlying manifold M. A point cloud in \mathbb{R}^D with intrinsic dimension d has d local degrees of freedom.

The goal of manifold learning is to recover M (or at least its intrinsic geometry) from the samples \{x_i\}. Each algorithm below attacks this problem by approximating a different geometric quantity from the preceding chapters:

From Geodesic Distance to ISOMAP

ISOMAP (Tenenbaum, de Silva, Langford, 2000) directly targets the geodesic distance d(p,q) from Chapter 6. If we knew the manifold, we would compute d(p,q) = \inf_\gamma L(\gamma) as the infimum over all paths. Without the manifold, we approximate this using a graph.

Notation

The graph shortest-path distance d_G(x_i, x_j) is the length of the shortest path between x_i and x_j in the k-nearest-neighbor graph, where edge weights are Euclidean distances \|x_i - x_j\|.

The algorithm proceeds in three steps: (1) build a k-NN graph on the data, (2) compute all-pairs shortest-path distances d_G(x_i, x_j) via Dijkstra's algorithm, (3) apply classical Multidimensional Scaling to embed points in \mathcal{Z} = \mathbb{R}^d.

Definition — ISOMAP Embedding

The ISOMAP embedding finds coordinates z_1, \dots, z_N \in \mathbb{R}^d minimizing the stress:

\text{Stress} = \sum_{i < j} \bigl( d_G(x_i, x_j) - \| z_i - z_j \| \bigr)^2

The key theoretical guarantee is that graph distances converge to geodesic distances under sufficient sampling.

Proof sketch: Graph Distances Converge to Geodesic Distances

Setup

Let M be a compact Riemannian manifold. Sample N points uniformly on M and build the \varepsilon-neighborhood graph (connecting points within distance \varepsilon).

Step 1: Upper bound

Any geodesic \gamma from p to q can be approximated by a chain of graph edges. If the sampling density is high enough, consecutive sample points along \gamma are within \varepsilon of each other, giving:

d_G(x_i, x_j) \leq d(p_i, p_j) + O(\varepsilon)

Step 2: Lower bound

Graph paths cannot "shortcut" through the interior of the manifold. Under sufficient sampling density, every graph path stays close to a manifold path of comparable length:

d_G(x_i, x_j) \geq d(p_i, p_j) - O(\varepsilon)

Step 3: Convergence

If \varepsilon \to 0 and N \varepsilon^d \to \infty (enough points per neighborhood), then:

\sup_{i,j} \bigl| d_G(x_i, x_j) - d(p_i, p_j) \bigr| \xrightarrow{N \to \infty} 0 \qquad \square

ISOMAP recovers global geometry (the full distance matrix) from a graph. But global information is expensive and fragile. Can we work with local geometry instead?

Local Tangent Spaces and LLE

In Chapter 2, we saw that the tangent space T_pM provides a local linear approximation to the manifold around p. Within a small neighborhood, the manifold looks flat, and points can be expressed as affine combinations of their neighbors. Locally Linear Embedding (Roweis and Saul, 2000) turns this geometric insight into an algorithm: instead of approximating distances globally, it captures the tangent-plane structure locally.

Notation

For each data point x_i, we write \mathcal{N}(i) for its set of k nearest neighbors in \mathcal{X}.

Definition — Locally Linear Reconstruction

The reconstruction weights for x_i are the coefficients w_{ij} minimizing:

\mathcal{E}_{\text{rec}}(W) = \sum_{i=1}^{N} \left\| x_i - \sum_{j \in \mathcal{N}(i)} w_{ij}\, x_j \right\|^2, \qquad \sum_{j} w_{ij} = 1

The constraint \sum_j w_{ij} = 1 forces affine (not just linear) combinations, encoding the local tangent-plane geometry. Note that w_{ij} = 0 when x_j is not a neighbor of x_i.

LLE then finds the low-dimensional coordinates z_1, \dots, z_N \in \mathbb{R}^d that best preserve these weights:

\mathcal{E}_{\text{embed}}(Z) = \sum_{i=1}^{N} \left\| z_i - \sum_{j \in \mathcal{N}(i)} w_{ij}\, z_j \right\|^2

The weights w_{ij} capture how x_i sits in the local tangent plane spanned by its neighbors. This is the discrete analogue of the "linearize, compute, map back" paradigm from Chapter 7: the reconstruction weights encode local geometry via \exp_p and \log_p, and the embedding step transports this structure to \mathcal{Z}.

Both ISOMAP and LLE discretize a specific geometric object (distances or tangent planes). A third approach asks: is there a single intrinsic operator that encodes the full geometry of (M, g), and can we approximate it from data?

Graph Laplacians and Spectral Methods

The metric tensor g_{ij} (Chapter 3) and the volume element \sqrt{\det g} (Chapter 6) combine into a single differential operator, the Laplace-Beltrami operator \Delta_M, which encodes the full intrinsic geometry of the manifold. Laplacian Eigenmaps (Belkin and Niyogi, 2003) approximate this operator from a point cloud using a graph construction.

Definition — Graph Laplacian

Given a weighted adjacency matrix W with heat kernel weights (here w_{ij} denotes adjacency weights, not the LLE reconstruction weights above):

w_{ij} = \exp\!\left( -\frac{\|x_i - x_j\|^2}{2\sigma^2} \right) \quad \text{for neighbors}, \qquad w_{ij} = 0 \quad \text{otherwise}

where \sigma > 0 is a global bandwidth parameter. The graph Laplacian is L_G = D_W - W, where D_W is the diagonal degree matrix with (D_W)_{ii} = \sum_j w_{ij}.

This discrete operator approximates a fundamental object from Riemannian geometry:

Definition — Laplace-Beltrami Operator

On a Riemannian manifold (M, g), the Laplace-Beltrami operator generalizes the Euclidean Laplacian:

\Delta_M f = \frac{1}{\sqrt{\det g}}\, \partial_i \!\left( \sqrt{\det g}\; g^{ij}\, \partial_j f \right)

This operator depends only on the metric g_{ij} (Chapter 3) and the volume element \sqrt{\det g} (Chapter 6). It is an intrinsic invariant: isometric manifolds have the same Laplace-Beltrami spectrum. The Laplacian Eigenmap embedding uses the first d nontrivial eigenvectors of L_G as coordinates in \mathcal{Z}.

Proof sketch: Graph Laplacian Converges to Laplace-Beltrami

Step 1: Pointwise action

For a smooth function f on M, the graph Laplacian acts on sampled values as:

(L_G f)_i = \sum_{j} w_{ij}\bigl(f(x_i) - f(x_j)\bigr)

Step 2: Integral limit

As N \to \infty, the sum converges to an integral weighted by the heat kernel:

\frac{1}{N\sigma^2} \sum_{j} w_{ij}\bigl(f(x_i) - f(x_j)\bigr) \;\longrightarrow\; \int_M H_\sigma(x, y)\bigl(f(x) - f(y)\bigr)\, d\text{Vol}(y)

Step 3: Taylor expansion

Expanding f(y) around x in normal coordinates (Chapter 7) and integrating the heat kernel yields:

\int_M H_\sigma(x, y)\bigl(f(x) - f(y)\bigr)\, d\text{Vol}(y) = c_d\, \sigma^2\, \Delta_M f(x) + O(\sigma^4)

where c_d is a dimension-dependent constant.

Step 4: Spectral convergence

Pointwise convergence of the operator implies convergence of eigenvalues and eigenvectors. As N \to \infty and \sigma \to 0 at an appropriate rate, the spectrum of L_G converges to the spectrum of \Delta_M. \square

ISOMAP, LLE, and Laplacian Eigenmaps each approximate one geometric object (distances, tangent planes, or the Laplacian). The most recent methods take a different path: they encode the entire neighborhood structure as a probability distribution and optimize an embedding to preserve it. In doing so, UMAP recovers an approximation of the metric tensor g_{ij} itself.

Modern Methods: t-SNE and UMAP

On a Riemannian manifold, the metric g_{ij} determines how "close" two nearby points are. When the manifold is unknown, we only have Euclidean distances in ambient space, which can be misleading (as the Swiss roll illustrates). Both t-SNE and UMAP address this by converting neighborhood structure into probability distributions that capture intrinsic proximity, then optimizing an embedding to match.

t-SNE: Stochastic Neighbor Embedding

For each pair (x_i, x_j), t-SNE (van der Maaten and Hinton, 2008) defines a conditional similarity in the high-dimensional space using a Gaussian kernel with per-point bandwidth \sigma_i:

p_{j|i} = \frac{\exp\!\bigl(-\|x_i - x_j\|^2 / 2\sigma_i^2\bigr)}{\sum_{k \neq i} \exp\!\bigl(-\|x_i - x_k\|^2 / 2\sigma_i^2\bigr)}

The bandwidth \sigma_i is chosen so that the entropy of the conditional distribution matches a target perplexity. The similarities are symmetrized as p_{ij} = (p_{j|i} + p_{i|j}) / 2N.

In the low-dimensional space \mathcal{Z}, similarities use a Student-t kernel with one degree of freedom (the Cauchy distribution):

q_{ij} = \frac{\bigl(1 + \|z_i - z_j\|^2\bigr)^{-1}}{\sum_{k \neq l} \bigl(1 + \|z_k - z_l\|^2\bigr)^{-1}}

The heavy tails of the Student-t distribution solve the crowding problem: in low dimensions, there is less room to accommodate neighbors, and the heavy tails give distant points more space.

The embedding is found by minimizing the KL divergence between the two distributions:

\text{KL}(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}

UMAP: Uniform Manifold Approximation

UMAP (McInnes, Healy, Melville, 2018) has a more principled geometric foundation. It constructs a fuzzy simplicial set representing the topological structure of the data, then optimizes an embedding to match.

Notation

For each point x_i, let \rho_i = \min_{j \in \mathcal{N}(i)} \|x_i - x_j\| be the distance to its nearest neighbor.

The high-dimensional similarities are defined by normalizing distances relative to \rho_i:

p_{j|i} = \exp\!\left( -\frac{\max(0,\; \|x_i - x_j\| - \rho_i)}{\sigma_i} \right)

This construction has a Riemannian interpretation: at each point x_i, UMAP builds a local metric by rescaling distances so that the k-th neighbor is at unit distance. This is equivalent to constructing a local approximation of the metric tensor g_{ij} from Chapter 3.

In the low-dimensional space, UMAP uses a smooth approximation to the Student-t family:

q_{ij} = \bigl(1 + a\,\|z_i - z_j\|^{2b}\bigr)^{-1}

where a, b are fitted to match a target distribution. The embedding minimizes the binary cross-entropy:

\mathcal{L}_{\text{UMAP}} = \sum_{i \neq j} \left[ p_{ij} \log \frac{p_{ij}}{q_{ij}} + (1 - p_{ij}) \log \frac{1 - p_{ij}}{1 - q_{ij}} \right]

Compared to t-SNE's KL divergence, the cross-entropy includes a repulsive term (1-p_{ij})\log\frac{1-p_{ij}}{1-q_{ij}} that explicitly pushes apart points that are not neighbors. This makes the optimization more stable and the embedding more faithful to global structure.

All four families of algorithms (ISOMAP, LLE, spectral methods, UMAP/t-SNE) share a fundamental limitation: they produce a finite set of coordinates z_1, \dots, z_N \in \mathcal{Z}, but no continuous map between spaces. There is no function that takes a new point in \mathcal{Z} and produces a point in \mathcal{X}. Without such a map, we cannot generate new data, compute a metric tensor, or find geodesics in the embedding space. What if a neural network could learn this map explicitly? A decoder f: \mathcal{Z} \to \mathcal{X} provides exactly what unsupervised manifold learning cannot: a smooth parametric mapping whose Jacobian induces a Riemannian metric g_{ij}(z) = J(z)^T J(z) on the latent space. In the next chapter, we study this pullback metric and show that geodesics in latent space produce geometrically correct interpolations.

09

Latent Space Geometry

Chapter 8 showed that manifold learning algorithms produce embeddings, but not continuous maps. A neural network decoder provides exactly this missing piece, and with it comes a full Riemannian geometry on latent space.

The Pullback Metric

In Chapter 3, the Riemannian metric g_{ij} on the sphere S^2 was inherited from its embedding in \mathbb{R}^3. The metric measured how infinitesimal displacements in coordinates (\theta, \varphi) translated into distances in the ambient space. Here, the situation is reversed: a decoder f: \mathcal{Z} \to \mathcal{X} maps from latent coordinates to the ambient data space, and its Jacobian induces a metric on \mathcal{Z}.

Notation

A decoder is a smooth map f: \mathcal{Z} \to \mathcal{X} where \mathcal{Z} = \mathbb{R}^d is the latent space and \mathcal{X} = \mathbb{R}^D the ambient data space (with d \leq D). Its Jacobian at z is the D \times d matrix:

J(z) = \frac{\partial f}{\partial z}, \quad J(z)^\alpha{}_i = \frac{\partial f^\alpha}{\partial z^i}

When we move by \mathrm{d}z in latent space, the decoded point moves by \mathrm{d}x = J(z)\,\mathrm{d}z in data space. The squared length of this displacement is:

\|\mathrm{d}x\|^2 = \mathrm{d}z^T \, J(z)^T J(z) \, \mathrm{d}z
Definition — Pullback Metric

The pullback metric on \mathcal{Z} induced by the decoder f is:

g_{ij}(z) = \bigl(J(z)^T J(z)\bigr)_{ij} = \sum_{\alpha=1}^{D} \frac{\partial f^\alpha}{\partial z^i} \frac{\partial f^\alpha}{\partial z^j}

This is the same Riemannian metric g_{ij} from Chapter 3, now computed from the decoder rather than given by a formula.

To build intuition, consider a concrete decoder. Let \mathcal{Z} = \mathbb{R}^2 and \mathcal{X} = \mathbb{R}^3, with:

f(z_1, z_2) = \bigl(z_1,\; z_2,\; A \exp(-\tfrac{z_1^2 + z_2^2}{2\sigma^2})\bigr)

This maps a flat 2D plane to a surface with a Gaussian bump of amplitude A. Writing h(z) = A \exp(-\|z\|^2 / 2\sigma^2) for the height function, the Jacobian and metric are:

J = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ \partial_1 h & \partial_2 h \end{pmatrix}, \qquad g_{ij} = \delta_{ij} + (\partial_i h)(\partial_j h)

Far from the origin, \partial_i h \approx 0 and g_{ij} \approx \delta_{ij}: the metric is flat. Near the bump, the off-diagonal terms grow and the metric becomes curved.

Proof: The Pullback Metric is Riemannian

Claim

If f: \mathcal{Z} \to \mathcal{X} is a smooth immersion (i.e., J(z) has full column rank for all z), then g_{ij}(z) = (J^T J)_{ij} is a Riemannian metric on \mathcal{Z}.

Step 1: Symmetry

g_{ij} = \sum_\alpha (\partial_i f^\alpha)(\partial_j f^\alpha) = \sum_\alpha (\partial_j f^\alpha)(\partial_i f^\alpha) = g_{ji}.

Step 2: Positive-definiteness

For any nonzero tangent vector v \in T_z\mathcal{Z}:

g_{ij}\, v^i v^j = \sum_\alpha \Bigl(\sum_i \frac{\partial f^\alpha}{\partial z^i} v^i\Bigr)^2 = \|J(z)\, v\|^2 > 0

The strict inequality holds because J has full column rank, so Jv = 0 implies v = 0. This is the same full-rank Jacobian condition that ensured parametric surfaces in Chapter 1 were well-defined.

Step 3: Smoothness

Since f is smooth, all partial derivatives \partial_i f^\alpha are smooth, and products of smooth functions are smooth. \square

Geometry of the Latent Metric

In Chapter 5, Gaussian curvature K measured how a surface deviates from flatness. In Chapter 6, the volume element \sqrt{\det g} measured how areas distort under the metric. Both quantities are fully determined by g_{ij}, and we can compute them explicitly for our decoder.

For the Gaussian bump surface f(z) = (z_1, z_2, h(z)), the metric determinant and volume element are:

\det g = 1 + (\partial_1 h)^2 + (\partial_2 h)^2, \qquad \sqrt{\det g} = \sqrt{1 + \|\nabla h\|^2}

Where \sqrt{\det g} > 1, the decoder stretches latent-space areas. For our bump, this occurs near the origin where the surface is steep.

The Gaussian curvature follows from the classical Monge-patch formula (Chapter 5):

K = \frac{\partial_{11}h \cdot \partial_{22}h - (\partial_{12}h)^2}{(1 + (\partial_1 h)^2 + (\partial_2 h)^2)^2}

At the top of the bump, K > 0 (positive curvature, like a sphere). On the flanks, K < 0 (negative curvature, like a saddle). Far from the origin, K \to 0 (flat).

To find geodesics (next section), we need the Christoffel symbols \Gamma^k_{ij} from Chapter 3. For a graph surface, the Levi-Civita formula from Chapter 6 simplifies beautifully:

Definition — Pullback Christoffel Symbols

For a graph decoder f(z) = (z, h(z)), the Christoffel symbols of the pullback metric reduce to:

\Gamma^k_{ij}(z) = \frac{(\partial_k h)(\partial_{ij}^2 h)}{\det g}

where \partial_k h are first derivatives and \partial_{ij}^2 h are second derivatives of the height function.

This formula is remarkably compact: the Christoffel symbols vanish wherever the surface is flat (\nabla h = 0), and they are largest where both the slope and curvature of h are large.

The left panel shows the decoded surface colored by Gaussian curvature. The right panel shows the latent space \mathcal{Z}, where the heatmap encodes \sqrt{\det g} (volume distortion) and the ellipses show the local metric tensor: circles indicate flat regions, elongated ellipses indicate strong stretching. Use the slider to vary the bump amplitude and watch the geometry change.

Geodesics in Latent Space

In Chapter 4, geodesics on S^2 were great circles, satisfying the geodesic equation \ddot{\gamma}^k + \Gamma^k_{ij}\,\dot{\gamma}^i\dot{\gamma}^j = 0. The same equation applies here, with the Christoffel symbols computed from the pullback metric. In latent space \mathcal{Z} = \mathbb{R}^2, the system becomes:

\ddot{z}^k + \frac{(\partial_k h)(\partial_{ij}^2 h)}{\det g}\,\dot{z}^i\dot{z}^j = 0, \qquad k = 1, 2

This is the payoff of the entire blog. A "straight line" in latent space (linear interpolation z(t) = (1-t)\,z_A + t\,z_B) ignores the geometry: its decoded image may climb over the bump, taking a longer path on the surface. The geodesic curves in latent space to avoid high-metric regions, producing a decoded path that is shorter on the surface.

Definition — Latent Geodesic

A latent geodesic is a curve \gamma: [0,1] \to \mathcal{Z} satisfying the geodesic equation with Christoffel symbols \Gamma^k_{ij}(z) computed from the pullback metric g_{ij}(z) = (J^T J)_{ij}. Its image f \circ \gamma traces the shortest path on the decoded surface between f(z_A) and f(z_B).

Drag the endpoints A and B in the latent space panel. When the straight line (orange, dashed) passes through the bump region, the geodesic (cyan, solid) curves around it. The surface lengths confirm that the geodesic finds a shorter path on the decoded surface, even though it is longer in Euclidean latent coordinates.

Connections to Representation Learning

In Chapter 6, an isometry was a map that preserves the metric: \phi^*g = g. An ideal representation would make the pullback metric as close to flat as possible, so that Euclidean operations in \mathcal{Z} (linear interpolation, nearest neighbors, averaging) would be geometrically correct.

A disentangled representation corresponds to coordinates where the metric tensor is diagonal: g_{12}(z) = 0 everywhere. In such coordinates, the latent dimensions are geometrically independent, and moving along one axis does not affect distances measured along another. The off-diagonal terms g_{ij} with i \neq j quantify the degree of geometric entanglement.

A perfectly flat representation (g_{ij} = \delta_{ij} everywhere) would make the decoder an isometry. But the Theorema Egregium from Chapter 5 tells us this is impossible whenever the data manifold has nonzero Gaussian curvature: you cannot flatten a curved surface without distortion. This is a fundamental obstruction, not a limitation of the architecture.

The Fisher information metric provides another instance of the pullback construction. For a parametric family of distributions p_\omega(x), the Fisher metric F_{ij}(\omega) = \mathbb{E}[\partial_i \log p_\omega \cdot \partial_j \log p_\omega] is the pullback of the L^2 metric on the space of square-root densities. In a variational autoencoder (VAE), the KL regularization term penalizes deviations of the posterior from a standard Gaussian, which has the geometric effect of encouraging a flat latent metric. This provides a probabilistic motivation for geometric regularity.

Whether we seek disentanglement, isometry, or information-theoretic regularity, the pullback metric g_{ij}(z) = J(z)^T J(z) is the unifying language. It measures exactly what a decoder does to geometry, and any notion of "representation quality" can be phrased as a constraint on this tensor.

Closing the Loop

We opened this blog with a question: what is a straight line on a curved surface?

The answer required nine layers of mathematical structure: a manifold to define "surface" (Chapter 1), tangent spaces to define "direction" (Chapter 2), a connection and metric to define "straight" (Chapter 3), the geodesic equation to define "line" (Chapter 4), curvature to measure how far from flat (Chapter 5), metric geometry to measure distance and volume (Chapter 6), the exponential map to go from direction to destination (Chapter 7), manifold learning to discover the surface from data (Chapter 8), and the pullback metric to compute all of the above when the surface is learned by a neural network (Chapter 9).

A straight line on a curved surface is a geodesic: a curve \gamma satisfying \nabla_{\dot{\gamma}} \dot{\gamma} = 0. It parallel-transports its own velocity, has zero intrinsic acceleration, and is a critical point of the energy functional E[\gamma]. In a latent space equipped with the pullback metric g_{ij}(z) = J(z)^T J(z), the geodesic is the curve whose decoded image traces the shortest path on the learned manifold. Linear interpolation in latent space ignores the geometry; the geodesic respects it.

Every curved space has its own notion of straightness. Learning that notion from data is the meeting point of differential geometry and deep learning.