What does it mean for a space to be curved? We begin with the historical shift from Euclidean to Riemannian geometry, then formalize the notion of a smooth manifold using the sphere as our guiding example.
For over two thousand years, geometry meant Euclidean geometry. Euclid's five postulates, formulated for the flat plane, provided the foundation. The fifth (the parallel postulate) states: given a line in the plane and a point not on it, there exists exactly one line through that point which never meets the first.
In the 19th century, mathematicians began questioning this assumption. Gauss, Bolyai, and Lobachevsky showed that consistent geometries exist where the parallel postulate fails. Bernhard Riemann took this further in his 1854 Habilitationsschrift, proposing a framework where geometry is not a fixed background but a property of the space itself.
On a sphere, for instance, "lines" (geodesics) are great circles. Any two great circles intersect, so there are no parallel lines at all. The geometry of the sphere is intrinsically different from that of the plane.
We denote by S^2 the unit 2-sphere in \mathbb{R}^3:
S^2 is a surface: two-dimensional, smooth, and curved. The most natural way to describe it is via spherical coordinates (\theta, \varphi):
Two parameters produce a point in \mathbb{R}^3. Near most points, this is a smooth bijection: the surface locally looks like a piece of \mathbb{R}^2. These coordinates will be our working coordinates for concrete computations throughout the blog (metric, Christoffel symbols, geodesics).
However, this parametrization fails at the poles (\theta = 0 and \theta = \pi): all values of \varphi map to the same point, and the Jacobian drops rank. No single coordinate system can cover all of S^2 without such degeneracies. We need a more general framework.
Before introducing the abstract machinery, let us make precise what we just did. We described S^2 by giving a smooth map from a region of \mathbb{R}^2 into \mathbb{R}^3. This is the concrete notion of a parametric surface.
A parametric surface is a smooth map F : U \subset \mathbb{R}^2 \to \mathbb{R}^3 whose Jacobian has rank 2 everywhere. The image F(U) is the surface sitting in \mathbb{R}^3.
For the sphere, the parametrization is exactly the map we wrote above:
The key observation is that a chart is the local inverse of a parametrization. Where F goes from coordinates to the surface, the chart \psi goes the other way:
In practice, we often think in terms of the parametrization F (going "up" from coordinates to the surface) rather than the chart \psi (going "down" from the surface to coordinates). They carry the same information, just in opposite directions.
A topological manifold M of dimension n is a topological space that is:
S^2 satisfies all three conditions: it is a 2-dimensional topological manifold.
Since no single coordinate system covers S^2 without singularities, we need several overlapping ones. The following construction provides a singularity-free atlas (we will not use it for calculations, only to prove that S^2 is a smooth manifold).
A chart on M is a pair (U_\alpha, \psi_\alpha) where U_\alpha \subset M is open and \psi_\alpha : U_\alpha \to \psi_\alpha(U_\alpha) \subset \mathbb{R}^n is a homeomorphism. An atlas is a collection of charts \{(U_\alpha, \psi_\alpha)\} that covers M.
Example. The stereographic projection from the north pole N = (0, 0, 1) maps S^2 \setminus \{N\} to \mathbb{R}^2:
From the south pole S = (0, 0, -1):
Each chart misses one point, but together (S^2 \setminus \{N\},\, \sigma_N) and (S^2 \setminus \{S\},\, \sigma_S) cover all of S^2.
On the overlap S^2 \setminus \{N, S\}, the transition map between our two charts is:
This inversion map is C^\infty on its domain. An atlas whose transition maps are all smooth is a smooth atlas.
A smooth manifold is a topological manifold M equipped with a maximal smooth atlas (a smooth structure). In practice, any smooth atlas determines a unique smooth structure.
With S^2 established as a smooth manifold, we now have the foundation to build richer structure on top of it. In the next chapter, we introduce tangent spaces, which capture the notion of "directions" at a point.
At each point of a manifold, the tangent space captures all possible directions of movement.
At a point p \in S^2, imagine all curves passing through p. Their velocity vectors at p span a plane tangent to the sphere. This plane is the tangent space T_pS^2.
In our spherical coordinates (\theta, \varphi), the two basis vectors are obtained by differentiating the parametrization:
These two vectors are tangent to S^2 at p and form a basis of T_pS^2.
On \mathbb{R}^n, a vector v is a "arrow". On a curved manifold, there is no ambient space to draw arrows in. Instead, we define a tangent vector by what it does: it computes how a function changes in a given direction.
Concrete example on S^2. Let f: S^2 \to \mathbb{R} be the temperature at each point. The chart \psi assigns coordinates (\theta, \varphi) to each point, so f \circ \psi^{-1}(\theta, \varphi) is the temperature expressed as a function of \theta and \varphi. A tangent vector v = 3\,\partial/\partial\theta + 2\,\partial/\partial\varphi gives:
This is the rate of change of temperature in the direction v. The output v(f) is a real number, not a vector. The vector is the operator v itself.
Why define a vector as an operator? On \mathbb{R}^n, a tangent vector is an "arrow" and v(f) = \langle v, \nabla f \rangle is just the dot product with the gradient. On an abstract manifold, there is no ambient space to draw arrows in, and no global coordinate system to write components. But smooth functions f: M \to \mathbb{R} always exist, and the directional derivative v(f) is a coordinate-free real number. So we characterize a vector by its action on all functions: this is its dual representation (analogous to the Riesz representation theorem in functional analysis).
This characterization is faithful: a linear map v: C^\infty(M) \to \mathbb{R} is a tangent vector if and only if it satisfies the Leibniz rule: v(fg) = v(f)\,g(p) + f(p)\,v(g). This condition forces v to be local and finite-dimensional: the space of all such operators at p is isomorphic to \mathbb{R}^n. Two different vectors always disagree on at least one function, and every Leibniz-compatible operator corresponds to exactly one geometric direction. The correspondence is a bijection, not merely an encoding.
Example. On S^2, take the coordinate functions f_1 = \theta and f_2 = \varphi. Then v(f_1) = 3 and v(f_2) = 2 recover the components of v = 3\,\partial/\partial\theta + 2\,\partial/\partial\varphi. In a different chart, the components change, but the operator v and all its outputs v(f) remain the same.
General formula. On a manifold M of dimension n with chart (U, \psi) and coordinates (x^1, \ldots, x^n):
The tangent space T_pM at p \in M is the vector space of derivations at p: linear maps v: C^\infty(M) \to \mathbb{R} satisfying the Leibniz rule. A tangent vector v \in T_pM acts on any smooth function f: M \to \mathbb{R} by:
Here f \circ \psi^{-1}: \mathbb{R}^n \to \mathbb{R} is f rewritten in local coordinates (so that we can take ordinary partial derivatives), and (v^1, \ldots, v^n) are the components of v in the coordinate basis \{\partial/\partial x^i\}. This action of a vector on functions will be the building block for the covariant derivative in Chapter 3, where we will need to differentiate vector fields along directions on M.
The tangent bundle is the disjoint union of all tangent spaces: TM = \bigsqcup_{p \in M} T_pM. It is itself a smooth manifold of dimension 2n.
Each point of TM is a pair (p, v) with p \in M and v \in T_pM. For S^2, the tangent bundle TS^2 is a 4-dimensional manifold. In the next chapter, we will see how to differentiate vector fields on M, which requires additional structure called a connection.
How do we differentiate a vector field on a curved surface?
From this chapter onward, we use the Einstein summation convention: when an index appears once as a superscript and once as a subscript in the same expression, a sum over that index is implied. For example, on S^2 with basis vectors e_1 = e_\theta, e_2 = e_\varphi:
v^i\, e_i = v^1\, e_1 + v^2\, e_2 = v^\theta\, e_\theta + v^\varphi\, e_\varphi
The repeated index i is a dummy index (summed over); non-repeated indices are free (they label the components of the result). This compact notation avoids writing explicit \sum signs in the tensor formulas that follow.
In Chapter 2 we built the basis vectors e_\theta = \partial/\partial\theta and e_\varphi = \partial/\partial\varphi at each point of S^2. These vectors change from point to point. How do they change?
Consider e_\varphi (pointing "east" along latitude circles). As we move along the sphere in the e_\varphi direction, the ordinary derivative \partial e_\varphi / \partial\varphi in \mathbb{R}^3 points inward, toward the center of the sphere. It leaves the tangent plane.
On \mathbb{R}^n this does not happen because the tangent space is the same everywhere. On a curved surface, naive differentiation produces a vector that is no longer tangent. The solution: project the result back onto T_pS^2.
This projection defines the covariant derivative \nabla. On S^2, we can compute all four covariant derivatives of basis vectors along basis vectors:
Each result is itself a tangent vector, so it can be expressed in the basis \{e_\theta, e_\varphi\}. The coefficients of this expansion are the Christoffel symbols.
To compute the covariant derivative from a formula (rather than by projecting in \mathbb{R}^3), we need the metric: the inner product on each tangent space. For S^2 with spherical coordinates (\theta, \varphi), the metric components g_{ij} = \langle e_i, e_j \rangle are:
A Riemannian metric g on a manifold M is a smooth assignment of an inner product g_p: T_pM \times T_pM \to \mathbb{R} to each tangent space. It allows us to measure lengths, angles, and volumes on M.
This says: e_\theta is a unit vector, e_\varphi has length \sin\theta (it shrinks near the poles), and they are orthogonal. The line element is ds^2 = d\theta^2 + \sin^2\theta\, d\varphi^2. These metric components will determine the Christoffel symbols below and the geodesic equation in Chapter 4.
The inverse metric g^{ij} is defined by g^{ik}g_{kj} = \delta^i_j. It allows us to "raise indices": convert a covector into a vector. On S^2: g^{\theta\theta} = 1, g^{\varphi\varphi} = 1/\sin^2\theta.
The Christoffel symbols \Gamma^k_{ij} are the components of the covariant derivative of basis vectors: \nabla_{e_i} e_j = \Gamma^k_{ij} \, e_k. Here i is the direction of differentiation, j is the basis vector being differentiated, and k indexes the components of the result.
Example on S^2. With two basis vectors, there are 2 \times 2 \times 2 = 8 possible symbols. Most are zero. The non-zero ones are:
For example, \nabla_{e_\varphi} e_\varphi = -\sin\theta\cos\theta \, e_\theta: as we move east, the "east" vector tilts toward the pole. At the equator (\theta = \pi/2) this effect vanishes. Near the poles it grows, reflecting the convergence of meridians.
Any vector field on S^2 can be written X = X^\theta e_\theta + X^\varphi e_\varphi and Y = Y^\theta e_\theta + Y^\varphi e_\varphi where the components X^\theta, X^\varphi, Y^\theta, Y^\varphi are smooth functions on S^2. By linearity of \nabla, the covariant derivative of Y along X is:
The first term \partial Y^k / \partial x^i is how the components of Y change. The second term \Gamma^k_{ij} Y^j corrects for the fact that e_\theta and e_\varphi themselves rotate from point to point.
More abstractly, \nabla is an affine connection: an operator on any smooth manifold M that differentiates vector fields while keeping the result in T_pM. In the next chapter, we use it to define geodesics and parallel transport.
What is the shortest path between two points on a curved surface? This question, older than calculus itself, leads to the calculus of variations, the Euler-Lagrange equation, and the notion of geodesic.
On a flat plane, the shortest path between two points is a straight line. This fact is so basic that Euclid took it as an axiom. But what happens on a curved surface?
Navigators have known for centuries that the shortest route between two cities on Earth follows a great circle, not a straight line on the map. This practical observation hides a deep mathematical question: among all curves connecting two points on a surface, which one has the smallest length?
In 1696, Johann Bernoulli posed the brachistochrone problem: find the curve along which a bead slides fastest under gravity. This was not about shortest distance but shortest time, yet the mathematical structure is the same. Solutions came from Leibniz, Newton, l'Hôpital, and Jakob Bernoulli. Each had to optimize over an infinite-dimensional space of curves.
Leonhard Euler systematized these ideas in 1744, developing the calculus of variations: a framework for finding curves (or functions) that minimize (or make stationary) a given integral quantity. Joseph-Louis Lagrange refined the method in 1755, giving it the elegant form we use today.
Given a curve \gamma: [a, b] \to M on a Riemannian manifold (a smooth manifold M equipped with a metric g, as introduced in Chapter 3), its length is:
The square root makes this functional difficult to work with (it is invariant under reparametrization, which introduces degeneracies). Instead, we minimize the energy functional:
The energy of a curve \gamma: [a, b] \to M is:
A curve that minimizes E among all curves with the same endpoints also minimizes L (by the Cauchy-Schwarz inequality), and it is automatically parametrized proportionally to arc length. This is why we work with energy rather than length.
The key idea: to find the curve that makes E stationary, we perturb it. Consider a smooth one-parameter family of curves \gamma_\varepsilon(t) = \gamma(t) + \varepsilon \, \eta(t) where \eta is a smooth vector field along \gamma that vanishes at the endpoints: \eta(a) = \eta(b) = 0.
The curve \gamma is a critical point of E if and only if:
This is the infinite-dimensional analogue of setting the gradient to zero. The computation that turns this condition into a differential equation is the heart of the calculus of variations.
A curve \gamma(t) on a Riemannian manifold is a geodesic if it satisfies:
This equation has two readings. From the calculus of variations: geodesics are critical points of the energy functional. From the connection of Chapter 3: geodesics are curves that parallel-transport their own tangent vector, i.e. \nabla_{\dot\gamma} \dot\gamma = 0. These two characterizations are equivalent.
On S^2, using the Christoffel symbols computed in Chapter 3, the geodesic equation becomes a coupled system for \theta(t) and \varphi(t):
The solutions are great circles. Choose a point and an initial direction to trace one:
The geodesic equation says \nabla_{\dot\gamma} \dot\gamma = 0: the tangent vector is "constant" along the curve. We can generalize this to any vector carried along any curve.
A vector field V(t) along a curve \gamma(t) is parallel-transported if \nabla_{\dot\gamma} V = 0. In coordinates:
Given an initial vector V(0) \in T_{\gamma(0)}M, this first-order linear ODE has a unique solution. Parallel transport defines a linear isomorphism between tangent spaces at different points along the curve.
On a flat surface, parallel-transporting a vector around a closed loop returns it unchanged. On a curved surface, it comes back rotated. The rotation angle is called the holonomy of the loop.
For a geodesic triangle on S^2 with interior angles A, B, C, the holonomy equals the angular excess:
where K = 1 is the Gaussian curvature of the unit sphere and \Delta is the region enclosed. This is the Gauss-Bonnet theorem in action, and it connects parallel transport directly to the curvature of Chapter 5.
We have seen that parallel transport reveals something fundamental about a surface: it rotates vectors, and the rotation is proportional to the enclosed area. This "rotation per unit area" is the curvature. In the next chapter, we make this precise and extend the notion beyond surfaces to manifolds of any dimension.
In Chapter 4, we saw that parallel transport around a loop rotates a vector. Curvature is what makes this happen: it measures, at each point, how much the manifold deviates from being flat.
Recall from Chapter 4: parallel-transporting a vector around a geodesic triangle on S^2 produces a rotation \Omega equal to the area of the triangle (since K = 1 on the unit sphere). Now consider what happens when we shrink the triangle toward a single point p.
As the triangle contracts, both the holonomy \Omega and the enclosed area A tend to zero. But their ratio converges to a definite limit:
This limit is the Gaussian curvature at p. It is a local quantity: the amount of "rotation per unit area" that the surface generates at each point. On the unit sphere, K = 1 everywhere. On a flat plane, K = 0.
For a surface in \mathbb{R}^3, Gaussian curvature can be understood through principal curvatures.
At each point of a surface in \mathbb{R}^3, the principal curvatures \kappa_1 and \kappa_2 are the maximum and minimum curvatures of normal cross-sections. The Gaussian curvature is their product.
At each point, the surface bends most in one direction (\kappa_1) and least in another (\kappa_2):
This gives a geometric classification:
The Gaussian curvature K depends only on the metric g_{ij} and its derivatives, not on how the surface is embedded in \mathbb{R}^3. In other words, K is an intrinsic invariant: a creature living on the surface could measure it without any knowledge of the ambient space.
This is remarkable. A cylinder has K = 0 everywhere: you can unroll it flat without stretching. A sphere has K > 0: no map of the Earth can be distance-preserving (this is why all flat maps distort). These are intrinsic facts, detectable from within the surface.
On S^2, using the round metric from Chapter 3 (g_{\theta\theta} = 1, g_{\varphi\varphi} = \sin^2\theta), the Gaussian curvature formula gives K = 1 everywhere, confirming our holonomy computation.
On a torus with major radius a and minor radius b, parametrized by (u, v):
The outer rim (v = 0) has K > 0 (sphere-like), the inner rim (v = \pi) has K < 0 (saddle-like), and the top and bottom circles (v = \pm\pi/2) have K = 0. The total curvature integrates to zero, consistent with the Gauss-Bonnet theorem for a torus (Euler characteristic \chi = 0).
On a 2-dimensional surface, a single number K(p) captures all curvature information at p. In higher dimensions, curvature is richer: the manifold can curve differently in different 2-planes through p. We need a more powerful object.
The Riemann curvature tensor measures the failure of covariant derivatives to commute. On \mathbb{R}^n, the order of differentiation does not matter: \nabla_X \nabla_Y Z = \nabla_Y \nabla_X Z. On a curved manifold, it does.
The Lie bracket [X, Y] of two vector fields is the vector field defined by [X,Y](f) = X(Y(f)) - Y(X(f)) for any smooth function f. It measures whether the flows of X and Y commute. For coordinate basis vectors, [e_i, e_j] = 0.
The Riemann curvature tensor \text{Riem} is defined by:
The term \nabla_{[X,Y]}Z corrects for the fact that X and Y might not commute as vector fields. For coordinate basis vectors this term vanishes, simplifying computations.
In coordinates, the Riemann tensor has components (often written R^l_{\ kij} for brevity):
This is a direct consequence of the Christoffel symbols from Chapter 3: curvature is built entirely from the connection and its derivatives.
The sectional curvature of a 2-plane \sigma = \text{span}(X, Y) in T_pM is:
The Riemann tensor is a 4-index object, which can be hard to interpret geometrically. The sectional curvature extracts a single number for each 2-plane:
Intuitively, K(\sigma) is the Gaussian curvature of the 2-dimensional surface obtained by "slicing" the manifold along \sigma (via geodesics starting in the directions of \sigma). On a 2-manifold, there is only one 2-plane at each point, and K(\sigma) reduces to the Gaussian curvature.
The Riemann tensor can be progressively simplified by contraction (summing over indices):
The Ricci tensor is the trace of the Riemann tensor: \text{Ric}_{ij} = \text{Riem}^k_{\ ikj}. It averages sectional curvatures over all 2-planes containing a given direction.
The scalar curvature is the trace of the Ricci tensor: S = g^{ij}\,\text{Ric}_{ij}. It is a single number at each point, summarizing the total curvature.
The hierarchy is:
Each contraction loses information but gains interpretability. For a 2-manifold, all three are equivalent (determined by K). In general relativity, the Einstein field equations relate the Ricci tensor to the energy-momentum content of spacetime. In machine learning, the scalar curvature of a latent space measures how much the learned representation distorts local volumes (Chapter 9).
All of these curvature quantities are built from the metric g and its derivatives. In Chapter 3, we introduced the metric on S^2 to compute Christoffel symbols. In the next chapter, we study the Riemannian metric in full generality: how it defines lengths, areas, and the unique Levi-Civita connection.
In Chapter 3, we introduced the Riemannian metric g on S^2 to compute Christoffel symbols. Here we explore the full power of the metric: measuring distances, computing areas, and the remarkable fact that the metric alone determines the connection.
In Chapter 4, we defined the length of a curve \gamma: [a,b] \to M using the metric. The geodesic distance between two points is the infimum of lengths over all curves connecting them:
The geodesic distance between p, q \in M is d(p,q) = \inf_\gamma L(\gamma), where the infimum is taken over all piecewise smooth curves from p to q.
This distance function turns (M, d) into a metric space in the topological sense. On S^2, the geodesic distance between two points is the angle between them (for the unit sphere): d(p,q) = \arccos(\langle p, q \rangle). The geodesics (great circles) are precisely the curves that achieve this minimum.
The metric also lets us measure areas and volumes. In coordinates (x^1, \ldots, x^n), the volume element is:
The factor \sqrt{\det(g_{ij})} accounts for the distortion introduced by the coordinate system. On S^2:
This is the familiar area element on the sphere. It shrinks near the poles (\theta \to 0, \pi) where the coordinate grid compresses. Integrating over the full sphere gives \int_0^\pi \int_0^{2\pi} \sin\theta \, d\varphi \, d\theta = 4\pi.
An isometry between Riemannian manifolds (M, g) and (N, h) is a diffeomorphism \phi: M \to N that preserves the metric: \phi^* h = g. In other words, h_{\phi(p)}(d\phi(v), d\phi(w)) = g_p(v, w) for all tangent vectors v, w \in T_pM.
Isometries preserve everything that the metric defines: distances, angles, areas, curvature, geodesics. They are the "rigid motions" of Riemannian geometry.
A cylinder and a flat plane are locally isometric (you can unroll a cylinder without stretching). Both have K = 0. A sphere and a plane are not locally isometric (no distance-preserving map exists), which is why every flat world map necessarily distorts distances. This is a consequence of the Theorema Egregium (Chapter 5): isometries preserve Gaussian curvature, and K = 1 \neq 0.
In Chapter 3, we defined the covariant derivative \nabla on S^2 by projecting ordinary derivatives onto the tangent plane. On a general Riemannian manifold, there are many possible connections. A remarkable theorem says that the metric singles out a unique one.
On any Riemannian manifold (M, g), there exists a unique connection \nabla (the Levi-Civita connection) satisfying:
Metric compatibility means parallel transport preserves inner products: lengths and angles are unchanged. Torsion-free means the connection is symmetric (\Gamma^k_{ij} = \Gamma^k_{ji}). Together, these two conditions determine the Christoffel symbols entirely from the metric:
This formula closes the circle: in Chapter 3, we computed the Christoffel symbols of S^2 by projecting derivatives in \mathbb{R}^3. Here we see that the same symbols emerge purely from the metric g_{\theta\theta} = 1, g_{\varphi\varphi} = \sin^2\theta, without any reference to an ambient space. The metric is all you need.
With the full toolkit of metric geometry in hand (distances, areas, curvature, the Levi-Civita connection), we can now move toward applications. In the next chapter, we study how to project points onto manifolds and how the exponential map links tangent spaces to the manifold itself.
Given a point in ambient space, finding the closest point on a manifold is fundamental for optimization, statistics, and ML. The exponential map reverses the question: starting from the tangent space, it shoots geodesics onto the manifold, giving the definitive answer to our central question.
In Chapters 1 through 6, we lived entirely "on" the manifold. Now we step back and ask: given a manifold M embedded in ambient space \mathbb{R}^n, what happens to nearby points that are not on M? This requires understanding directions perpendicular to the surface.
For M embedded in \mathbb{R}^n, the normal space at p is:
On S^2 \subset \mathbb{R}^3, the tangent plane at p is the plane perpendicular to p (viewed as a radius vector). Therefore N_pS^2 = \text{span}(p): the normal direction at any point on the sphere is simply the radial direction.
The normal bundle collects all normal spaces into a single smooth manifold:
If M has dimension k inside \mathbb{R}^n, then NM is itself a smooth manifold of dimension n (the k dimensions of M plus the n - k normal directions at each point).
Given a point q near M but not on it, the nearest-point projection sends q to the closest point on M.
The projection map \pi is defined by:
This is well-defined only if q is "close enough" to M. Points too far away may have multiple closest points. Consider the center of a sphere: every surface point is equidistant, so \pi(0) is undefined. The region where projection is well-behaved is called a tubular neighborhood.
A tubular neighborhood of M in \mathbb{R}^n is an open set U \supset M such that \pi: U \to M is a smooth submersion. Concretely:
Every compact embedded submanifold M \subset \mathbb{R}^n admits a tubular neighborhood.
On S^2, the tubular neighborhood is the open shell \{ q \in \mathbb{R}^3 : 1 - \varepsilon < \|q\| < 1 + \varepsilon \} for any \varepsilon \in (0, 1). The projection takes a particularly simple form:
The distance from q to its projection is \|q - \pi(q)\| = |\,\|q\| - 1\,|: simply how far q is from the unit sphere.
Projection maps from ambient space to the manifold. The exponential map goes the other way: from the tangent space to the manifold. It is arguably the most important construction in Riemannian geometry.
The exponential map at p sends a tangent vector to the point reached by following the geodesic it defines:
where \gamma_v is the unique geodesic with \gamma_v(0) = p and \dot\gamma_v(0) = v. In words: "walk from p in direction v for unit time along the geodesic with initial speed \|v\|_g." The distance traveled is \|v\|_g.
This is the precise answer to the blog's central question. The exponential map converts "straight lines in the tangent plane" into "straight lines on the curved surface" (geodesics).
On S^2, geodesics are great circles. For p \in S^2 and v \in T_pS^2 with \|v\| = r, the closed-form expression is:
This follows directly from the parametrization of great circles: t \mapsto \cos(t)\, p + \sin(t)\, \hat{v} evaluated at t = \|v\|.
Key properties of exp:
Since \exp_p is a local diffeomorphism, it has a local inverse.
The logarithmic map is the local inverse of the exponential map:
It returns the initial velocity of the geodesic from p to q. On S^2, with d = \arccos\langle p, q \rangle:
The numerator q - \langle p, q \rangle\, p is the component of q tangent to S^2 at p, and d is the geodesic distance (from Chapter 6).
How far can we push \exp_p before it stops being injective? This is captured by the injectivity radius.
The injectivity radius at p is:
On S^2, \mathrm{inj}(p) = \pi for every p. The exponential map is a diffeomorphism on the open disk of radius \pi in T_pM. It fails at the antipodal point -p (distance \pi), where all geodesics from p reconverge. This point is called the cut point.
Since \exp_p is a diffeomorphism near p, we can use it as a chart. This gives a coordinate system centered at p where geodesics through p are straight lines and the metric is Euclidean to first order.
Choose an orthonormal basis \{e_1, \ldots, e_n\} of T_pM. The normal coordinate map is:
The key property (a consequence of the Gauss lemma) is that in normal coordinates, the metric looks Euclidean at the center and the Christoffel symbols vanish:
The curvature appears only in the second-order correction:
This is conceptually powerful: at any single point, you can always pretend you are in flat space. Curvature is what makes this approximation break down as you move away from p. The visualization below shows this in action: radial geodesics from p form the "coordinate lines," and on S^2 (positive curvature) they converge at the antipodal point, revealing how the metric deviates from flatness.
The exponential and logarithmic maps give us a principled way to move between the tangent space (a linear space where we can do standard linear algebra) and the manifold itself. This "linearize, compute, map back" paradigm is exactly what manifold learning algorithms exploit. In the next chapter, we see how algorithms like ISOMAP, LLE, and UMAP use these geometric ideas to discover low-dimensional manifold structure hidden in high-dimensional data.
In Chapters 1 through 7, the manifold was given: we knew S^2, we knew its metric, and we computed geodesics explicitly. In practice, we rarely have such luxury. Data arrives as a cloud of points in high-dimensional space, and the manifold is hidden. Manifold learning algorithms recover that hidden structure using the same geometric concepts we have built: geodesic distances, tangent spaces, and the Laplace-Beltrami operator.
We write \mathcal{X} = \mathbb{R}^D for the ambient data space of dimension D, and \mathcal{Z} = \mathbb{R}^d for the low-dimensional embedding space with d \ll D. A dataset is a finite sample \{x_1, \dots, x_N\} \subset \mathcal{X}.
A dataset \{x_1, \dots, x_N\} \subset \mathbb{R}^D satisfies the manifold hypothesis if there exists a smooth manifold M of dimension d \ll D and a smooth embedding \iota: M \hookrightarrow \mathbb{R}^D such that the data concentrates near \iota(M).
Consider a sheet of paper rolled into a spiral in \mathbb{R}^3: the Swiss roll. Points on this surface live in 3D, but the sheet itself is 2-dimensional. Two points close in Euclidean distance may be far apart along the surface, because the straight line between them cuts through the roll. The true distance is the geodesic distance d(p,q) from Chapter 6, measured along the manifold.
The intrinsic dimensionality of a dataset is the dimension d of the underlying manifold M. A point cloud in \mathbb{R}^D with intrinsic dimension d has d local degrees of freedom.
The goal of manifold learning is to recover M (or at least its intrinsic geometry) from the samples \{x_i\}. Each algorithm below attacks this problem by approximating a different geometric quantity from the preceding chapters:
ISOMAP (Tenenbaum, de Silva, Langford, 2000) directly targets the geodesic distance d(p,q) from Chapter 6. If we knew the manifold, we would compute d(p,q) = \inf_\gamma L(\gamma) as the infimum over all paths. Without the manifold, we approximate this using a graph.
The graph shortest-path distance d_G(x_i, x_j) is the length of the shortest path between x_i and x_j in the k-nearest-neighbor graph, where edge weights are Euclidean distances \|x_i - x_j\|.
The algorithm proceeds in three steps: (1) build a k-NN graph on the data, (2) compute all-pairs shortest-path distances d_G(x_i, x_j) via Dijkstra's algorithm, (3) apply classical Multidimensional Scaling to embed points in \mathcal{Z} = \mathbb{R}^d.
The ISOMAP embedding finds coordinates z_1, \dots, z_N \in \mathbb{R}^d minimizing the stress:
The key theoretical guarantee is that graph distances converge to geodesic distances under sufficient sampling.
ISOMAP recovers global geometry (the full distance matrix) from a graph. But global information is expensive and fragile. Can we work with local geometry instead?
In Chapter 2, we saw that the tangent space T_pM provides a local linear approximation to the manifold around p. Within a small neighborhood, the manifold looks flat, and points can be expressed as affine combinations of their neighbors. Locally Linear Embedding (Roweis and Saul, 2000) turns this geometric insight into an algorithm: instead of approximating distances globally, it captures the tangent-plane structure locally.
For each data point x_i, we write \mathcal{N}(i) for its set of k nearest neighbors in \mathcal{X}.
The reconstruction weights for x_i are the coefficients w_{ij} minimizing:
The constraint \sum_j w_{ij} = 1 forces affine (not just linear) combinations, encoding the local tangent-plane geometry. Note that w_{ij} = 0 when x_j is not a neighbor of x_i.
LLE then finds the low-dimensional coordinates z_1, \dots, z_N \in \mathbb{R}^d that best preserve these weights:
The weights w_{ij} capture how x_i sits in the local tangent plane spanned by its neighbors. This is the discrete analogue of the "linearize, compute, map back" paradigm from Chapter 7: the reconstruction weights encode local geometry via \exp_p and \log_p, and the embedding step transports this structure to \mathcal{Z}.
Both ISOMAP and LLE discretize a specific geometric object (distances or tangent planes). A third approach asks: is there a single intrinsic operator that encodes the full geometry of (M, g), and can we approximate it from data?
The metric tensor g_{ij} (Chapter 3) and the volume element \sqrt{\det g} (Chapter 6) combine into a single differential operator, the Laplace-Beltrami operator \Delta_M, which encodes the full intrinsic geometry of the manifold. Laplacian Eigenmaps (Belkin and Niyogi, 2003) approximate this operator from a point cloud using a graph construction.
Given a weighted adjacency matrix W with heat kernel weights (here w_{ij} denotes adjacency weights, not the LLE reconstruction weights above):
where \sigma > 0 is a global bandwidth parameter. The graph Laplacian is L_G = D_W - W, where D_W is the diagonal degree matrix with (D_W)_{ii} = \sum_j w_{ij}.
This discrete operator approximates a fundamental object from Riemannian geometry:
On a Riemannian manifold (M, g), the Laplace-Beltrami operator generalizes the Euclidean Laplacian:
This operator depends only on the metric g_{ij} (Chapter 3) and the volume element \sqrt{\det g} (Chapter 6). It is an intrinsic invariant: isometric manifolds have the same Laplace-Beltrami spectrum. The Laplacian Eigenmap embedding uses the first d nontrivial eigenvectors of L_G as coordinates in \mathcal{Z}.
ISOMAP, LLE, and Laplacian Eigenmaps each approximate one geometric object (distances, tangent planes, or the Laplacian). The most recent methods take a different path: they encode the entire neighborhood structure as a probability distribution and optimize an embedding to preserve it. In doing so, UMAP recovers an approximation of the metric tensor g_{ij} itself.
On a Riemannian manifold, the metric g_{ij} determines how "close" two nearby points are. When the manifold is unknown, we only have Euclidean distances in ambient space, which can be misleading (as the Swiss roll illustrates). Both t-SNE and UMAP address this by converting neighborhood structure into probability distributions that capture intrinsic proximity, then optimizing an embedding to match.
For each pair (x_i, x_j), t-SNE (van der Maaten and Hinton, 2008) defines a conditional similarity in the high-dimensional space using a Gaussian kernel with per-point bandwidth \sigma_i:
The bandwidth \sigma_i is chosen so that the entropy of the conditional distribution matches a target perplexity. The similarities are symmetrized as p_{ij} = (p_{j|i} + p_{i|j}) / 2N.
In the low-dimensional space \mathcal{Z}, similarities use a Student-t kernel with one degree of freedom (the Cauchy distribution):
The heavy tails of the Student-t distribution solve the crowding problem: in low dimensions, there is less room to accommodate neighbors, and the heavy tails give distant points more space.
The embedding is found by minimizing the KL divergence between the two distributions:
UMAP (McInnes, Healy, Melville, 2018) has a more principled geometric foundation. It constructs a fuzzy simplicial set representing the topological structure of the data, then optimizes an embedding to match.
For each point x_i, let \rho_i = \min_{j \in \mathcal{N}(i)} \|x_i - x_j\| be the distance to its nearest neighbor.
The high-dimensional similarities are defined by normalizing distances relative to \rho_i:
This construction has a Riemannian interpretation: at each point x_i, UMAP builds a local metric by rescaling distances so that the k-th neighbor is at unit distance. This is equivalent to constructing a local approximation of the metric tensor g_{ij} from Chapter 3.
In the low-dimensional space, UMAP uses a smooth approximation to the Student-t family:
where a, b are fitted to match a target distribution. The embedding minimizes the binary cross-entropy:
Compared to t-SNE's KL divergence, the cross-entropy includes a repulsive term (1-p_{ij})\log\frac{1-p_{ij}}{1-q_{ij}} that explicitly pushes apart points that are not neighbors. This makes the optimization more stable and the embedding more faithful to global structure.
All four families of algorithms (ISOMAP, LLE, spectral methods, UMAP/t-SNE) share a fundamental limitation: they produce a finite set of coordinates z_1, \dots, z_N \in \mathcal{Z}, but no continuous map between spaces. There is no function that takes a new point in \mathcal{Z} and produces a point in \mathcal{X}. Without such a map, we cannot generate new data, compute a metric tensor, or find geodesics in the embedding space. What if a neural network could learn this map explicitly? A decoder f: \mathcal{Z} \to \mathcal{X} provides exactly what unsupervised manifold learning cannot: a smooth parametric mapping whose Jacobian induces a Riemannian metric g_{ij}(z) = J(z)^T J(z) on the latent space. In the next chapter, we study this pullback metric and show that geodesics in latent space produce geometrically correct interpolations.
Chapter 8 showed that manifold learning algorithms produce embeddings, but not continuous maps. A neural network decoder provides exactly this missing piece, and with it comes a full Riemannian geometry on latent space.
In Chapter 3, the Riemannian metric g_{ij} on the sphere S^2 was inherited from its embedding in \mathbb{R}^3. The metric measured how infinitesimal displacements in coordinates (\theta, \varphi) translated into distances in the ambient space. Here, the situation is reversed: a decoder f: \mathcal{Z} \to \mathcal{X} maps from latent coordinates to the ambient data space, and its Jacobian induces a metric on \mathcal{Z}.
A decoder is a smooth map f: \mathcal{Z} \to \mathcal{X} where \mathcal{Z} = \mathbb{R}^d is the latent space and \mathcal{X} = \mathbb{R}^D the ambient data space (with d \leq D). Its Jacobian at z is the D \times d matrix:
When we move by \mathrm{d}z in latent space, the decoded point moves by \mathrm{d}x = J(z)\,\mathrm{d}z in data space. The squared length of this displacement is:
The pullback metric on \mathcal{Z} induced by the decoder f is:
This is the same Riemannian metric g_{ij} from Chapter 3, now computed from the decoder rather than given by a formula.
To build intuition, consider a concrete decoder. Let \mathcal{Z} = \mathbb{R}^2 and \mathcal{X} = \mathbb{R}^3, with:
This maps a flat 2D plane to a surface with a Gaussian bump of amplitude A. Writing h(z) = A \exp(-\|z\|^2 / 2\sigma^2) for the height function, the Jacobian and metric are:
Far from the origin, \partial_i h \approx 0 and g_{ij} \approx \delta_{ij}: the metric is flat. Near the bump, the off-diagonal terms grow and the metric becomes curved.
In Chapter 5, Gaussian curvature K measured how a surface deviates from flatness. In Chapter 6, the volume element \sqrt{\det g} measured how areas distort under the metric. Both quantities are fully determined by g_{ij}, and we can compute them explicitly for our decoder.
For the Gaussian bump surface f(z) = (z_1, z_2, h(z)), the metric determinant and volume element are:
Where \sqrt{\det g} > 1, the decoder stretches latent-space areas. For our bump, this occurs near the origin where the surface is steep.
The Gaussian curvature follows from the classical Monge-patch formula (Chapter 5):
At the top of the bump, K > 0 (positive curvature, like a sphere). On the flanks, K < 0 (negative curvature, like a saddle). Far from the origin, K \to 0 (flat).
To find geodesics (next section), we need the Christoffel symbols \Gamma^k_{ij} from Chapter 3. For a graph surface, the Levi-Civita formula from Chapter 6 simplifies beautifully:
For a graph decoder f(z) = (z, h(z)), the Christoffel symbols of the pullback metric reduce to:
where \partial_k h are first derivatives and \partial_{ij}^2 h are second derivatives of the height function.
This formula is remarkably compact: the Christoffel symbols vanish wherever the surface is flat (\nabla h = 0), and they are largest where both the slope and curvature of h are large.
The left panel shows the decoded surface colored by Gaussian curvature. The right panel shows the latent space \mathcal{Z}, where the heatmap encodes \sqrt{\det g} (volume distortion) and the ellipses show the local metric tensor: circles indicate flat regions, elongated ellipses indicate strong stretching. Use the slider to vary the bump amplitude and watch the geometry change.
In Chapter 4, geodesics on S^2 were great circles, satisfying the geodesic equation \ddot{\gamma}^k + \Gamma^k_{ij}\,\dot{\gamma}^i\dot{\gamma}^j = 0. The same equation applies here, with the Christoffel symbols computed from the pullback metric. In latent space \mathcal{Z} = \mathbb{R}^2, the system becomes:
This is the payoff of the entire blog. A "straight line" in latent space (linear interpolation z(t) = (1-t)\,z_A + t\,z_B) ignores the geometry: its decoded image may climb over the bump, taking a longer path on the surface. The geodesic curves in latent space to avoid high-metric regions, producing a decoded path that is shorter on the surface.
A latent geodesic is a curve \gamma: [0,1] \to \mathcal{Z} satisfying the geodesic equation with Christoffel symbols \Gamma^k_{ij}(z) computed from the pullback metric g_{ij}(z) = (J^T J)_{ij}. Its image f \circ \gamma traces the shortest path on the decoded surface between f(z_A) and f(z_B).
Drag the endpoints A and B in the latent space panel. When the straight line (orange, dashed) passes through the bump region, the geodesic (cyan, solid) curves around it. The surface lengths confirm that the geodesic finds a shorter path on the decoded surface, even though it is longer in Euclidean latent coordinates.
In Chapter 6, an isometry was a map that preserves the metric: \phi^*g = g. An ideal representation would make the pullback metric as close to flat as possible, so that Euclidean operations in \mathcal{Z} (linear interpolation, nearest neighbors, averaging) would be geometrically correct.
A disentangled representation corresponds to coordinates where the metric tensor is diagonal: g_{12}(z) = 0 everywhere. In such coordinates, the latent dimensions are geometrically independent, and moving along one axis does not affect distances measured along another. The off-diagonal terms g_{ij} with i \neq j quantify the degree of geometric entanglement.
A perfectly flat representation (g_{ij} = \delta_{ij} everywhere) would make the decoder an isometry. But the Theorema Egregium from Chapter 5 tells us this is impossible whenever the data manifold has nonzero Gaussian curvature: you cannot flatten a curved surface without distortion. This is a fundamental obstruction, not a limitation of the architecture.
The Fisher information metric provides another instance of the pullback construction. For a parametric family of distributions p_\omega(x), the Fisher metric F_{ij}(\omega) = \mathbb{E}[\partial_i \log p_\omega \cdot \partial_j \log p_\omega] is the pullback of the L^2 metric on the space of square-root densities. In a variational autoencoder (VAE), the KL regularization term penalizes deviations of the posterior from a standard Gaussian, which has the geometric effect of encouraging a flat latent metric. This provides a probabilistic motivation for geometric regularity.
Whether we seek disentanglement, isometry, or information-theoretic regularity, the pullback metric g_{ij}(z) = J(z)^T J(z) is the unifying language. It measures exactly what a decoder does to geometry, and any notion of "representation quality" can be phrased as a constraint on this tensor.
We opened this blog with a question: what is a straight line on a curved surface?
The answer required nine layers of mathematical structure: a manifold to define "surface" (Chapter 1), tangent spaces to define "direction" (Chapter 2), a connection and metric to define "straight" (Chapter 3), the geodesic equation to define "line" (Chapter 4), curvature to measure how far from flat (Chapter 5), metric geometry to measure distance and volume (Chapter 6), the exponential map to go from direction to destination (Chapter 7), manifold learning to discover the surface from data (Chapter 8), and the pullback metric to compute all of the above when the surface is learned by a neural network (Chapter 9).
A straight line on a curved surface is a geodesic: a curve \gamma satisfying \nabla_{\dot{\gamma}} \dot{\gamma} = 0. It parallel-transports its own velocity, has zero intrinsic acceleration, and is a critical point of the energy functional E[\gamma]. In a latent space equipped with the pullback metric g_{ij}(z) = J(z)^T J(z), the geodesic is the curve whose decoded image traces the shortest path on the learned manifold. Linear interpolation in latent space ignores the geometry; the geodesic respects it.
Every curved space has its own notion of straightness. Learning that notion from data is the meeting point of differential geometry and deep learning.