Week 1: Basics of Optimization
Taylor Series, Gradient, and Hessian
Let us consider a local extremum (minimum or maximum) of a function \(f: \mathbb{R}^2 \to \mathbb{R}\), \(\bm{w} = (w_1, w_2) \mapsto f(\bm{w}) = f(w_1, w_2)\). We perform a power series (Taylor) expansion around \(\bm{w} = \bm{w}^{(0)}\) and just keep the first three terms to obtain
\[ f(\bm{w}) \approx f\left(\bm{w}^{(0)}\right) + \nabla_wf\left(\bm{w}^{(0)}\right)^\top \delta \bm{w} + \frac{1}{2} \delta \bm{w}^\top \bm{H}_f(\bm{w}) \delta \bm{w}, \tag{1}\]
where the gradient of \(f\) is defined by \(\nabla_w f(\bm{w}) = \bmat{\pd{f}{w_1}(\bm{w}) & \pd{f}{w_2}(\bm{w})}^\top\), \(\delta \bm{w} = \bm{w} - \bm{w}^{(0)}\), and \(\bm{H}_f(\bm{w})\) is the Hessian matrix of second derivatives:
\[ \bm{H}_f(\bm{w}) = \bmat{ \pdd{f}{w_1}(\bm{w}) & \frac{\partial^2 f}{\partial w_1 \partial w_2}(\bm{w}) \\ \frac{\partial^2 f}{\partial w_2 \partial w_1}(\bm{w}) & \pdd{f}{w_2}(\bm{w}) }(\bm{w}). \]
If \(\bm{w}^{(0)}\) is an extremum, then \(\nabla_w f\left(\bm{w}^{(0)}\right) = \bm{0}\), so at such a point, the linear term in Equation 1 vanishes. Therefore \(\bm{w}^{(0)}\) is a local minimum if and only if the Hessian is positive definite. By the same token, it is a local maximum if and only if the Hessian is negative definite.
Gradient Descent/Ascent
For complicated functions, such as neural nets, it is difficult to solve \(\nabla_w f\left(\bm{w}^{(0)}\right) = \bm{0}\) for \(\bm{w}^{(0)}\). Instead, we start at a random point \(\bm{w}^{(1)}\) and search for where \(\nabla f = \bm{0}\). To do this, we use the first-order approximation
\[ f\left( \bm{w}^{n+1} \right) \approx f\left( \bm{w}^{n} \right) + \nabla_w f\left(\bm{w}^{(n)}\right)^\top \delta \bm{w}^{(n)}, \]
where \(\delta \bm{w}^{(n)} = \bm{w}^{(n+1)} - \bm{w}^{(n)}\). Then with \(\eta > 0\), we set \(\bm{w}^{(n+1)}\) according to
\[ \bm{w}^{(n+1)} := \bm{w}^{(n)} \mp \eta \nabla_w f\left( \bm{w}^{(n)}\right). \]
Depending on the sign on the right-hand side of this equation, this forces \(f\) to decrease or increase, respectively. Indeed, computing the change in \(f\) after one iteration, we obtain
\[ \begin{equation} \begin{split} \Delta f \triangleq f\left(\bm{w}^{(n+1)}\right) - f\left(\bm{w}^{(n)}\right) &= \nabla_w f\left( \bm{w}^{(n)} \right) \delta \bm{w}^{(n)} \\ &= \mp \eta \nabla_w f\left( \bm{w}^{(n)} \right)^\top \nabla_w f\left( \bm{w}^{(n)} \right) \\ &= \mp \eta \norm{\nabla_w f\left( \bm{w}^{(n)} \right)}{}^2 \lessgtr 0 \end{split} \end{equation} \]
Symmetric and Positive Definite Matrices
Definition 1 (Symmetric Matrix) A matrix \(\bm{Q} \in \R^{n\times n}\) is symmetric if
\[ \bm{Q}^\top = \bm{Q}. \]
The symbol for symmetric matrices of dimension \(n\) is \(\S_n\).
Definition 2 (Positive Semidefinite Matrix) A matrix \(\bm{Q} \in \S_n\) is positive semidefinite if for all \(\bm{x} \in \R^n\),
\[ \bm{x}^\top \bm{Q x} \geq 0. \]
Definition 3 (Positive Definite Matrix) A matrix \(\bm{Q} \in \S_n\) is positive definite if for all \(\bm{x} \in \R^n\) such that \(\bm{x} \neq \bm{0}\),
\[ \bm{x}^\top \bm{Q x} > 0. \]
Definition 4 (Negative (Semi)Definite Matrix) A matrix \(\bm{Q} \in \S_n\) is negative (semi)definite if its additive inverse, \(-\bm{Q}\), is positive (semi)definite.
Least Squares Identification
Given \(y(x) = \sin{x}\), \(-\pi \leq x \leq \pi\), find the parameters \(a\), \(b\), \(c\), \(d\), such that
\[ y_P(x) = a + bx + cx^s + dx^3 \]
is a “good” approximation to \(y(x) = \sin{x}\). Let
\[ \bm{x} = \bmat{x_1 = -\pi & x_2 & \cdots & x_{n-1} & x_n = \pi} \]
and
\[ \bm{y} = \sin{\bm{x}} = \bmat{\sin{x_1} = 0 & \sin{x_2} & \cdots & \sin{x_{n-1}} & \sin{x_n} = 0}. \]
Then, we want to find \(a\), \(b\), \(c\), \(d\) such that
\[ \underbrace{ \bmat{ y_1 \\ y_2 \\ \vdots \\ y_n }}_{\bm{Y}\in \R^n} = \bmat{ \sin{x_1} \\ \sin{x_2} \\ \vdots \\ \sin{x_n} } = \underbrace{\bmat{ 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^2 & x_2^3 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^3 }}_{\bm{W} \in \R^{n \times 4}} \underbrace{\bmat{ a \\ b \\ c \\ d }}_{\bm{P} \in \R^4} \]
Let us define the error vector \(\bm{E} := \bm{Y} - \bm{WP}\) and the total error \(e: \R^n \to \R\)
\[ e(\bm{Y}) := \norm{\bm{E}}{}^2 = \bm{E}^\top \bm{E} = \bm{Y}^\top \bm{Y} - 2 \bm{P}^\top \bm{W}^\top \bm{Y} + \bm{P}^\top \bm{W}^\top\bm{W} \bm{P}. \]
The gradient of the error function is given by
\[ \nabla e(\bm{Y}) = -2 \bm{W}^\top \bm{Y} + 2 \bm{W}^\top\bm{WP} = -2\bm{W}^\top (\bm{Y} - \bm{WP}) = -2\bm{W}^\top\bm{E}. \]
Setting this equal to zero, we obtain a linear equation in \(\bm{P}\), whose solution is given by
\[ \bm{P}^* = \left( \bm{W}^\top \bm{W} \right)^{-1} \bm{W}^\top \bm{Y}. \]
Note that \(\bm{W}^\top \bm{W} \in \S_4\) is invertible if and only if \(\bm{W}\) has full rank if and only if there are at least \(4\) distinct data points in the data set \(\bm{x}\). The iterative solution to this problem is found by picking a guess at a solution, say \(\bm{P}^{(0)} = \bm{0}\), and then update it until convergence according to the first-order difference equation
\[ \begin{split} \bm{P}^{(n+1)} = \bm{P}^{(n)} - \eta \nabla e(\bm{Y}) &= \bm{P}^{(n)} + 2\eta\bm{W}^\top\left(\bm{Y} - \bm{W}\bm{P}^{(n)}\right) \\ &= \left(\bm{I} - 2\eta\bm{W}^\top \bm{W}\right)\bm{P}^{(n)} + 2\eta \bm{W}^\top\bm{Y}. \end{split} \tag{2}\]
\(\bm{W}^\top \bm{W}\) is obviously symmetric. Taking an arbitrary \(\bm{x} \in \R^4\), and letting \(\bm{v} = \bm{Wx}\) we see that \[ \bm{x}^\top \left(\bm{W}^\top \bm{W}\right) \bm{x} = \bm{v}^\top \bm{v} = \norm{\bm{v}}{}^2 \geq 0. \] Finally, observe that \(\bm{v} = \bm{0}\) if and only if \(\bm{x} = \bm{0}\) since \(\bm{W}\) has full rank. \(\square\)
This difference Equation 2 has an associated differential equation, namely
\[ \dot{\bm{P}} = -\alpha\bm{W}^\top\bm{WP} + \alpha\bm{W}^\top \bm{Y} \] with \(\alpha > 0\). Since \(\bm{W}^\top \bm{W} \succeq 0\), this is a stable linear time-invariant ordinary differential equation. Hence, its final value can be found by invoking the Final Value Theorem (FVT) from control theory. To apply that, take the Laplace transform of the ODE (with zero initial conditions) to get
\[ s \bm{P}(s) = -\alpha \bm{W}^\top \bm{W} \bm{P}(s) + \frac{\alpha \bm{W}^\top\bm{Y}}{s} \]
After some rearrangement, the FVT yields \[ \lim_{t \to \infty} \bm{P}(t) = \lim_{s \to 0} s \bm{P}(s) = \lim_{s \to 0} s \left(s\bm{I} + \alpha \bm{W}^\top \bm{W}\right)^{-1}\frac{\alpha \bm{W}^\top \bm{Y}}{s} = \left(\bm{W}^\top\bm{W}\right)^{-1}\bm{W}^\top\bm{Y}. \]
Infinitely-Many Extrema
Because JAX is not yet compatible with WebAssembly (browser-based Python), this example uses standard Numpy and the analytical gradient formula.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 675
from shiny import App, render, ui
import numpy as np
import matplotlib.pyplot as plt
# 1. Define Math Functions (Using NumPy instead of JAX)
def C(x):
return 0.5 * (np.cos(x[0])**2 + np.sin(x[1])**2)
def grad_C(x):
# Analytical gradient derived from C(x)
return np.array([
-0.5 * 2 * np.cos(x[0]) * np.sin(x[0]), # dx0: -cos*sin
0.5 * 2 * np.sin(x[1]) * np.cos(x[1]) # dx1: sin*cos
])
# 2. Gradient Descent Logic
def run_gradient_descent(x0, eta, n_iter=50):
x = np.array(x0, dtype=np.float64)
path = [x.copy()]
for _ in range(n_iter):
g = grad_C(x)
x = x - eta * g
path.append(x.copy())
return np.array(path)
# 3. Shiny UI
app_ui = ui.page_fluid(
ui.layout_sidebar(
ui.sidebar(
ui.h4("Parameters"),
ui.input_slider("lr", "Learning Rate (η)", 0.0, 2.0, 0.2, step=0.05),
ui.input_slider("n_iter", "Iterations", 1, 100, 20),
ui.hr(),
ui.h5("Start Position"),
ui.input_slider("x0", "X0 (Start)", -6.0, 6.0, -5.5),
ui.input_slider("x1", "X1 (Start)", -6.0, 6.0, 2.0),
open="always",
),
ui.card(
ui.output_plot("descent_plot")
)
)
)
# 4. Server Logic
def server(input, output, session):
@output
@render.plot
def descent_plot():
# A. Setup Background Contour
pi = np.pi
xrange = np.linspace(-2.5*pi, 2.5*pi, 100)
yrange = np.linspace(-2.5*pi, 2.5*pi, 100)
xx, yy = np.meshgrid(xrange, yrange)
# Calculate Z for contour using vectorized numpy
# Note: We construct input manually for broadcasting
Z = 0.5 * (np.cos(xx)**2 + np.sin(yy)**2)
# B. Run Algorithm based on inputs
start_pos = [input.x0(), input.x1()]
path = run_gradient_descent(start_pos, input.lr(), input.n_iter())
# C. Plotting
fig, ax = plt.subplots(figsize=(6,6))
# Plot Contour
c = ax.contourf(xx, yy, Z, levels=25, cmap='viridis', alpha=0.6)
plt.colorbar(c, ax=ax, label='Cost C(x)')
# Plot Trajectory
ax.plot(path[:,0], path[:,1], 'r.-', linewidth=2, markersize=6, label='Gradient Path')
ax.plot(path[0,0], path[0,1], 'ko', markersize=9, label='Start')
ax.plot(path[-1,0], path[-1,1], 'w*', markersize=12, markeredgecolor='k', label='End')
# Place text at 5% x, 95% y (Top Left), relative to axes dimensions
formula_text = r"$C(x) = \frac{1}{2} (\cos^2(x_0) + \sin^2(x_1))$"
ax.text(0.05, 0.95, formula_text, transform=ax.transAxes,
fontsize=14, verticalalignment='top',
bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
ax.set_title(f"Gradient Descent (η = {input.lr()})")
ax.set_xlabel('$x_0$')
ax.set_ylabel('$x_1$')
ax.legend(loc='lower right')
return fig
app = App(app_ui, server)
Least Squares: Closed-Form vs. Gradient Descent
This interactive application demonstrates Least Squares Identification as described in the provided notes. We aim to approximate \(y = \sin(x)\) using a 3rd-degree polynomial:
\[ y_P(x) = a + bx + cx^2 + dx^3 \]
We compare two methods:
- Closed Form: The analytical solution \(\bm{P}^* = \left(\bm{W}^\top \bm{W}\right)^{-1} \bm{W}^\top \bm{Y}\).
- Gradient Descent: Iteratively minimizing the squared error \(e\).
The Taylor series for \(\sin(x)\) is \(x - \frac{x^3}{6} + \dots\). Therefore, we expect our coefficients to converge near: \(a=0, b=1, c=0, d=-\nicefrac{1}{6}\).
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 925
from shiny import App, render, ui, reactive
import numpy as np
import matplotlib.pyplot as plt
# 1. Setup Data (Fixed range as per PDF)
# Generate n points between -pi and pi
n = 100
x_data = np.linspace(-np.pi, np.pi, n)
y_data = np.sin(x_data)
# Construct W matrix (Design Matrix) [1, x, x^2, x^3]
# As defined in the PDF
W = np.column_stack([
np.ones_like(x_data),
x_data,
x_data**2,
x_data**3
])
# 2. Closed Form Solution (The "Best" fit)
# P* = (W.T @ W)^-1 @ W.T @ Y
WtW = W.T @ W
WtY = W.T @ y_data
P_closed = np.linalg.solve(WtW, WtY) # More stable than inv()
# 3. Gradient Descent Implementation
def run_gd_linear_regression(W, Y, learning_rate, epochs):
# Initialize coefficients [a, b, c, d] to zeros
P = np.zeros(4)
# Track error history
errors = []
# Loop
for _ in range(epochs):
# Calculate Prediction: Y_pred = W @ P
Y_pred = W @ P
# Calculate Error Vector: (Y - Y_pred)
error = Y - Y_pred
# Calculate Gradient (derived in PDF):
# dE/dP = -2 * W.T @ (Y - W @ P)
gradient = -2 * W.T @ error
# Update Weights: P_new = P_old - eta * gradient
P = P - learning_rate * gradient
# Store Sum Squared Error
sse = np.sum(error**2)
errors.append(sse)
return P, errors
# 4. Shiny UI
app_ui = ui.page_fluid(
ui.layout_sidebar(
ui.sidebar(
ui.h4("Hyperparameters"),
# Note: Polynomial features grow large (x^3 approx 27).
# Gradients are huge, so Learning Rate must be tiny.
ui.input_slider("lr", "Learning Rate", 0.000001, 0.0001, 0.00001, step=0.000001, sep=""),
ui.input_slider("epochs", "Iterations", 0, 5000, 100, step=100),
ui.hr(),
ui.h5("Coefficients (a,b,c,d)"),
ui.output_text_verbatim("coeff_display"),
ui.hr(),
ui.input_action_button("reset", "Reset / Re-run"),
open="always",
),
ui.card(
ui.output_plot("main_plot")
)
)
)
# 5. Server Logic
def server(input, output, session):
@reactive.Calc
def perform_gd():
# Re-run when button is clicked or inputs change
input.reset()
lr = input.lr()
epochs = input.epochs()
P_gd, error_history = run_gd_linear_regression(W, y_data, lr, epochs)
return P_gd, error_history
@output
@render.text
def coeff_display():
P_gd, _ = perform_gd()
# Format for display
return (
f"--- Closed Form ---\n"
f"a: {P_closed[0]:.4f}\n"
f"b: {P_closed[1]:.4f}\n"
f"c: {P_closed[2]:.4f}\n"
f"d: {P_closed[3]:.4f}\n\n"
f"--- Gradient Descent ---\n"
f"a: {P_gd[0]:.4f}\n"
f"b: {P_gd[1]:.4f}\n"
f"c: {P_gd[2]:.4f}\n"
f"d: {P_gd[3]:.4f}"
)
@output
@render.plot
def main_plot():
P_gd, error_history = perform_gd()
# Generate Predictions
y_pred_closed = W @ P_closed
y_pred_gd = W @ P_gd
# Error metrics
mse_closed = np.mean((y_data - y_pred_closed)**2)
mse_gd = np.mean((y_data - y_pred_gd)**2)
# Plotting
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(5, 10))
# Plot 1: Curve Fitting
ax1.plot(x_data, y_data, 'k-', alpha=0.3, linewidth=4, label='True sin(x)')
ax1.plot(x_data, y_pred_closed, 'b--', linewidth=2, label=f'Closed Form (MSE: {mse_closed:.4f})')
ax1.plot(x_data, y_pred_gd, 'r-', label=f'Gradient Descent (MSE: {mse_gd:.4f})')
ax1.set_title(f"Polynomial Fit (Iter: {input.epochs()})")
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot 2: Error Curve
if len(error_history) > 0:
ax2.plot(error_history)
ax2.set_title("Sum Squared Error over Iterations")
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Error (SSE)")
ax2.set_yscale('log') # Log scale helps visualize convergence
ax2.grid(True, alpha=0.3)
return fig
app = App(app_ui, server)
Embedding these Jupyter Notebooks makes the website sluggish, so please download them and open locally.
