$$ % Define your custom commands here \newcommand{\bmat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\S}{\mathbb{S}} \newcommand{\R}{\mathbb{R}} \newcommand{\S}{\mathbb{S}} \newcommand{\norm}[2]{\|{#1}\|_{{}_{#2}}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\pdd}[2]{\frac{\partial^2 #1}{\partial #2^2}} \newcommand{\vectornorm}[1]{\left|\left|#1\right|\right|} \newcommand{\abs}[1]{\left|{#1}\right|} \newcommand{\mbf}[1]{\mathbf{#1}} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\bm}[1]{\boldsymbol{#1}} \newcommand{\nicefrac}[2]{{}^{#1}\!/_{\!#2}} \newcommand{\argmin}{\operatorname*{arg\,min}} \newcommand{\argmax}{\operatorname*{arg\,max}} $$

Week 1: Basics of Optimization

Taylor Series, Gradient, and Hessian

Let us consider a local extremum (minimum or maximum) of a function $f: \mathbb{R}^2 \to \mathbb{R}$, $\bm{w} = (w_1, w_2) \mapsto f(\bm{w}) = f(w_1, w_2)$. We perform a power series (Taylor) expansion around $\bm{w} = \bm{w}^{(0)}$ and just keep the first three terms to obtain

\[ f(\bm{w}) \approx f\left(\bm{w}^{(0)}\right) + \nabla_wf\left(\bm{w}^{(0)}\right)^\top \delta \bm{w} + \frac{1}{2} \delta \bm{w}^\top \bm{H}_f(\bm{w}) \delta \bm{w}, \tag{1}\]

where the gradient of $f$ is defined by $\nabla_w f(\bm{w}) = \bmat{\pd{f}{w_1}(\bm{w}) & \pd{f}{w_2}(\bm{w})}^\top$, $\delta \bm{w} = \bm{w} - \bm{w}^{(0)}$, and $\bm{H}_f(\bm{w})$ is the Hessian matrix of second derivatives:

\[ \bm{H}_f(\bm{w}) = \bmat{ \pdd{f}{w_1}(\bm{w}) & \frac{\partial^2 f}{\partial w_1 \partial w_2}(\bm{w}) \\ \frac{\partial^2 f}{\partial w_2 \partial w_1}(\bm{w}) & \pdd{f}{w_2}(\bm{w}) }(\bm{w}). \]

If $\bm{w}^{(0)}$ is an extremum, then $\nabla_w f\left(\bm{w}^{(0)}\right) = \bm{0}$, so at such a point, the linear term in Equation 1 vanishes. Therefore $\bm{w}^{(0)}$ is a local minimum if and only if the Hessian is positive definite. By the same token, it is a local maximum if and only if the Hessian is negative definite.

Gradient Descent/Ascent

For complicated functions, such as neural nets, it is difficult to solve $\nabla_w f\left(\bm{w}^{(0)}\right) = \bm{0}$ for $\bm{w}^{(0)}$. Instead, we start at a random point $\bm{w}^{(1)}$ and search for where $\nabla f = \bm{0}$. To do this, we use the first-order approximation

\[ f\left( \bm{w}^{n+1} \right) \approx f\left( \bm{w}^{n} \right) + \nabla_w f\left(\bm{w}^{(n)}\right)^\top \delta \bm{w}^{(n)}, \]

where $\delta \bm{w}^{(n)} = \bm{w}^{(n+1)} - \bm{w}^{(n)}$. Then with $\eta > 0$, we set $\bm{w}^{(n+1)}$ according to

\[ \bm{w}^{(n+1)} := \bm{w}^{(n)} \mp \eta \nabla_w f\left( \bm{w}^{(n)}\right). \]

Depending on the sign on the right-hand side of this equation, this forces $f$ to decrease or increase, respectively. Indeed, computing the change in $f$ after one iteration, we obtain

\[ \begin{equation} \begin{split} \Delta f \triangleq f\left(\bm{w}^{(n+1)}\right) - f\left(\bm{w}^{(n)}\right) &= \nabla_w f\left( \bm{w}^{(n)} \right) \delta \bm{w}^{(n)} \\ &= \mp \eta \nabla_w f\left( \bm{w}^{(n)} \right)^\top \nabla_w f\left( \bm{w}^{(n)} \right) \\ &= \mp \eta \norm{\nabla_w f\left( \bm{w}^{(n)} \right)}{}^2 \lessgtr 0 \end{split} \end{equation} \]

Symmetric and Positive Definite Matrices

Definition 1 (Symmetric Matrix) A matrix $\bm{Q} \in \R^{n\times n}$ is symmetric if

\[ \bm{Q}^\top = \bm{Q}. \]

The symbol for symmetric matrices of dimension $n$ is $\S_n$.

Definition 2 (Positive Semidefinite Matrix) A matrix $\bm{Q} \in \S_n$ is positive semidefinite if for all $\bm{x} \in \R^n$,

\[ \bm{x}^\top \bm{Q x} \geq 0. \]

Definition 3 (Positive Definite Matrix) A matrix $\bm{Q} \in \S_n$ is positive definite if for all $\bm{x} \in \R^n$ such that $\bm{x} \neq \bm{0}$,

\[ \bm{x}^\top \bm{Q x} > 0. \]

Definition 4 (Negative (Semi)Definite Matrix) A matrix $\bm{Q} \in \S_n$ is negative (semi)definite if its additive inverse, $-\bm{Q}$, is positive (semi)definite.

Least Squares Identification

Given $y(x) = \sin{x}$, $-\pi \leq x \leq \pi$, find the parameters $a$, $b$, $c$, $d$, such that

\[ y_P(x) = a + bx + cx^s + dx^3 \]

is a “good” approximation to $y(x) = \sin{x}$. Let

\[ \bm{x} = \bmat{x_1 = -\pi & x_2 & \cdots & x_{n-1} & x_n = \pi} \]

and

\[ \bm{y} = \sin{\bm{x}} = \bmat{\sin{x_1} = 0 & \sin{x_2} & \cdots & \sin{x_{n-1}} & \sin{x_n} = 0}. \]

Then, we want to find $a$, $b$, $c$, $d$ such that

\[ \underbrace{ \bmat{ y_1 \\ y_2 \\ \vdots \\ y_n }}_{\bm{Y}\in \R^n} = \bmat{ \sin{x_1} \\ \sin{x_2} \\ \vdots \\ \sin{x_n} } = \underbrace{\bmat{ 1 & x_1 & x_1^2 & x_1^3 \\ 1 & x_2 & x_2^2 & x_2^3 \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 & x_n^3 }}_{\bm{W} \in \R^{n \times 4}} \underbrace{\bmat{ a \\ b \\ c \\ d }}_{\bm{P} \in \R^4} \]

Let us define the error vector $\bm{E} := \bm{Y} - \bm{WP}$ and the total error $e: \R^n \to \R$

\[ e(\bm{Y}) := \norm{\bm{E}}{}^2 = \bm{E}^\top \bm{E} = \bm{Y}^\top \bm{Y} - 2 \bm{P}^\top \bm{W}^\top \bm{Y} + \bm{P}^\top \bm{W}^\top\bm{W} \bm{P}. \]

The gradient of the error function is given by

\[ \nabla e(\bm{Y}) = -2 \bm{W}^\top \bm{Y} + 2 \bm{W}^\top\bm{WP} = -2\bm{W}^\top (\bm{Y} - \bm{WP}) = -2\bm{W}^\top\bm{E}. \]

Setting this equal to zero, we obtain a linear equation in $\bm{P}$, whose solution is given by

\[ \bm{P}^* = \left( \bm{W}^\top \bm{W} \right)^{-1} \bm{W}^\top \bm{Y}. \]

Note that $\bm{W}^\top \bm{W} \in \S_4$ is invertible if and only if $\bm{W}$ has full rank if and only if there are at least $4$ distinct data points in the data set $\bm{x}$. The iterative solution to this problem is found by picking a guess at a solution, say $\bm{P}^{(0)} = \bm{0}$, and then update it until convergence according to the first-order difference equation

\[ \begin{split} \bm{P}^{(n+1)} = \bm{P}^{(n)} - \eta \nabla e(\bm{Y}) &= \bm{P}^{(n)} + 2\eta\bm{W}^\top\left(\bm{Y} - \bm{W}\bm{P}^{(n)}\right) \\ &= \left(\bm{I} - 2\eta\bm{W}^\top \bm{W}\right)\bm{P}^{(n)} + 2\eta \bm{W}^\top\bm{Y}. \end{split} \tag{2}\]

$\bm{W}^\top \bm{W}$ is positive definite if $\bm{W}$ has full rank.

$\bm{W}^\top \bm{W}$ is obviously symmetric. Taking an arbitrary $\bm{x} \in \R^4$, and letting $\bm{v} = \bm{Wx}$ we see that \[ \bm{x}^\top \left(\bm{W}^\top \bm{W}\right) \bm{x} = \bm{v}^\top \bm{v} = \norm{\bm{v}}{}^2 \geq 0. \] Finally, observe that $\bm{v} = \bm{0}$ if and only if $\bm{x} = \bm{0}$ since $\bm{W}$ has full rank. $\square$

Continuous Analogue of the Gradient Descent

This difference Equation 2 has an associated differential equation, namely

\[ \dot{\bm{P}} = -\alpha\bm{W}^\top\bm{WP} + \alpha\bm{W}^\top \bm{Y} \] with $\alpha > 0$. Since $\bm{W}^\top \bm{W} \succeq 0$, this is a stable linear time-invariant ordinary differential equation. Hence, its final value can be found by invoking the Final Value Theorem (FVT) from control theory. To apply that, take the Laplace transform of the ODE (with zero initial conditions) to get

\[ s \bm{P}(s) = -\alpha \bm{W}^\top \bm{W} \bm{P}(s) + \frac{\alpha \bm{W}^\top\bm{Y}}{s} \]

After some rearrangement, the FVT yields \[ \lim_{t \to \infty} \bm{P}(t) = \lim_{s \to 0} s \bm{P}(s) = \lim_{s \to 0} s \left(s\bm{I} + \alpha \bm{W}^\top \bm{W}\right)^{-1}\frac{\alpha \bm{W}^\top \bm{Y}}{s} = \left(\bm{W}^\top\bm{W}\right)^{-1}\bm{W}^\top\bm{Y}. \]

Infinitely-Many Extrema

Note

Because JAX is not yet compatible with WebAssembly (browser-based Python), this example uses standard Numpy and the analytical gradient formula.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 675

from shiny import App, render, ui
import numpy as np
import matplotlib.pyplot as plt

# 1. Define Math Functions (Using NumPy instead of JAX)
def C(x): 
    return 0.5 * (np.cos(x[0])**2 + np.sin(x[1])**2)

def grad_C(x):
    # Analytical gradient derived from C(x)
    return np.array([
        -0.5 * 2 * np.cos(x[0]) * np.sin(x[0]), # dx0: -cos*sin
         0.5 * 2 * np.sin(x[1]) * np.cos(x[1])  # dx1:  sin*cos
    ])

# 2. Gradient Descent Logic
def run_gradient_descent(x0, eta, n_iter=50):
    x = np.array(x0, dtype=np.float64)
    path = [x.copy()]
    
    for _ in range(n_iter):
        g = grad_C(x)
        x = x - eta * g
        path.append(x.copy())
        
    return np.array(path)

# 3. Shiny UI
app_ui = ui.page_fluid(
    ui.layout_sidebar(
        ui.sidebar(
            ui.h4("Parameters"),
            ui.input_slider("lr", "Learning Rate (η)", 0.0, 2.0, 0.2, step=0.05),
            ui.input_slider("n_iter", "Iterations", 1, 100, 20),
            ui.hr(),
            ui.h5("Start Position"),
            ui.input_slider("x0", "X0 (Start)", -6.0, 6.0, -5.5),
            ui.input_slider("x1", "X1 (Start)", -6.0, 6.0, 2.0),
            open="always",
        ),
        ui.card(
            ui.output_plot("descent_plot")
        )
    )
)

# 4. Server Logic
def server(input, output, session):
    
    @output
    @render.plot
    def descent_plot():
        # A. Setup Background Contour
        pi = np.pi
        xrange = np.linspace(-2.5*pi, 2.5*pi, 100)
        yrange = np.linspace(-2.5*pi, 2.5*pi, 100)
        xx, yy = np.meshgrid(xrange, yrange)
        
        # Calculate Z for contour using vectorized numpy
        # Note: We construct input manually for broadcasting
        Z = 0.5 * (np.cos(xx)**2 + np.sin(yy)**2)

        # B. Run Algorithm based on inputs
        start_pos = [input.x0(), input.x1()]
        path = run_gradient_descent(start_pos, input.lr(), input.n_iter())

        # C. Plotting
        fig, ax = plt.subplots(figsize=(6,6))
        
        # Plot Contour
        c = ax.contourf(xx, yy, Z, levels=25, cmap='viridis', alpha=0.6)
        plt.colorbar(c, ax=ax, label='Cost C(x)')
        
        # Plot Trajectory
        ax.plot(path[:,0], path[:,1], 'r.-', linewidth=2, markersize=6, label='Gradient Path')
        ax.plot(path[0,0], path[0,1], 'ko', markersize=9, label='Start')
        ax.plot(path[-1,0], path[-1,1], 'w*', markersize=12, markeredgecolor='k', label='End')

        # Place text at 5% x, 95% y (Top Left), relative to axes dimensions
        formula_text = r"$C(x) = \frac{1}{2} (\cos^2(x_0) + \sin^2(x_1))$"
        ax.text(0.05, 0.95, formula_text, transform=ax.transAxes, 
                fontsize=14, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.9))
       
        ax.set_title(f"Gradient Descent (η = {input.lr()})")
        ax.set_xlabel('$x_0$')
        ax.set_ylabel('$x_1$')
        ax.legend(loc='lower right')
        
        return fig

app = App(app_ui, server)

Least Squares: Closed-Form vs. Gradient Descent

This interactive application demonstrates Least Squares Identification as described in the provided notes. We aim to approximate $y = \sin(x)$ using a 3rd-degree polynomial:

\[ y_P(x) = a + bx + cx^2 + dx^3 \]

We compare two methods:

Closed Form: The analytical solution $\bm{P}^* = \left(\bm{W}^\top \bm{W}\right)^{-1} \bm{W}^\top \bm{Y}$.
Gradient Descent: Iteratively minimizing the squared error $e$.

Taylor Series Check

The Taylor series for $\sin(x)$ is $x - \frac{x^3}{6} + \dots$. Therefore, we expect our coefficients to converge near: $a=0, b=1, c=0, d=-\nicefrac{1}{6}$.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 925

from shiny import App, render, ui, reactive
import numpy as np
import matplotlib.pyplot as plt

# 1. Setup Data (Fixed range as per PDF)
# Generate n points between -pi and pi
n = 100
x_data = np.linspace(-np.pi, np.pi, n)
y_data = np.sin(x_data)

# Construct W matrix (Design Matrix) [1, x, x^2, x^3]
# As defined in the PDF
W = np.column_stack([
    np.ones_like(x_data), 
    x_data, 
    x_data**2, 
    x_data**3
])

# 2. Closed Form Solution (The "Best" fit)
# P* = (W.T @ W)^-1 @ W.T @ Y
WtW = W.T @ W
WtY = W.T @ y_data
P_closed = np.linalg.solve(WtW, WtY) # More stable than inv()

# 3. Gradient Descent Implementation
def run_gd_linear_regression(W, Y, learning_rate, epochs):
    # Initialize coefficients [a, b, c, d] to zeros
    P = np.zeros(4)
    
    # Track error history
    errors = []
    
    # Loop
    for _ in range(epochs):
        # Calculate Prediction: Y_pred = W @ P
        Y_pred = W @ P
        
        # Calculate Error Vector: (Y - Y_pred)
        error = Y - Y_pred
        
        # Calculate Gradient (derived in PDF):
        # dE/dP = -2 * W.T @ (Y - W @ P)
        gradient = -2 * W.T @ error
        
        # Update Weights: P_new = P_old - eta * gradient
        P = P - learning_rate * gradient
        
        # Store Sum Squared Error
        sse = np.sum(error**2)
        errors.append(sse)
        
    return P, errors

# 4. Shiny UI
app_ui = ui.page_fluid(
    ui.layout_sidebar(
        ui.sidebar(
            ui.h4("Hyperparameters"),
            
            # Note: Polynomial features grow large (x^3 approx 27). 
            # Gradients are huge, so Learning Rate must be tiny.
            ui.input_slider("lr", "Learning Rate", 0.000001, 0.0001, 0.00001, step=0.000001, sep=""),
            ui.input_slider("epochs", "Iterations", 0, 5000, 100, step=100),
            
            ui.hr(),
            ui.h5("Coefficients (a,b,c,d)"),
            ui.output_text_verbatim("coeff_display"),
            
            ui.hr(),
            ui.input_action_button("reset", "Reset / Re-run"),
            open="always",
        ),
        ui.card(
            ui.output_plot("main_plot")
        )
    )
)

# 5. Server Logic
def server(input, output, session):
    
    @reactive.Calc
    def perform_gd():
        # Re-run when button is clicked or inputs change
        input.reset() 
        
        lr = input.lr()
        epochs = input.epochs()
        
        P_gd, error_history = run_gd_linear_regression(W, y_data, lr, epochs)
        return P_gd, error_history

    @output
    @render.text
    def coeff_display():
        P_gd, _ = perform_gd()
        
        # Format for display
        return (
            f"--- Closed Form ---\n"
            f"a: {P_closed[0]:.4f}\n"
            f"b: {P_closed[1]:.4f}\n"
            f"c: {P_closed[2]:.4f}\n"
            f"d: {P_closed[3]:.4f}\n\n"
            f"--- Gradient Descent ---\n"
            f"a: {P_gd[0]:.4f}\n"
            f"b: {P_gd[1]:.4f}\n"
            f"c: {P_gd[2]:.4f}\n"
            f"d: {P_gd[3]:.4f}"
        )

    @output
    @render.plot
    def main_plot():
        P_gd, error_history = perform_gd()
        
        # Generate Predictions
        y_pred_closed = W @ P_closed
        y_pred_gd = W @ P_gd
        
        # Error metrics
        mse_closed = np.mean((y_data - y_pred_closed)**2)
        mse_gd = np.mean((y_data - y_pred_gd)**2)

        # Plotting
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(5, 10))
        
        # Plot 1: Curve Fitting
        ax1.plot(x_data, y_data, 'k-', alpha=0.3, linewidth=4, label='True sin(x)')
        ax1.plot(x_data, y_pred_closed, 'b--', linewidth=2, label=f'Closed Form (MSE: {mse_closed:.4f})')
        ax1.plot(x_data, y_pred_gd, 'r-', label=f'Gradient Descent (MSE: {mse_gd:.4f})')
        
        ax1.set_title(f"Polynomial Fit (Iter: {input.epochs()})")
        ax1.set_xlabel("x")
        ax1.set_ylabel("y")
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        
        # Plot 2: Error Curve
        if len(error_history) > 0:
            ax2.plot(error_history)
            ax2.set_title("Sum Squared Error over Iterations")
            ax2.set_xlabel("Epoch")
            ax2.set_ylabel("Error (SSE)")
            ax2.set_yscale('log') # Log scale helps visualize convergence
            ax2.grid(True, alpha=0.3)
        
        return fig

app = App(app_ui, server)

Embedding these Jupyter Notebooks makes the website sluggish, so please download them and open locally.

JAX JIT Compilation (.ipynb) JAX Random Numbers (.ipynb) JAX PyTrees (.ipynb)