Mathematical Foundation of AI

Author: Hadi Daneshmand and Yangfeng Ji

We introduce a set of interactive classroom puzzles that move beyond traditional lecture-based instruction. These puzzles are designed to bridge theory and practice, deepen intuition, and strengthen conceptual understanding. All course materials—including the puzzles—can be readily adopted by instructors, students, and AI-related programs.

This page is developed as part of the prospective interdisciplinary AI minor at the University of Virginia.

Acknowledgement. This educational effort was supported in part by NSF grant 2343611 (PI David Evans) whose support is gratefully acknowledged.

Citations You’re welcome to use these puzzles in your classroom! We’d appreciate it if you could cite our work when reusing them.

@book{daneshmand2025machine,
  title     = {Machine Learning Foundations with Puzzles},
  author    = {Daneshmand, Hadi and Ji, Yangfeng},
  year      = {2025},
  publisher = {University of Virginia}
}

Overview

Topics

The graph above illustrates the core topics planned for the Mathematical Foundations of AI course, a central component of the AI minor program at the University of Virginia. Topics on the left represent foundational background material, while those on the right correspond to key machine learning concepts. We currently provide interactive puzzles for three essential areas: supervised learning, linear algebra, and optimization.

Core I: Supervised Learning

Topics

Title Author Booktitle Chapter Subtopics
Motivation & In-class group activities Daneshmand and Ji Puzzles for Math of AI Core I (1) logistic regression (2) SVM (3) Perceptron (4) linear regression for classification
Linear regression Friedman, Tibshirani, and Hastie Elements of Statistical Learning Ch. 3 (1) maximum likelihood estimation and regression (2) lasso (3) bias–variance decomposition
Linear classification Friedman, Tibshirani, and Hastie Elements of Statistical Learning Ch. 4 (1) logistic regression (2) perceptron (3) SVM
Non-linear supervised learning Friedman, Tibshirani, and Hastie Elements of Statistical Learning Ch. 6 (1) maximum margin notion (2) hard-margin SVM (3) dual formulation of SVM (4) soft-margin SVM

We designed several interactive Colab visualization to compare different classification methods and to demonstrate the weaknesses and strengths of various classification methods.

Problem 1: Breaking linear regression for classifcation

In the plot below, you see a set of blue and red points denoted by

(x(i),y(i)). Each feature vector
x(i)R2

represents the location of a point, and each label
y(i){0,1}

determines its color.

The dashed line corresponds to a separating plane obtained via regression.
Specifically, we solve

w^=argminwi(x(i),wy(i))2,
and then plot the linear predictor used to seperate blue points from the red points
y=w^,x+0.5.

Question: Move the points in this colab so that the regression solution can no longer separate the two classes while the points are still linearly separable.

bokeh_plot (4)

Solution

Linear regression is sensitive to outliers, so by moving a point far away one can break regression as shown bellow

plot

Problem 2: Breaking logistic regression

In this colab, you see a set of draggable points, each labeled as either blue or red. The dashed line corresponds to the separating boundary computed by logistic regression. Recall that logistic regression models the probability of class

y{0,1} given a feature vector
xR2
as
P(y=1x)=σ(wx),where σ(z)=11+ez.

The parameters
w
are obtained by minimizing the logistic loss

w^=argminwi(y(i)logσ(wx(i))(1y(i))log(1σ(wx(i)))).

Question
Change the location of a single point so that the logistic regression solution misclassifies many of the remaining points.

Solution.

This exercise demonstrates how sensitive the separating boundary of logistic regression is to misclassified points. If we move one of the blue points far to the right toward the red points, the logistic regression model shifts the classification boundary to reduce misclassification. This behavior is due to the exponential penalty for misclassified points in the likelihood function.

Problem 3: Rosenblatt’s algorithm

One way to fix logistic regression issue in the last problme is changing the loss function as

argminw1(x(i),w×y(i)<0.5),1(a<0.5)={1a<0.50a>0.5

The problem is equivalent to learning a perceptron classifier. Since the objective function above is not differentiable, it cannot be directly optimized using standard gradient descent. Rosenblatt proposed an alternative algorithm capable of optimizing this objective despite its non-differentiability. However, the problem remains non-convex, which introduces challenges that we aim to investigate—specifically, under what conditions the algorithm successfully classifies the data points.

Question: In this Colab demo, you can drag the points, and the separating line computed by Rosenblatt’s algorithm will update interactively. Under what conditions can Rosenblatt’s algorithm successfully separate the blue and red points?

Solution

We observe that the algorithm successfully finds a separating line as long as the data remains linearly separable; however, its convergence slows significantly as the margin decreases.

Problem 4: Hard margin SVM

In this colab demo, you can drag the points, and the decision boundary of the hard-margin SVM (shown as the dashed line) updates interactively.

Recall: Hard-Margin SVM in PyTorch via Dual Optimization (no upper box)
Dual:

maxf(α)=1Tα1/2αT(YKY)α
s.t.α>=0,yTα=0

Separating line after finding the solution the above maximization:

w=iαiyixi and
b=meani:αi>0[yiwTxi]
.
Question: Identify a point whose movement can significantly alter the decision boundary of the hard-margin SVM.

Solution.

The task requires students to determine the support vectors and adjust their positions to reduce the margin separating the blue and red point sets. In the dual perspective, this corresponds to minimizing the distance between the convex hulls of the two classes, with the boundary points of these hulls serving as the support vectors (see the plot below).

bokeh_plot (7)

Problem 5: Hard margin vs soft margin SVM

Compare the peformance of hard-margin and soft-margin SVM in this colab.

Question. By dragging points, generate datasets on which hard-margin and soft-margin lead to completely classification decision boundary.

Solution

Soft-margin SVM is very robust to missification and keep a large margin when the outliers can reduces to a small margin. You can compare hard-margin decision boundary in problem 4 with soft-margin decision boundary in the plot bellow to see the main difference between these two foundemental algorithms.

bokeh_plot (8)

Problem 6: Non-linear Classification

The question is how to use linear models when classes are not linearly seperable. Consider the following data points and example

image
In the plot above, the SVM decision boundary misclassifies nearly half of the points. How can we address this issue?

Open this interactive colab demo. You can modify the function
phi in the demo to apply different feature transformations. The demo will then update the decision boundary based on your nonlinear transformation (by fitting a linear model on the transformed features).
Question:: find a transformation that perfectly separates the points.

Solution

When we include features that construct the norm of the points, the classification becomes linear in the transformed space. The mapping below illustrates one valid choice, though many alternatives exist.

def phi(x, z): # Note x and z are 1D array of size #samples containing the x and z coords. return np.stack([x**2, z**2], axis=-1)

image

Problem 7: Kernel Methods and Overfitting

In this interactive colab, you can adjust the kernel bandwith and observe how changing bandwidth adjust the decision boundary.

image

Question: Observe how the decision boundary and the test error (reported in the plot title) change as you adjust the RBF kernel bandwidth parameter

γ defined in
k(x,y)=eγxy2
.

Part a) Explain why the decision boundary forms isolated “islands” around individual points when

γ is very large, as shown below.
image

Solution.

Recall represented theorm states the decision boundary has the following form

y(x)=i=1nαik(x,xi)=i=1nαieγxxi2. When
γ
is very large,
k(x,xi)
is zero if
x
is slightly far from
xi
s. That is why we observe isolated “islands” around individual points for a large
γ
.

Part b) Why test error (error on unseen data drawn from the same distribution) increases for large

γ?

Solution.

If the decision boundary encloses individual training points in small, isolated regions, the classifier becomes highly sensitive to minor variations in the input. This behavior reflects overfitting: the model performs well on the training data but fails to generalize to new samples that are even slightly different.

Background I: Linear Algebra

Topics

Title Author Booktitle Chapter Subtopics
Motivation & In-class group activities Hadi Daneshmand and Yangfeng Ji Puzzles for Math of AI Section I (1) word embeddings, (2) denoising and (3) linear transformation of circle
Orthogonality Gilbert Strang Introduction to Linear Algebra Ch. 3 (1)Orthogonality, (2) least squares, (3) change of basis
Spectral Decomposition Gilbert Strang Introduction to Linear Algebra Ch. 5 Eigenvalues, Eigenvectors, Eigenvalue decomposition, Matrix power
Convolution Gilbert Strang Computational Science and Engineering Ch. 4 Fourier transform, convolution, low pass filtering
Computational Linear Algebra Gilbert Strang Introduction to Linear Algebra Ch. 7 Condition number, power method, matrix norms

Problem 8: Word embedding

One effective way to motivate linear algebra is word embeddings. Word embeddings not only demonstrate the power of representing objects as vectors, but also provide a foundation for understanding the mechanisms of language models and the advanced techniques used in natural language processing.

Required backgrounds: SVD decomposition, word embedding in NLP, Glove and loading it in python

Part a)
Why is pattern recognition challenging when words are represented only at the character level?

Solution

Two words may share very similar character patterns yet convey completely different meanings—for example, desert and dessert.

Part b)
Word embedding provides a vector representation for each words. Such representation allows computers to do inference by computation. To domonstrate the power of word embedding, we propose a simple in-class room coding exercise.

Let

vword denote the vector representation of the word. The following python code compute the vector
vmadridvspain+vfrance
where word representations are computed using Glove method that provide a vector representation of each word. Return
5
words with highest cosine similarity with this vector.

!pip install --upgrade "scipy<1.14"
!pip install --upgrade gensim

import gensim.downloader as api

# load a pretrained embedding model (small enough for demo)
model = api.load("glove-wiki-gigaword-300")

# words
spain, madrid, france = "spain", "madrid", "france"

# vector arithmetic: v = v_madrid - v_spain + v_france
v = model[madrid] - model[spain] + model[france]

# find top 5 most similar words (excluding the originals)

Solution and Remarks Loading the word embeddings takes about 3–5 minutes. The implementation below identifies the top five most similar words, with Paris emerging as the closest match. This demonstrates that word embeddings effectively capture semantic meaning, enabling computers to perform computations—such as linear algebra—for text processing.

# solution 

results = model.similar_by_vector(v, topn=10)
results = [(w, s) for w, s in results if w not in [spain, madrid, france]][:5]

results

Part c) The following code constructs a

6×300 matrix
X
containing the word embeddings for "cat", "dog", "frog", "chair", "table", and "desk". By applying singular value decomposition (SVD), please group the rows of
X
into two categories. Which words fall into the same category?

#!pip install --upgrade "scipy<1.14"
#!pip install --upgrade gensim

import gensim.downloader as api
import numpy as np
# load a pretrained embedding model (small enough for demo)
model = api.load("glove-wiki-gigaword-300")

 
words = ["cat", "dog", "frog", "table", "chair", "desk"]

# build embedding matrix (rows = words, cols = embedding dimensions)
X = np.vstack([model[w] for w in words])
Solution.

The following code effectively generates two groups: group 1 = {cat,dog,frog} and group 2= {table,chair,desk}. Think of laying marbles (cat, dog, frog, table, chair, desk) on a flat floor. SVD looks at the “longest spread” direction. That spread happens to separate the animals from the furnitures, so cutting along that direction naturally forms two groups.

Xc = X - X.mean(0, keepdims=True) U, S, Vt = np.linalg.svd(Xc, full_matrices=False) # project on first singular vector and split by sign proj = Xc @ Vt[0] g1 = [w for w, z in zip(words, proj) if z >= 0] g2 = [w for w, z in zip(words, proj) if z < 0] print("Cluster 1:", g1) print("Cluster 2:", g2)

Problem 9: Denoising of a song

Denoising is traditionally introduced in signal processing courses, but it also serves as an excellent example for motivating and revisiting key concepts in linear algebra—such as orthogonal bases, change of basis, and the comparison between the standard basis and the Fourier basis. We therefore recommend the following in-class exercise to help students deepen their understanding of these algebraic notions.

Part a) Fourier transform of a song

We load a song in Python using the code below. Then, we apply the np.fft.fft and np.fft.fftfreq functions to compute and plot the Fourier transform of the signal, showing frequency versus complex magnitude.

    # Here, we load a song
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from scipy.ndimage import gaussian_filter1d
from IPython.display import Audio
# Load audio (longer example)
y, sr = librosa.load(librosa.example("nutcracker"))

# Apply  Gaussian smoothing
y_smooth = gaussian_filter1d(y, sigma=10)

# Play the song
Audio(y_smooth, rate=sr)
Solution.

Output will be the following plot

image

Remarkably, the Fourier transform is an example of a change of orthogonal basis. Recall that a basis

v1,,vn is orthogonal if
vj,vi={0ij1i=j
.
A change of basis can be expressed as a simple matrix multiplication.

    # Plot smoothed waveform
plt.figure(figsize=(12, 4))
librosa.display.waveshow(y_smooth, sr=sr)
plt.title("Smoothed Waveform")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

# Compute FFT of smoothed signal
n = len(y_smooth)
Y = np.fft.fft(y_smooth)
Y = np.abs(Y[:n//2])
freqs = np.fft.fftfreq(n, 1/sr)[:n//2]

plt.figure(figsize=(12, 4))
plt.plot(freqs, Y)
plt.title("Frequency Spectrum (Gaussian Smoothed Signal)")
plt.xlabel("Frequency (Hz)")
plt.ylabel("Magnitude")
plt.xlim(0, 8000)
plt.show()

Part b) Why is the Fourier transform sparse?
In part (a), we observed that the Fourier transform of the signal has vanishing magnitude at high frequencies. Explain the reasoning behind this observation.

Solution.

Intuition. Low frequencies capture the coarse (rough) structure of a signal, while high frequencies encode its fine details. The following figure from Wikipedia illustrates the incremental process of signal reconstruction using Fourier basis functions.

  • First row: Reconstruction of the signal (blue) using only the first Fourier basis function.
  • Second row: Reconstruction using the first two basis functions.
  • (Additional rows continue by adding more frequencies, progressively refining the approximation.)

image

Theory. It is known that that fourier transform of

k-times continousely differentiable function whose
k1
derivatives are periodic decays at rate
1/|w|k
where
w
is the frequency.

Theorem. Let

k be an integer number. Suppose
f
is a function of a variable whose
k
-derviative denoted by
f(k)(x)
. Let
f¯(ξ)
denote the Fourier transform of
f
where
ξ
is a complex number. Then for all
ξ0
,
|f^(ξ)|1(2π|ξ|)kf(k)L1(R).
Equivalently,
|f^(ξ)|=O(|ξ|k)
as
|ξ|
.


Proof Idea. The proof relies on the following property of the Fourier transform: Let

f¯k(ξ) denote the Fourier transform of
f(k)(x)
then
f(k)^(ξ)=(2πiξ)kf^(ξ).

Part c) Fourier transform of noisy song
We add noise to the signal:

# Play the noisy song
y_smooth = y_smooth + 0.02*np.random.randn(len(y_smooth))
Audio(y_smooth, rate=sr)

You can hear background noise in the song. Plot the Fourier transform of the noisy song and compare it with the Fourier transform of the original song from part (a). What differences do you observe?

Solution. It is easy to adapt the code to plot Fourier transform of the noisy signal. We observe a siginficant change in high frequencies. Since the noise i.i.d. in time, it ossilate a lot which indicate that high frequency sin wave capture it.

image

Part d) denoising
Using the observation in part c, propose a method to denoise the song.

Solution The solution is to set the high frequencies to zero and compute the inverse of Fourier transform.

n = len(y_smooth)
Y = np.fft.fft(y_smooth)
Y[100000:n-100000] = 0
y_recon = np.fft.irfft(Y, n=n)
Audio(y_recon, rate=sr)

Problem 10: Linear transform of the unit circle

In this problem, we review the concept of eigenvectors and eigendecomposition of square symmetric matrices. Consider the following linear function

f(x)=Ax,A:=[10.50.51],xR2

part a)
We compute the image of the unit circle under (f) and obtain the following plot.
Based on this plot, what is the leading eigenvector of

A?

image

Solution.

The leading eigenvector of

A is the solution of
maxxAx,xunit circle
. As we observe, vector
[1,1]/2
is the solution of the above optimization.

Part b) In the figure below, the image of the unit circle under f(f(circle)) is shown in green, and the image under f(circle) is shown in red. Explain why both the red and green ellipses share the same diagonal directions, and how these directions relate to the eigenvectors of

A.
image

Background II: Optimization

Topics

Title Author Booktitle Chapter Subtopics
Smooth Unconstrained Optimization Yurii Nesterov Introductory Lectures on Convex Optimization Ch. 2 (1) Limitations of search for optimization (2) gradient descent, (3) algorithm definition of convexity
Convergence analysis and 2nd order optimization Lieven Vandenberghe and Stephen P. Boyd Convex Optimization Ch. 9 1) condition number, (2) convergence rate of GD on strongly convex functions (3) 2nd order optimization methods (Newton's method)
Constrained optimization Lieven Vandenberghe and Stephen P. Boyd Convex Optimization Ch. 10 and 11 Lagrange multipliers, barrier functions, and interior-point methods

Problem 11: Recovering horse from shadows

Motivation: A key skill in machine learning is the ability to formulate real-world problems as optimization problems. In this exercise, we practice that skill through a creative challenge: given two distinct shadows of a horse, reconstruct a three-dimensional point cloud representation of the horse.

horse

Problem Statement
We are given a 3D point cloud of a horse represented by a matrix

XR48485×3, where each row corresponds to a single point in the point-cloud representation. From this data, we generate two 2D projections (or shadows):
Y1=XW1R48485×2
and
Y2=XW2R48485×2
,
where
W1
and
W2
are random matrices.

The following figures show scatter plots of

Y1 and
Y2
:

image
image

The goal is to reconstruct

X (horse) from
Y1
and
Y2
.

Part a) Formulate the construcation as an optimization problem

Solution:

minW1,W2,XY1XW1F2+Y2XW2F2.

Part b) This colab provides code skeleton for the recover and ploting the 3d reconstruction of the horse. Complete the missed line marked with TODO to observe the origional horse.

Solution: TODO line has to be completed as based on solution of part a).

 loss = torch.norm(X @ A1 - S1)**2 + torch.norm(X@A2 - S2)**2

After completing the code, and running the optimizaiton, the students can observed the following 3d reconsturction of the horse.

image

Problem 12: Geometric interperation of gradient

Goal: Introduce the geometric interpretation of partial derivatives. This helps students understand the gradient descent method.

Part A: Partial Derivative.
Given a function

f:RR. The partial derivative of
f
at
x
is defined as
f(x)=limt0f(x+t)f(x)t

Is

f(2) possitive or negative for the following function? Hint: the orange line plots a line tangent to
f
at
2
.

square_10

Solution.

When

t is small, we can estimate
f(x+t)
with the value of
yΔ<y
marked in the following plot
square_10

Thus
(f(2+t)f(2))/t<0
for a small
t
which concludes
f(2)<2
. Notably
f(x)
is equal to
cos(α)
where
α
is the slope of the tangent line at
x
.

Part 2: Gradient

Given a function

f of multivariables
x1,,xd
, the gradient of
f
is defined as the vector of paritial derivatives:

f(x)=[limt10f(x1+t,x2,xd)t1,limt20f(x1,x2+t,xd)t2,,limtd0f(x1,x2,xd+t)td]

The figure below shows the contour plot of a function over

R2, where each blue diamond represents a level set with constant function value. At the purple point, what is the direction of the gradient?
square_10

Solution.

Recall the first-order Taylor expansion of

f around
x
as
f(x+y)f(x)+f(x),y.

Since the function value is constant along the blue curve, we must have
f(x+y)=f(x)
in a neighborhood of
x
. This condition implies that the gradient
f(x)
is orthogonal to the blue curve. Thus, at any point, the gradient is perpendicular to the curve.

Problem 13: Trapping in Local Optimum and Non-convexity

Goal: Simulating gradient descent, and understanding the concept of local optimality

Part a: Tracking gradient descent

Recall gradient descent optimizes function

f through the following recurrance

xk+1=xkγkf(xk)
where
γkR+
is called stepsize or learning rate. Consider function
f(x)=x3+(x1)3
in the following plot. Mark the iterates of
x1
and
x2
on the following plot starting from
x0=0
, which marked with the orange point, with stepsize
γk=0.02
.
square_10

Solution
It is easy to check the sign of gradient using the slop of the tangent line. This allows us to check if gradient descent iterates move to left or right. The following figure marks the 2nd iterates obtained by the following
demo

square_10

Part b: convergence of gradient descent

Consider the example in part b. What is the limit point of the sequence of iterates

limkxk starting from
x0=0
with stepsize
γ=0.02
? Does gradient descent converge to
argminf(x)
? Hint: Use the following demo for observation

Solution.

Gradient iterates do not change if they reach a a point where

f(xk)=03x2+2(x1)=0x=0.54 or
x=1.2
. Since the gradient is starting from
x=2
it will stock at
0.54
. Note that
x=0.54
is not global minimum of
f
and the minimum of this function is unbounded. However,
x=0.54
is a optimal in its local neighbourhood.
square_10

Problem 14: GD Convergence rate on quadratic functions

Goal: Introduce automatic differentiation with PyTorch and observe the role of conditioning in the convergence of gradient descent.
Part a: Implementing GD in pytorch
Complete this colab to implement GD on

f(x,y)=x2+(y1)2 using automatic differentiation.

Solution. Add the following lines to the code completes the code skeleton

with torch.no_grad():
        x -= lr * x.grad
        y -= lr * y.grad

After completing the code, we observe the following convergence result

download (15)

Part b: Comparing the convergence speed on two quadratic functions
Run the experiment in part a for

f(x,y)=0.1x2+(1y)2 and compare the convergence speed of gradient descent.

Solution.

Aftering changing

α=0.1 in the code, we get the following result
image1

The convergence is siginficantly slower than those for the function in part a. The instructor can use this example to introduce the convergence rate of GD on general quadratic functions in the following form

f(x)=i=1dj=1dxixjQij,QRd×d
Let
x=argminxRdf(x)
; Let
μ
and
L
are minimum and maximum eigenvalues of the symmetric square postivie semi-definite matrix
Q
, respectively.
κ=L/μ
is called condition number which governs the convergence of GD as
xkx=O((κ1κ+1)k)
holds. For function in part a,
κ=1
which indicates that gradient with proper stepsize converges to the minimizer
x
in one step. While in part b,
κ=10
which leads to slow convergence rate. It is easy to check that for having
xkx
, we need
k=O(κlog(1/ϵ))
which linearly grows with the condition number.