Mathematical Foundation of AI
Author: Hadi Daneshmand and Yangfeng Ji
We introduce a set of interactive classroom puzzles that move beyond traditional lecture-based instruction. These puzzles are designed to bridge theory and practice, deepen intuition, and strengthen conceptual understanding. All course materials—including the puzzles—can be readily adopted by instructors, students, and AI-related programs.
This page is developed as part of the prospective interdisciplinary AI minor at the University of Virginia.
Acknowledgement. This educational effort was supported in part by NSF grant 2343611 (PI David Evans) whose support is gratefully acknowledged.
Citations You’re welcome to use these puzzles in your classroom! We’d appreciate it if you could cite our work when reusing them.
Overview

The graph above illustrates the core topics planned for the Mathematical Foundations of AI course, a central component of the AI minor program at the University of Virginia. Topics on the left represent foundational background material, while those on the right correspond to key machine learning concepts. We currently provide interactive puzzles for three essential areas: supervised learning, linear algebra, and optimization.
Core I: Supervised Learning
Topics
| Title |
Author |
Booktitle |
Chapter |
Subtopics |
| Motivation & In-class group activities |
Daneshmand and Ji |
Puzzles for Math of AI |
Core I |
(1) logistic regression (2) SVM (3) Perceptron (4) linear regression for classification |
| Linear regression |
Friedman, Tibshirani, and Hastie |
Elements of Statistical Learning |
Ch. 3 |
(1) maximum likelihood estimation and regression (2) lasso (3) bias–variance decomposition |
| Linear classification |
Friedman, Tibshirani, and Hastie |
Elements of Statistical Learning |
Ch. 4 |
(1) logistic regression (2) perceptron (3) SVM |
| Non-linear supervised learning |
Friedman, Tibshirani, and Hastie |
Elements of Statistical Learning |
Ch. 6 |
(1) maximum margin notion (2) hard-margin SVM (3) dual formulation of SVM (4) soft-margin SVM |
We designed several interactive Colab visualization to compare different classification methods and to demonstrate the weaknesses and strengths of various classification methods.
Problem 1: Breaking linear regression for classifcation
In the plot below, you see a set of blue and red points denoted by
. Each feature vector
represents the location of a point, and each label
determines its color.
The dashed line corresponds to a separating plane obtained via regression.
Specifically, we solve
and then plot the linear predictor used to seperate blue points from the red points
Question: Move the points in this colab so that the regression solution can no longer separate the two classes while the points are still linearly separable.

Solution
Linear regression is sensitive to outliers, so by moving a point far away one can break regression as shown bellow

Problem 2: Breaking logistic regression
In this colab, you see a set of draggable points, each labeled as either blue or red. The dashed line corresponds to the separating boundary computed by logistic regression. Recall that logistic regression models the probability of class given a feature vector as
The parameters are obtained by minimizing the logistic loss
Question
Change the location of a single point so that the logistic regression solution misclassifies many of the remaining points.
Solution.
This exercise demonstrates how sensitive the separating boundary of logistic regression is to misclassified points. If we move one of the blue points far to the right toward the red points, the logistic regression model shifts the classification boundary to reduce misclassification. This behavior is due to the exponential penalty for misclassified points in the likelihood function.
Problem 3: Rosenblatt’s algorithm
One way to fix logistic regression issue in the last problme is changing the loss function as
The problem is equivalent to learning a perceptron classifier. Since the objective function above is not differentiable, it cannot be directly optimized using standard gradient descent. Rosenblatt proposed an alternative algorithm capable of optimizing this objective despite its non-differentiability. However, the problem remains non-convex, which introduces challenges that we aim to investigate—specifically, under what conditions the algorithm successfully classifies the data points.
Question: In this Colab demo, you can drag the points, and the separating line computed by Rosenblatt’s algorithm will update interactively. Under what conditions can Rosenblatt’s algorithm successfully separate the blue and red points?
Solution
We observe that the algorithm successfully finds a separating line as long as the data remains linearly separable; however, its convergence slows significantly as the margin decreases.
Problem 4: Hard margin SVM
In this colab demo, you can drag the points, and the decision boundary of the hard-margin SVM (shown as the dashed line) updates interactively.
Recall: Hard-Margin SVM in PyTorch via Dual Optimization (no upper box)
Dual:
Separating line after finding the solution the above maximization: and .
Question: Identify a point whose movement can significantly alter the decision boundary of the hard-margin SVM.
Solution.
The task requires students to determine the support vectors and adjust their positions to reduce the margin separating the blue and red point sets. In the dual perspective, this corresponds to minimizing the distance between the convex hulls of the two classes, with the boundary points of these hulls serving as the support vectors (see the plot below).

Problem 5: Hard margin vs soft margin SVM
Compare the peformance of hard-margin and soft-margin SVM in this colab.
Question. By dragging points, generate datasets on which hard-margin and soft-margin lead to completely classification decision boundary.
Solution
Soft-margin SVM is very robust to missification and keep a large margin when the outliers can reduces to a small margin. You can compare hard-margin decision boundary in problem 4 with soft-margin decision boundary in the plot bellow to see the main difference between these two foundemental algorithms.

Problem 6: Non-linear Classification
The question is how to use linear models when classes are not linearly seperable. Consider the following data points and example

In the plot above, the SVM decision boundary misclassifies nearly half of the points. How can we address this issue?
Open this interactive colab demo. You can modify the function
phi in the demo to apply different feature transformations. The demo will then update the decision boundary based on your nonlinear transformation (by fitting a linear model on the transformed features).
Question:: find a transformation that perfectly separates the points.
Solution
When we include features that construct the norm of the points, the classification becomes linear in the transformed space. The mapping below illustrates one valid choice, though many alternatives exist.

Problem 7: Kernel Methods and Overfitting
In this interactive colab, you can adjust the kernel bandwith and observe how changing bandwidth adjust the decision boundary.

Question: Observe how the decision boundary and the test error (reported in the plot title) change as you adjust the RBF kernel bandwidth parameter defined in .
Part a) Explain why the decision boundary forms isolated “islands” around individual points when is very large, as shown below.

Solution.
Recall represented theorm states the decision boundary has the following form
. When is very large, is zero if is slightly far from s. That is why we observe isolated “islands” around individual points for a large .
Part b) Why test error (error on unseen data drawn from the same distribution) increases for large ?
Solution.
If the decision boundary encloses individual training points in small, isolated regions, the classifier becomes highly sensitive to minor variations in the input. This behavior reflects overfitting: the model performs well on the training data but fails to generalize to new samples that are even slightly different.
Background I: Linear Algebra
Topics
| Title |
Author |
Booktitle |
Chapter |
Subtopics |
| Motivation & In-class group activities |
Hadi Daneshmand and Yangfeng Ji |
Puzzles for Math of AI |
Section I |
(1) word embeddings, (2) denoising and (3) linear transformation of circle |
| Orthogonality |
Gilbert Strang |
Introduction to Linear Algebra |
Ch. 3 |
(1)Orthogonality, (2) least squares, (3) change of basis |
| Spectral Decomposition |
Gilbert Strang |
Introduction to Linear Algebra |
Ch. 5 |
Eigenvalues, Eigenvectors, Eigenvalue decomposition, Matrix power |
| Convolution |
Gilbert Strang |
Computational Science and Engineering |
Ch. 4 |
Fourier transform, convolution, low pass filtering |
| Computational Linear Algebra |
Gilbert Strang |
Introduction to Linear Algebra |
Ch. 7 |
Condition number, power method, matrix norms |
Problem 8: Word embedding
One effective way to motivate linear algebra is word embeddings. Word embeddings not only demonstrate the power of representing objects as vectors, but also provide a foundation for understanding the mechanisms of language models and the advanced techniques used in natural language processing.
Required backgrounds: SVD decomposition, word embedding in NLP, Glove and loading it in python
Part a)
Why is pattern recognition challenging when words are represented only at the character level?
Solution
Two words may share very similar character patterns yet convey completely different meanings—for example, desert and dessert.
Part b)
Word embedding provides a vector representation for each words. Such representation allows computers to do inference by computation. To domonstrate the power of word embedding, we propose a simple in-class room coding exercise.
Let denote the vector representation of the word. The following python code compute the vector where word representations are computed using Glove method that provide a vector representation of each word. Return words with highest cosine similarity with this vector.
Solution and Remarks Loading the word embeddings takes about 3–5 minutes. The implementation below identifies the top five most similar words, with Paris emerging as the closest match. This demonstrates that word embeddings effectively capture semantic meaning, enabling computers to perform computations—such as linear algebra—for text processing.
Part c) The following code constructs a matrix containing the word embeddings for "cat", "dog", "frog", "chair", "table", and "desk". By applying singular value decomposition (SVD), please group the rows of into two categories. Which words fall into the same category?
Solution.
The following code effectively generates two groups: group 1 = {cat,dog,frog} and group 2= {table,chair,desk}. Think of laying marbles (cat, dog, frog, table, chair, desk) on a flat floor. SVD looks at the “longest spread” direction. That spread happens to separate the animals from the furnitures, so cutting along that direction naturally forms two groups.
Problem 9: Denoising of a song
Denoising is traditionally introduced in signal processing courses, but it also serves as an excellent example for motivating and revisiting key concepts in linear algebra—such as orthogonal bases, change of basis, and the comparison between the standard basis and the Fourier basis. We therefore recommend the following in-class exercise to help students deepen their understanding of these algebraic notions.
Part a) Fourier transform of a song
We load a song in Python using the code below. Then, we apply the np.fft.fft and np.fft.fftfreq functions to compute and plot the Fourier transform of the signal, showing frequency versus complex magnitude.
Solution.
Output will be the following plot

Remarkably, the Fourier transform is an example of a change of orthogonal basis. Recall that a basis is orthogonal if
.
A change of basis can be expressed as a simple matrix multiplication.
Part b) Why is the Fourier transform sparse?
In part (a), we observed that the Fourier transform of the signal has vanishing magnitude at high frequencies. Explain the reasoning behind this observation.
Solution.
Intuition. Low frequencies capture the coarse (rough) structure of a signal, while high frequencies encode its fine details. The following figure from Wikipedia illustrates the incremental process of signal reconstruction using Fourier basis functions.
- First row: Reconstruction of the signal (blue) using only the first Fourier basis function.
- Second row: Reconstruction using the first two basis functions.
- (Additional rows continue by adding more frequencies, progressively refining the approximation.)

Theory. It is known that that fourier transform of -times continousely differentiable function whose derivatives are periodic decays at rate where is the frequency.
Theorem. Let be an integer number. Suppose is a function of a variable whose -derviative denoted by . Let denote the Fourier transform of where is a complex number. Then for all , Equivalently, as .
Proof Idea. The proof relies on the following property of the Fourier transform: Let denote the Fourier transform of then
Part c) Fourier transform of noisy song
We add noise to the signal:
You can hear background noise in the song. Plot the Fourier transform of the noisy song and compare it with the Fourier transform of the original song from part (a). What differences do you observe?
Solution. It is easy to adapt the code to plot Fourier transform of the noisy signal. We observe a siginficant change in high frequencies. Since the noise i.i.d. in time, it ossilate a lot which indicate that high frequency sin wave capture it.

Part d) denoising
Using the observation in part c, propose a method to denoise the song.
Solution The solution is to set the high frequencies to zero and compute the inverse of Fourier transform.
In this problem, we review the concept of eigenvectors and eigendecomposition of square symmetric matrices. Consider the following linear function
part a)
We compute the image of the unit circle under (f) and obtain the following plot.
Based on this plot, what is the leading eigenvector of ?

Solution.
The leading eigenvector of is the solution of
. As we observe, vector is the solution of the above optimization.
Part b) In the figure below, the image of the unit circle under f(f(circle)) is shown in green, and the image under f(circle) is shown in red. Explain why both the red and green ellipses share the same diagonal directions, and how these directions relate to the eigenvectors of .

Background II: Optimization
Topics
| Title |
Author |
Booktitle |
Chapter |
Subtopics |
| Smooth Unconstrained Optimization |
Yurii Nesterov |
Introductory Lectures on Convex Optimization |
Ch. 2 |
(1) Limitations of search for optimization (2) gradient descent, (3) algorithm definition of convexity |
| Convergence analysis and 2nd order optimization |
Lieven Vandenberghe and Stephen P. Boyd |
Convex Optimization |
Ch. 9 |
1) condition number, (2) convergence rate of GD on strongly convex functions (3) 2nd order optimization methods (Newton's method) |
| Constrained optimization |
Lieven Vandenberghe and Stephen P. Boyd |
Convex Optimization |
Ch. 10 and 11 |
Lagrange multipliers, barrier functions, and interior-point methods |
Problem 11: Recovering horse from shadows
Motivation: A key skill in machine learning is the ability to formulate real-world problems as optimization problems. In this exercise, we practice that skill through a creative challenge: given two distinct shadows of a horse, reconstruct a three-dimensional point cloud representation of the horse.

Problem Statement
We are given a 3D point cloud of a horse represented by a matrix , where each row corresponds to a single point in the point-cloud representation. From this data, we generate two 2D projections (or shadows):
and ,
where and are random matrices.
The following figures show scatter plots of and :


The goal is to reconstruct (horse) from and .
Part a) Formulate the construcation as an optimization problem
Part b) This colab provides code skeleton for the recover and ploting the 3d reconstruction of the horse. Complete the missed line marked with TODO to observe the origional horse.
Solution: TODO line has to be completed as based on solution of part a).
After completing the code, and running the optimizaiton, the students can observed the following 3d reconsturction of the horse.

Problem 12: Geometric interperation of gradient
Goal: Introduce the geometric interpretation of partial derivatives. This helps students understand the gradient descent method.
Part A: Partial Derivative.
Given a function . The partial derivative of at is defined as
Is possitive or negative for the following function? Hint: the orange line plots a line tangent to at .

Solution.
When is small, we can estimate with the value of marked in the following plot

Thus for a small which concludes . Notably is equal to where is the slope of the tangent line at .
Part 2: Gradient
Given a function of multivariables , the gradient of is defined as the vector of paritial derivatives:
The figure below shows the contour plot of a function over , where each blue diamond represents a level set with constant function value. At the purple point, what is the direction of the gradient?

Solution.
Recall the first-order Taylor expansion of around as
Since the function value is constant along the blue curve, we must have in a neighborhood of . This condition implies that the gradient is orthogonal to the blue curve. Thus, at any point, the gradient is perpendicular to the curve.
Problem 13: Trapping in Local Optimum and Non-convexity
Goal: Simulating gradient descent, and understanding the concept of local optimality
Part a: Tracking gradient descent
Recall gradient descent optimizes function through the following recurrance
where is called stepsize or learning rate. Consider function in the following plot. Mark the iterates of and on the following plot starting from , which marked with the orange point, with stepsize .

Solution
It is easy to check the sign of gradient using the slop of the tangent line. This allows us to check if gradient descent iterates move to left or right. The following figure marks the 2nd iterates obtained by the following
demo

Part b: convergence of gradient descent
Consider the example in part b. What is the limit point of the sequence of iterates starting from with stepsize ? Does gradient descent converge to ? Hint: Use the following demo for observation
Solution.
Gradient iterates do not change if they reach a a point where or . Since the gradient is starting from it will stock at . Note that is not global minimum of and the minimum of this function is unbounded. However, is a optimal in its local neighbourhood.

Problem 14: GD Convergence rate on quadratic functions
Goal: Introduce automatic differentiation with PyTorch and observe the role of conditioning in the convergence of gradient descent.
Part a: Implementing GD in pytorch
Complete this colab to implement GD on using automatic differentiation.
Solution. Add the following lines to the code completes the code skeleton
After completing the code, we observe the following convergence result

Part b: Comparing the convergence speed on two quadratic functions
Run the experiment in part a for and compare the convergence speed of gradient descent.
Solution.
Aftering changing in the code, we get the following result

The convergence is siginficantly slower than those for the function in part a. The instructor can use this example to introduce the convergence rate of GD on general quadratic functions in the following form
Let ; Let and are minimum and maximum eigenvalues of the symmetric square postivie semi-definite matrix , respectively. is called condition number which governs the convergence of GD as
holds. For function in part a, which indicates that gradient with proper stepsize converges to the minimizer in one step. While in part b, which leads to slow convergence rate. It is easy to check that for having , we need which linearly grows with the condition number.
Mathematical Foundation of AI
Author: Hadi Daneshmand and Yangfeng Ji
We introduce a set of interactive classroom puzzles that move beyond traditional lecture-based instruction. These puzzles are designed to bridge theory and practice, deepen intuition, and strengthen conceptual understanding. All course materials—including the puzzles—can be readily adopted by instructors, students, and AI-related programs.
This page is developed as part of the prospective interdisciplinary AI minor at the University of Virginia.
Acknowledgement. This educational effort was supported in part by NSF grant 2343611 (PI David Evans) whose support is gratefully acknowledged.
Citations You’re welcome to use these puzzles in your classroom! We’d appreciate it if you could cite our work when reusing them.
Overview

The graph above illustrates the core topics planned for the Mathematical Foundations of AI course, a central component of the AI minor program at the University of Virginia. Topics on the left represent foundational background material, while those on the right correspond to key machine learning concepts. We currently provide interactive puzzles for three essential areas: supervised learning, linear algebra, and optimization.
Core I: Supervised Learning
Topics
We designed several interactive Colab visualization to compare different classification methods and to demonstrate the weaknesses and strengths of various classification methods.
Problem 1: Breaking linear regression for classifcation
In the plot below, you see a set of blue and red points denoted by
. Each feature vector
represents the location of a point, and each label
determines its color.
The dashed line corresponds to a separating plane obtained via regression.
Specifically, we solve
and then plot the linear predictor used to seperate blue points from the red points
Question: Move the points in this colab so that the regression solution can no longer separate the two classes while the points are still linearly separable.
Solution
Problem 2: Breaking logistic regression
In this colab, you see a set of draggable points, each labeled as either blue or red. The dashed line corresponds to the separating boundary computed by logistic regression. Recall that logistic regression models the probability of class given a feature vector as
are obtained by minimizing the logistic loss
The parameters
Question
Change the location of a single point so that the logistic regression solution misclassifies many of the remaining points.
Solution.
Problem 3: Rosenblatt’s algorithm
One way to fix logistic regression issue in the last problme is changing the loss function as
The problem is equivalent to learning a perceptron classifier. Since the objective function above is not differentiable, it cannot be directly optimized using standard gradient descent. Rosenblatt proposed an alternative algorithm capable of optimizing this objective despite its non-differentiability. However, the problem remains non-convex, which introduces challenges that we aim to investigate—specifically, under what conditions the algorithm successfully classifies the data points.
Question: In this Colab demo, you can drag the points, and the separating line computed by Rosenblatt’s algorithm will update interactively. Under what conditions can Rosenblatt’s algorithm successfully separate the blue and red points?
Solution
Problem 4: Hard margin SVM
In this colab demo, you can drag the points, and the decision boundary of the hard-margin SVM (shown as the dashed line) updates interactively.
Recall: Hard-Margin SVM in PyTorch via Dual Optimization (no upper box)
Dual:
Separating line after finding the solution the above maximization: and .
Question: Identify a point whose movement can significantly alter the decision boundary of the hard-margin SVM.
Solution.
Problem 5: Hard margin vs soft margin SVM
Compare the peformance of hard-margin and soft-margin SVM in this colab.
Question. By dragging points, generate datasets on which hard-margin and soft-margin lead to completely classification decision boundary.
Solution
Problem 6: Non-linear Classification
The question is how to use linear models when classes are not linearly seperable. Consider the following data points and example

In the plot above, the SVM decision boundary misclassifies nearly half of the points. How can we address this issue?
Open this interactive colab demo. You can modify the function
phi in the demo to apply different feature transformations. The demo will then update the decision boundary based on your nonlinear transformation (by fitting a linear model on the transformed features).
Question:: find a transformation that perfectly separates the points.
Solution
Problem 7: Kernel Methods and Overfitting
In this interactive colab, you can adjust the kernel bandwith and observe how changing bandwidth adjust the decision boundary.

Question: Observe how the decision boundary and the test error (reported in the plot title) change as you adjust the RBF kernel bandwidth parameter defined in .
Part a) Explain why the decision boundary forms isolated “islands” around individual points when is very large, as shown below.

Solution.
Part b) Why test error (error on unseen data drawn from the same distribution) increases for large ?
Solution.
Background I: Linear Algebra
Topics
Problem 8: Word embedding
One effective way to motivate linear algebra is word embeddings. Word embeddings not only demonstrate the power of representing objects as vectors, but also provide a foundation for understanding the mechanisms of language models and the advanced techniques used in natural language processing.
Required backgrounds: SVD decomposition, word embedding in NLP, Glove and loading it in python
Part a)
Why is pattern recognition challenging when words are represented only at the character level?
Solution
Part b)
Word embedding provides a vector representation for each words. Such representation allows computers to do inference by computation. To domonstrate the power of word embedding, we propose a simple in-class room coding exercise.
Let denote the vector representation of the word. The following python code compute the vector where word representations are computed using Glove method that provide a vector representation of each word. Return words with highest cosine similarity with this vector.
!pip install --upgrade "scipy<1.14" !pip install --upgrade gensim import gensim.downloader as api # load a pretrained embedding model (small enough for demo) model = api.load("glove-wiki-gigaword-300") # words spain, madrid, france = "spain", "madrid", "france" # vector arithmetic: v = v_madrid - v_spain + v_france v = model[madrid] - model[spain] + model[france] # find top 5 most similar words (excluding the originals)Solution and Remarks Loading the word embeddings takes about 3–5 minutes. The implementation below identifies the top five most similar words, with Paris emerging as the closest match. This demonstrates that word embeddings effectively capture semantic meaning, enabling computers to perform computations—such as linear algebra—for text processing.
# solution results = model.similar_by_vector(v, topn=10) results = [(w, s) for w, s in results if w not in [spain, madrid, france]][:5] resultsPart c) The following code constructs a matrix containing the word embeddings for "cat", "dog", "frog", "chair", "table", and "desk". By applying singular value decomposition (SVD), please group the rows of into two categories. Which words fall into the same category?
#!pip install --upgrade "scipy<1.14" #!pip install --upgrade gensim import gensim.downloader as api import numpy as np # load a pretrained embedding model (small enough for demo) model = api.load("glove-wiki-gigaword-300") words = ["cat", "dog", "frog", "table", "chair", "desk"] # build embedding matrix (rows = words, cols = embedding dimensions) X = np.vstack([model[w] for w in words])Solution.
Problem 9: Denoising of a song
Denoising is traditionally introduced in signal processing courses, but it also serves as an excellent example for motivating and revisiting key concepts in linear algebra—such as orthogonal bases, change of basis, and the comparison between the standard basis and the Fourier basis. We therefore recommend the following in-class exercise to help students deepen their understanding of these algebraic notions.
Part a) Fourier transform of a song
We load a song in Python using the code below. Then, we apply the
np.fft.fftandnp.fft.fftfreqfunctions to compute and plot the Fourier transform of the signal, showing frequency versus complex magnitude.# Here, we load a song import librosa import librosa.display import matplotlib.pyplot as plt import numpy as np from scipy.ndimage import gaussian_filter1d from IPython.display import Audio # Load audio (longer example) y, sr = librosa.load(librosa.example("nutcracker")) # Apply Gaussian smoothing y_smooth = gaussian_filter1d(y, sigma=10) # Play the song Audio(y_smooth, rate=sr)Solution.
# Plot smoothed waveform plt.figure(figsize=(12, 4)) librosa.display.waveshow(y_smooth, sr=sr) plt.title("Smoothed Waveform") plt.xlabel("Time (s)") plt.ylabel("Amplitude") plt.show() # Compute FFT of smoothed signal n = len(y_smooth) Y = np.fft.fft(y_smooth) Y = np.abs(Y[:n//2]) freqs = np.fft.fftfreq(n, 1/sr)[:n//2] plt.figure(figsize=(12, 4)) plt.plot(freqs, Y) plt.title("Frequency Spectrum (Gaussian Smoothed Signal)") plt.xlabel("Frequency (Hz)") plt.ylabel("Magnitude") plt.xlim(0, 8000) plt.show()Part b) Why is the Fourier transform sparse?
In part (a), we observed that the Fourier transform of the signal has vanishing magnitude at high frequencies. Explain the reasoning behind this observation.
Solution.
Part c) Fourier transform of noisy song
We add noise to the signal:
# Play the noisy song y_smooth = y_smooth + 0.02*np.random.randn(len(y_smooth)) Audio(y_smooth, rate=sr)You can hear background noise in the song. Plot the Fourier transform of the noisy song and compare it with the Fourier transform of the original song from part (a). What differences do you observe?
Solution. It is easy to adapt the code to plot Fourier transform of the noisy signal. We observe a siginficant change in high frequencies. Since the noise i.i.d. in time, it ossilate a lot which indicate that high frequency sin wave capture it.
Part d) denoising
Using the observation in part c, propose a method to denoise the song.
Solution The solution is to set the high frequencies to zero and compute the inverse of Fourier transform.
n = len(y_smooth) Y = np.fft.fft(y_smooth) Y[100000:n-100000] = 0 y_recon = np.fft.irfft(Y, n=n) Audio(y_recon, rate=sr)Problem 10: Linear transform of the unit circle
In this problem, we review the concept of eigenvectors and eigendecomposition of square symmetric matrices. Consider the following linear function
part a) ?
We compute the image of the unit circle under (f) and obtain the following plot.
Based on this plot, what is the leading eigenvector of
Solution.
Part b) In the figure below, the image of the unit circle under f(f(circle)) is shown in green, and the image under f(circle) is shown in red. Explain why both the red and green ellipses share the same diagonal directions, and how these directions relate to the eigenvectors of .

Background II: Optimization
Topics
Problem 11: Recovering horse from shadows
Motivation: A key skill in machine learning is the ability to formulate real-world problems as optimization problems. In this exercise, we practice that skill through a creative challenge: given two distinct shadows of a horse, reconstruct a three-dimensional point cloud representation of the horse.
Problem Statement , where each row corresponds to a single point in the point-cloud representation. From this data, we generate two 2D projections (or shadows):
and , and are random matrices.
We are given a 3D point cloud of a horse represented by a matrix
where
The following figures show scatter plots of and :
The goal is to reconstruct (horse) from and .
Part a) Formulate the construcation as an optimization problem
Solution:
Part b) This colab provides code skeleton for the recover and ploting the 3d reconstruction of the horse. Complete the missed line marked with TODO to observe the origional horse.
Solution: TODO line has to be completed as based on solution of part a).
loss = torch.norm(X @ A1 - S1)**2 + torch.norm(X@A2 - S2)**2After completing the code, and running the optimizaiton, the students can observed the following 3d reconsturction of the horse.

Problem 12: Geometric interperation of gradient
Goal: Introduce the geometric interpretation of partial derivatives. This helps students understand the gradient descent method.
Part A: Partial Derivative. . The partial derivative of at is defined as
Given a function
Is possitive or negative for the following function? Hint: the orange line plots a line tangent to at .
Solution.
Part 2: Gradient
Given a function of multivariables , the gradient of is defined as the vector of paritial derivatives:
The figure below shows the contour plot of a function over , where each blue diamond represents a level set with constant function value. At the purple point, what is the direction of the gradient?

Solution.
Problem 13: Trapping in Local Optimum and Non-convexity
Goal: Simulating gradient descent, and understanding the concept of local optimality
Part a: Tracking gradient descent
Recall gradient descent optimizes function through the following recurrance
where
Solution
It is easy to check the sign of gradient using the slop of the tangent line. This allows us to check if gradient descent iterates move to left or right. The following figure marks the 2nd iterates obtained by the following
demo
Part b: convergence of gradient descent
Consider the example in part b. What is the limit point of the sequence of iterates starting from with stepsize ? Does gradient descent converge to ? Hint: Use the following demo for observation
Solution.
Problem 14: GD Convergence rate on quadratic functions
Goal: Introduce automatic differentiation with PyTorch and observe the role of conditioning in the convergence of gradient descent. using automatic differentiation.
Part a: Implementing GD in pytorch
Complete this colab to implement GD on
Solution. Add the following lines to the code completes the code skeleton
with torch.no_grad(): x -= lr * x.grad y -= lr * y.gradAfter completing the code, we observe the following convergence result

Part b: Comparing the convergence speed on two quadratic functions and compare the convergence speed of gradient descent.
Run the experiment in part a for
Solution.