Understanding the mathematical foundations behind machine learning algorithms
Welcome to "Math for Machine Learning" – a comprehensive resource designed to help you understand the critical mathematical concepts that underpin modern machine learning algorithms. Whether you're a student, researcher, or professional, a solid grasp of these mathematical foundations will empower you to go beyond simply using machine learning tools to truly understanding how and why they work.
Machine learning is fundamentally a mathematical discipline. The algorithms that power everything from recommendation systems to autonomous vehicles are built on mathematical principles. Understanding these principles allows you to:
Think of machine learning like driving a car. You can operate a car by learning which pedals to push and when to turn the steering wheel (using ML libraries), but if you understand how the engine works (the math), you can diagnose problems, make repairs, and even build custom vehicles suited to specific needs.
In this resource, we'll explore the three main branches of mathematics that form the foundation of machine learning:
We'll then connect these mathematical foundations to key machine learning concepts, showing how they all work together to create powerful learning algorithms.
Linear algebra is the language of machine learning. It provides the fundamental representation of data and the operations we can perform on that data. Most machine learning models, especially deep learning, rely heavily on linear algebra operations.
Vectors are ordered arrays of numbers that can represent points in space, features of data samples, or parameters in a model.
A vector \(\mathbf{x}\) with \(n\) components can be written as:
\[\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}\]
Think of a vector like a set of coordinates. Just as you might describe a location using latitude, longitude, and altitude, in machine learning, we describe data points using features. A house might be represented by a vector containing square footage, number of bedrooms, and age in years.
Use the sliders below to change the components of vectors A and B, and see how vector addition works.
Matrices are rectangular arrays of numbers. In machine learning, they can represent collections of data samples, transformations, or the parameters of a model.
A matrix \(\mathbf{A}\) with \(m\) rows and \(n\) columns can be written as:
\[\mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix}\]
A matrix is like a spreadsheet or table. Each row might represent a different data sample (like a house), and each column might represent a different feature (like square footage or number of bedrooms).
Matrix multiplication is the foundation of linear transformations and neural network operations.
For matrices \(\mathbf{A}\) (m×n) and \(\mathbf{B}\) (n×p), their product \(\mathbf{C} = \mathbf{A} \times \mathbf{B}\) (m×p) is given by:
\[c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj}\]
Adjust the values to see how matrix multiplication works:
The dot product measures the similarity between vectors and is central to many ML algorithms.
For vectors \(\mathbf{a}\) and \(\mathbf{b}\), their dot product is:
\[\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = a_1 b_1 + a_2 b_2 + \cdots + a_n b_n\]
The dot product is like measuring how similar two data points are. If you compute the dot product between a query and documents (after normalization), the highest values indicate the most relevant documents – this is the basic principle behind search engines!
PCA uses eigenvalues and eigenvectors to find the most important directions in your data, allowing for dimensionality reduction.
For a matrix \(\mathbf{A}\), an eigenvector \(\mathbf{v}\) and corresponding eigenvalue \(\lambda\) satisfy:
\[\mathbf{A} \mathbf{v} = \lambda \mathbf{v}\]
See how PCA finds the principal components of a dataset:
Calculus provides the tools for optimization, which is at the heart of machine learning. It allows us to find the best parameters for models by minimizing error or maximizing likelihood.
Derivatives tell us how a function changes as its inputs change, which is crucial for finding optima.
The derivative of a function \(f(x)\) at a point \(x\) is defined as:
\[f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}\]
Imagine you're standing on a mountain and want to reach the highest peak. The derivative is like looking around you to see which way is uphill (positive derivative) or downhill (negative derivative). By continually moving uphill, you can find a peak (though it might only be a local maximum).
Move the slider to see how the derivative relates to the original function:
The gradient extends the concept of derivatives to functions of multiple variables, giving us the direction of steepest ascent.
For a function \(f(x_1, x_2, \ldots, x_n)\), the gradient is:
\[\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}\]
See how gradient descent finds the minimum of a function:
Gradient descent is like placing a ball on a hilly landscape and letting it roll downhill until it reaches a valley. The learning rate is like the ball's weight – too light (small learning rate) and it takes forever to reach the bottom; too heavy (large learning rate) and it might bounce out of the valley or oscillate wildly.
Backpropagation uses the chain rule from calculus to efficiently compute the gradients needed to train neural networks.
For composite functions, the chain rule states:
\[\frac{d}{dx}(f(g(x))) = f'(g(x)) \cdot g'(x)\]
Various optimization algorithms build on the basic concept of gradient descent, adding momentum, adaptive learning rates, and other refinements to improve convergence.
Probability and statistics provide the framework for reasoning about uncertainty, making inferences from data, and evaluating model performance.
Probability distributions describe the likelihood of different outcomes and are fundamental to many ML algorithms.
The probability density function of a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) is:
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]
The normal distribution is like the distribution of performance metrics in a large population. Most values cluster around the average (mean), with fewer and fewer examples as you move toward exceptional performance (the tails of the distribution).
Bayes' theorem provides a way to update our beliefs as we gather new evidence, forming the foundation of many ML approaches.
Bayes' theorem states:
\[P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\]
Where:
Consider a medical test for a rare disease. If you test positive, should you be worried? Bayes' theorem helps us understand that the answer depends not just on the accuracy of the test but also on how rare the disease is. A 99% accurate test for a 1-in-10,000 disease will still yield mostly false positives!
Adjust the parameters to see how Bayes' theorem works:
MLE is a method for estimating the parameters of a model by finding values that maximize the likelihood of the observed data.
Given a data set \(X = \{x_1, x_2, \ldots, x_n\}\) and a model with parameters \(\theta\), MLE finds:
\[\hat{\theta} = \underset{\theta}{\operatorname{argmax}} \, P(X|\theta)\]
Cross-validation helps assess how well a model will generalize to unseen data by partitioning the data and using different subsets for training and validation.
Now that we've covered the mathematical foundations, let's see how they come together in key machine learning concepts and algorithms.
Linear regression models the relationship between a dependent variable and one or more independent variables using a linear equation.
For a simple linear regression:
\[y = \beta_0 + \beta_1 x + \epsilon\]
Where \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(\epsilon\) is the error term.
Click to add data points, then see the regression line adjust:
Linear regression is like finding the "line of best fit" on a scatter plot. Imagine you're trying to summarize the relationship between study hours and test scores with a single line – you want the line that gets as close as possible to all the data points.
Logistic regression models the probability of an event occurring based on one or more predictor variables, using the logistic function to constrain the output to [0,1].
The logistic function is:
\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
For logistic regression with parameters \(\mathbf{w}\) and input \(\mathbf{x}\):
\[P(y=1|\mathbf{x}) = \sigma(\mathbf{w} \cdot \mathbf{x})\]
Neural networks are composed of layers of interconnected nodes or "neurons" that process information using weighted connections and activation functions.
For a single neuron with inputs \(\mathbf{x}\), weights \(\mathbf{w}\), bias \(b\), and activation function \(f\):
\[y = f(\mathbf{w} \cdot \mathbf{x} + b)\]
A neural network is like an assembly line in a factory. Raw materials (input data) enter the system, and each worker (neuron) performs a simple operation before passing the partially processed goods to the next station. By the end of the line, the raw materials have been transformed into a finished product (prediction).
Decision trees are hierarchical models that make sequential decisions based on feature values, creating a tree-like structure of decisions and their consequences.
Decision trees often use entropy to measure impurity:
\[H(S) = -\sum_{i} p_i \log_2(p_i)\]
Where \(p_i\) is the proportion of class \(i\) in set \(S\).
A decision tree is like a flow chart of yes/no questions. Starting at the top, you ask a question about the data ("Is feature X greater than value Y?"), and based on the answer, you follow a branch to the next question or to a final decision.
Clustering groups similar data points together based on their features, allowing for unsupervised learning and pattern discovery.
K-means clustering is like sorting students into study groups based on their learning style and performance. You want students with similar characteristics to be in the same group so they can work effectively together.