General Linear Regression

From PrattWiki
Jump to navigation Jump to search

This is a work in progress. It is meant to capture the mathematical proof of how general linear regression works. It is math-heavy. It has not been edited sufficiently.

Introduction

Assume you have some data set where you have $$N$$ independent values $$x_k$$ and dependent values $$y_k$$ stored in column vectors $$\boldsymbol{x}$$ and $$\boldsymbol{y}$$, respectively. You also have some reasonable scientific model that relates the dependent variable to the independent variable. If that model can be written as a general linear model, that means you can represent the fit function $$\hat{y}(x)$$ as:

$$ \begin{align*} \hat{y}(x)&=\sum_{m=0}^{M-1}a_m\phi_m(x) \end{align*} $$

where $$\phi_m(x)$$ is the $$m$$th basis function in your model and $$a_m$$ is the constant coefficient. For instance, if you end up having a model:

$$ \begin{align*} \hat{y}(x)&=3e^{-2x}+5 \end{align*} $$

then you could map these to the summation with $$M=2$$ basis function total and:

$$ \begin{align*} a_0 &= 3 & \phi_0(x) &= e^{-2x} \\ a_1 &= 5 & \phi_1(x) &= x^0 \end{align*} $$

Note for the second term that $$\phi(x)$$ must be a function of $$x$$ -- constants are thus the coefficients on an implied $$x^0$$.

The goal, once we have established a scientifically valid model, is to determine the "best" set of coefficients for that model. We are going to define the "best" set of coefficients as the values of $$a_m$$ that minimize the sum of the squares of the estimate residuals, $$S_r$$, for that particular model. Recall that:

$$ \begin{align*} S_r&=\sum_k\left(y_k-\hat{y}_k\right)^2=\sum_k\left(\hat{y}_k-y_k\right)^2 \end{align*} $$

Finding the coefficients for the "constant" model

The simplest model you might come up with is a single constant, $$\hat{y}(x)=a_0x^0$$. This means that the $$S_r$$ value, using the second version above, will be:

$$\begin{align*} S_r&=\sum_k\left(\hat{y}(x_k)-y_k\right)^2=\sum_k\left(a_0-y_k\right)^2 \end{align*}$$

Keep in mind that the only variable right now is $$a_0$$; all the $$x$$ and $$y$$ values are constant independent or dependent values from your data set. The only parameter you can adjust is $$a_0$$. This means that to minimize the $$S_r$$ value, you need to solve:

$$ \begin{align*} \frac{dS_r}{da_0}&=0 \end{align*}$$

Here goes!

$$ \begin{align*} \frac{dS_r}{da_0}=\frac{d}{da_0}\left(\sum_k\left(a_0-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{d}{da_0}\left(a_0-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0-y_k)$$ so $$\frac{du}{da_0}=1$$ here:

$$ \begin{align*} \sum_k2\left(a_0-y_k\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into two parts...

$$ \begin{align*} \sum_k\left(a_0\right)-\sum_k\left(y_k\right)&=0 \end{align*}$$

...and then the parts can be separated.

$$ \begin{align*} \sum_k\left(a_0\right)&=\sum_k\left(y_k\right) \end{align*}$$

Recognize the $$a_0$$ is a constant; since you are adding that constant to itself for each of the $$N$$ data points, you can replace the summation with:

$$ \begin{align*} Na_0&=\sum_k\left(y_k\right)\end{align*}$$

Dividing by $$N$$ reveals the answer:

$$ \begin{align*} a_0&=\frac{1}{N}\sum_k\left(y_k\right)=\bar{y} \end{align*}$$

The best constant with which to model a data set is its own average! Admittedly, this will lead to an $$r^2$$ value of 0, which is not great, but it is as good as you can get with a model containing nothing more than a constant.

Finding the coefficients for a "straight line" model

So, that was relatively painless (except for the typesetting). Next up, let's look at a slightly more complex model: a straight line. That is to say, $$\hat{y}(x)=a_0x^1+a_1x^0$$. Yes, the indexing is a little unfortunate but that's the way it goes. This means that the $$S_r$$ value, using the second version above, will be:

$$\begin{align*} S_r&=\sum_k\left(\hat{y}(x_k)-y_k\right)^2=\sum_k\left(a_0x_k+a_1-y_k\right)^2 \end{align*}$$

There are now two variables: $$a_0$$ and $$a_1$$. This means that to minimize the $$S_r$$ value, you need to solve two equations simultaneously:

$$ \begin{align*} \frac{\partial S_r}{\partial a_0}&=0\\ \frac{\partial S_r}{\partial a_1}&=0 \end{align*}$$

where the $$\partial$$ symbol indicates a partial derivative. A partial derivative simply means that you are looking at how something changes with respect to changes in only one of its variables - all the other variables are assumed constant. For example, the volume of a cylinder can be given by $$V=\pi r^2h$$ where $$r$$ is the radius of the base and $$h$$ is the height. Using partial derivatives, you can calculate how the volume changes either as a function of changing the radius of the base or as a function of changing the height:

$$ \begin{align*} \frac{\partial V}{\partial r}&=2\pi r h & \frac{\partial V}{\partial h}&=\pi r^2 \end{align*}$$

On the left, the $$h$$ is taken as a constant; on the right, the $$r$$ is taken as a constant.

Here goes round two! First, let's look at $$a_0$$:

$$ \begin{align*} \frac{\partial S_r}{\partial a_0}=\frac{\partial}{\partial a_0}\left(\sum_k\left(a_0x_k+a_1-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{\partial}{\partial a_0}\left(a_0x_k+a_1-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0x_k+a_1-y_k)$$ so $$\frac{du}{da_0}=x_k$$ here (this is different from what it was above):

$$ \begin{align*} \sum_k2\left(a_0x_k+a_1-y_k\right)x_k=\sum_k2\left(a_0x_{k}^{2}+a_1x_k-y_kx_k\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into three parts...

$$ \begin{align*} \sum_k\left(a_0x_{k}^{2}\right)+\sum_k\left(a_1x_{k}\right)-\sum_k\left(y_kx_k\right)&=0 \end{align*}$$

...and then the parts can be separated; I also changed the order of the product on the right for reasons that will become clear later.

$$ \begin{align*} \sum_k\left(a_0x_{k}^{2}\right)+\sum_k\left(a_1x_{k}\right)&=\sum_k\left(x_ky_k\right) \end{align*}$$

None of these terms is simple enough do do anything with other than to recognize that the $$a_0$$ and $$a_1$$ are not functions of $$k$$ and can thus be brought out of the summations:

$$ \begin{align} a_0\sum_k\left(x_{k}^{2}\right)+a_1\sum_k\left(x_{k}\right)&=\sum_k\left(x_ky_k\right) \end{align}$$

Now let's look at $$a_1$$:

$$ \begin{align*} \frac{\partial S_r}{\partial a_1}=\frac{\partial}{\partial a_1}\left(\sum_k\left(a_0x_k+a_1-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{\partial}{\partial a_1}\left(a_0x_k+a_1-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0x_k+a_1-y_k)$$ so $$\frac{du}{da_1}=1$$ here:

$$ \begin{align*} \sum_k2\left(a_0x_k+a_1-y_k\right)(1)=\sum_k2\left(a_0x_{k}+a_1-y_k\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into three parts...

$$ \begin{align*} \sum_k\left(a_0x_{k}\right)+\sum_k\left(a_1\right)-\sum_k\left(y_k\right)&=0 \end{align*}$$

...and then the parts can be separated.

$$ \begin{align*} \sum_k\left(a_0x_{k}\right)+\sum_k\left(a_1\right)&=\sum_k\left(y_k\right) \end{align*}$$

While the second term is actually simple enough to do something with (it is just adding up $$a_1$$ $$N$$ times and thus could be $$Na_1$$, we are simply going to recognize that the $$a_0$$ and $$a_1$$ are not functions of $$k$$ and can thus be brought out of the summations:

$$ \begin{align} a_0\sum_k\left(x_{k}\right)+a_1\sum_k\left(1\right)&=\sum_k\left(y_k\right) \end{align}$$

If we wanted to be explicit about it, we could note that $$\phi_1(x)$$ here is $$x^0$$ and write:

$$ \begin{align} \tag{2e} a_0\sum_k\left(x^{0}_{k}x^{1}_{k}\right)+a_1\sum_k\left(x^{0}_{k}x^{0}_{k}\right)&=\sum_k\left(x^{0}_{k}y_k\right) \end{align}$$

In fact, we could do the same with equation (1) above and write it as:

$$ \begin{align} \tag{1e} a_0\sum_k\left(x^{1}_{k}x^{1}_{k}\right)+a_1\sum_k\left(x^{1}_{k}x^{0}_{k}\right)&=\sum_k\left(x^{1}_{k}y_k\right) \end{align}$$

Equations (1e) and (2e) give two equations with two unknowns; putting them in matrix form yields:

$$ \begin{align} \begin{bmatrix} \sum_k\left(x^{1}_{k}x^{1}_{k}\right) & \sum_k\left(x^{1}_{k}x^{0}_{k}\right) \\ \sum_k\left(x^{0}_{k}x^{1}_{k}\right) & \sum_k\left(x^{0}_{k}x^{0}_{k}\right) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= \begin{bmatrix} \sum_k\left(x^{1}_{k}y_k\right) \\ \sum_k\left(x^{0}_{k}y_k\right) \end{bmatrix} \end{align}$$

Side bar: Dot Products

Now is a good time to take a moment to remember what a dot product is. If you have column vectors $$p$$ and $$q$$, each with the same number of elements, the dot product of $$p$$ and $$q$$ is given as $$\sum_kp_kq_k$$. The vector-math version of that is if you have column vectors $$p$$ and $$q$$, each with the same number of elements, the dot product of $$p$$ and $$q$$ is given as the matrix product $$p^Tq$$ where $$p^T$$ is the transpose of $$p$$. In this case, that means turning $$p$$ (the column vector) into $$p^T$$ (a row vector) and doing matrix multiplication with $$q$$.

Re-writing sums as dot products

If we have $$N$$ data points and a model of $$\hat{y}(x)=a_0x^1+a_1x^0$$, what we are really trying to solve is:

$$ \begin{align*} \begin{bmatrix} x_0^1 & x_0^0\\ x_1^1 & x_1^0\\ x_2^1 & x_2^0\\ \vdots & \vdots\\ x_{N-1}^1 & x_{N-1}^0\\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= \begin{bmatrix} y_0\\ y_1\\ y_2\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align*}$$

but if $$N>2$$ we may not be able to find $$a_0$$ and $$a_1$$ that actually solve all those equations perfectly. The geometric way of saying that is if I have more than two points, they may not all be on the same line. But, if I pre-multiply both sides by the transpose of the matrix with all the function of $$x$$ in it, I can find the slope and intercept of the line that comes as close as possible to the data points (where "as close as possible" means minimizing the sums of the squares of the vertical distances between the line and the points):

$$ \begin{align*} {\color{Purple}{ \begin{bmatrix} x_0^1 & x_1^1 & x_2^1 & \ldots & x_{N-1}^1 \\ x_0^0 & x_1^0 & x_2^0 & \ldots & x_{N-1}^0 & \end{bmatrix}}} \begin{bmatrix} x_0^1 & x_0^0\\ x_1^1 & x_1^0\\ x_2^1 & x_2^0\\ \vdots & \vdots\\ x_{N-1}^1 & x_{N-1}^0\\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} x_0^1 & x_1^1 & x_2^1 & \ldots & x_{N-1}^1 \\ x_0^0 & x_1^0 & x_2^0 & \ldots & x_{N-1}^0 & \end{bmatrix}}} \begin{bmatrix} y_0\\ y_1\\ y_2\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align*}$$

which is to say:

$$ \begin{align*} {\color{Purple}{ \begin{bmatrix} (x^T)^1 \\ (x^T)^0 \end{bmatrix}}} \begin{bmatrix} (x)^1 & (x)^0 \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} (x^T)^1 \\ (x^T)^0 \end{bmatrix}}} \begin{bmatrix} y \end{bmatrix} \end{align*}$$

which, given the relationship between matrix multiplication and the dot product yields the same result as:

$$ \begin{align*} \begin{bmatrix} \sum_k\left(x^{1}_{k}x^{1}_{k}\right) & \sum_k\left(x^{1}_{k}x^{0}_{k}\right) \\ \sum_k\left(x^{0}_{k}x^{1}_{k}\right) & \sum_k\left(x^{0}_{k}x^{0}_{k}\right) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= \begin{bmatrix} \sum_k\left(x^{1}_{k}y_k\right) \\ \sum_k\left(x^{0}_{k}y_k\right) \end{bmatrix} \end{align*}$$

Finding the coefficients for a general linear model

It turns out, the process is very similar for any linear model! If we assume that $$\hat{y}(x)=\sum_ma_m\phi_m(x)$$ for $$M$$ total basis functions indexed from $$m=0$$ through $$m=M-1$$. This means that the $$S_r$$ value, using the second version above, will be:

$$\begin{align*} S_r&=\sum_k\left(\hat{y}(x_k)-y_k\right)^2=\sum_k\left(\sum_ma_m\phi_m(x_k)-y_k\right)^2 \end{align*}$$

There are now $$M$$variables: all the $$a_m$$. This means that to minimize the $$S_r$$ value, you need to solve all $$M$$ equations of the form:

$$ \begin{align*} \frac{\partial S_r}{\partial a_m}&=0 \end{align*}$$

Here goes round...$$M$$! Let's look at $$a_p$$, the coefficient of just $$\phi_{p}(x)$$ where $$p$$ is some index between 0 and $$M-1$$:

$$ \begin{align*} \frac{\partial S_r}{\partial a_p}=\frac{\partial}{\partial a_p}\left(\sum_k\left(\sum_ma_m\phi_m(x_k)-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{\partial}{\partial a_p}\left(\sum_ma_m\phi_m(x_k)-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(\sum_ma_m\phi_m(x_k)-y_k)$$ so $$\frac{du}{da_p}=\phi_p(x_k)$$ here since the only variable of interest to the partial derivative is $$a_p$$ and it is multiplying $$\phi_p(x_k)$$; all the other $$a_m$$ for $$m\neq p$$ are constants with respect to $$a_p$$ and thus their derivatives would be 0. This leads to:

$$ \begin{align*} \sum_k2\left(\sum_ma_m\phi_m(x_k)-y_k\right)\phi_p(x_k)=\sum_k2\left(\sum_ma_m\phi_m(x_k)\phi_p(x_k)-y_k\phi_p(x_k)\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into two parts...

$$ \begin{align*} \sum_k\left(\sum_ma_m\phi_m(x_k)\phi_p(x_k)\right) -\sum_k\left(y_k\phi_p(x_k)\right)&=0 \end{align*}$$

...and then the parts can be separated; I also changed the order of the products for reasons that will become clear later.

$$ \begin{align*} \sum_k\left(\phi_p(x_k)\sum_m\phi_m(x_k)a_m\right) &= \sum_k\left(\phi_p(x_k)y_k\right) \end{align*}$$

The inner sum is a dot product (and a function of k)! It is a little different from the previous ones we have encountered - it is a dot product of (1) a row made of each of the basis function values of the $$k$$th data point and (2) the column of coefficient values $$a$$. In other words:

$$ \begin{align*} F(k)=\sum_m\phi_m(x_k)a_m &= \begin{bmatrix} \phi_0(x_0) & \phi_1(x_0) & \ldots & \phi_{M-1}(x_0)\\ \phi_0(x_1) & \phi_1(x_1) & \ldots & \phi_{M-1}(x_1)\\ \vdots & \vdots & \vdots & \vdots \\ \phi_0(x_{N-1}) & \phi_1(x_{N-1}) & \ldots & \phi_{M-1}(x_{N-1}) \end{bmatrix} \begin{bmatrix} a_0\\a_1\\ \vdots \\a_{M-1} \end{bmatrix} \end{align*}$$

Or, assuming $$\boldsymbol{x}$$ is a column vector,

$$ \begin{align*} F(k)=\sum_m\phi_m(x_k)a_m &= \begin{bmatrix} \phi_0(\boldsymbol{x}) & \phi_1(\boldsymbol{x}) & \ldots & \phi_{M-1}(\boldsymbol{x}) \end{bmatrix} \begin{bmatrix} a_0\\a_1\\ \vdots \\a_{M-1} \end{bmatrix} \end{align*}$$

This matrix multiplication takes the $$N$$x$$M$$ matrix of function values and multiplies it by the $$M$$x1 matrix of coefficient values and returns an $$N$$x1 matrix. This leaves us with:

$$ \begin{align*} \sum_k\left(\phi_p(x_k)F(k)\right) &= \sum_k\left(\phi_p(x_k)y_k\right) \end{align*}$$

Each of these summations represents a dot product too! The left one represents the dot product of any given basis function $$\phi_p(\boldsymbol{x}^T)$$ with my just-calculated $$F(k)$$ matrix; the right one represents the dot product of that same basis function $$\phi_p(\boldsymbol{x}^T)$$ with my original dependent data $$y$$. And since I have to solve for all the $$p$$ values between 0 and $$M-1$$, it would make sense to take a matrix of rows of $$\phi_p(\boldsymbol{x}^T)$$ and pre-multiply the left and right sides by it.

The Big Reveal

If we have $$N$$ data points and a model with $$M$$ basis functions ($$\hat{y}(x)=\sum_ma_m\phi_m(x)$$), what we are really trying to solve is:

$$ \begin{align*} \begin{bmatrix} \phi_0(x_0) & \phi_1(x_0) & \ldots & \phi_{M-1}(x_0)\\ \phi_0(x_1) & \phi_1(x_1) & \ldots & \phi_{M-1}(x_1)\\ \vdots & \vdots & \vdots & \vdots \\ \phi_0(x_{N-1}) & \phi_1(x_{N-1}) & \ldots & \phi_{M-1}(x_{N-1}) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\a_{M-1}\end{bmatrix}&= \begin{bmatrix} y_0\\ y_1\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align*}$$

but if $$N>M$$ we may not be able to find a set of $$a_m$$ that actually solve all those equations perfectly. However, I can find the $$a_m$$ that minimize the sum of the squares of the estimate residuals if I pre-multiply both sides by the transpose of the matrix with all the functions of $$x$$ in it:

$$ \begin{align*} {\color{Purple}{ \begin{bmatrix} \phi_0(x_0) & \phi_0(x_1) & \ldots & \phi_0(x_{N-1}) \\ \phi_1(x_0) & \phi_1(x_1) & \ldots & \phi_1(x_{N-1}) \\ \vdots & \vdots & \vdots & \vdots \\ \phi_{M-1}(x_0) & \phi_{M-1}(x_1) & \ldots & \phi_{M-1}(x_{N-1}) \\ \end{bmatrix}}} \begin{bmatrix} \phi_0(x_0) & \phi_1(x_0) & \ldots & \phi_{M-1}(x_0)\\ \phi_0(x_1) & \phi_1(x_1) & \ldots & \phi_{M-1}(x_1)\\ \vdots & \vdots & \vdots & \vdots \\ \phi_0(x_{N-1}) & \phi_1(x_{N-1}) & \ldots & \phi_{M-1}(x_{N-1}) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_{N-1} \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} \phi_0(x_0) & \phi_0(x_1) & \ldots & \phi_0(x_{N-1}) \\ \phi_1(x_0) & \phi_1(x_1) & \ldots & \phi_1(x_{N-1}) \\ \vdots & \vdots & \vdots & \vdots \\ \phi_{M-1}(x_0) & \phi_{M-1}(x_1) & \ldots & \phi_{M-1}(x_{N-1}) \\ \end{bmatrix}}} \begin{bmatrix} y_0\\ y_1\\ y_2\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align*}$$

which is to say:

$$ \begin{align*} {\color{Purple}{ \begin{bmatrix} \phi_0(\boldsymbol{x}^T)\\ \phi_1(\boldsymbol{x}^T)\\ \vdots \\ \phi_{M-1}(\boldsymbol{x}^T) \end{bmatrix}}} \begin{bmatrix} \phi_0(\boldsymbol{x}) & \phi_1(\boldsymbol{x}) & \ldots & \phi_{M-1}(\boldsymbol{x}) \end{bmatrix} \begin{bmatrix} \boldsymbol{a} \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} \phi_0(\boldsymbol{x}^T)\\ \phi_1(\boldsymbol{x}^T)\\ \vdots \\ \phi_{M-1}(\boldsymbol{x}^T) \end{bmatrix}}} \begin{bmatrix} \boldsymbol{y} \end{bmatrix} \end{align*}$$

If you define $$\Phi$$ as the matrix containing columns of the basis functions $$\phi_m(x)$$, $$\boldsymbol{a}$$ as the column vector with your coefficients, $$\boldsymbol{y}$$ as the column vector of dependent data points, your system is originally:

$$ \begin{align*} \Phi \boldsymbol{a}&=\boldsymbol{y} \end{align*}$$

what you are really going to solve is

$$ \begin{align*} \Phi^T \Phi \boldsymbol{a}&=\Phi^T \boldsymbol{y} \end{align*}$$

which is to say

$$ \begin{align} \boldsymbol{a}&=\mbox{inv}(\Phi^T \Phi )\Phi^T \boldsymbol{y} \end{align}$$

Pythonic Punchline

See Python:Fitting#General_Linear_Regression