Difference between revisions of "General Linear Regression"

From PrattWiki
Jump to navigation Jump to search
(Finding the coefficients for a general linear model)
Line 1: Line 1:
This is a work in progress.  It is meant to capture the mathematical proof of how general linear regression works.  It is math-heavy.
+
This is a work in progress.  It is meant to capture the mathematical proof of how general linear regression works.  It is math-heavy. It has not been edited sufficiently.
  
 
== Introduction ==
 
== Introduction ==
Assume you have some data set where you have $$N$$ independent values $$x_k$$ and dependent values $$y_k$$.  You also have some reasonable scientific model that relates the dependent variable to the independent variable.  If that model can be written as a general linear fit, that means you can represent the fit function $$\hat{y}(x)$$ as:
+
Assume you have some data set where you have $$N$$ independent values $$x_k$$ and dependent values $$y_k$$.  You also have some reasonable scientific model that relates the dependent variable to the independent variable.  If that model can be written as a general linear model, that means you can represent the fit function $$\hat{y}(x)$$ as:
 
<center>$$
 
<center>$$
 
\begin{align*}
 
\begin{align*}
Line 208: Line 208:
 
x_2^1 & x_2^0\\
 
x_2^1 & x_2^0\\
 
\vdots & \vdots\\
 
\vdots & \vdots\\
x_N^1 & x_N^0\\
+
x_{N-1}^1 & x_{N-1}^0\\
 
  \end{bmatrix}
 
  \end{bmatrix}
 
\begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&=
 
\begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&=
Line 216: Line 216:
 
y_2\\
 
y_2\\
 
\vdots\\
 
\vdots\\
y_N\end{bmatrix}
+
y_{N-1}\end{bmatrix}
 
\end{align}$$
 
\end{align}$$
 
</center>
 
</center>

Revision as of 02:16, 28 October 2019

This is a work in progress. It is meant to capture the mathematical proof of how general linear regression works. It is math-heavy. It has not been edited sufficiently.

Introduction

Assume you have some data set where you have $$N$$ independent values $$x_k$$ and dependent values $$y_k$$. You also have some reasonable scientific model that relates the dependent variable to the independent variable. If that model can be written as a general linear model, that means you can represent the fit function $$\hat{y}(x)$$ as:

$$ \begin{align*} \hat{y}(x)&=\sum_{m=0}^{M-1}a_m\phi_m(x) \end{align*} $$

where $$\phi_m(x)$$ is the $$m$$th basis function in your model and $$a_m$$ is the constant coefficient. For instance, if you end up having a model:

$$ \begin{align*} \hat{y}(x)&=3e^{-2x}+5 \end{align*} $$

then you could map these to the summation with $$M=2$$ basis function total and:

$$ \begin{align*} a_0 &= 3 & \phi_0(x) &= e^{-2x} \\ a_1 &= 5 & \phi_1(x) &= x^0 \end{align*} $$

Note for the second term that $$\phi(x)$$ must be a function of $$x$$ -- constants are thus the coefficients on an implied $$x^0$$.

The goal, once we have established a scientifically valid model, is to determine the "best" set of coefficients for that model. We are going to define the "best" set of coefficients as the values of $$a_m$$ that minimize the sum of the squares of the estimate residuals, $$S_r$$, for that particular model. Recall that:

$$ \begin{align*} S_r&=\sum_k\left(y_k-\hat{y}_k\right)^2=\sum_k\left(\hat{y}_k-y_k\right)^2 \end{align*} $$

Finding the coefficients for the "constant" model

The simplest model you might come up with is a simple constant, $$\hat{y}(x)=a_0x^0$$. This means that the $$S_r$$ value, using the second version above, will be:

$$\begin{align*} S_r&=\sum_k\left(\hat{y}_k-y_k\right)^2=\sum_k\left(a_0-y_k\right)^2 \end{align*}$$

Keep in mind that the only variable right now is $$a_0$$; all the $$x$$ and $$y$$ values are constant independent or dependent values from your data set. The only parameter you can adjust is $$a_0$$. This means that to minimize the $$S_r$$ value, you need to solve:

$$ \begin{align*} \frac{dS_r}{da_0}&=0 \end{align*}$$

Here goes!

$$ \begin{align*} \frac{dS_r}{da_0}=\frac{d}{da_0}\left(\sum_k\left(a_0-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{d}{da_0}\left(a_0-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0-y_k)$$ so $$\frac{du}{da_0}=1$$ here:

$$ \begin{align*} \sum_k2\left(a_0-y_k\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into two parts...

$$ \begin{align*} \sum_k\left(a_0\right)-\sum_k\left(y_k\right)&=0 \end{align*}$$

...and then the parts can be separated.

$$ \begin{align*} \sum_k\left(a_0\right)&=\sum_k\left(y_k\right) \end{align*}$$

Recognize the $$a_0$$ is a constant; since you are adding that constant to itself for each of the $$N$$ data points, you can replace the summation with:

$$ \begin{align*} Na_0&=\sum_k\left(y_k\right)\end{align*}$$

Dividing by $$N$$ reveals the answer:

$$ \begin{align*} a_0&=\frac{1}{N}\sum_k\left(y_k\right)=\bar{y} \end{align*}$$

The best constant with which to model a data set is its own average! Admittedly, this will lead to an $$r^2$$ value of 0, which is not great, but it is as good as you can get with a model containing nothing more than a constant.

Finding the coefficients for a "straight line" model

So, that was relatively painless (except for the typesetting). Next up, let's look at a slightly more complex model: a straight line. That is to say, $$\hat{y}(x)=a_0x^1+a_1x^0$$. Yes, the indexing is a little unfortunate but that's the way it goes. This means that the $$S_r$$ value, using the second version above, will be:

$$\begin{align*} S_r&=\sum_k\left(\hat{y}_k-y_k\right)^2=\sum_k\left(a_0x_k+a_1-y_k\right)^2 \end{align*}$$

There are now two variables:$$a_0$$ and $$a_1$$. This means that to minimize the $$S_r$$ value, you need to solve:

$$ \begin{align*} \frac{\partial S_r}{\partial a_0}&=0\\ \frac{\partial S_r}{\partial a_1}&=0 \end{align*}$$

where the $$\partial$$ symbol indicates a partial derivative. A partial derivative simply means that you are looking at how something changes with respect to changes in only one of its variables - all the other variables are assumed constant. For example, the volume of a cylinder can be given by $$V=\pi r^2h$$ where $$r$$ is the radius of the base and $$h$$ is the height. Using partial derivatives, you can calculate how the volume changes either as a function of changing the radius of the base or as a function of changing the height:

$$ \begin{align*} \frac{\partial V}{\partial r}&=2\pi r h & \frac{\partial V}{\partial h}&=\pi r^2 \end{align*}$$

On the left, the $$h$$ is taken as a constant; on the right, the $$r$$ is taken as a constant.

Here goes round two! First, let's look at $$a_0$$:

$$ \begin{align*} \frac{\partial S_r}{\partial a_0}=\frac{\partial}{\partial a_0}\left(\sum_k\left(a_0x_k+a_1-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{\partial}{\partial a_0}\left(a_0x_k-a_1-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0x_k+a_1-y_k)$$ so $$\frac{du}{da_0}=x_k$$ here (this is different from what it was above):

$$ \begin{align*} \sum_k2\left(a_0x_k+a_1-y_k\right)x_k=\sum_k2\left(a_0x_{k}^{2}+a_1x_k-y_kx_k\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into three parts...

$$ \begin{align*} \sum_k\left(a_0x_{k}^{2}\right)+\sum_k\left(a_1x_{k}\right)-\sum_k\left(y_kx_k\right)&=0 \end{align*}$$

...and then the parts can be separated; I also changed the order of the product on the right for reasons that will become clear later.

$$ \begin{align*} \sum_k\left(a_0x_{k}^{2}\right)+\sum_k\left(a_1x_{k}\right)&=\sum_k\left(x_ky_k\right) \end{align*}$$

None of these terms is simple enough do do anything with other than to recognize that the $$a_0$$ and $$a_1$$ are not functions of $$k$$ and can thus be brought out of the summations:

$$ \begin{align} a_0\sum_k\left(x_{k}^{2}\right)+a_1\sum_k\left(x_{k}\right)&=\sum_k\left(x_ky_k\right) \end{align}$$

Now let's look at $$a_1$$:

$$ \begin{align*} \frac{\partial S_r}{\partial a_1}=\frac{\partial}{\partial a_1}\left(\sum_k\left(a_0x_k+a_1-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{\partial}{\partial a_1}\left(a_0x_k-a_1-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0x_k+a_1-y_k)$$ so $$\frac{du}{da_1}=1$$ here:

$$ \begin{align*} \sum_k2\left(a_0x_k+a_1-y_k\right)(1)=\sum_k2\left(a_0x_{k}+a_1-y_k\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into three parts...

$$ \begin{align*} \sum_k\left(a_0x_{k}\right)+\sum_k\left(a_1\right)-\sum_k\left(y_k\right)&=0 \end{align*}$$

...and then the parts can be separated.

$$ \begin{align*} \sum_k\left(a_0x_{k}\right)+\sum_k\left(a_1\right)&=\sum_k\left(y_k\right) \end{align*}$$

While the second term is actually simple enough to do something with (it is just adding up $$a_1$$ $$N$$ times and thus could be $$Na_1$$, we are simply going to recognize that the $$a_0$$ and $$a_1$$ are not functions of $$k$$ and can thus be brought out of the summations:

$$ \begin{align} a_0\sum_k\left(x_{k}\right)+a_1\sum_k\left(1\right)&=\sum_k\left(y_k\right) \end{align}$$

If we wanted to be explicit about it, we could note that $$\phi_1(x)$$ here is $$x^0$$ and write:

$$ \begin{align} \tag{2e} a_0\sum_k\left(x^{0}_{k}x^{1}_{k}\right)+a_1\sum_k\left(x^{0}_{k}x^{0}_{k}\right)&=\sum_k\left(x^{0}_{k}y_k\right) \end{align}$$

In fact, we could do the same with equation (1) above and write it as:

$$ \begin{align} \tag{1e} a_0\sum_k\left(x^{1}_{k}x^{1}_{k}\right)+a_1\sum_k\left(x^{1}_{k}x^{0}_{k}\right)&=\sum_k\left(x^{1}_{k}y_k\right) \end{align}$$

Equations (1e) and (2e) give two equations with two unknowns; putting them in matrix form yields:

$$ \begin{align} \begin{bmatrix} \sum_k\left(x^{1}_{k}x^{1}_{k}\right) & \sum_k\left(x^{1}_{k}x^{0}_{k}\right) \\ \sum_k\left(x^{0}_{k}x^{1}_{k}\right) & \sum_k\left(x^{0}_{k}x^{0}_{k}\right) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= \begin{bmatrix} \sum_k\left(x^{1}_{k}y_k\right) \\ \sum_k\left(x^{0}_{k}y_k\right) \end{bmatrix} \end{align}$$

Side bar: Dot Products

Now is a good time to take a moment to remember what a dot product is. If you have column vectors $$p$$ and $$q$$, each with the same number of elements, the dot product of $$p$$ and $$q$$ is given as $$\sum_kp_kq_k$$. The vector-math version of that is ff you have column vectors $$p$$ and $$q$$, each with the same number of elements, the dot product of $$p$$ and $$q$$ is given as the matrix product $$p^Tq$$ where $$p^T$$ is the transpose of $$p$$. In this case, that means turning $$p$$ (the column vector) into $$p^T$$ (a row vector) and doing matrix multiplication with $$q$$.

Re-writing sums as dot products

If we have $$N$$ data points and a model of $$\hat{y}(x)=a_0x^1+a_1x^0$$, what we are really trying to solve is:

$$ \begin{align} \begin{bmatrix} x_0^1 & x_0^0\\ x_1^1 & x_1^0\\ x_2^1 & x_2^0\\ \vdots & \vdots\\ x_{N-1}^1 & x_{N-1}^0\\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= \begin{bmatrix} y_0\\ y_1\\ y_2\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align}$$

but if $$N>2$$ we may not be able to find $$a_1$$ and $$a_0$$ that actually solve all those equations perfectly. The geometric way so f saying that is if I have more than two points, they may not all be on the same line. But, if I pre-multiply both sides by the transpose of the matrix with all the function of $$x$$ in it:

$$ \begin{align} {\color{Purple}{ \begin{bmatrix} x_0^1 & x_1^1 & x_2^1 & \ldots & x_{N-1}^1 \\ x_0^0 & x_1^0 & x_2^0 & \ldots & x_{N-1}^0 & \end{bmatrix}}} \begin{bmatrix} x_0^1 & x_0^0\\ x_1^1 & x_1^0\\ x_2^1 & x_2^0\\ \vdots & \vdots\\ x_{N-1}^1 & x_{N-1}^0\\ \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} x_0^1 & x_1^1 & x_2^1 & \ldots & x_{N-1}^1 \\ x_0^0 & x_1^0 & x_2^0 & \ldots & x_{N-1}^0 & \end{bmatrix}}} \begin{bmatrix} y_0\\ y_1\\ y_2\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align}$$

which is to say:

$$ \begin{align} {\color{Purple}{ \begin{bmatrix} (x^T)^1 \\ (x^T)^0 \end{bmatrix}}} \begin{bmatrix} (x)^1 & (x)^0 \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} (x^T)^1 \\ (x^T)^0 \end{bmatrix}}} \begin{bmatrix} y \end{bmatrix} \end{align}$$

which, given the relationship between matrix multiplication and the dot product yields the same result as:

$$ \begin{align} \begin{bmatrix} \sum_k\left(x^{1}_{k}x^{1}_{k}\right) & \sum_k\left(x^{1}_{k}x^{0}_{k}\right) \\ \sum_k\left(x^{0}_{k}x^{1}_{k}\right) & \sum_k\left(x^{0}_{k}x^{0}_{k}\right) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= \begin{bmatrix} \sum_k\left(x^{1}_{k}y_k\right) \\ \sum_k\left(x^{0}_{k}y_k\right) \end{bmatrix} \end{align}$$

Finding the coefficients for a general linear model

It turns out, the process is very similar for any linear model! If we assume that $$\hat{y}(x)=\sum_ma_m\phi_m(x)$$ for $$M$$ total basis functions indexed from $$m=0$$ through $$m=M-1$$. This means that the $$S_r$$ value, using the second version above, will be:

$$\begin{align*} S_r&=\sum_k\left(\hat{y}_k-y_k\right)^2=\sum_k\left(\sum_ma_m\phi_m(x_k)-y_k\right)^2 \end{align*}$$

There are now $$M$$variables: all the $$a_m$$. This means that to minimize the $$S_r$$ value, you need to solve all $$M$$ equations of the form:

$$ \begin{align*} \frac{\partial S_r}{\partial a_m}&=0 \end{align*}$$

Here goes round...$$M$$! Let's look at $$a_p$$, the coefficient of just $$\phi_{p}(x)$$ where $$p$$ is some index between 0 and $$M-1$$:

$$ \begin{align*} \frac{\partial S_r}{\partial a_p}=\frac{\partial}{\partial a_p}\left(\sum_k\left(\sum_ma_m\phi_m(x_k)-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{\partial}{\partial a_p}\left(\sum_ma_m\phi_m(x_k)-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(\sum_ma_m\phi_m(x_k)-y_k)$$ so $$\frac{du}{da_p}=\phi_p(x_k)$$ here since the only variable of interest to the partial derivative is $$a_p$$ and it is multiplying $$\phi_p(x_k)$$; all the other $$a_m$$ for $$m\neq p$$ are constants with respect to $$a_p$$ and this their derivatives would be 0. This leads to:

$$ \begin{align*} \sum_k2\left(\sum_ma_m\phi_m(x_k)-y_k\right)\phi_p(x_k)=\sum_k2\left(\sum_ma_m\phi_m(x_k)\phi_p(x_k)-y_k\phi_p(x_k)\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into two parts...

$$ \begin{align*} \sum_k\left(\sum_ma_m\phi_m(x_k)\phi_p(x_k)\right) -\sum_k\left(y_k\phi_p(x_k)\right)&=0 \end{align*}$$

...and then the parts can be separated; I also changed the order of the products for reasons that will become clear later.

$$ \begin{align*} \sum_k\left(\phi_p(x_k)\sum_m\phi_m(x_k)a_m\right) &= \sum_k\left(\phi_p(x_k)y_k\right) \end{align*}$$

The inner sum is a dot product (and a function of k)! It is a little different from the previous noes we have encountered - it is a dot product of a row made of each of the basis function values of the $$k$$th data point with the column of coefficient values. In other words:

$$ \begin{align*} F(k)=\sum_m\phi_m(x_k)a_m &= \begin{bmatrix} \phi_0(x_0) & \phi_1(x_0) & \ldots & \phi_{M-1}(x_0)\\ \phi_0(x_1) & \phi_1(x_1) & \ldots & \phi_{M-1}(x_1)\\ \vdots & \vdots & \vdots & \vdots \\ \phi_0(x_{N-1}) & \phi_1(x_{N-1}) & \ldots & \phi_{M-1}(x_{N-1}) \end{bmatrix} \begin{bmatrix} a_0\\a_1\\ \vdots \\a_{M-1} \end{bmatrix} \end{align*}$$

Or, assuming $$x$$ is a column vector,

$$ \begin{align*} F(k)=\sum_m\phi_m(x_k)a_m &= \begin{bmatrix} \phi_0(x) & \phi_1(x) & \ldots & \phi_{M-1}(x) \end{bmatrix} \begin{bmatrix} a_0\\a_1\\ \vdots \\a_{M-1} \end{bmatrix} \end{align*}$$

This matrix multiplication takes the $$N$$x$$M$$ matrix of function values and multiplies it by the $$M$$x1 matrix of coefficient values and returns an $$N$$x1 matrix. This leaves us with:

$$ \begin{align*} \sum_k\left(\phi_p(x_k)F(k)\right) &= \sum_k\left(\phi_p(x_k)y_k\right) \end{align*}$$

Each of these summations represents a dot product too! The left one represents the dot product of any given basis function $$\phi_p(x^T)$$ with my just-calculated $$F(k)$$ matrix; the right one represents the dot product of that same basis function $$\phi_p(x^T)$$ with my original dependent data $$y$$. And since I have to solve for all the $$p$$ values between 0 and $$M-1$$, it would make sense to take a matrix of rows of $$\phi_p(x^T)$$ and pre-multiply the left and right sides of it.

The Big Reveal

If we have $$N$$ data points and a model with $$M$$ basis functions ($$\hat{y}(x)=\sum_ma_m\phi_m(x)$$), what we are really trying to solve is:

$$ \begin{align} \begin{bmatrix} \phi_0(x_0) & \phi_1(x_0) & \ldots & \phi_{M-1}(x_0)\\ \phi_0(x_1) & \phi_1(x_1) & \ldots & \phi_{M-1}(x_1)\\ \vdots & \vdots & \vdots & \vdots \\ \phi_0(x_{N-1}) & \phi_1(x_{N-1}) & \ldots & \phi_{M-1}(x_{N-1}) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\a_{M-1}\end{bmatrix}&= \begin{bmatrix} y_0\\ y_1\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align}$$

but if $$N>M$$ we may not be able to find a set of $$a_m$$ that actually solve all those equations perfectly. However, I can find the $$a_m$$ that minimize the sum of the squares of the estimate residuals if I pre-multiply both sides by the transpose of the matrix with all the functions of $$x$$ in it:

$$ \begin{align} {\color{Purple}{ \begin{bmatrix} \phi_0(x_0) & \phi_0(x_1) & \ldots & \phi_0(x_{N-1}) \\ \phi_1(x_0) & \phi_1(x_1) & \ldots & \phi_1(x_{N-1}) \\ \vdots & \vdots & \vdots & \vdots \\ \phi_{M-1}(x_0) & \phi_{M-1}(x_1) & \ldots & \phi_{M-1}(x_{N-1}) \\ \end{bmatrix}}} \begin{bmatrix} \phi_0(x_0) & \phi_1(x_0) & \ldots & \phi_{M-1}(x_0)\\ \phi_0(x_1) & \phi_1(x_1) & \ldots & \phi_{M-1}(x_1)\\ \vdots & \vdots & \vdots & \vdots \\ \phi_0(x_{N-1}) & \phi_1(x_{N-1}) & \ldots & \phi_{M-1}(x_{N-1}) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \\ \vdots \\ a_{N-1} \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} \phi_0(x_0) & \phi_0(x_1) & \ldots & \phi_0(x_{N-1}) \\ \phi_1(x_0) & \phi_1(x_1) & \ldots & \phi_1(x_{N-1}) \\ \vdots & \vdots & \vdots & \vdots \\ \phi_{M-1}(x_0) & \phi_{M-1}(x_1) & \ldots & \phi_{M-1}(x_{N-1}) \\ \end{bmatrix}}} \begin{bmatrix} y_0\\ y_1\\ y_2\\ \vdots\\ y_{N-1}\end{bmatrix} \end{align}$$

which is to say:

$$ \begin{align} {\color{Purple}{ \begin{bmatrix} \phi_0(x^T)\\ \phi_1(x^T)\\ \vdots \\ \phi_{M-1}(x^T) \end{bmatrix}}} \begin{bmatrix} \phi_0(x^T) & \phi_1(x^T) & \ldots & \phi_{M-1}(x^T) \end{bmatrix} \begin{bmatrix} a_0 \\ a_1 \end{bmatrix}&= {\color{Purple}{ \begin{bmatrix} \phi_0(x)\\ \phi_1(x)\\ \vdots \\ \phi_{M-1}(x) \end{bmatrix}}} \begin{bmatrix} y \end{bmatrix} \end{align}$$

Or, if you define $$\Phi$$ as the matrix containing columns of the basis functions $$\phi_m$$, if your system is originally:

$$ \begin{align} \Phi a&=y \end{align}$$

what you are really going to solve is

$$ \begin{align} \Phi^T \Phi a&=\Phi y \end{align}$$

which is to say

$$ \begin{align} a&=\mbox{inv}(\Phi^T \Phi )\Phi y \end{align}$$