Difference between revisions of "General Linear Regression"

From PrattWiki
Jump to navigation Jump to search
(Finding the coefficients for the "constant" model)
(Finding the coefficients for the "constant" model)
Line 52: Line 52:
 
\end{align*}$$
 
\end{align*}$$
 
</center>
 
</center>
Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$\frac{du}{da_0}=1$$ here:
+
Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0-y_k)$$ so $$\frac{du}{da_0}=1$$ here:
 
<center>$$
 
<center>$$
 
\begin{align*}
 
\begin{align*}
Line 80: Line 80:
 
\end{align*}$$
 
\end{align*}$$
 
</center>
 
</center>
The best constant with which to model a data set is its own average!
+
The best constant with which to model a data set is its own average! Admittedly, this will lead to an $$r^2$$ value of 0, which is not great, but it is as good as you can get with a model containing nothing more than a constant.
 +
 
 +
== Finding the coefficients for a "straight line" model ==

Revision as of 23:57, 27 October 2019

This is a work in progress. It is meant to capture the mathematical proof of how general linear regression works. It is math-heavy.

Introduction

Assume you have some data set where you have $$N$$ independent values $$x_k$$ and dependent values $$y_k$$. You also have some reasonable scientific model that relates the dependent variable to the independent variable. If that model can be written as a general linear fit, that means you can represent the fit function $$\hat{y}(x)$$ as:

$$ \begin{align*} \hat{y}(x)&=\sum_{m=0}^{M-1}a_m\phi_m(x) \end{align*} $$

where $$\phi_m(x)$$ is the $$m$$th basis function in your model and $$a_m$$ is the constant coefficient. For instance, if you end up having a model:

$$ \begin{align*} \hat{y}(x)&=3e^{-2x}+5 \end{align*} $$

then you could map these to the summation with $$M=2$$ basis function total and:

$$ \begin{align*} a_0 &= 3 & \phi_0(x) &= e^{-2x} \\ a_1 &= 5 & \phi_1(x) &= x^0 \end{align*} $$

Note for the second term that $$\phi(x)$$ must be a function of $$x$$ -- constants are thus the coefficients on an implied $$x^0$$.

The goal, once we have established a scientifically valid model, is to determine the "best" set of coefficients for that model. We are going to define the "best" set of coefficients as the values of $$a_m$$ that minimize the sum of the squares of the estimate residuals, $$S_r$$, for that particular model. Recall that:

$$ \begin{align*} S_r&=\sum_k\left(y_k-\hat{y}_k\right)^2=\sum_k\left(\hat{y}_k-y_k\right)^2 \end{align*} $$

Finding the coefficients for the "constant" model

The simplest model you might come up with is a simple constant, $$\hat{y}(x)=a_0x^0$$. This means that the $$S_r$$ value, using the second version above, will be:

$$\begin{align*} S_r&=\sum_k\left(\hat{y}_k-y_k\right)^2=\sum_k\left(a_0-y_k\right)^2 \end{align*}$$

Keep in mind that the only variable right now is $$a_0$$; all the $$x$$ and $$y$$ values are constant independent or dependent values from your data set. The only parameter you can adjust is $$a_0$$. This means that to minimize the $$S_r$$ value, you need to solve:

$$ \begin{align*} \frac{dS_r}{da_0}&=0 \end{align*}$$

Here goes!

$$ \begin{align*} \frac{dS_r}{da_0}=\frac{d}{da_0}\left(\sum_k\left(a_0-y_k\right)^2 \right)&=0 \end{align*}$$

The derivative of a sum is the same as the sum of derivatives, so put the derivative operator inside:

$$ \begin{align*}\sum_k\frac{d}{da_0}\left(a_0-y_k\right)^2&=0 \end{align*}$$

Use the power rule to get that $$d(u^2)=2u~du$$ and note that $$u=(a_0-y_k)$$ so $$\frac{du}{da_0}=1$$ here:

$$ \begin{align*} 2\sum_k\left(a_0-y_k\right)&=0\end{align*}$$

Since we are setting the left side to 0, the 2 is irrelevant. Also, the summand can be split into two parts...

$$ \begin{align*} \sum_k\left(a_0\right)-\sum_k\left(y_k\right)&=0 \end{align*}$$

...and then the parts can be separated.

$$ \begin{align*} \sum_k\left(a_0\right)&=\sum_k\left(y_k\right) \end{align*}$$

Recognize the $$a_0$$ is a constant; since you are adding that constant to itself for each of the $$N$$ data points, you can replace the summation with:

$$ \begin{align*} Na_0&=\sum_k\left(y_k\right)\end{align*}$$

Dividing by $$N$$ reveals the answer:

$$ \begin{align*} a_0&=\frac{1}{N}\sum_k\left(y_k\right)=\bar{y} \end{align*}$$

The best constant with which to model a data set is its own average! Admittedly, this will lead to an $$r^2$$ value of 0, which is not great, but it is as good as you can get with a model containing nothing more than a constant.

Finding the coefficients for a "straight line" model