# A Deep Dive Into How R Fits a Linear Model

R is a high level language for statistical computations. One of my most used R functions is the humble `lm`

, which fits a linear regression model. The mathematics behind fitting a linear regression is relatively simple, some standard linear algebra with a touch of calculus. It therefore may quite surprise the reader to learn that behind even the simplest of calls to R’s `lm`

function lies a journey through three different programming languages, which in the end arrives at some of the oldest open source software still in common use.

So, in the spirit of the famous thought experiment “what happens when you type www.google.com into your address bar and pres enter”, I’d like to discuss what happens when you call `lm`

in R.

This essay is inspired by a question of Antoni Parellada’s on CrossValidated. It is essentially a much expanded version of my answer there.

We will make heavy use of the R source code, which you can find here.

## The R Layer

Our point or origin is `lm`

, the interface exposed to the R programmer. It offers a friendly way to specify models using the core R `formula`

and `data.frame`

datatypes. A prototypical call to `lm`

looks something like this

The first argument is a model formula, and the second is a dataframe. The dataframe must contain columns `x1`

, `x2`

, and `y`

, which are transformed into the design matrix and response vector of the model.

The source code for any R function (except those implemented in the R source code itself, which are called `.Primitives`

) can be viewed by typing the function name into the R interpreter. Typing `lm`

reveals the both full function signature

and the source code.

It is worth a moment to point out, that majority of the source code to `lm`

(the same is true for the majority of most quality production level code) is boring but necessary busy work and defensive programming: checking inputs, throwing errors

and setting of object attributes

The same is true for much of the other code we will investigate, but hereafter we will ignore defensive code and focus only on the interesting bits.

Now, if we think at a high level, there are two fundamental tasks that `lm`

must accomplish

- It must consume the formula and dataframe it receives and produce a design matrix and response vector.
- It must use this design matrix and response, along with some linear algebra, to compute the linear regression coefficients.

both of these tasks turn out to, in fact, be interesting bits.

## Constructing the Design Matrix

Obviously we need to construct a design matrix first. This task starts in this dense and somewhat obscure block of code

This small block highlights a strange idiom of R programming, its direct manipulation of function calls, frozen in time. The function `match.call`

, when called upon in the local scope of another function, returns an object of type `call`

, which captures the call to the enclosing function along with the values bound to its formal parameters.

That’s pretty difficult to take in, so here is an example

A `call`

object is a small wonder, it can be indexed into and manipulated much like other R objects. The first index always gives the name of the function that was called (not as a string, as a symbol object)

The other indices return the arguments passed into the call

In our current situation the function we called is `lm`

, so the line

replaces the function name `lm`

in the call object with `model.frame`

(the `quote`

creates a symbol out of a string or expression). Similarly, the lines

discard any of the various arguments to `lm`

that are not needed to construct the design matrix.

Alltogether, we have gone from a call to `lm`

like

to a call to `model.frame`

like

Which is pretty neat.

We can now unstop time and evaluate the function call we have so meticulously constructed

We get, courtesy of `model.frame`

, a `data.frame`

that contains all the terms in our formula fully evaluated. For example

The way R gets to this new data frame is quite interesting. First, the model formula is turned into a terms object, which contains metadata needed to identify which column represents the response, and which the predictors (the encoding of this information is pretty obscure, so I wont bother to unwind it here).

The `varaibles`

attribute of the terms object is our old friend, a `call`

So we should be able to evaluate the call

The problem here is one of scoping, we want the function call to get it’s arguments from the data frame `df`

instead of the calling environment. The solution is to use the data frame itself as the `envir`

argument to `eval`

We’ve lost the variable names in this demonstration but those can be easily reattached to the result.

Our model frame has one deficiency, though this example does not show it. If our model specification includes and factor variables, they are simply passed through to the model frame untouched

The work of deconstructing a factor variable into multiple binary predictors is done in a call to `model.matrix`

.

This deconstruction can be a complex task, so we will no go into details lest it take us too far afield. It would be a nice topic for another essay one day.

## Calculating the Regression - R

Now that we have a design matrix, we can move on to fitting the regression.

`lm.fit`

is another R function, which we can call ourselves if we are so inclined

While `lm`

conveniently works with formulas and `data.frames`

, `lm.fit`

wants matrices, so moving from `lm`

to `lm.fit`

removes one layer of abstraction.

The interesting action in `lm.fit`

is the one line

Now we are getting somewhere intriguing. `.Call`

is modern R’s way of calling into C code. The first argument to `.Call`

is a sting or symbol identifying a compiled C function that is either part of the R distribution, or linked as a shared library. The remaining arguments to `.Call`

are passed as R objects into the C function, as we will see shortly. In some cases where the called function is part of the R source code, the function name is prepended with a `C_`

. In our case, we are calling a C function named `Cdqrls`

.

## Calculating the Regression - C

The `Cdqrls`

is found in the R source here. Note its peculiar signature

`SEXP`

is the datatype that R’s source uses for a generic R object. Essentially everything that an R programmer manipulates when working day to day is internally an `SEXP`

. There are various subtypes of `SEXPs`

used for more specific types of objects, for example

- A
`INTSXP`

is an integer vector. - A
`REALSXP`

is a vector of floating point numbers. - A
`VECSXP`

is an R list.

More advanced types, that you may not think of as data, are also `SEXPs`

, for example, a `CLOSXP`

is an R function.

Knowing these types, we can make some sense of the code in `Cdqrls`

. Here, a list object is created to hold output

Here, a vector of floating point numbers is created to hold the fit coefficients, which is then inserted in the appropriate slot in the output object

notice that, depending on the shape of `y`

, `coefficients`

can be either a matrix or a regular vector.

The `PROTECT`

macro issues instruction to R’s garbage collector, new objects must be `PROTECTED`

, lest they be collected.

Yet again, the majority of the work in `Cdqrls`

is concerned with checking invariants of inputs, and constructing and initializing new objects. And, once more, the real work of fitting the model is passed to another function, implemented in another language

`F77_CALL`

is a macro that wraps calls to FORTRAN functions. It doesn’t amount to much

The `##`

operator in a C macro is simple string concatenation, so `F77_CALL`

either leaves its argument alone, or prepends an underscore. It seems that certain platforms append an underscore to function names when FORTRAN code is compiled, and C code calling into the compiled FORTRAN must be aware of this. For our purposes, the `F77_CALL`

clues us in that we are now calling into a function `dqrls`

implemented in FORTRAN.

## Calculating the Regression - FORTRAN

So now we are on our third language, R has called C which has called FORTRAN. Here’s the FORTRAN code for `dqrls`

.

The first comment tells it all

(interestingly, it looks like the name of this routine was changed at some point, from `dqrfit`

to `dqrls`

, but someone forgot to update the comment). We’re finally at the point where we can do some linear algebra, and actually solve the system of equations. This is the sort of thing that FORTRAN was designed to do, and is really good at, which explains why we eventually ended up here.

Fortran can be a bit jarring to modern programmer sensibilities. Starting with the function signature:

that is a *lot* of positional arguments. Note taking is almost required to keep the order straight, and getting it wrong is likely to result in a program that simply gives nonsense results (as opposed to crashing).

Worth noting is the lack of a return value. Instead, we have some documentation containing the phrases “on entry”

and “on return”

Instead of explicitly returning a value to the caller, the common modern paradigm, instead FORTRAN “returns” data from a function call by overwriting some of the data passed in. This kind of thing can also be done in C by passing in a pointer to some data, but it is not the default way to return.

It looks like FORTRAN is going to solve our system by finding the QR decomposition of the coefficient matrix `x`

. The first thing that happens, and by far the most important, is

which calls the fortran function `dqrdc2`

on our input matrix `x`

We’ve finally made it to linpack. Linpack is a FORTRAN linear algebra library that was created between 1977 and 1979 by a distributed team of four programmers. The development was funded by the NSF, and was eventually released to the general public. Of course, in 1979 the public could not simply download linpack from the internet, an interested party had to send $75 ($250 in 2016 dollars) to a distributor to receive a copy on tape. Sometimes its worth reflecting on how good a modern developer really has it.

Most serious linear algebra eventually finds its way to linpack or one of its more modern successors (lapack). In our case, we are using the function dqrdc2

This is where the actual work is done. We are going to decompose into its `QR`

factorization.

This is a smart thing to do, because once you have and you can solve the linear equations for regression

very easily. Indeed

so the whole system becomes

is upper triangular, so it has the same rank as , and if our problem is well posed then has full rank. So, as is a full rank matrix, we can ignore the factor in the equations above, and simply seek solutions to the equation

But here’s the awesome thing. Again, is upper triangular, so the last linear equation here is just `constant * beta_n = constant`

, so solving for is trivial. We can then go up the rows, one by one, and substitute in the
s we already know, each time getting a simple one variable linear equation to solve. So, once we have and , the whole thing collapses to what is called *backwards substitution*, which is easy.

The simplest and most intuitive way to compute the factorization of a matrix is with the Ghram-Schmidt procedure, which unfortunately is not suitable for serious numeric work due to it’s instability. Linpack instead uses Householder reflections, which have better computational properties.

## Householder Reflections

A Householder reflection is a linear transformation that preforms a reflection about a hyperplane (a hyperplane is a linear subspace of dimension one less than the enclosing space). A hyperplane can be (up to sign) uniquely described by its unit normal vector , in which case the corresponding Householder reflection is the mapping

which is easily understood with a picture and the mental note that describes the projection of the vector onto . Written as a matrix, a Householder transformation is

Given any vector there is a hyperplane in which we can reflect with the resulting image landing along the first coordinate axis^{1}. For a vector , call this transformation

So what does this have to do with the decomposition of a matrix ? Well, if we denote by the first column of , then is a matrix whose first column lies in the first coordinate direction. That is, it has the following shape

Now we can continue inductively on with the same procedure applied to matrix , just construct a Householder transformation that maps the first column of into the first coordinate axies and let

Then

If we continue on like this, we will create a sequence of reflections which reduces into an upper triangular matrix

Each of the reflection matrices is orthogonal, so we can invert the part and move everything to the other side

Setting , we have the decomposition of .

Unfortunately, it’s hard to untangle the FORTRAN code to see these concepts, as FORTRAN uses an efficient encoding of the matrices involved at the expense of clarity

Hopefully we have given enough details here that the concepts are clear to the reader.

^{1} In the plane containing and , let be their angle bisector. Take the vector orthogonal to in the plane of and . It’s not too hard to find a formula for this vector.

## Solving the Least Squares Problem

After calling `dqrsl`

we have in hand the decomposition of the matrix , and so are well on our way to getting our hands on the linear regression solution. The least squares problem is solved here

That `20`

hanging out on the left is a *line number*, a concept that thankfully died with the first few generations of programming languages. The line number acts as a directive to the `do`

loop (the reader may wish to pause and reflect on how much more clean and expressive `{brackets}`

are).

`dqrsl`

is another `linpack`

function whose purpose is to consume the output of `dqrsl`

We are interested in solutions to the least squares problem, so this seems we are in the right place

Again, the FORTRAN code is convoluted enough that it’s not worth it to untangle its knots.

Let’s take a stock of where we are. We’ve passed down through many layers R, then C, and finally FORTRAN code from the early days of computing. At the very bottom, we finally receive the solution to the linear regression problem we constructed in R at the very top of the stack. All that’s left is to propagate this information back to the user.

## Wrap Up

If you’ve read this far, I hope that you’ve found this journey through one of the most basic and fundamental R functions enjoyable. Many expositions of linear regression focus on the mathematics, statistics, or at worst the mechanical application of canned routines in this or that programming language. There is a lot more to a full implementation of these concepts than the solution equation

I think it can be very useful to see what technologies, both computational and mathematical, go into a practical and assessable solution to the problem.

There are some things we did not even touch, or swept under the rug

- How is the model matrix defined / contrasts applied to the factors?
- What happens when is not full rank, how is that detected?
- What is the mysterious encoding
`linpack`

uses for the decomposition. - How are standard errors computed?

I don’t know the answer to all of these questions, the reader may find it fun to investigate them.