A Deep Dive Into How R Fits a Linear Model
R is a high level language for statistical computations. One of my most used R functions is the humble lm
, which fits a linear regression model. The mathematics behind fitting a linear regression is relatively simple, some standard linear algebra with a touch of calculus. It therefore may quite surprise the reader to learn that behind even the simplest of calls to R’s lm
function lies a journey through three different programming languages, which in the end arrives at some of the oldest open source software still in common use.
So, in the spirit of the famous thought experiment “what happens when you type www.google.com into your address bar and pres enter”, I’d like to discuss what happens when you call lm
in R.
This essay is inspired by a question of Antoni Parellada’s on CrossValidated. It is essentially a much expanded version of my answer there.
We will make heavy use of the R source code, which you can find here.
The R Layer
Our point or origin is lm
, the interface exposed to the R programmer. It offers a friendly way to specify models using the core R formula
and data.frame
datatypes. A prototypical call to lm
looks something like this
The first argument is a model formula, and the second is a dataframe. The dataframe must contain columns x1
, x2
, and y
, which are transformed into the design matrix and response vector of the model.
The source code for any R function (except those implemented in the R source code itself, which are called .Primitives
) can be viewed by typing the function name into the R interpreter. Typing lm
reveals the both full function signature
and the source code.
It is worth a moment to point out, that majority of the source code to lm
(the same is true for the majority of most quality production level code) is boring but necessary busy work and defensive programming: checking inputs, throwing errors
and setting of object attributes
The same is true for much of the other code we will investigate, but hereafter we will ignore defensive code and focus only on the interesting bits.
Now, if we think at a high level, there are two fundamental tasks that lm
must accomplish
- It must consume the formula and dataframe it receives and produce a design matrix and response vector.
- It must use this design matrix and response, along with some linear algebra, to compute the linear regression coefficients.
both of these tasks turn out to, in fact, be interesting bits.
Constructing the Design Matrix
Obviously we need to construct a design matrix first. This task starts in this dense and somewhat obscure block of code
This small block highlights a strange idiom of R programming, its direct manipulation of function calls, frozen in time. The function match.call
, when called upon in the local scope of another function, returns an object of type call
, which captures the call to the enclosing function along with the values bound to its formal parameters.
That’s pretty difficult to take in, so here is an example
A call
object is a small wonder, it can be indexed into and manipulated much like other R objects. The first index always gives the name of the function that was called (not as a string, as a symbol object)
The other indices return the arguments passed into the call
In our current situation the function we called is lm
, so the line
replaces the function name lm
in the call object with model.frame
(the quote
creates a symbol out of a string or expression). Similarly, the lines
discard any of the various arguments to lm
that are not needed to construct the design matrix.
Alltogether, we have gone from a call to lm
like
to a call to model.frame
like
Which is pretty neat.
We can now unstop time and evaluate the function call we have so meticulously constructed
We get, courtesy of model.frame
, a data.frame
that contains all the terms in our formula fully evaluated. For example
The way R gets to this new data frame is quite interesting. First, the model formula is turned into a terms object, which contains metadata needed to identify which column represents the response, and which the predictors (the encoding of this information is pretty obscure, so I wont bother to unwind it here).
The varaibles
attribute of the terms object is our old friend, a call
So we should be able to evaluate the call
The problem here is one of scoping, we want the function call to get it’s arguments from the data frame df
instead of the calling environment. The solution is to use the data frame itself as the envir
argument to eval
We’ve lost the variable names in this demonstration but those can be easily reattached to the result.
Our model frame has one deficiency, though this example does not show it. If our model specification includes and factor variables, they are simply passed through to the model frame untouched
The work of deconstructing a factor variable into multiple binary predictors is done in a call to model.matrix
.
This deconstruction can be a complex task, so we will no go into details lest it take us too far afield. It would be a nice topic for another essay one day.
Calculating the Regression - R
Now that we have a design matrix, we can move on to fitting the regression.
lm.fit
is another R function, which we can call ourselves if we are so inclined
While lm
conveniently works with formulas and data.frames
, lm.fit
wants matrices, so moving from lm
to lm.fit
removes one layer of abstraction.
The interesting action in lm.fit
is the one line
Now we are getting somewhere intriguing. .Call
is modern R’s way of calling into C code. The first argument to .Call
is a sting or symbol identifying a compiled C function that is either part of the R distribution, or linked as a shared library. The remaining arguments to .Call
are passed as R objects into the C function, as we will see shortly. In some cases where the called function is part of the R source code, the function name is prepended with a C_
. In our case, we are calling a C function named Cdqrls
.
Calculating the Regression - C
The Cdqrls
is found in the R source here. Note its peculiar signature
SEXP
is the datatype that R’s source uses for a generic R object. Essentially everything that an R programmer manipulates when working day to day is internally an SEXP
. There are various subtypes of SEXPs
used for more specific types of objects, for example
- A
INTSXP
is an integer vector. - A
REALSXP
is a vector of floating point numbers. - A
VECSXP
is an R list.
More advanced types, that you may not think of as data, are also SEXPs
, for example, a CLOSXP
is an R function.
Knowing these types, we can make some sense of the code in Cdqrls
. Here, a list object is created to hold output
Here, a vector of floating point numbers is created to hold the fit coefficients, which is then inserted in the appropriate slot in the output object
notice that, depending on the shape of y
, coefficients
can be either a matrix or a regular vector.
The PROTECT
macro issues instruction to R’s garbage collector, new objects must be PROTECTED
, lest they be collected.
Yet again, the majority of the work in Cdqrls
is concerned with checking invariants of inputs, and constructing and initializing new objects. And, once more, the real work of fitting the model is passed to another function, implemented in another language
F77_CALL
is a macro that wraps calls to FORTRAN functions. It doesn’t amount to much
The ##
operator in a C macro is simple string concatenation, so F77_CALL
either leaves its argument alone, or prepends an underscore. It seems that certain platforms append an underscore to function names when FORTRAN code is compiled, and C code calling into the compiled FORTRAN must be aware of this. For our purposes, the F77_CALL
clues us in that we are now calling into a function dqrls
implemented in FORTRAN.
Calculating the Regression - FORTRAN
So now we are on our third language, R has called C which has called FORTRAN. Here’s the FORTRAN code for dqrls
.
The first comment tells it all
(interestingly, it looks like the name of this routine was changed at some point, from dqrfit
to dqrls
, but someone forgot to update the comment). We’re finally at the point where we can do some linear algebra, and actually solve the system of equations. This is the sort of thing that FORTRAN was designed to do, and is really good at, which explains why we eventually ended up here.
Fortran can be a bit jarring to modern programmer sensibilities. Starting with the function signature:
that is a lot of positional arguments. Note taking is almost required to keep the order straight, and getting it wrong is likely to result in a program that simply gives nonsense results (as opposed to crashing).
Worth noting is the lack of a return value. Instead, we have some documentation containing the phrases “on entry”
and “on return”
Instead of explicitly returning a value to the caller, the common modern paradigm, instead FORTRAN “returns” data from a function call by overwriting some of the data passed in. This kind of thing can also be done in C by passing in a pointer to some data, but it is not the default way to return.
It looks like FORTRAN is going to solve our system by finding the QR decomposition of the coefficient matrix x
. The first thing that happens, and by far the most important, is
which calls the fortran function dqrdc2
on our input matrix x
We’ve finally made it to linpack. Linpack is a FORTRAN linear algebra library that was created between 1977 and 1979 by a distributed team of four programmers. The development was funded by the NSF, and was eventually released to the general public. Of course, in 1979 the public could not simply download linpack from the internet, an interested party had to send $75 ($250 in 2016 dollars) to a distributor to receive a copy on tape. Sometimes its worth reflecting on how good a modern developer really has it.
Most serious linear algebra eventually finds its way to linpack or one of its more modern successors (lapack). In our case, we are using the function dqrdc2
This is where the actual work is done. We are going to decompose into its QR
factorization.
This is a smart thing to do, because once you have and you can solve the linear equations for regression
very easily. Indeed
so the whole system becomes
is upper triangular, so it has the same rank as , and if our problem is well posed then has full rank. So, as is a full rank matrix, we can ignore the factor in the equations above, and simply seek solutions to the equation
But here’s the awesome thing. Again, is upper triangular, so the last linear equation here is just constant * beta_n = constant
, so solving for is trivial. We can then go up the rows, one by one, and substitute in the
s we already know, each time getting a simple one variable linear equation to solve. So, once we have and , the whole thing collapses to what is called backwards substitution, which is easy.
The simplest and most intuitive way to compute the factorization of a matrix is with the Ghram-Schmidt procedure, which unfortunately is not suitable for serious numeric work due to it’s instability. Linpack instead uses Householder reflections, which have better computational properties.
Householder Reflections
A Householder reflection is a linear transformation that preforms a reflection about a hyperplane (a hyperplane is a linear subspace of dimension one less than the enclosing space). A hyperplane can be (up to sign) uniquely described by its unit normal vector , in which case the corresponding Householder reflection is the mapping
which is easily understood with a picture and the mental note that describes the projection of the vector onto . Written as a matrix, a Householder transformation is
Given any vector there is a hyperplane in which we can reflect with the resulting image landing along the first coordinate axis1. For a vector , call this transformation
So what does this have to do with the decomposition of a matrix ? Well, if we denote by the first column of , then is a matrix whose first column lies in the first coordinate direction. That is, it has the following shape
Now we can continue inductively on with the same procedure applied to matrix , just construct a Householder transformation that maps the first column of into the first coordinate axies and let
Then
If we continue on like this, we will create a sequence of reflections which reduces into an upper triangular matrix
Each of the reflection matrices is orthogonal, so we can invert the part and move everything to the other side
Setting , we have the decomposition of .
Unfortunately, it’s hard to untangle the FORTRAN code to see these concepts, as FORTRAN uses an efficient encoding of the matrices involved at the expense of clarity
Hopefully we have given enough details here that the concepts are clear to the reader.
1 In the plane containing and , let be their angle bisector. Take the vector orthogonal to in the plane of and . It’s not too hard to find a formula for this vector.
Solving the Least Squares Problem
After calling dqrsl
we have in hand the decomposition of the matrix , and so are well on our way to getting our hands on the linear regression solution. The least squares problem is solved here
That 20
hanging out on the left is a line number, a concept that thankfully died with the first few generations of programming languages. The line number acts as a directive to the do
loop (the reader may wish to pause and reflect on how much more clean and expressive {brackets}
are).
dqrsl
is another linpack
function whose purpose is to consume the output of dqrsl
We are interested in solutions to the least squares problem, so this seems we are in the right place
Again, the FORTRAN code is convoluted enough that it’s not worth it to untangle its knots.
Let’s take a stock of where we are. We’ve passed down through many layers R, then C, and finally FORTRAN code from the early days of computing. At the very bottom, we finally receive the solution to the linear regression problem we constructed in R at the very top of the stack. All that’s left is to propagate this information back to the user.
Wrap Up
If you’ve read this far, I hope that you’ve found this journey through one of the most basic and fundamental R functions enjoyable. Many expositions of linear regression focus on the mathematics, statistics, or at worst the mechanical application of canned routines in this or that programming language. There is a lot more to a full implementation of these concepts than the solution equation
I think it can be very useful to see what technologies, both computational and mathematical, go into a practical and assessable solution to the problem.
There are some things we did not even touch, or swept under the rug
- How is the model matrix defined / contrasts applied to the factors?
- What happens when is not full rank, how is that detected?
- What is the mysterious encoding
linpack
uses for the decomposition. - How are standard errors computed?
I don’t know the answer to all of these questions, the reader may find it fun to investigate them.