Writing a C Extension Module for Python
I like to solve project euler problems using python. Over time, I’ve built up a small library of helper routines that are useful for many different problems. Since these routines get used so often, it pays to make them as efficient as possible.
A C extension module is a python module, only written in C. This is possible because the main python runtime, the program that interprets and runs python programs, is written is C (a fact embedded into the programs name, cpython
). It is therefore possible to write a python module as a collection of C functions that is interacts directly with the python runtime. For tasks requiring heavy computation, this can result in considerable speedups to many algorithms (computationally heavy parts of numpy and scipy, for example, are written as C extension modules).
In this post I will be using python 3; I believe there are a few minor details that are different if you are writing extension modules for python 2.
The Problem
Many problems from project euler benefit from having an apriori list of prime numbers. For these, calling a function primes_less_than(n)
, which returns a list of all the prime numbers less than a positive integer , is a good first step to constructing a solution.
An efficient algorithm for producing such a list is the Sieve of Eratosthenes, which goes like this
- Initialize a boolean list of length , which we will use to record whether or not each integer is prime.
- Zero and one are not prime, mark them so.
- Two is prime, mark it as such. All other multiples of two are not prime.
- The next unmarked number is not divisible by any of the primes we have already marked, so it must be prime; mark it, then mark all its multiples as not prime. (the first time we get to this step, this unmarked number is three).
- Repeat the above procedure until the list is exhausted.
There is one minor optimization to the above algorithm that it is always worth employing. When we find a prime we have, in previous steps, marked as non-prime all multiples of all primes in the interval . This means that we must have found all non-primes less than , for any composite number must be divisible by some number . So, when marking multiples of a prime , we can start marking at .
Here’s a pure python implementation of this algorithm
Our goal will be to translate this algorithm into a python module written in C.
Creating a Module Object
We are going to create a python module primes
which contains the primes_less_than
function, so let’s begin with a file primes.c
. Before we can get to implementing the algorithm, we need to do the somewhat boring work of creating the module object, and then telling C what functions and objects we want to put inside it.
Since our goal is to interface directly with python, we need to import the python C api by including the appropriate header
including this header makes the C python api available to us.
First, lets record what functions we intend to include in our module by creating what python calls the method table
PyMethodDef
is a defined by the python header, it is a simple struct
with four entries
char* ml_name
: The name of the method.PyCFunction ml_func
: A pointer to the C function implementing the method. The naming convention for python methods ismodule_name + '_' + method_name
, which is how we ended up with the awkward C function nameprimes_primes_less_than
.int ml_flags
: A python header defined set of flags controlling method calling conventions. We will be using the simplest possible calling convention (positional arguments only), which corresponds to theMETH_VARARGS
flag.char* ml_doc
: A documentation string for the method.
The PyCFunction
type is defined in the same file as PyMethodDef
and is interesting if only for its fascinating C twistedness
this is a type definition of a pointer to a function that consumes two pointers to PyObjects
and returns a pointer to a PyObject
(not a typedef
I could pull off first try). We will say more about PyObjects
later on.
Now we have to define the module object and tell it about our method table
Here there are some internal python things, which we mostly don’t have to worry about very much. The first entry in this struct must always be set to PyModuleDef_HEAD_INIT
(the documentation is very explicit about this), and the -1
is a flag controlling how memory is allocated for module level objects. The remaining entries in the struct are
char* m_name
: The name of module.char* m_doc
: A documentation string for the module.PyMethodDef* m_methods
: The method table of the module being created.
Finally we need to initialize the module, i.e. tell python what to do when we import
it
The return type PyMODINiT_FUNC
is a compiler macro, created in a maze of #defines
here. This allows the declaration to adapt itself situationally. The simplest possible declaration is used when compiling modules while building the python runtime itself
In our case we have already have a working install of python, so we need to compile our module as a shared library (compiled code usable from other compiled code)
Luckily, we don’t have to manage the compiler flags ourselves to get this to work out correctly, python will do that for us, as we will see when building our module.
It’s worth mentioning that we return NULL
when the module creation fails, this is a general pattern in cpython
code. When failures occur, we signal a generic error to the caller by passing NULL
, otherwise a valid PyObject
is returned. To signal more granular error information the runtime provides an api that allows us to set and check exceptions in a static global variable.
This is the end of the necessary setup, so we can get on to writing code to solve our actual problem. We will leave the setup code we just completed at the bottom of the module (as it references the yet to be written method primes_primes_less_than
, which we will need to define above the module method table, or else the compiler will complain).
Getting Python Objects Into C
There are two fundamental tasks we must complete when writing a C function to be called from pyhon
- Receive the python objects passed as arguments into the python method, and deconstruct them into their native C datatypes.
- Construct a python object to return.
Here we will focus on the first task.
The function signature of a module level python method in C is, surprisingly, always the same
We won’t need the self
argument (luckily, the documentation is somewhat unclear about its purpose), but we will certainly need args
.
Notice that all the arguments to the function are passed from python as a single python object args
. The PyObject
type is completely generic; at its most fundamental level a PyObject
contains a reference count (the number of other objects that care that it still exists) and a pointer to a structure containing more granular type information. Depending on what flavor of PyObject
it is, it can contain an entire host of other functionality. Numbers, lists, tuples, and dictionaries are all PyObject
s at the C level.
In our case, the self
object will contain a single python integer, we need to extract the C integer out of this so we can use it in our algorithm (note the distinction, a python integer is a PyObject
, but a C integer is a native 64-bit binary integer, to highlight the distinction it is common to call a python integer “wrapped”).
Thankfully, the python developers have provided a wonderful way to unwrap the native integer data. The code
will parse the argument object, extract the long integer, and place it in the variable n
. The interface to PyArg_ParseTuple
is very clever, the number and types of arguments to the python method are communicated in a format string passed as the second argument. If we had two integer arguments we would write
or an integer and a string
More complicated objects are also supported, here’s a two-tuple of integers and a dictionary
since the dictionary is recieved as a generic PyObject*
, we could then use PyDict_Check
to validate its type.
The PyArg_ParseTuple
function returns a value, which can be used to check if the args
object was successusfully parsed (because, for example, we could easily pass an incorrect format string). As discussed, we want python methods to return NULL
on error conditions, so our code for parsing args
becomes
and our partially completed method is
An interesting engeneering question now arises, should we implement the algorithm inline or break out the computation into a seperate, private, function? Consider the consesquences of writing the code inline, what would happen if we wrote another function in this module that would like to call primes_primes_less_than
? That function could not simply pass along the long
, it would have to construct a python object to pass as args
, which seems like a lot of trouble. If instead, we write a private function _primes_less_than
which consumes a long
, then we have the algorithm available to other functions in our C code, a clear win. So our python method becomes
Implementing the Algorithm
Let’s turn to implementing the algorithm in our private function
Reviewing our pure python implementation, we need two data structures
- A fixed length boolean array (length
n
) to record which intergers we have determened to be composite so far, we called thiscould_be_prime
in our python implementation. - A varaible length array for accumulating our prime numbers.
We only need the boolean array for the life of the function call, so we can create it on the stack (with storage class automatic) and let C collect it automatically when it falls out of scope (though this decision will have consequences, keep reading). We only know the size of the array at runtime (as oposed to compile time), but modern C (C99) allows us to get away with that using varaible length arrays.
To accumulate our prime numbers, we will use a python list, since we need to return the result anyway, and it saves us from writing our own expandable array type.
Our could_be_prime
array is initially filled with garbage data, so we need to initialize it before we get to work
The implementation of the sieve is now quite straighforward. We only need to know how to create a python integer object, and how to add these to our (initially empty) python list. These tasks are accomplished with the python api functions PyLong_FromLong
and PyList_Append
, which are well documented in the C api references for python long integers and python lists. Making use of these functions, it’s easy to translate our sieving algorithm into C
And with that, we’ve written a complete extension module.
Building and Testing the Module
The final steps are to build our extension module, and then test to make sure it works.
Building is easy, we only need to create a very simple python script containing metadata needed for the build. The complete setup.py
script for our primes
module is
Setup scripts can be much more complicated, but our simple case is self explanatory and to the point.
Now we cross our fingers and build the module
If we are very lucky, the module will compile without error, otherwise we have some bugs to fix. In any case, once all the problems are resolved we have a build directory which contains our extension module
If we navigate to the directory containing the shared object file and drop into python, we can import the module and use our method
Success!
Comparison
Now that we have working python and C implementation of the algorithm, we can compare thier performance and see what we have won. Let’s wrap a call to the primes_less_than
function in two python scripts, one calling the python implementation, and one calling the C implementation. Here’s the c-test.py
script
The python implementation takes a good deal of time to generate such an extensive list of primes
Unfortuantely, running the test for the extension module causes an unexpected error
The dreaded segmentation fault! What’s going on here? Why did our program crash here when we verified that it worked for a smaller list of primes?
Recall from earlier that we created our working array could_be_prime
in the local scope of a function, i.e. we created the array on the stack. The stack is generally a small segment of memory, on a unix system the stack size can be determined with the ulimit
command
Clearly our list of booleans is overflowing this limit. We could solve this issue by using ulimit
to allow our program a (much) bigger stack, but this is brittle. Instead, we can just allocate space for could_be_prime
on the heap
After recompiling, let’s try again
Nice!
Conclusions
We got about an order of magnitude of speedup for our effort. For an algorithm that is used in a tight loop, this coud be a very signifigant boon. It is possible to go much further, creating classes and objects at the C level.
The cython project offers another approach to writing extension modules, using a python like dialect instead of interacting with the C api directly. Cython also integrates very nicely with numpy, which can be more difficult to do in pure C. On the other hand, writing cython does not offer the same insights and discovery as interacting with cython
directly, so both approaches have thier advantages.
Writing an extension module is often held up as one of the great possibilities available to the python programmer, but I often get the feeling that few prople actually do it. Hopefully this example will help more python programmers get over the fear and dive into deeper waters and adventures.