Using C/C++ for Python Extension

In general, C/C++ can be used to extend the functionality of Python with almost the highest performance you demand. To write a Python extension in C/C++ is relatively easy.

I'll show a simplified extension which is used in real life. This extension is made to extract records in a special file format, .pcap, and .pcap file is used to store the captured network packets so that the network activities can be analysed later.

Although there are many alternatives, they cannot achieve the goal in reasonable time. One of these alternatives is scapy, please don't get me wrong, scapy is a fabulous networking package. It can automatically parse all the records in .pcap file, which is an amazing feature. However, the parsing work will also take significant amount of time, especially for a large .pcap file with hundreds of thousands records inside.

At that time, my goal was quite straightforward. The time when captured the packet, from which source IP the packet was sent, and the destination IP of the packet. Given these demanding, there is no need to parse any record as deep as scapy would do. I can just check whether it contains IP layer or not, and if yes, extract the source IP and destination IP. Otherwise I'll skip to next record. And that's all.

I decided to name the extension as streampcap. And the class name would be StreamPcap so that I can write my Python code as below.

from streampcap import StreamPcap

pcap = StreamPcap("sample.pcap")
packet = pcap.next()
while packet is not None:
    print("{} {} {}".format(packet["time"], packet["ip_src"], packet["ip_dst"]))
    packet = pcap.next()

In order to implement this functionality, python-dev should be installed if the OS is Ubuntu/Debian/CentOS and etc Linux based operating systems. As for macOS, personally I use miniconda to manage the Python environment, and I think that miniconda will automatically get the same thing done. And miniconda is also available for Linux based OS. Life is easier!

To begin with, you can create a virtual Python environment, or use your currently working Python environment. Just make sure that the major and minor version of Python interpreters are matched. For instance, you can create a virtual environment with Python 3.6.3, and then you can distribute the package to Python 3.6.x without any concerns. (Of course you can consider the some backwards compatibility, but that would be another topic.) Here I'll demonstrate how to create a new virtual environment with Python 3.6. (Miniconda will choose the latest available version in Python 3.6.x)

conda create -n pyext python=3.6

After entering the command above, a prompt will be shown to let you confirm the actions. And if everything is right, miniconda will tell you that, to activate this environment, use

conda activate pyext

With blink of an eye, a new virtual Python environment is created and activated.

Ensure the correct virtual environment is activated by typing the following command

which python3

If you see some output like /Users/YOUR_USERNAME/miniconda3/envs/pyext/bin/python3, it suggests that you are in the right direction.

Besides that, Python.h should be checked whether presents inside this virtual environment or not. Simply type ls (do not push return/enter now), then copy and paste the previous output, and replace the /bin/python3 at the end of previous output to include/python3.6m/. By hitting return/enter key, Python.h should be shown in the output.

After all the preparations are done, we can proceed to the second step. In this step, a directory should be created for the extension.

mkdir -p tmp/pyextension
cd tmp/pyextension
touch streampcap.c
vim streampcap.c

And absolutely, you can use any editor you like to edit streampcap.c. Then with streampcap.c opened, the very first thing is to write

#include <Python.h>

This is a Python extension after all. Now we can go on and let's define the streampcap module.

static PyModuleDef streampcapmodule = {
    PyModuleDef_HEAD_INIT,
    // module name
    .m_name = "streampcap",
    // Documentation for the module
    .m_doc = "A module that reads the pcap file.",
    .m_size = -1,
}

That's the easiest and happiest part. Because we will proceed the StreamPcap class now. Because our StreamPcap will read .pcap file, thus a packet capture descriptor would be associated to our class.

typedef struct {
    PyObject_HEAD
    /// Packet capture descriptor
    pcap_t *p;
} StreamPcapObject;

As a class, firstly an allocator is a must, which allocates sufficient memory for its instance and does some basic initialisation work. Secondly, the actual __init__(self, ...) function in Python code will be written in C/C++. Thirdly, to free all resources associated when the instance is gone, an deallocator may be needed as well. Finally, based on our proposal, there would be a .next() method of StreamPcap class, thus we should define the methods. Let me walk you through these things one by one in the following contents.

Firstly, the allocator. Given that there is a pointer in our StreamPcapObject, it would be better if we explicitly initialise that pointer to NULL after successfully allocated memory.

/**
 StreamPcap Allocator
 
 @param type         StreamPcapType
 @param ignored_args unused positional args
 @param ignored_kwds unused keyword args
 @return Pointer to a StreamPcapObject
 */
static PyObject *
StreamPcap_new(PyTypeObject *type, PyObject *Py_UNUSED(ignored_args), PyObject *Py_UNUSED(ignored_kwds))
{
    StreamPcapObject *self = (StreamPcapObject *)type->tp_alloc(type, 0);
    if (self) {
        // explicitly initialise the pointer to NULL
        self->p = NULL;
    }
    return (PyObject *)self;
}

And then followed by the initialiser. If the StreamPcap class was wrote in Python, this would be the __init__(self, filepath) function. Once we get this function done, we will have achieved these lines in pseudo Python code.

class StreamPcap:
    def __init__(self, filepath):
        self.p = pcap_open_offline(filepath)
        if self.p is None:
            raise RuntimeError("No such file or directory")

As seen from above, there is 1 positional argument, and it's also an keyword argument with entry as filepath.

/**
 Initialiser
 
 @param self Pointer to a StreamPcapObject
 @param args StreamPcapObject(args...)
 @param kwds Should be used for keyword args, but here we do not take them
 @return Pointer to a StreamPcapObject if file exists and can be opened
 Otherwise NULL
 */
static int
StreamPcap_init(StreamPcapObject *self, PyObject *args, PyObject *kwds)
{
    const char * kwargs[] = {"filepath", NULL};
    // local variable for filepath
    const char * filepath;
    
    // try to get a string
    // which would be the filepath
    if (!PyArg_ParseTupleAndKeywords(args, kwds, (char *)"s", (char **)kwargs, &filepath)) {
        return -1;
    }
    
    // Open a file containing packet capture data. This must be called
    // before processing any of the packet capture data. The file
    // containing pcaket capture data should have been generated by a
    // previous call to pcap_open_live()
    char errbuf[PCAP_ERRBUF_SIZE];
    if (!(self->p = pcap_open_offline(filepath, errbuf))) {
        PyErr_SetString(PyExc_RuntimeError, errbuf);
        return -1;
    }
    
    return 0;
}

Here we come to the third part, the deallocator. It's necessary for StreamPcap class, because there is a packet capture descriptor associated with the instance. And to avoid double free issue, the packet capture descriptor is checked whether is NULL or not. If not NULL, it will be passed to pcap_close function and assigned NULL.

/**
 StreamPcap Destructor
 
 @param self Pointer to a StreamPcapObject
 */
static void
StreamPcap_dealloc(StreamPcapObject *self)
{
    // check whether the file is open or not
    if (self->p) {
        // close the packet capture device and free the memory used by the
        // packet capture descriptor
        pcap_close(self->p);
        self->p = NULL;
    }
    
    // free memory
    Py_TYPE(self)->tp_free((PyObject *) self);
}

Finally the really useful part, .next() method. This is where the record get extracted, checked and packed as Python dict object. Though the code is slight long, the majority of it is Libpcap related.

/**
 Get next record in pcap file
 
 @param self Pointer to a StreamPcapObject
 @param args instance.next(arg...)
 @param kwds Should be used for keyword args, but here we do not take them
 @return next record if exists
 Otherwise None
 */
static PyObject* StreamPcap_next(StreamPcapObject* self, PyObject* args, PyObject* kwds)
{
    if (self->p == NULL) {
        PyErr_SetString(PyExc_RuntimeError, "No pcap file opened");
        return NULL;
    }
    
    struct pcap_pkthdr hdr;
    const u_char * packet;
    while (1) {
        packet = pcap_next(self->p, &hdr);
        
        if (packet) {
            // Get Ethernet header.
            struct ether_header * eh = (struct ether_header *)packet;
            
            // Get upper protocol type.
            unsigned short ether_type = ntohs(eh->ether_type);
            
            // Only cares about IP
            if (ether_type == ETHERTYPE_IP) {
                // Get IP header
                struct ip * iph = (struct ip *)(packet + sizeof(struct ether_header));
                
                PyObject * ip_src, * ip_dst;
                ip_src = Py_BuildValue("s", inet_ntoa(iph->ip_src));
                ip_dst = Py_BuildValue("s", inet_ntoa(iph->ip_dst));
                
                // https://www.tutorialspoint.com/python/python_further_extensions.htm
                PyObject* ret = Py_BuildValue("{s:i,s:i,s:i,s:i,s:O,s:O,s:d}",
                                              "caplen", hdr.caplen,
                                              "len", hdr.len,
                                              "ip_v", iph->ip_v,
                                              "ip_hl", iph->ip_hl<<2,
                                              "ip_src", ip_src,
                                              "ip_dst", ip_dst,
                                              "ts", hdr.ts.tv_sec + (1.0l/1000000) * hdr.ts.tv_usec
                                              );
                Py_DECREF(ip_src);
                Py_DECREF(ip_dst);
                return ret;
            } else {
                continue;
            }
        } else {
            Py_INCREF(Py_None);
            return Py_None;
        }
    }
}

The only method in StreamPcap is done, but a little bit more information is required by Python. A structure that describes all the methods in StreamPcap is demanded.

static PyMethodDef StreamPcap_methods[] = {
    {"next", (PyCFunction)StreamPcap_next, METH_NOARGS,
        "Return the next record in dict if exists, otherwise None"
    },
    {NULL}  /* Sentinel */
};

Once the structure that describes all the methods in StreamPcap is done, the structure that defines the StreamPcap class can be written.

static PyTypeObject StreamPcapType = {
    PyVarObject_HEAD_INIT(NULL, 0)
    // Class name
    .tp_name = "streampcap.StreamPcap",
    // Documentation
    .tp_doc = "StreamPcap objects",
    // Instance size
    .tp_basicsize = sizeof(StreamPcapObject),
    .tp_itemsize = 0,
    .tp_flags = Py_TPFLAGS_DEFAULT,
    // Allocate new instance
    .tp_new = StreamPcap_new,
    // Initialize new instance
    .tp_init = (initproc)StreamPcap_init,
    // Destructor of StreamPcap
    .tp_dealloc = (destructor)StreamPcap_dealloc,
    // Methods in StreamPcap
    .tp_methods = StreamPcap_methods,
};

To wrap up all the things in our C/C++ code, a module initialiser is required.

/**
 Python Module Initialization
 
 @return module object
 */
PyMODINIT_FUNC
PyInit_streampcap(void)
{
    PyObject *m;
    if (PyType_Ready(&StreamPcapType) < 0) {
        return NULL;
    }
    
    m = PyModule_Create(&streampcapmodule);
    if (m == NULL) {
        return NULL;
    }
    
    Py_INCREF(&StreamPcapType);
    if (PyModule_AddObject(m, "StreamPcap", (PyObject *)&StreamPcapType) < 0) {
        Py_DECREF(&StreamPcapType);
        Py_DECREF(m);
        return NULL;
    }
    return m;
}

Lastly, Python will use setup.py as the manifest. Inside the setup.py file, there is information about the name of extension, the version of it, which libraries should be linked, which source files should be used for compiling and etc.

from distutils.core import setup, Extension

streampcap = Extension('streampcap',
                        define_macros = [('MAJOR_VERSION', '1'),
                                        ('MINOR_VERSION', '0')],
                        include_dirs = ['/usr/local/include'],
                        libraries = ['pcap'],
                        library_dirs = ['/usr/local/lib'],
                        sources = ['streampcap.c'])


setup (name = 'streampcap',
        version = '1.0',
        description = 'Read a pcap stream',
        author = 'Cocoa',
        author_email = '[data deleted]',
        url = '#/pyextension',
        long_description = '''
Read from pcap file, streamify each record inside it
''',
        ext_modules = [streampcap])

Are we all set now? Actually, not yet. These header files should be included in the streampcap.c file as well.

#include <fcntl.h> // open()
#include <pcap.h>  // pcap_open_offline(), pcap_next(), pcap_close()
#include <netinet/if_ether.h> // struct ether_header
#include <netinet/ip.h> // struct ip
#include <arpa/inet.h> // ntohs(), inet_ntoa()

Here we can finally compile the extension!

python3 setup.py build
python3 setup.py install

Check the result~

Leave a Reply

Your email address will not be published. Required fields are marked *

fifteen − 14 =