pickle efficiency when dealing with binary data

Sunday 27 November 2011, 15:39:00 | python

For my Pyro project I recently wrote in some detail about the pickle efficiency of various types when dealing with binary data. I've added it to the Pyro documentation but it is also useful when dealing with pickle in general.

So, here is a short overview of the pickle wire protocol overhead for the possible types you can use when transferring binary data:

Your best bet seems to be to use the bytes type (and possibly the array("B") type if you’re using Python 3.x) and stay clear from the rest. It’s strange that the bytearray type is encoded so inefficiently by pickle.

A bytearray is pickled (using max protocol) as follows:

>>> pickletools.dis(pickle.dumps(bytearray([255]*10),2))
    0: \x80 PROTO      2
    2: c    GLOBAL     '__builtin__ bytearray'
   25: q    BINPUT     0
   27: X    BINUNICODE u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff'
   52: q    BINPUT     1
   54: U    SHORT_BINSTRING 'latin-1'
   63: q    BINPUT     2
   65: \x86 TUPLE2
   66: q    BINPUT     3
   68: R    REDUCE
   69: q    BINPUT     4
   71: .    STOP

>>> bytearray("\xff"*10).__reduce__()
(<type 'bytearray'>, (u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff', 'latin-1'), None)

Most notably, the actual *bytes* in the bytearray are represented by an UTF-8 string. This needs to be transformed into a unicode string and then encoded back into bytes, when unpickled. The thing being a bytearray, I would expect it to be pickled as such: a sequence of bytes. And then possibly converted back to bytearray using the constructor that takes the bytes directly (BINSTRING/BINBYTES pickle opcodes).

The above occurs both on Python 2.x and 3.x.

I have no idea yet why this is. Maybe I'll write a patch to improve it, doesn't seem that hard to do.

Edit: Seems we can indeed optimize the pickle stream, using [SHORT_]BINBYTES for Python 3 and [SHORT_]BINSTRING for Python 2:

>>> p=b'\x80\x03cbuiltins\nbytearray\nC\x04ABCD\x85R.'
>>> pickletools.dis(p)
    0: \x80 PROTO      3
    2: c    GLOBAL     'builtins bytearray'
   22: C    SHORT_BINBYTES 'ABCD'
   28: \x85 TUPLE1
   29: R    REDUCE
   30: .    STOP
highest protocol among opcodes = 3
>>> pickle.loads(p)
bytearray(b'ABCD')

Bug filed including patches; http://bugs.python.org/issue13503 - it will be fixed in Python 3.3.