For my Pyro project I recently wrote in some detail about the pickle efficiency of various types when dealing with binary data. I've added it to the Pyro documentation but it is also useful when dealing with pickle in general.
So, here is a short overview of the pickle wire protocol overhead for the possible types you can use when transferring binary data:
Python 2.x: efficient; directly encoded as a byte sequence, because that’s what it is. Python 3.x: inefficient; encoded in UTF-8 on the wire, because it is a unicode string.
Python 2.x: same as str. Python 3.x: efficient; directly encoded as a byte sequence.
Inefficient; encoded as UTF-8 on the wire (pickle does this in both Python 2.x and 3.x)
- array("B") (array of unsigned ints of size 1)
Python 2.x: very inefficient; every element is encoded as a separate token+value. Python 3.x: efficient; uses machine type encoding on the wire (a byte sequence).
Your best bet seems to be to use the bytes type (and possibly the array("B") type if you’re using Python 3.x) and stay clear from the rest. It’s strange that the bytearray type is encoded so inefficiently by pickle.
A bytearray is pickled (using max protocol) as follows:
>>> pickletools.dis(pickle.dumps(bytearray(*10),2)) 0: \x80 PROTO 2 2: c GLOBAL '__builtin__ bytearray' 25: q BINPUT 0 27: X BINUNICODE u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff' 52: q BINPUT 1 54: U SHORT_BINSTRING 'latin-1' 63: q BINPUT 2 65: \x86 TUPLE2 66: q BINPUT 3 68: R REDUCE 69: q BINPUT 4 71: . STOP >>> bytearray("\xff"*10).__reduce__() (<type 'bytearray'>, (u'\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff', 'latin-1'), None)
Most notably, the actual *bytes* in the bytearray are represented by an UTF-8 string. This needs to be transformed into a unicode string and then encoded back into bytes, when unpickled. The thing being a bytearray, I would expect it to be pickled as such: a sequence of bytes. And then possibly converted back to bytearray using the constructor that takes the bytes directly (BINSTRING/BINBYTES pickle opcodes).
The above occurs both on Python 2.x and 3.x.
I have no idea yet why this is. Maybe I'll write a patch to improve it, doesn't seem that hard to do.
Edit: Seems we can indeed optimize the pickle stream, using [SHORT_]BINBYTES for Python 3 and [SHORT_]BINSTRING for Python 2:
>>> p=b'\x80\x03cbuiltins\nbytearray\nC\x04ABCD\x85R.' >>> pickletools.dis(p) 0: \x80 PROTO 3 2: c GLOBAL 'builtins bytearray' 22: C SHORT_BINBYTES 'ABCD' 28: \x85 TUPLE1 29: R REDUCE 30: . STOP highest protocol among opcodes = 3 >>> pickle.loads(p) bytearray(b'ABCD')
Bug filed including patches; http://bugs.python.org/issue13503 - it will be fixed in Python 3.3.