In the begining, there was only binary data.
And that is the genesis of this post.
Any serious server side language must play nice with binary data. Node’s strategy was to introduce Buffers. If you have dealt with streams or crypto in node.js, chances are you have dealt with Buffers. This one is about one such encounter.
One fine afternoon we were trying to generate random codes. Each code is just a random string of ascii characters (like P4AZHEDXWA). There was just one problem, we were trying to generate 200 million unique codes, and we ran into a problem with Buffers.
When marketing comes to us with gigantic promotional campaigns (like this one that requires doling out 200 million unique promo codes), sometimes things get pushed to their limits. In the process, we discover a thing or two about the tools we use everyday. That’s the beauty of scale.
But first, let me describe how we decided to generate these codes.
Step 1: Get a good deal of entropy.
This one was easy, /dev/random was enough. We estimated we would need a couple of gigs of random data, hence this
dd if=/dev/random of=random.data bs=1k count=2048000
This was quite fast on my Mac, barely taking a couple of minutes.
Step 2: Read n bytes at a time, check for uniqueness, and generate
In this step, we would read a chunk of bytes from the random seed, check if that chunk is unique (hasn’t been seen yet), and BASE32 encode it to get our promo code.
How to you check uniqueness of such a large set? Our approach involved using Bloom filters, which are essentially a set of hashes used to set bits in a buffer.
That is potentially the worst description of bloom filters one can come up with, but Bloom filters are such a nice data structure that we use them often and I would probably cover them in an independent post.
Back to the excercise, we fire up the program and generate the codes. The curious thing we noticed was that the nodejs script that was generating the codes went very fast in the begining, and then kept getting slower and slower. Eventaully, at around 6 million codes, it would just seem to hang.
Now one of the nice things about bloom is that adding to a bloom filter (or looking up from it) is a constant time operation. We had already generated all the randomness we needed and dumped it to a file, so the script couldn’t potentially be waiting for entropy. strace offered the first clue - we found that the script kept asking for more and more memory, and at around 6 Million codes, having exhausted all the system memory, it just seemed to hang.
It all boiled down to this
I can imagine the Crockford fans jump in joyous exhilaration with a look of ‘I told you so’. Yes, its a bad idea to use bitshifts to do a divide by 8, even if its performant. But first, an explanation of what was causing the leak is in order.
Each input chunk (that is being checked for uniqueness) is converted into a set of hashes (using FNV). Each hash code is then used to set a bit position in a buffer. To calculate the bit position, the hash is divided by 8.
Its here that this fails. If the hash value is too big (and the sign bit is set), the following will not be equivalent
Want to see it for yourself? Try with 2415919103 (which is 0x8fffffff). You will get a negative number when you do a bitshift using ».
But that’s not the end of it. Don’t you need to know how negative array index lead to memory leak of this sort? That’s where the Buffer class comes in. You see, in the sample code above, this.buffer is an instance of Buffer, which is not an instance of Array. Don’t take my word for it, try it out yourself.
The for loop comes to our rescue
So there we have it. If you perform an out of bounds indexing on a Buffer, that becomes a property of the Buffer.You may end up with a gigantic memory leak if you aren’t careful about maintaining the bounds yourself.
BTW, we are hiring, and if this sort of
workdebugging excites you, write to me at qasim at paytm dot com.