How to represent arbitrary-precision integers?

4    23 Jun 2015 21:34 by u/NotSurvivingLife

Perhaps this may be better off in /v/compsci or something, and indeed I shall crosslink this post there.

I'm wondering: what are the different ways of encoding arbitrary-precision integers, and how do they compare w.r.t. efficiency.

The ways I know of:

  • Encode it in unary, with n ones followed by a zero. This takes n+1 bits to encode. So 10 (base 10) gets encoded as 1111 1111 11 0.
  • Encode the number of digits in the above scheme, followed by the digits of the number in base 2. This takes 2*ceil(log2(n+1)) + 1 digits to encode, or 1 bit for zero. So 10 (base 10) gets encoded as 1111 0 1010. (Or potentially with the number encoded LSB-first instead. Same difference)
  • Encode the number of digits with zeros, followed by the number MSB-first. (Effectively you reuse the implied 1 at the highest bit of the number - similar to how floats work.) Note that in this scheme 0 has to be represented specially, by 1, for instance. So 10 gets encoded as 0000 1010. This takes 2*ceil(log2(n+1)) digits to encode, or 1 bit for 0. Again, MSB-first versus LSB-first, same difference.
  • Encode the number of digits for the number of digits with zeros, followed by the number of digits MSB-first, followed by the number MSB-first. Again, with 0 represented specially, for instance with 1. This takes ceil(log2(n+1)) + 2*ceil(log2(ceil(log2(n+1))+1)) digits to encode, or 1 bit for 0.
  • Note that you can iterate the above as many times as you wish.
  • Encode the number as groups of i digits, for some value of i >= 1, where all ones means "the number is done" else it is a digit of the number base 2^i-1. Note that i=1 is equivalent to unary, above. So, for example, with i=2 10 (base 10) gets encoded as 01 00 01 11 (101, base 3, then stop. Again, LSB or MSB first, same difference.) This takes i*(ceil(log_(2^i-1)(n) + 1) + 1) bits to encode, or i for zero.

I am sure there are more potential encodings, however. And it would not surprise me if I messed up on counting the numbers of bits required. Note that encoding 2 is always beat or tied by encoding 3.

3 comments

1

I would recommend you look at Java's BigInteger. Internally it's an array of ints.

0

Pretty much this. There are a bunch of encodings that are theoretically possible, but in practice you can't beat an array of ints for performance on a COTS CPU. (Involve GPUs or FPGAs and it's a completely different story.)

1

One format I've seen in several places (the MIDI file format, for example) is base-128 with the high bit set on all bytes except the last, i.e. (in hex):

0 = 00
1 = 01
...
127 = 7F
128 = 81 00
129 = 81 01
...
255 = 81 7F
256 = 82 00
...
16383 = FF 7F
16384 = 81 80 00
...

The nice thing about this is a number always encodes to a multiple of 8 bits - reading and writing data tends to be much easier when everything is byte-aligned.