v/programming: How to represent arbitrary-precision integers?

How to represent arbitrary-precision integers?

4 23 Jun 2015 21:34 by u/NotSurvivingLife

Perhaps this may be better off in /v/compsci or something, and indeed I shall crosslink this post there.

I'm wondering: what are the different ways of encoding arbitrary-precision integers, and how do they compare w.r.t. efficiency.

The ways I know of:

Encode it in unary, with n ones followed by a zero. This takes n+1 bits to encode. So 10 (base 10) gets encoded as 1111 1111 11 0.
Encode the number of digits in the above scheme, followed by the digits of the number in base 2. This takes 2*ceil(log2(n+1)) + 1 digits to encode, or 1 bit for zero. So 10 (base 10) gets encoded as 1111 0 1010. (Or potentially with the number encoded LSB-first instead. Same difference)
Encode the number of digits with zeros, followed by the number MSB-first. (Effectively you reuse the implied 1 at the highest bit of the number - similar to how floats work.) Note that in this scheme 0 has to be represented specially, by 1, for instance. So 10 gets encoded as 0000 1010. This takes 2*ceil(log2(n+1)) digits to encode, or 1 bit for 0. Again, MSB-first versus LSB-first, same difference.
Encode the number of digits for the number of digits with zeros, followed by the number of digits MSB-first, followed by the number MSB-first. Again, with 0 represented specially, for instance with 1. This takes ceil(log2(n+1)) + 2*ceil(log2(ceil(log2(n+1))+1)) digits to encode, or 1 bit for 0.
Note that you can iterate the above as many times as you wish.
Encode the number as groups of i digits, for some value of i >= 1, where all ones means "the number is done" else it is a digit of the number base 2^i-1. Note that i=1 is equivalent to unary, above. So, for example, with i=2 10 (base 10) gets encoded as 01 00 01 11 (101, base 3, then stop. Again, LSB or MSB first, same difference.) This takes i*(ceil(log_(2^i-1)(n) + 1) + 1) bits to encode, or i for zero.

I am sure there are more potential encodings, however. And it would not surprise me if I messed up on counting the numbers of bits required. Note that encoding 2 is always beat or tied by encoding 3.

3 comments

1 u/nefreat 24 Jun 2015 02:16

I would recommend you look at Java's BigInteger. Internally it's an array of ints.

0 u/rdnetto 24 Jun 2015 16:27

Pretty much this. There are a bunch of encodings that are theoretically possible, but in practice you can't beat an array of ints for performance on a COTS CPU. (Involve GPUs or FPGAs and it's a completely different story.)

1 u/isosum 25 Jun 2015 04:44

One format I've seen in several places (the MIDI file format, for example) is base-128 with the high bit set on all bytes except the last, i.e. (in hex):

0 = 00
1 = 01
...
127 = 7F
128 = 81 00
129 = 81 01
...
255 = 81 7F
256 = 82 00
...
16383 = FF 7F
16384 = 81 80 00
...

The nice thing about this is a number always encodes to a multiple of 8 bits - reading and writing data tends to be much easier when everything is byte-aligned.