home shape

Sorting number strings numerically

Recently I gave a talk about ArangoDB in front of a community of mathematicians. I advertised that nearly arbitrary data can “easily” be stored in a JSON based document store. The moment I had uttered the word “easily”, one of them asked about long integers. And if a mathematician says “long integer” they do not mean 64bit but “properly long”. He actually wanted to store orders of finite groups. I said one should use a JSON UTF-8 string for this but I should have seen the next question coming because he then wanted that a sorted index would actually sort the documents by the numerical value stored in the string. But most databases – and ArangoDB is no exception here – will compare UTF-8 strings lexicographically (dictionary order).

And here is today’s problem:

How can one store long integral values as UTF-8 strings, such that the usual UTF-8 byte-wise lexicographical comparison actually sorts the values by their numerical value?

So the real title of this post should have been “An order-preserving embedding of the set of integers into the set of lexicographically sorted UTF-8 strings“. But who would have clicked on that? After all, this is a blog and no mathematical journal.

This problem has a surprisingly simple solution, at least for the set of positive numbers: To store the number N simply store N times the letter x. The larger N is, the longer the string gets and lexicographic ordering says that in this case longer strings with more letters are larger.

This answer is exactly as mathematically correct as it is totally useless, because who would like to waste nearly a megabyte of data for the number 1000000? We have to do better!

If we restrict ourselves to numbers with at most k digits, we can just pad with zeroes on the left and store 1000000 as (say)

    00000000000000000000000000000000000000000001000000

but this is also wasteful, and surely once we decide about the maximal number of digits, somebody comes along and would like to store a larger number. So what we really would like is a scheme with the following properties:

  • use not much more space than would anyway be needed in the usual notation
  • work for arbitrarily long numbers
  • can easily be converted to and from the usual decimal notation
  • is easy to explain and to comprehend

This post suggests such an encoding.

Let’s do the non-negative numbers first

For numbers with up to 92 digits, we simply put the number of digits of the number first (as 1 character, using Unicode code points 34 to 125, adding 33 to the number of digits
to get the code point), then follow by a space and then the actual number in the usual notation.

Here are a few examples:

    " 0             is the number 0
    " 1             is the number 1
    # 42            is the number 42
    . 1623463626463" is the number 1623463626463

Why does this work and preserves the numerical ordering? Clearly, a “shorter” positive number is smaller than a longer one (in the usual notation without leading zeroes). But in this case the lexicographical ordering of the strings compares the number of digits correctly in the first character (letters have higher Unicode code points than digits). For two numbers with the same number of digits the initial character is the same, the space as well, and then the lexicographical comparison works for the actual numbers. Note that transformation between formats is easy because the actual notation is contained verbatim after the space character, which makes it human readable at the same time.

This takes us up to 1092-1, which is clearly a large number but not enough for mathematicians.

For larger numbers Y >= 1092 we use the following trick

So far, we have encoded the number of digits of our number using the first character. Now we simply put a tilde character `~` to indicate something larger (the Unicode code point for tilde is 126 and thus larger than any one of the ones we have used above), then put the number of digits in the above notation (without the space), followed by a space and then the actual number, that is,

    "~cX Y"

where c is the character for the number of digits of X and 92 < X < 1092 in the above notation, and Y has exactly X digits not starting with a 0 digit. This works, because of a very similar argument as above: All these numbers are larger than 1092-1 and all the strings are lexicographically larger than the ones without a tilde. If two such numbers Y and Y’ have a different number of digits, then the one with fewer digits is smaller. But then the corresponding string “~cX” is lexicographically smaller than “~cX'” because of our previous arguments. If Y and Y’ have the same number of digits, then their “~cX” parts are the same and the lexicographical ordering sorts the strings correctly according to the numerical values.

This covers the range 1092 <= Y < 10(1092-1)-1, which is good for all practical cases, because no computer in this universe can store a string with 1092-1 characters anyway.

Therefore, our problem is practically solved. However, I told you in the beginning that this audience consisted of mathematicians. If they say “for all integers” they mean it.

So how would we do even larger numbers?

Actually, the case of exactly one tilde character is just a special case. In general, we store a string of this form:

    "~~~~~~ca[0]a[1]a[2]... a[n]"

which consists of n tilde characters (n some positive integer) followed by one character c and then n+1 decimal numbers a[0], a[1], a[2], …a[n], and a space before the last one. a[0] is encoded in the above notation with c and is less than 1092, a[1] has exactly a[0] digits, a[2] has exactly a[1] digits and so on until a[n] has exactly a[n-1] digits.

We claim that this delivers an embedding of an arbitrarily large positive integer into the set of all UTF-8 strings and it is order preserving. The details of the proof are left for the reader, but the main argument about the ordering works as above: To compare the two such representations of a and b respectively: Either a and b have the same number of digits, then the two strings will have equally many tilde characters and an equal amount of numbers and in both cases all but the last numbers are pairwise identical. Then the lexicographical comparison of a[n] and b[n] works correctly. If they have a different number of digits, then the shorter one is the smaller one and our previous argument shows that lexicographical comparison works correctly.

This is now a mathematically satisfying solution of the problem.

But stop! We forgot about the negative numbers? Fortunately, we left out the `!` character with Unicode code point 33, which is smaller than any initial character we have used above. So we can simply prepend a `!` sign and follow with the encoding of the absolute value, and have all negative numbers earlier than all positive numbers. However, there is a slight problem: We get that the string `!” 6` compares smaller than `!” 7` and this contradicts -7 being smaller than -6.

Therefore, we have to find a way to invert the ordering of the encodings we used for the positive numbers.
Fortunately, this is easy: Since we only used the characters from code points 33 to 126, we can simply translate everything by keeping the space and reversing the order. That
is, code points 33 and 126 are interchanged, 34 and 125, and so on, until 79 and 80. Then the lexicographic ordering will be the other way around. That is,

    `!} m` stands for -2 (since `" 2` is the encoding of +2)
    `!| jl` stands for -53 (since `# 53` is the encoding of +53)

This concludes the mathematical solution. For the sake of completeness, here are some JavaScript functions which do the encoding and decoding.

Note that we omitted the rare cases of strings with more than 10(1092-1)-1 digits:

  function encodeNonNegative(s, pad) {
      let l = s.length;
      if (l <= 92) { return String.fromCodePoint(33+l) + pad + s; }
      return '~' + encodeNonNegative(l.toString(10), '') + ' ' + s;
    }

    function translate(s) {
      let r = [];
      for (i = 0; i < s.length; ++i) {
        let c = s.charCodeAt(i);
        r.push(c === 32 ? ' ' : String.fromCodePoint(159 - c));
      }
      return r.join('');
    }

    function encodeLong(s) {
      if (s[0] !== '-') { return encodeNonNegative(s, ' '); }
      let p = encodeNonNegative(s.slice(1), ' ');
      return '-' + translate(p);
    }

    function decodeNonNegative(s) {
      return s.slice(s.indexOf(" ")+1);
    }

    function decodeLong(s) {
      if (s[0] !== '-') { return decodeNonNegatve(s); }
      let p = decodeNonNegative(translate(s.slice(1)));
      return '-' + p;
    }

The problem is solved, we can embed any integer into an UTF-8 string and preserve the proper order.

Authors note: This problem and its solution comes out of a collaboration between Christopher Jefferson (University of St Andrews), Max Horn (Universität Gießen) and Max Neunhöffer (ArangoDB).

Max Neunhöffer

Max is one of the C/C++ developers working on the ArangoDB core. In particular, he is responsible for the sharding extension and additionally converts the latest ideas from database science into C/C++ code. Furthermore, he enjoys to give public talks about the technical aspects of the ArangoDB development.

Leave a Comment





Get the latest tutorials, blog posts and news: