It is a common requirement to allow users to input short unique codes or identifiers into E-Commerce websites or applications. The code might be part of a URL, to identify a unique website, in the case of URL shorteners like http://goo.gl/ and http://tiny.cc/, a unique coupon code that takes the user to a special offer, or a security code to identify that the user is real. There are a myriad of other ways in which human friendly alphanumeric codes might be useful in web-based applications.

With this in mind, I decided to investigate creating user-friendly codes that can be used as unique identifiers. I set myself these goals:

- All codes should comprise a combination of letters and numbers.
- Codes should use as few characters as possible.
- All letters should be case insensitive.
- Any letters or numbers which are similar in appearance should be avoided (like “I” and “1”, “O” and “0”, etc).
- For security reasons, the codes should appear to be unpredictable.
- It should be possible to increase the number of codes available indefinitely.
- All codes should be unique – there cannot be clashes.
- The algorithm should be easy to implement in any language, including stored procedures

On the face of it, it may seem like an easy challenge. But judging by the number of articles on the subject, it is not as easy as you may at first think. The main challenges are ensuring uniqueness and unpredictability. Although the values may appear to be unpredictable, this is not an attempt to provide a secure encryption algorithm or random number generator, which is best achieved using the cryptographic classes in .Net.

The Uniqueness Problem

The easiest way to solve the uniqueness problem is to derive the code from an integer value, like an auto-incrementing database key. In my case, I chose a 32-bit signed integer as the key. This would allow up to 21 billion unique codes (or three codes for everyone on the planet!). But I don’t want users to be entering integer codes, because these can be long and fiddly, and so I needed a way to reliably convert the integer codes into letters and numbers. The solution I chose was to encode the integer using “base 32”. Base 32 values are alphanumeric strings comprising only the letters A-Z and 2-9. The Base 32 encoder routine is described in more detail later. All integers map directly onto unique base 32 values, but the results are sequential and predictable.

Predictability

In base 32, the value “AAAA” (zero) is followed by the value “AAAB” (one). To overcome the predictability of a base32 series, there needs be way of juggling the characters so that the encoded values for two consecutive numbers are sufficiently different from each other. One solution is to use a table that maps all allowable input characters to a corresponding set of characters, eg:

Input |
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | … |

Mapped |
N | P | U | V | 5 | K | L | Q | 9 | A | H | 7 | … |

But this approach isn’t very helpful as the sequence would now look like this:

Input |
Result |

AAAA | 9999 |

AAAB | 999A |

AAAC | 999H |

AAAD | 9997 |

… |

A better solution is to change the mapped values by shifting all characters one place to the left (or right), with every input character.

Input |
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | … |

Map1 |
N | P | U | V | 5 | K | L | Q | 9 | A | H | 7 | … |

Map2 |
P | U | V | 5 | K | L | Q | 9 | A | H | 7 | B | … |

Map3 |
U | V | 5 | K | L | Q | 9 | A | H | 7 | B | W | … |

… |

Conceptually, it’s like having a “code wheel” that turns one position with every input character. The new sequence would now look like this:

Input |
Result |

AAAA | 9AH7 |

AAAB | 9AHB |

AAAC | 9AHW |

AAAD | 9AHX |

AAAE | 9AH6 |

… |

The sequence “AAAA” is now replaced by varying characters “9AH7”. This approach will always result in distinct and unique results. Each result can only ever arise from one specific input value (and if it’s not clear why – think about it some more!). This is important to note, because it is easy to imagine more complex encoding scheme that suffer from clashes. The approach is an improvement, but any values beginning with “AAA” result in values prefixed by “9AH”, which isn’t ideal.

Varying the Prefixes in a Series

We need to avoid the same prefixes, given any input series with the same prefix. One possibility is to reverse the output strings (whereby “9AH7” becomes “7HA9”). But this just moves the problem to the suffix. A better solution is to rotate the initial code wheel based on the value of the rightmost input character, and use that as the new wheel for the remaining characters. So if the last character is “B”, rotate it one place, and if the last character is “D”, rotate it 4 places, etc.

If the algorithm employs too much wheel rotation, we may introduce clashes, so it needs to be done with care. We also need a scheme that can be easily decoded. Subsequently, the last character in the input will never be encoded, just directly mapped. This will allow us to determine the wheel’s starting position for decoding. It will also ensure that two values with different code wheels that would otherwise cause a clash are distinct (because they would have different rightmost characters).

And in case you are wondering – we cannot just randomize the wheel, because again, we wouldn’t ever be able to decode values. The new sequence we get by rotating the “initial wheel” is as follows:

Input |
Result |

AAAA | G8J9 |

AAAB | 8JYA |

AAAC | JYZH |

AAAD | YZ27 |

AAAE | Z2RB |

And so there we have it. A scheme that looks random, with no clashes. In the final algorithm, I pad the input value so the code is at least 5 characters, and use a different code wheel for the rightmost character, as you will see from the code itself.

Two “Conceal” Methods

The encoding method is called “Conceal” – to distinguish it from other types of encoding methods. There is a generic version and an integer version. The generic version:

- Works for any characters and isn’t confined to Base 32.
- Takes two parameters as follows:
- Value – a string to be encoded/concealed
- Wheel – a list of allowable characters as a string (e.g. a list of base 32 characters).

- Encodes all the supplied characters simply by rotating the supplied code wheel

Here is the code:

public static string Conceal(string value, string wheel) { var alphabet = sortedAlphabet(wheel); var distinctChars = wheel.Distinct().ToArray(); if (distinctChars.Length != wheel.Length) throw (new ArgumentException("Error: Wheel contains duplicate characters.")); string result = ""; for (int i = 0; i < value.Length; i++) { int letterPosition = Array.IndexOf(alphabet, value[i]); if (letterPosition == -1) throw (new ArgumentException("Error: supplied character '" + value[i] + "' does not appear in code wheel.", value)); char encodedLetter = wheel[(letterPosition + i) % wheel.Length]; result += encodedLetter; } return result; }

So, to conceal a hex value (base 16) you could pass it a wheel of “0123456789ABCDEF” (although the order of characters in the wheel should be randomly arranged).

Concealing Integer Values

The second version of the “Conceal” method only works for 32-bit integers and is designed for integer keys. It:

- Uses the default wheels (these are constants that can be altered as necessary).
- Conceals the rightmost character using a different wheel to the remaining characters.
- Results in less predictable sequences.
- Uses the generic “Conceal” method as part of its implementation.

The implementation is as follows:

public static string Conceal(int value) { string base32value = IntToBase32(value); base32value = base32value.PadLeft(5, 'A'); var alphabet = sortedAlphabet(MainWheel); int offset = Array.IndexOf(alphabet, base32value.Last()); string wheel = MainWheel.Substring(offset) + MainWheel.Substring(0, offset); return Conceal( base32value.Substring(0, base32value.Length - 1), wheel) + InitialWheel[offset]; }

Base 32 Encoding

If the resulting codes contained any letter A-Z and any digit 0-9 (eg. “ZD7FF3”) it would make a total of 36 distinct characters. But because zero and the letter “O” are easily confused, I have excluded these, as well as the “1” digit and the letter “I”. This leaves just 32 distinct characters. To convert an int32 value into a sequence using these desired characters, we simply convert the integer to base 32.

In base 32, the decimal number “2,147,483,647” would be represented as “B999999” (which is more compact). Here are some examples of decimals and their base 32 equivalents:

Decimal |
Base 32 |

32767 | 999 |

32768 | BYYY |

1048575 | 9999 |

1048576 | BYYYY |

33554431 | 99999 |

33554432 | BYYYYY |

In other words, the first million or so codes can all be represented using just 4-characters, which is simple for users. Another advantage of base 32 encoding is that the algorithm is fast and easy to implement – just repeatedly divide by the number by 32. Here is the code:

public static string IntToBase32(int input) { return ConvertToBase(input, Base32Alphabet); } public static int Base32ToInt(string input) { return ConvertFromBase(input, Base32Alphabet); } public static string ConvertToBase(int input, string baseAlphabet) { string result = ""; if (input == 0) { result += baseAlphabet[0]; } else { while (input > 0) { //Must make sure result is in the correct order result = baseAlphabet[input % baseAlphabet.Length] + result; input /= baseAlphabet.Length; } } return result; } public static int ConvertFromBase(string input, string baseAlphabet) { var inputString = input.ToUpper(); int result = 0; for (int i = 0; i < inputString.Length; i++) { result *= baseAlphabet.Length; var character = inputString[i]; result += baseAlphabet.IndexOf(character); } return result; }

Conclusion

The use of Base 32 ensures that the output is a compact combination of letters and numbers that avoids any data-entry ambiguity. Using the encoding method, the codes appear to be unpredictable, and the number of codes available can be easily increased indefinitely.

Although the “Conceal” algorithm is not suitable for encrypting sensitive data, it may also be possible to add additional layers of “concealment” by repeatedly encoding values using different code wheels or by other approaches.

Although I have no reason to doubt that the algorithm will always produce unique values, I have only empirically tested the first 16.7 million integers (2^24). Beyond that, my laptop runs out of memory during the test.

The source code is useful for generating pseudo-random data in a variety of contexts. I would be delighted to learn of any uses.

Here is a link to the full source code.

Very nice. Thanks for sharing!

jennifer

Interesting. Are you sure that <= doesn’t translate to a single machine instruction? In any case, I wouldn’t be surprised if there are optimisations that could be made, as it was mainly intended to illustrate the algorithm rather than be production-ready code.

Cheers,

Mike.

Hi, i have reading out and i will definitely bookmarrk your site, just wanted to say i liked this article.