tero.co.uk

DES Encoding

This page discusses the various encoding options, and is the result of some careful investigations by Alberto Biamino. The discussion below makes a number of recommendations for anybody confronted with character encoding problems. Many thanks to Alberto for the information provided. Most of it was written and programmed by him.

Input filter
The des function accepts a Javascript string as input, and uses the charCodeAt function to convert the characters in the string into integers. Problems can therefore arise if non ASCII characters are passed in. Non ASCII means characters with Unicode numbers greater than 255 (such as non Latin letters).

Further problems can arise with ASCII characters between 128 and 255 as their value may be different on different browsers using different encodings. For example, the € symbol might come up as 128 or as 8364 (see the charCodeAt discussion below). If the string is encrypted on one browser and decrypted on another, the results might appear different. However, as this is only a display problem, it may not need be dealt with in the input filter.

It is therefore recommended that the programmer filter the input string first and either convert unfriendly characters into HTML entities (such as ›, discussed in the next section) or simply replace the characters with a string of their choice. This string may be a character like '?' (suggested default to match browsers' behavior) or '_', a string with more than one character, or an empty string (ie: ignored).

The filter purpose is avoiding unsafe characters meet with charCodeAt() and fromCharCode(). A question arises whether the character to let untouched are those in the code range 0-127, including control codes, or only strict ASCII 32-126 (127 is the control code DEL). In other words, which consider safe characters?

For this reason (and since converting to entities gives a pure 32-126 ASCII string), I would conclude to let the filter pass only 32-126 ASCII characters, plus \n (LF = 10) \r (CR = 13) \t (HT = 9). Additionally, when copied from an HTML element into the input filed, the non-breaking space (Unicode 160) becomes a normal space (Unicode-ASCII 32) and then is converted in entity . This seems another argument for the filter let pass only 32-126.

After some discussion, it was decided that an entity filter should not be included within the des function, but the character filter should be. This is so that the des function does not have an HTML bias, but can still handle messy situations. So managing the input filter needs two parameters: a boolean specifying which filter to activate and the character, or group of characters, to replace invalid characters with. Rather than function arguments, I'd introduce these parameters as pre-set "fine tune" configuration variables in the des() function, to avoid an excessive number of arguments, e.g.

var filter_active = 1;
var filter_replace = '?';

If the programmer converts the input in entities, or assumes to deal only with ASCII characters (e.g. with a function that doesn't let the user enter other than allowed characters), can turn the filter off.

Converting to entities
Of course, preconverting to entities allows entering any character (supported by JavaScript or adequately handled by the C2E() and E2C() functions below). If in the des function non ASCII characters are preconverted to entities, with bundled functions, the filter simply doesn't affect the string.

Unprintable control codes before ASCII 32 display in a form field as blocks that looks visually the same, but are detected and handled individually. The C2E() function below converts into entities all characters outside the [32, 126] interval. This converts both control codes and extended ASCII characters. E2C() converts entities back into characters.

function C2E (str) {
    str = str.replace(/&/g, '&');
    str = str.replace(/'/g, ''');
    str = str.replace(/"/g, '"');
    str = str.replace(/\\/g, '\');
    var acc = '';
    for (var i = 0; i < str.length; i++) {
        if (str.charCodeAt(i) > 31 && str.charCodeAt(i) < 127) acc += str.charAt(i) 
        else acc += '&#' + str.charCodeAt(i) + ';';
    }
    return acc;
}

function E2C (str) {
    str = str.replace(/(&#[0-9]+;)/g, '\n$1\n');
    str = str.replace(/\n\n/g, '\n');
    spl = str.split('\n');
    for (var i = 0; i < spl.length; i++) {
        if (spl[i].charAt(0) == '&') {
            spl[i] = spl[i].replace(/&#([0-9]+);/g, '$1');
            spl[i] = String.fromCharCode(spl[i]);
        }
    } 
    str = spl.join('');
    return str;
}

Testing charCodeAt and fromCharCode
For unsupported codes, fromCharCode outputs a "block" character (sometimes a dot) or a question mark. Which codes result in blocks, dots or question marks, is browser dependent. For its output, typeof is generally "string" and isNaN is "true" except for HT, CR, LF, the blank space and some other depending by the browser. fromCharCode is used in the DES script and in the entity-to-character E2C() conversion function. If charCodeAt has previously done a good job in filtering or in optional preconversion to entities, no unsupported codes should reach fromCharCode.

Thus, if there is a problem, it may concern charCodeAt.

NB. Note that Windows makes codes from 128 to 159 correspond to higher Unicode characters so that &#n; entities give the same character as &#u; where n and u are respectively extended ASCII and Unicode number for a character. For instance, € is the entity for the euro sign, but in Windows 1252 also € gives the euro sign, although € is not a defined entity. However, one has not to rely on behaviour doesn't adhere to a recognized standard.

Detecting characters unsupported by charCodeAt could be important in the input filter, and in the optional character-to-entity preconversion. These must properly handle everything that is not supported by the browser and JavaScript version in use. The input string filter or preconversion to entities may be entered in a form field or expressed in the source code. For this last option, I don't know what may happen, but only the first option is significant for the final user of a site. Entering characters in a field, unsupported characters display, and are detected as by charCodeAt, question marks (eventually dots? tested with IE 5.0 and NN 6.2 and 4.7).

NB. Undisplayable ASCII control codes display as blocks but are properly recognized and given their codes by charCodeAt.

NB. Extended ASCII characters mapped by Windows to codes 128-159 don't give problems: they are converted into entities with their proper (higher) Unicode codes, then reverted to the same characters.

If converted to question marks (ASCII 63), unsupported characters don't give further problems. The question whether unsupported characters are always converted to question marks or also other neutral characters, which might be unsupported, remains open.

However, when entering unsupported characters, the user immediately sees they are not accepted and can regulate himself.

In conclusion, conversion in question marks by the browser eliminates eventual problems related to unsupported characters - either for the input filter or for preconversion in entities.

DES output/input encoding options
It would be nice to add an extra parameter to des to specify the type of input, so that the user could request from the function hex, base64, plain text, or plain text with entities. If another parameter could specify the nature of the incoming text, it could be combined with filtering parameter.

Alberto has provided the following useful chart to summarise how these output/input encoding options would work:

How Encryption-decryption affects content
=========================================

e0) INPUT as entered (any Unicode character)
      |
  (browser)
      |
e1) input as displayed (any supported Unicode character)
            |
      ----------------------------------
      |                                |
    (C2E)                           (as is)
      |                                |
    text with entities (ASCII)         |
      |                                |
   (filter does not affect)        (filter)
      |                                |
    whole content (**)   'censored' content
      |                                |
      ----------------------------------
            |
e2) filtered input (ASCII)
      |
  (ENCRYPT)
      |
e3) encrypted string (0-255 8 bit)
         |
  (output encoding)
         |
   ------------------------------------
   |           |              |       |
 plain   plain + entities    hex    base64
   |           |              |       |
 8 bit      (ASCII)        0-9 A-F   0-9
 0-255        (*)          (ASCII)   A-Z
  (?)          |              |      a-z
   |           |              |      + /
   |           |              |    (ASCII)
   |           |              |       |
   ------------------------------------
       |
e4) OUTPUT
      ||
d1) input to be decrypted
      |
  (DECRYPT)
      |
d2) the same as in e2) (ASCII)
                |
      ----------------------------------
      |                                |
    whole content        'censored' content
    as in e2)             as in e2)    |
      |                                |
    original text                      |
    with entities (ASCII)           (ASCII)
           |                           |
     --------------                    |
     |            |                    |
  (use as is)   (E2C)             (use as is)
     |            |                    |

(*) Don't know how to handle ASCII 0 (NUL): &#0; is not recognized as an entity: while for instance &#2; displays as a block in the HTML document or in a TEXTAREA, and this block is recognized by charCodeAt() as Unicode 2 = ASCII 2, and fromCharCode(2) returns this block, &#0; remains as is. In IE 5.0 fromCharCode(0) results in nothing, and the following characters are eaten. In N 6.2 they are not eaten, so we may presume a browser dependent behaviour.

C2E() might be modified to convert, if encounters the corresponding character, NUL in a pseudo-entity '&#0;'. Then, working with Regular expressions, E2C() can replace the expression with fromCharCode(0): but this seems to be nothing (eating additionally following characters in IE 5.0).

(**) If characters had been converted in entities before entering the string in the encryption function, the filter should not have acted any change and the resulting decrypted string is identical to the original.

If e4) encryption output were plain, it would contain all 8-bit characters, most of them being intercepted, during decryption, by the filter (the filter must act even when the des() function is used to decrypt). So, the string passed to decryption would be altered, and the decrypted string would not match the original.

If the filter would pass all 8-bit characters, or if it were disabled for decryption, it would pass characters that charCodeAt() does not handle.

If e4) encryption output were plain with entities for unsafe characters, i.e. applying conversion in entities to the output other than to the input, a method should be provided to handle the character NUL = ASCII 0. Running the script in 1. you can see that Internet Explorer (at least, my IE 5.0) does strange things with it. Also, this option looks redundant and without practical application (until one wants to display encrypted output in a Web page: but there's still the problem with NUL). I'd rather avoid it at all.

If e4) output is hexadecimal, contains only caracters from 0 to 9 and from a (or A) to f (or F), so safe ASCII characters. The same if it is base64.

In conculstion, perhapd des should allow only hex and base64 as output formats.

This looks somewhat non-aesthetic, as the des() function outputs hex/base64 when encrypting and ASCII characters when decrypting. One would prefer something like in PHP, and I presume in Perl, that deal symmetrically with 8-bit strings. But this can happen because they can rely on functions that convert characters to their extended ASCII values, and vice versa, allowing to represent with characters all the 0-255 range.

After all, DES calculations intrinsically involve numbers - in fact, charCodeAt() is used to translate the input string into numbers - so avoiding plain (and text with entities) output formats doesn't seem me a limitation.

Backward compatibility
The conversion functions above should probably be written so as not to use String.replace as this is function is only available in Javascript 1.2, and the rest of DES works in Javascript 1.0 (I believe) so is a somewhat backward compatible. However, all modern browsers do support this function.

In practice String.replace (and working with regular expressions) seems the faster choice for recognizing the &*; entity pattern. As an alternative, one has to parse the string character by character, when encounters '&#' see where is the next ';' and take what's between as the Unicode number (maybe checking if it is a number and does not contain non numbers). It's feasible.

Precautional object detection
Maybe object detection could be added - although the task might not be quick - to make the browser see if all functions and methods required by the script (like charCodeAt and fromCharCode) work as expected for the JavaScript version in use, degrading in some way if not.

About the key
Since calculations are performed in an 8-bit space, all 256 values should be used for the key (for security). But charCodeAt(), as seen, maps some characters out of this space. So I'd let the user enter a key formed by any character allowed by the encoding of his system, then operate a function that converts the key string entered by the user in an 8-bit (numeric) string, which is then used as the actual key to encrypt/decrypt. For this purpose, modding characters' Unicode codes by 256 looks good.

In this case, values over 256 could be modded by 256 to bring them into 8-bit space. Modding in this case does not constitute a limitation, as the key hasn't any particular meaning to be preserved.

ASCII 0 = NUL shouldn't give problems as can't apparently be entered in a field. It may result in the modded string (e.g. 256 => 0), so this must be used immediately in a numeric form such as hexadecimal or binary, depending on how the script calculates. In other words, the modding function converts the entered key in a string formed by (hexadecimal or binary) digits, immediately used as a number.

However, the best option, if allowing the user to enter a key into a form field, is probably to ask the user to enter hex or base64 directly, thus encouraging them to use a wide range of characters. Even this would probably be woefully unrandom, so perhaps a function should be written to detect random mouse events and create a key for the user.

Conclusion
In conclusion, it is recommended to extend the des function to provide something like:

des (string key, string message, boolean encrypt, [integer mode], [string iv], [string encoding])

Where encoding = hex/base64 with filter related variables in the script. Along with backward compatible entity conversion functions. (This has yet to be implemented.)