~ This is part of the EuCrypt series. Start with Introducing EuCrypt. ~

The potential bit disorder trouble with Keccak highlighted at the end of the previous chapter calls for some decision to be made since a hash function won't be of much help if bits come out in different orders from different implementations. So now and here is your time to have your say on this matter - I'll first describe the changes made for this chapter, since they are directly linked to the issue anyway and then I'll lay out the current options as I see them. You are warmly invited to tell me what's wrong with them, come up with better ones, compare them or shred them to bits or... shut up and then live with them.

Strictly speaking, this chapter simply brings in a few modifications and additions to the previous "word-level" version of the Keccak sponge, introduced in Chapter 7. The goal is to make this relatively fast^{1} Keccak (compared to the bit-level version from Chapter 8) run with the same results on Big Endian as well as Little Endian machines. As I don't actually have a Big Endian machine at the moment and the virtualisation option with Qemu proved a HUGE headache, I'll be grateful for any feedback from someone who can run the code and tests provided (or any other tests they choose) on a Big Endian platform.

Before digging into the code itself, it's quite useful to review a bit the Endianness issue. In itself, big/little endian refers to the order in which octets (bytes) are stored in memory: either most significant octet first (big endian) or least significant octet first (little endian). So in principle as long as one doesn't address different octets directly, there is no problem. However, the compounded trouble with Keccak is that the darned specification of the Keccak state and its transformations in fact has unstated assumptions about both **octet ordering** and **bit ordering**! So on top of octet (byte) order, one needs to make sure that bit order assumptions also hold at all times and to say that this gives me a headache is an understatement. Essentially, Keccak's specification relies (implicitly, not explicitly...) on Big Endian octet (byte) ordering when describing the transformations but it uses Little Endian octet ordering and LSB (least significant *bit*) ordering for the Keccak state so that 0x80 for instances is represented as 0000 0001. Oh, the joys of consistent, crystal-clear and totally-not-confusing specifications!

Just about the only ~~bit~~ ray of light in this whole mess is that most of Keccak's transformations themselves are actually immune to the specific byte/bit order in use (as long as it's consistently used, of course). Nevertheless, there is potentially big trouble whenever the Z dimension of the Keccak state is directly involved and especially for the likes of the Iota transformation that uses 64-bit pre-defined values (the round constants). To address those issues, the current approach is as follows:

- For the octet (byte) order: the input is considered to be at all times given in *Little Endian* format. Consequently, the Keccak implementation will check the endianness when it is running and it will *flip the octets* when running on Big Endian iron.
- For the bit order: in the bit-level implementation, the Z dimension of the Keccak state is traversed from high index to low index. This simply means that the convention for the Z dimension is that "next position" decreases the index rather than increasing it.

Reflecting the above, the .vpatch for this chapter adds a method for flipping octets of the input stream and checks for local bit order whenever changing from bits to words or the other way around. Moreover, the conversion methods from bits to words and the other way around are updated to match the LSB bit-order that Keccak supposes. The relevant parts are in eucrypt/smg_keccak/smg_keccak.adb:

-- convert from a bitstream of ZWord size to an actual ZWord number function BitsToWord( BWord: in Bitword ) return ZWord is W : ZWord; Bits: Bitword; begin -- just copy octets if machine is little endian -- flip octets if machine is big endian if Default_Bit_Order = Low_Order_First then Bits := BWord; else Bits := FlipOctets( BWord ); end if; -- actual bits to word conversion W := 0; -- LSB bit order (inside octet) as per Keccak spec for I in reverse Bitword'Range loop W := Shift_Left( W, 1 ) + ZWord( Bits( I ) ); end loop; return W; end BitsToWord; -- convert from a ZWord (lane of state) to a bitstream of ZWord size function WordToBits( Word: in ZWord ) return Bitword is Bits: Bitword := (others => 0); W: ZWord; begin W := Word; for I in Bitword'Range loop Bits( I ) := Bit( W mod 2 ); W := Shift_Right( W, 1 ); end loop; -- flip octets if machine is big endian if Default_Bit_Order = High_Order_First then Bits := FlipOctets( Bits ); end if; return Bits; end WordToBits; -- flip given octets (i.e. groups of 8 bits) function FlipOctets( BWord : in Bitword ) return Bitword is Bits : Bitword; begin -- copy groups of 8 octets changing their order in the array -- i.e. 1st octet in BWord becomes last octet in Bits and so on for I in 0 .. ( Bitword'Length / 8 - 1 ) loop Bits ( Bits'First + I * 8 .. Bits'First + I * 8 + 7 ) := BWord( BWord'Last - I * 8 - 7 .. BWord'Last - I * 8); end loop; return Bits; end FlipOctets;

As you might notice in the above, there are a few other small changes to the conversion functions, mainly replacing direct divisions/multiplications by powers of 2 with corresponding shifts - since this implementation uses anyway rotation of bits, there really is no reason **not** to take advantage of bit shifting too, where possible.

In addition to the code itself, there are of course new tests as well, mainly for the bit flipping:

procedure test_flip is B: constant Bitword := (1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0); Expected: Bitword := (0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0); Output : Bitword; begin Output := FlipOctets( B ); if Output /= Expected then Put_Line( "FAILED: flip octets" ); Put_Line( "Expected: " ); for I of Expected loop Put(Bit'Image(I)); end loop; new_line(1); Put_Line( "Output: " ); for I of Output loop Put(Bit'Image(I)); end loop; new_line(1); else Put_Line( "PASSED: flip octets" ); end if; end test_flip;

In addition to the new test for flipping, the older tests for the sponge are updated to be directly comparable with those from the bit-level version of Keccak so that one can easily check that the output of the two implementations is indeed the same. So the current test_sponge function is the following:

procedure test_sponge is Bitrate : constant Keccak_Rate := 1344; Input : Bitstream(1..5) := (1, 1, 0, 0, 1); Hex : array(0..15) of Character := ("0123456789ABCDEF"); C : Natural; HexPos : Natural; Error : Natural; Pos : Natural; ExpHex : constant String := "CB7FFB7CE7572A06C537858A0090FC2888C3C6BA9A3ADAB4"& "FE7C9AB4EFE7A1E619B834C843A5A79E23F3F7E314AA597D"& "9DAD376E8413A005984D00CF954F62F59EF30B050C99EA64"& "E958335DAE684195D439B6E6DFD0E402518B5E7A227C48CF"& "239CEA1C391241D7605733A9F4B8F3FFBE74EE45A40730ED"& "1E2FDEFCCA941F518708CBB5B6D5A69C30263267B97D7B29"& "AC87043880AE43033B1017EFB75C33248E2962892CE69DA8"& "BAF1DF4C0902B16C64A1ADD42FF458C94C4D3B0B32711BBA"& "22104989982543D1EF1661AFAF2573687D588C81113ED7FA"& "F7DDF912021FC03D0E98ACC0200A9F7A0E9629DBA33BA0A3"& "C03CCA5A7D3560A6DB589422AC64882EF14A62AD9807B353"& "8DEE1548194DBD456F92B568CE76827F41E0FB3C7F25F3A4"& "C707AD825B289730FEBDFD22A3E742C6FB7125DE0E38B130"& "F3059450CA6185156A7EEE2AB7C8E4709956DC6D5E9F99D5"& "0A19473EA7D737AC934815D68C0710235483DB8551FD8756"& "45692B4E5E16BB9B1142AE300F5F69F43F0091D534F372E1"& "FFC2E522E71003E4D27EF6ACCD36B2756FB5FF02DBF0C96B"& "CAE68E7D6427810582F87051590F6FB65D7B948A9C9D6C93"& "AF4562367A0AD79109D6F3087C775FE6D60D66B74F8D29FB"& "4BA80D0168693A748812EA0CD3CA23854CC84D4E716F4C1A"& "A3B340B1DED2F304DFDBACC1D792C8AC9A1426913E3F67DB"& "790FD5CFB77DAA29"; Output : Bitstream( 1 .. ExpHex'Length * 4 ); HexString : String( 1 .. ExpHex'Length ); begin Put_Line("---sponge test---"); Sponge(Input, Bitrate, Output); Put_Line("Input is:"); for I of Input loop Put(Bit'Image(I)); end loop; new_line(1); Put_Line("Output is:"); for I of Output loop Put(Bit'Image(I)); end loop; new_line(1); Error := 0; for I in 1..Output'Length/4 loop Pos := Output'First + (I-1)*4; C := Natural( Output( Pos ) ) + Natural( Output( Pos + 1 ) ) * 2 + Natural( Output( Pos + 2 ) ) * 4 + Natural( Output( Pos + 3 ) ) * 8; HexPos := I + 2 * ( I mod 2 ) - 1; Hexstring(HexPos) := Hex( C ); if Hexstring(HexPos) /= ExpHex(HexPos) then Error := Error + 1; end if; end loop; Put_Line("Expected: "); Put_Line(ExpHex); Put_Line("Obtained: "); Put_Line(Hexstring); Put_Line("Errors found: " & Natural'Image(Error)); end test_sponge;

The .vpatch for all the above and its signature can be found as usual on my Reference Code Shelf as well as directly from the links below:

To make it perfectly clear before I even get into any options regarding the bit & byte mess: the whole trouble arises essentially because Keccak's specification can't stick to one thing and one thing only i.e. either "this is a bit-level futzing" or "this is a byte level fizzing." So instead it ends up as a bit-byte-bit level-hopping mad tzizzing: theoretically it works at bit-level since the state and transformations really are defined at bit level; in practice and *at the same time* the whole thing is however specified and discussed at byte (octet) level with the Z dimension of the state pretty much ignored (since that is the bit-level basically) and algorithms duly given at... word level even (speed! implementation concern! Yes, yes, but why, WHY can't you keep the two well separated and labeled as such? WHY?). Moreover, note that the padding rule for instance is specified at bit-level: "10*1" so 2 bits of 1 at the ends with as many 0s as needed in between them so that the whole input is brought to a length that is a multiple of the selected block (Keccak rate) length. But in test vectors this ends up however as 0x80 because of the unstated assumptions of **bit order** in the Keccak state - basically, why not, we have 2 bit orders so why not use them, right? MOREOVER, the whole thing is anyway meant to work as a hashing of messages hence of characters hence of... octets (bytes)!^{2} So on one hand there are test vectors with all sort of number of bits as input, on the other hand you'll likely want it to handle *consistently* inputs that are inescapably in multiples of 8 bits and nothing else. "Oh, but this is solved already" you'll say - sure, it is of sorts, basically by NIST having on one hand their own whole damned Appendix for SHA-3 just to sort-of, ass-backwards explain the absorb into a state because it's so unclear due to bit order and on the other hand by Keccak having one padding rule while SHA-3 has yet another padding rule (hey, we put in a "delimiter" which is 01 for some reason). What, isn't that **clear**? NO? But how surprising!

To bring back some sanity, let's see a few options:

**1. Input/Output size & padding**:

1.1 Fix both input and output at octet (byte) level, reflecting the intended reality of 8 bits/ character anyway. In this case:

- valid input for Keccak is a multiple of 8 bits;
- valid output from Keccak is a multiple of 8 bits;
- Keccak rate (block size i.e. maximum number of bits absorb/squeezed between two scramblings of the Keccak state) is a multiple of 8 bits;
- input is padded with 10*1 up to closest multiple of block length (Keccak rate). Consequently, the length of padding will be a minimum of 8 bits and a maximum of block length bits.

1.2 Keep input and output at bit level, reflecting the bit-level nature of the Keccak permutation. In this case:

- valid input for Keccak is any number of bits > 0
- valid output from Keccak is any number of bits > 0
- Keccak rate (block size) is any number of bits < length of Keccak state in bits
- input is padded with 10*1 up to closest multiple of block length (Keccak rate). Consequently, the length of padding will be a minimum of 2 bits and a maximum of block length bits. In this case however the conversion from characters to bit streams needs to be specified clearly too: does one still consider the whole octet anyway or does one discard any trailing 0s for instance?

**2. Input bit and byte (octet) order**:

2.1 Input is first padded with the rule 10*1 and *the result* considered to have Little Endian byte order (hence octets will be flipped on Big Endian iron) and LSB bit order (hence absorbed bit by bit in order into the Keccak state).

2.2 The padding itself is flipped first to LSB (so the end octet if it's one octet full of padding will be 0x80 rather than 0x01) and then attached to the input that is considered as a result to have Little Endian byte order (hence octets will be flipped on Big Endian iron) and LSB bit order (hence absorbed bit by bit in order into the Keccak state).

**3. Output bit and byte (octet) order**:

3.1 Output is extracted bit by bit from the state, meaning that the output will have LSB bit order. For example, Keccak will output 1101 0011 that is LSB for the corresponding 1100 1011 MSB (or 0xCB). Moreover, if requested output is NOT a multiple of 8 bits, the "stray" bits at the end will be the least significant from the state's relevant octet. This is what the current SMG implementations do. Essentially this makes the absorb/squeeze quite straightforward and it leaves the interpretation of bits up to the user of Keccak, so outside the permutations themselves.

3.2 Output is extracted octet by octet from the state, meaning that the output will have MSB bit order. For example, Keccak will output 1100 1011 in the case above at 3.1 (0xCB). Note that in this case an output that is NOT a multiple of 8 bits will end in the most significant bits of the relevant octet in the state. As a concrete example: 4 bits of output in the same 0xCB case used before will be "1100" with this option but "1101" with the previous option from 3.1.

There are of course further options that go deeper into the internals of Keccak. For instance, it is certainly possible to use MSB bit ordering in the Keccak state and change accordingly the relevant transformations. However, such approaches are more time-consuming at this stage and I don't see the clear benefit that would justify that. Nevertheless, if you see such benefit, kindly argue your case in the comments section below. Once again, now it's the time to provide any input on this because once it's fixed that's how it stays and we'll have to live with it as it is.

This is not even by far the fastest implementation possible. At the very least, a faster implementation would unroll the loops and try to parallelize tasks whenever possible. ↩

And NO, I shall not even entertain the additional madness of Unicode and other Unicorns, thank you. ↩

Comments feed: RSS 2.0

[…] I finally have both Keccak at bit-level (reference version) and Keccak at word-level (working horse version because reality bytes), the next step is to implement the TMSR OAEP (optimal […]

[…] February, while I was still fighting to set straight the bits and bytes of Keccak for EuCrypt, a quiet note from my personal time-log popped up for a moment, as a brief reminder that it’s […]

[...] rather than byte level. While originally I discussed this issue in more detail and systematised the options available for Keccak to get some clarity and make a decision on either bit-level or byte-level, the actual [...]