Table of Contents
Before any encrypting can be done, the round keys for each round need to be created. This is referred to as the Key schedule, or key expansion. Basically what happens here is based on the various parameters, (key bits, block bits, and rounds) the key schedule array is filled. It consists of a key for each round, which is the same size as the data block size.
The expanded key is usually represented by the array W, which is a linear array of 32 bit words. The number of words in the array is the number of bytes in the plain-text multiplied by the number of rounds.
The first 0 through Nk (key bytes) of the W array are filled with the input key, and after that the rest consist of W[i-1] XORed with W[i-Nk], where i is the value you use to iterate through the array W. Whenever i is multiple of Nk, a transformation is performed on W[i-1] which consists of a shift in the bytes and a lookup in the S table.
The encryption process consists of rounds of 4 main steps. The number of rounds is dependent on the block size of both the key and the block that you are encrypting. Input data is transformed in to a 4x8, 4x6, or 4x4 byte array depending on block-size (32, 24, and 16 byte block sizes are allowed). Nb is the block size of the data block to be encrypted, and Nk is the key size.
| Rounds | Nb = 4 | Nb = 6 | Nb = 8 |
|---|---|---|---|
| Nk = 4 | 10 | 12 | 14 |
| Nk = 6 | 12 | 12 | 14 |
| Nk = 8 | 14 | 14 | 14 |
In order to make it easier for the encryption steps, the plain-text is placed into a array of similar dimensions to the key array, i.e. a 4 byte by Nb byte array. Before the first round begins, the first round key is XORed with the plain-text.
Substitution
The basis for this step is the field GF(2^8). Before encryption is performed, a S table is created, consisting of the values in the field GF(256). There are 256 values in this table which are used to perform a lookup. Iterating over each byte in the transformed data array (called DA here) it indexes into the S array by the byte value of a byte in DA, and performs a swap.
Shifting Rows
This step takes row 1, 2, and 3 of DA and cyclically rotates the values in it by an amount determined by the block size (Nb).
| Nb | C1 | C2 | C3 |
|---|---|---|---|
| 4 | 1 | 2 | 3 |
| 6 | 1 | 2 | 3 |
| 8 | 1 | 3 | 4 |
Mixing Columns
This step can be visualized as a matrix multiplication. The multiplication is done over the field GF(2^8), so the actual element multiplication uses a premade table to look up its multiplication. The multiplication is fixed, and is the following:
| [02 03 01 01] | [a0] |
| [01 02 03 01] | [a1] |
| [01 01 02 03] | [a2] |
| [03 01 01 02] | [a3] |
XORing with a Round key
This is a pretty easy step, the key array that was created by the key scheduling algorithm is simply XORed with the DA array.
Its nice to conceptually be able to see all the encryption steps in your code, but its really not the best way to code this. Each step requires multiple memory accesses, can include weird pointer arithmetic (depending on compiler), and is just too hard to do quickly.
Each step listed above can be combined into an array lookup and an XOR. This dramatically simplifies the computation and memory intensiveness of a round, and makes it realistic to perform. In order to combine all 4 steps, need to consider generic input and realize that the only data dependent step is the very first S box substitution. When completing a round of encryption for generic data, you will come out with a transform that can be applied to the S table to create four different master tables.
If are doing going to implement AES in hardware (something like an FPGA) then you do not want to use the super tables, cause they take up a couple Kb in memory. The S-Boxes can be implemented as an 8x8 LUT, and the rest of the steps can be done easily with shifts and such. There are a number of papers on high performance AES hardware design, and I have seen implementations achieve 2Gb/sec throughput on some of their designs.