Fixed instruction block was referring to the Huffman encoding. Something like 8 or 16kb per instruction block (perhaps set by a flag?). Compilers would have to optimize to stay within the block, but they optimize for sticking in L1 cache anyway.
Since we're going all-in on code density, let's go with a semi-accumulator 16-bit ISA. 8 bits for instructions, 8 bits for registers (with 32 total registers). We'll split into 5 bits and 3 bits. 5-bits gives access to all registers since quite a few are either read-only (zero register, flag register) or write occasionally (stack pointer, instruction counter). The remaining 3 bits specify 8 registers that can be the write target. There will be slightly more moves, but that just means that moves compress better and seems like it should enforce certain register patterns being used more frequently which is also better for compression.
We can take advantage of having 2 separate domains (one for each byte) to create 2 separate Huffman trees. In the worst case, it seems like we increase our code size, but in more typical cases where we're using just a few instructions a lot and using a few registers a lot, the output size should be smaller. While our worst-case lookup would be 8 deep, more domain-specific lookup would probably be more likely to keep the depth lower. In addition, two trees means we can process each instruction in parallel.
As a final optimization tradeoff, I believe you could do a modified Huffman that always encoded a fixed number of bits (eg, 2, 4, 6, or 8) which would half theoretical decode time at the expense of an extra bit on some encodings. it would be +25% for 3-bit encoding, but only 16% for 5-bit encoding (perhaps step 2, 3, 4, 6, 8). For even wider decode, we could trade off a little more by forcing the compiler to ensure that each Huffman encoding breaks evenly every N bytes so we can setup multiple encoders in advance. This would probably add quite a bit to compiling time, but would be a huge performance and scaling advantage.
Immediates are where things get a little strange. The biggest problem is that the immediate value is basically random so it messes up encoding, but otherwise it messes with data fetching. The best solution seems to be replacing the 5-bit register address with either 5 bits of data or 6 bits (one implied) of jump immediate.
Never gave it too much thought before now, but it's an interesting exercise.
Since we're going all-in on code density, let's go with a semi-accumulator 16-bit ISA. 8 bits for instructions, 8 bits for registers (with 32 total registers). We'll split into 5 bits and 3 bits. 5-bits gives access to all registers since quite a few are either read-only (zero register, flag register) or write occasionally (stack pointer, instruction counter). The remaining 3 bits specify 8 registers that can be the write target. There will be slightly more moves, but that just means that moves compress better and seems like it should enforce certain register patterns being used more frequently which is also better for compression.
We can take advantage of having 2 separate domains (one for each byte) to create 2 separate Huffman trees. In the worst case, it seems like we increase our code size, but in more typical cases where we're using just a few instructions a lot and using a few registers a lot, the output size should be smaller. While our worst-case lookup would be 8 deep, more domain-specific lookup would probably be more likely to keep the depth lower. In addition, two trees means we can process each instruction in parallel.
As a final optimization tradeoff, I believe you could do a modified Huffman that always encoded a fixed number of bits (eg, 2, 4, 6, or 8) which would half theoretical decode time at the expense of an extra bit on some encodings. it would be +25% for 3-bit encoding, but only 16% for 5-bit encoding (perhaps step 2, 3, 4, 6, 8). For even wider decode, we could trade off a little more by forcing the compiler to ensure that each Huffman encoding breaks evenly every N bytes so we can setup multiple encoders in advance. This would probably add quite a bit to compiling time, but would be a huge performance and scaling advantage.
Immediates are where things get a little strange. The biggest problem is that the immediate value is basically random so it messes up encoding, but otherwise it messes with data fetching. The best solution seems to be replacing the 5-bit register address with either 5 bits of data or 6 bits (one implied) of jump immediate.
Never gave it too much thought before now, but it's an interesting exercise.