UTF-8 Decoding on the Mill ISA

The Mill is a CPU architecture being designed to rethink how microarchitectures should work. It gives me the same feeling as when a problem is reframed and bunch of previously-tricky tradeoffs just evaporate. I try to pay attention when I get that feeling. I recommend Ivan Goddard's talks on the Mill if you're interested.

Decoding

So this is my attempt to make a UTF-8 decoder for the Mill, targeting the Silver spec. My goal is a working(ish) PoC.

I've been itching to try to write something interesting in ConAsm (Mill's assembly language) and UTF-8 in particular seemed a good choice because it's easy to describe in general terms but has a number of complications that impact decoding speed in most architectures. UTF-8 is extremely common and decode speed is important for many programs that will run on the Mill. If you're not familiar with UTF-8 I highly reccomend the Numberphile video or Wikipedia.

This implementation makes heavy use of vectors to collapse the decode logic and eliminate internal branching, and decodes one full codepoint from 1-4 bytes of UTF-8 in 14 cycles. There are only 3 branches: one for early exit for the 1-byte case, and two for decode error conditions, plus the final value return. It does error checking to test that the number of continuation bytes that follow the first byte matches the first byte prefix and only valid prefixes are allowed.

I'm almost certain that this has some problems, especially with with types, widening, latencies, and vectors. There's no public sim or a large corupus of examples, so I make some assumptions about the behavior of instructions where it isn't clear to me. I'm using named belt positions instead of belt numbers (these can be assigned automatically) and I'm ignoring belt length because my poor brain had had enough already.

If you notice some improvement feel free to let me know, I'll be updating the code as I get feedback with edit history in this gist.

// decode accepts a byte and a vector of the next 3 bytes (the %first byte and the possible 
// %continuation bytes and returns a decoded code point in an integer, and the number of bytes
// consumed (including the first byte) The vector may contain NaR bytes if at the end of a buffer
F('decode') %first, %cont;
// fast path for 1 byte
con(v(0xe0, 0xf0, 0xf8))   %prefmask, //R0    # masks of the three possible prefixes
lssu(%first, 0x80)         %onebyte,  //  E0  # check if the first byte is < 0x80
andlu(%first, %prefmask)   %masked,   //  E1  # bit-and with bitmasks of 3 possible prefixes
retntr(%onebyte, %first, 1);          //    F0# if < 0x80 return the first byte & consume one byte

// 2-4 byte decoding:                 
con(v(0xc0, 0xe0, 0xf0))   %prefixes, //R0    # the first three prefixes themselves
eqlu(%masked, %prefixes)   %picked,   //  E0  # compare the first byte with the 3 possible prefixes
andlu(%cont, 0xc0)         %conhi,    //  E1  # high bits of continuation bytes (x in 0bxxyyyyyy)
andlu(%cont, 0x3f)         %conlo;    //  E2  # low bits of continuation bytes (y in 0bxxyyyyyy)

con(v(2, 3, 4, 5))         %shufidxa, //R0    # adjusted shuffle indexes
smearx(%picked)            %excluded, //  E0  # mark all bytes excluded from decoding
any(%picked)               %matched,  //  E1  # find if any of the prefixes match
eqlu(%conhi, 0x80)         %contest,  //  E2  # check if high bits of continuation bytes match
left(%picked)              %idx;      //  E3  # index of the last continuation byte
retnfl(%matched, -1, 1);              //    F0# if no prefixes match, return error and consume one byte

con(v(0x1f, 0xf, 0x7))     %valmask,  //R0    # bitmasks of codepoint bits for each prefix
con(v(18, 12, 6, 0))       %shifts,   //R1    # a list of shift amounts
extract(%valmask, %idx)    %mask,     //  E0  # extract out the correct bitmask for the first byte
notl(%excluded)            %included, //  E1  # smearx+notl makes a logical bool vec of bytes to decode
sublu(%shufidxa, %idx)     %shufidx,  //  E2  # shuffle indexes for shift amounts
add1(%idx)                 %cnt;      //  E3  # count of continuation bytes to decode

shuffle(%shifts, %shufidx) %shift,    //  E0  # shuffle the shifts to match the codepoint layout
andu(%first, %mask)        %fbits,    //  E1  # mask off the codepoint bits from the first byte
imp(%included, %contest)   %conck;    //  E2  # the bytes to decode imply the matched continuation bytes
shuffle(%conlo, v(3, 0, 1, 2)) %cp0;  //  E3  # shift continuation bytes to make room for the first byte

inject(%cp0, %fbits, 0)    %cp1,      //  E0  # put the bits from the first byte into the codepoint vec
all(%conck)                %conok,    //  E1  # implies should return all true if everything is ok
inject(%included, %cnt, 1) %cpbytes,  //  E2  # set the last 
retnfl(%conok, -1, 1);                //    F0# else return an error and consume one byte

con(v(0, 0, 0, 0))         %zero,     //R0    # zeros
shiftluv(%cp1, %shift)     %cp2,      //  E0  # shift each set of codepoint bits to their final position
pick(%cpbytes, %cp2, %zero)%cp3;      //   P0 # fill non-cp elements (that may be NaR or None) with zero

// OR-reduce the vector and return the result
alternate(%cp3, %cp3)      %cp4 %cp5; Nope;

orl(%cp4, %cp5)            %cp6;

alternate(%cp6, %cp6)      %cp7 %cp8; Nope;

orl(%cp7, %cp8)            %cp9;

extract(%cp9, 0)           %cp,
add1(%cnt)                 %consumed
retn(%cp, %consumed);
;

Thoughts

There are a number of ways the function interface could be designed, for example it could have taken a single argument of 4 bytes including the first byte instead of separating the first byte as a different argument. I chose this because it seemed simpler to implement.

If we could assume the byte sequence is valid UTF-8 and didn't have to do error checking, a lot of the code could be eliminated, but I'm not sure how much that would actually affect the total runtime since the length is mostly determined by data dependencies. That said, I think that with some help and utilizing pipelining this vector strategy may able to get down to just a couple instructions decode time per full code point on a Gold spec. Especially if the call structure was inverted and the full decode loop call()ed a handler with the decoded code point on every loop iteration (instead of this design where the next code point is requested from somewhere outside this function).

In working with this I found a few instruction additions that I think may have helped and could be generally applicable:

  • An inverted smear would be useful and would eliminate the notl instruction. e.g. smearin, smearxn
  • A variant of extract that works with two vectors would be useful. extract(%values, %pick) where the first corresponding (logical bool) true vector element in %pick is extracted from %values as a scalar. This could remove the left instruction from the longest dependency chain.
  • I wish there was a better way to reduce a vector, it ends up taking over half the cycles. I felt like it could be easier even before I saw that alternate has a latency of 2 and requires No-ops to time correctly. In fact it would probably be faster to just extract everything and or it together manually vs doing it in-vector.

Example Runs

Here are some example runs decoding different codepoints so you can get an idea of how it works:

$/U+0024: 24

input: %first = 0x24, %cont = [NaR,NaR,NaR]

%onebyte = (0x24 < 0x80) = true
%masked = (0x24 & [0xe0, 0xf0, 0xf8]) = [20, 20, 20]
returns (0x24, 1)

¢/U+00A2: C2 A2

input: %first = C2, %cont = [A2,NaR,NaR]

%onebyte = (%first < 80) = false
%masked = (%first & [e0, f0, f8]) = [20, 20, 20]
(%onebyte is false, no branch)

%picked = (%masked == [c0, e0, f0]) = [true, false, false]
%conhi = (%cont & 0xc0) = [80, NaR, NaR]
%conlo = (%cont & 0x3f) = [22, NaR, NaR]

%excluded = [false, true, true]
%matched = true
%contest = (%conhi == 80) = [true, NaR, NaR]
%idx = 0
(%matched is true, no branch)

%mask = ([1f, 0f, 07][%idx]) = 1f
%included = (~%excluded) = [true, false, false]
%shufidx = ([2, 3, 4, 5] - %idx) = [2, 3, 4, 5]
%cnt = %idx+1 = 1

%shift = ([18, 12, 6, 0] shuffle %shufidx) = [6, 0, NaR, NaR]  // idk what happens with out-of-range shuffle indexes
%fbits = (%first & %mask) = 02
%conck = (%included -> %contest) = [true, true, true]            // false implies NaR should be true, right? recur with zeros otherwise
%cp0 = (%conlo shuffle [3, 0, 1, 2]) = [NaR, 22, NaR, NaR]

%cp1 = (%cp1[0] = %fbits) = [02, A2, NaR, NaR]
%conok = (all(%conck)) = true
%cpbytes = (%included[%cnt] = true) = [true, true, false]
(%conok is true, no branch)

%cp2 = (%cp1 << %shift) = [02 << 6, 22 << 0, NaR, NaR] = [80, 22, NaR, NaR]
%cp3 = (%cpbytes ? %cp2 : [0,0,0,0]) = [80, 22, 0, 0]

%cp4, %cp5 = [80, 80, 22, 22], [0, 0, 0, 0]

%cp6 = [80, 80, 22, 22]

%cp7, %cp8 = [80, 80, 80, 80], [22, 22, 22, 22]

%cp9 = [A2, A2, A2, A2]

%cp = A2
%consumed = 2
(return %cp, %consumed)

ह/U+0939: E0 A4 B9

input: %first = E0, %cont = [A4,B9,NaR]

%onebyte = (%first < 80) = false
%masked = (%first & [e0, f0, f8]) = [e0, e0, e0]
(%onebyte is false, no branch)

%picked = (%masked == [0xc0, 0xe0, 0xf0]) = [false, true, false]
%conhi = (%cont & 0xc0) = [80, 80, NaR]
%conlo = (%cont & 0x3f) = [24, 39, NaR]

%excluded = [false, false, true]
%matched = true
%contest = (%conhi == 80) = [true, true, NaR]
%idx = 1
(%matched is true, no branch)

%mask = ([1f, 0f, 07][%idx]) = 0f
%included = (~%excluded) = [true, true, false]
%shufidx = ([2, 3, 4, 5] - %idx) = [1, 2, 3, 4]
%cnt = %idx+1 = 2

%shift = ([18, 12, 6, 0] shuffle %shufidx) = [12, 6, 0, NaR]
%fbits = (%first & %mask) = 0
%conck = (%included -> %contest) = [true, true, true]
%cp0 = (%conlo shuffle [3, 0, 1, 2]) = [NaR, 24, 39, NaR]

%cp1 = (%cp1[0] = %fbits) = [0, 24, 39, NaR]
%conok = (all(%conck)) = true
%cpbytes = (%included[%cnt] = true) = [true, true, true]
(%conok is true, no branch)

%cp2 = (%cp1 << %shift) = [0, 900, 39, NaR]
%cp3 = (%cpbytes ? %cp2 : [0,0,0,0]) = [0, 900, 39, 0]

%cp4, %cp5 = [0, 0, 900, 900], [39, 39, 0, 0]

%cp6 = [39, 39, 900, 900]

%cp7, %cp8 = [39, 39, 39, 39], [900, 900, 900, 900]

%cp9 = [939, 939, 939, 939]

%cp = 939
%consumed = 3
(return %cp, %consumed)

𐍈/U+10348: F0 90 8D 88

input: %first = F0, %cont = [90,8D,88]

%onebyte = (%first < 80) = false
%masked = (%first & [e0, f0, f8]) = [f0, f0, f0]
(%onebyte is false, no branch)

%picked = (%masked == [0xc0, 0xe0, 0xf0]) = [false, false, true]
%conhi = (%cont & 0xc0) = [80, 80, 80]
%conlo = (%cont & 0x3f) = [10, 0d, 08]

%excluded = [false, false, false, true]
%matched = true
%contest = (%conhi == 80) = [true, true, true]
%idx = 2
(%matched is true, no branch)

%mask = ([1f, 0f, 07][%idx]) = 07
%included = (~%excluded) = [true, true, true]
%shufidx = ([2, 3, 4, 5] - %idx) = [0, 1, 2, 3]
%cnt = %idx+1 = 3

%shift = ([18, 12, 6, 0] shuffle %shufidx) = [18, 12, 6, 0]
%fbits = (%first & %mask) = 0
%conck = (%included -> %contest) = [true, true, true]
%cp0 = (%conlo shuffle [3, 0, 1, 2]) = [NaR, 10, 0d, 08]

%cp1 = (%cp1[0] = %fbits) = [0, 10, 0d, 08]
%conok = (all(%conck)) = true
%cpbytes = (%included[%cnt] = true) = [true, true, true, true]
(%conok is true, no branch)

%cp2 = (%cp1 << %shift) = [0, 10000, 340, 8]
%cp3 = (%cpbytes ? %cp2 : [0,0,0,0]) = [0, 10000, 340, 8]

%cp4, %cp5 = [0, 0, 10000, 10000], [340, 340, 8, 8]

%cp6 = [340, 340, 10008, 10008]

%cp7, %cp8 = [340, 340, 340, 340], [10008, 10008, 10008, 10008]

%cp9 = [10348, 10348, 10348, 10348]

%cp = 10348
%consumed = 4
(return %cp, %consumed)

Thanks for reading!

Comments on the Mill Forum.