More Toy Models for Debugging Generation in HFπŸ€—

by Manuel de Prada Corral

2 min read

Continuing the saga of the Toy Transformer Carcass (that I will eventualy merge here), I have built a few more toy models to debug generation algorithms in HuggingFace.

The Small Probabilities Transformer

The only reason this is named a Transformer is because it interfaces with the HF transformers library. Again, it is an autorergressive probabilistic model that does not learn from data and ignores its input.

The vocabulary consists of 12 IDs: token -2 is BOS, token 11 is both EOS and PAD, and tokens 0 to 9 are vocabulary tokens.

The intent here is to test the behavior of the decoding algorithm when some of the generated sequences have very small probabilities. The idea is very simple: in each autoregressive generation step, the model always gives equal probability to the 10 vocabulary tokens, so the sequences are random permutations over the 0-9 tokens. However, depending on the first token, the sequence will have a different length, and thus a different probability and there will be a different number of sequences with that length and probability.

First step: path selection

The second token (first after BOS) is generated with 10% probability between tokens 0 and 9:

stateDiagram-v2
    state "[-2]" as 0
    state "[-2,0]" as 00
    state "[-2,1]" as 01
    state "[-2,2]" as 02
    state "[-2,3]" as 03
    state "[-2,4]" as 04
    state "[-2,5]" as 05
    state "[-2,6]" as 06
    state "[-2,7]" as 07
    state "[-2,8]" as 08
    state "[-2,9]" as 09


    [*] --> 0 : -2 (BOS)
    0 --> 00 : 10%
    0 --> 01 : 10%
    0 --> 02 : 10%
    0 --> 03 : 10%
    0 --> 04 : 10%
    0 --> 05 : 10%
    0 --> 06 : 10%
    0 --> 07 : 10%
    0 --> 08 : 10%
    0 --> 09 : 10%

This first token determines completely the route that the sequence will follow. The next decoding steps will also always choose one of the 10 vocabulary tokens with equal probability, so all sequences will be random sequence of the 0-9 tokens, but the length of the sequence will depend on the first token.

Next steps: sequence length depending on the first token

The following table shows the properties of each sequence depending on its first token, and the number of different sequences with that length that the can be generate.

First Token Seq Length Prob Number of Seqs Ln(p)
0 12 10βˆ’1310^{-13} 101210^{12} -29.9
1 25 10βˆ’2610^{-26} 102510^{25} -59.9
2 38 10βˆ’3910^{-39} 103810^{38} -89.8
3 51 10βˆ’5210^{-52} 105110^{51} -119.7
4 64 10βˆ’6510^{-65} 106410^{64} -149.7
5 77 10βˆ’7810^{-78} 107710^{77} -179.6
6 90 10βˆ’9110^{-91} 109010^{90} -209.5
7 103 10βˆ’10410^{-104} 1010310^{103} -239.5
8 116 10βˆ’11710^{-117} 1011610^{116} -269.4
9 129 10βˆ’13010^{-130} 1012910^{129} -299.3
n 13(n+1)βˆ’113(n + 1) - 1 10βˆ’lenβˆ’110^{-\text{len}-1} 10len10^\text{len} ln⁑(10βˆ’lenβˆ’1)\ln(10^{-\text{len}-1})

Purpose

The gist of this model is to see how different generation algorithms behave when the probability of some sequences is very small. If we use ancestral sampling, we will get 10% of each type of sequence, irrespective of the probability of the final sequence.

When using a sampling without replacement algorithm like SBS, since the sample space is so large, we should also get 10% of each type of sequence.

The Binary Transformer

todo...