More toy models for debugging generation in HF🤗

Continuing the saga of the Toy Transformer (that I will eventually merge here), I have built a few more toy models to debug generation algorithms.

The Small Probabilities Transformer

The only reason this is named a Transformer is because it interfaces with the HF transformers library. Again, it is an autoregressive probabilistic model that does not learn from data and ignores its input.

The vocabulary consists of 12 IDs: token -2 is BOS, token 11 is both EOS and PAD, and tokens 0 to 9 are vocabulary tokens.

The intent here is to test the behavior of the decoding algorithm when some of the generated sequences have very small probabilities. The idea is very simple: in each autoregressive generation step, the model always gives equal probability to the 10 vocabulary tokens, so the sequences are random permutations over the 0-9 tokens. However, depending on the first token, the sequence will have a different length, and thus a different probability and there will be a different number of sequences with that length and probability.

First step: path selection

The second token (first after BOS) is generated with 10% probability between tokens 0 and 9:

stateDiagram-v2
    state "[-2]" as 0
    state "[-2,0]" as 00
    state "[-2,1]" as 01
    state "[-2,2]" as 02
    state "[-2,3]" as 03
    state "[-2,4]" as 04
    state "[-2,5]" as 05
    state "[-2,6]" as 06
    state "[-2,7]" as 07
    state "[-2,8]" as 08
    state "[-2,9]" as 09


    [*] --> 0 : -2 (BOS)
    0 --> 00 : 10%
    0 --> 01 : 10%
    0 --> 02 : 10%
    0 --> 03 : 10%
    0 --> 04 : 10%
    0 --> 05 : 10%
    0 --> 06 : 10%
    0 --> 07 : 10%
    0 --> 08 : 10%
    0 --> 09 : 10%

This first token determines completely the route that the sequence will follow. The next decoding steps will also always choose one of the 10 vocabulary tokens with equal probability, so all sequences will be random sequence of the 0-9 tokens, but the length of the sequence will depend on the first token.

Next steps: sequence length depending on the first token

The following table shows the properties of each sequence depending on its first token, and the number of different sequences with that length that the can be generate.

First Token	Seq Length	Prob	Number of Seqs	Ln(p)
0	12	$10^{-13}$	$10^{12}$	-29.9
1	25	$10^{-26}$	$10^{25}$	-59.9
2	38	$10^{-39}$	$10^{38}$	-89.8
3	51	$10^{-52}$	$10^{51}$	-119.7
4	64	$10^{-65}$	$10^{64}$	-149.7
5	77	$10^{-78}$	$10^{77}$	-179.6
6	90	$10^{-91}$	$10^{90}$	-209.5
7	103	$10^{-104}$	$10^{103}$	-239.5
8	116	$10^{-117}$	$10^{116}$	-269.4
9	129	$10^{-130}$	$10^{129}$	-299.3
n	$13(n + 1) - 1$	$10^{-\text{len}-1}$	$10^\text{len}$	$\ln(10^{-\text{len}-1})$

Purpose

The gist of this model is to see how different generation algorithms behave when the probability of some sequences is very small. If we use ancestral sampling, we will get 10% of each type of sequence, irrespective of the probability of the final sequence.

When using a sampling without replacement algorithm like SBS, since the sample space is so large, we should also get 10% of each type of sequence.

The Binary Transformer

todo...