A Toy Probabilistic Transformer for Debugging Generation Algorithms in HuggingFace🤗

by Manuel de Prada Corral

3 min read

A few weeks ago, I found myself implementing "Stochastic Beams and Where to Find Them" (sampling without replacement from a Transformer).

Debugging and verifying the correctness of a sampling algorithm in HuggingFace is not straightforward. Thus, I built a fake carcass for a Transformer model with a small vocabulary and fixed controlled probabilities that could allow to keep a close eye on the logits and the generated sequence.

stateDiagram-v2
    state "[0]" as 1
    state "[0,1]" as 01
    state "[0,2]" as 02
    state "[0,1,1]" as 011
    state "[0,1,1,3]" as 0113
    state "[0,1,2]" as 012
    state "[0,1,2,3]" as 0123
    state "[0,2,1]" as 021
    state "[0,2,1,3]" as 0213
    state "[0,2,2]" as 022
    state "[0,2,2,3]" as 0223
    
    note right of 0113
    prob=0.075
    logp=-2.59
    end note
    note right of 0123
        prob=0.675
        logp=-0.39
    end note
    note right of 0223
        prob=0.225
        logp=-1.49
    end note
             
    note right of 0213
        prob=0.025
        logp=-3.68
    end note


    [*] --> 1 : 0 (BOS)
    1 --> 01 : 75%
    1 --> 02 : 25%
    01 --> 011 : 10%
    01 --> 012 : 90%
    02 --> 021 : 10%
    02 --> 022 : 90%
    011 --> 0113 : EOS
    012 --> 0123 : EOS
    021 --> 0213 : EOS
    022 --> 0223 : EOS


It can be seen as a Probabilistic Finite State Automaton: it does not learn from data and ignores its input. Instead, it uses fixed rules to produce a sequence of output tokens with predefined probabilities. The simplicity of this setup is intentional, to isolate and spotlight the sequence generation process.

The vocabulary consists of only 4 IDs: token 0 (BOS, Beginning of Sequence), token 1, token 2, and token 3 (EOS, End of Sequence). It uses the length of the sequence to decide the probability distribution over the next token.

Here is how it works: if the sequence has a length of 1 (i.e., only BOS), it assigns a 75% probability to token 1 and a 25% probability to token 2. If the sequence length is 2, it gives a 10% probability to token 1 and a 90% probability to token 2. When the sequence length is 3, it always predicts EOS (with 100% probability), marking the end of the sequence.

Interfacing with Hugging Face's generate()

This tiny transformer is fully compatible with the built-in generate() function of Hugging Face transformers library. You can try out the different decoding algorithms, play with the probabilities and the length of the sequence, and see how the different decoding algorithms behave.

However, keep in mind that this model does not compute attention or hidden states, and returns an empty tuple for those attributes.

Utility in Debugging and Exploration

The utility of this toy Transformer lies in its ability to facilitate a granular, step-by-step examination of the sequence generation process. Since it is intended for probabilistic sampling, the default generation params are do_sample=True and num_beams=1.

Find the code at GitHub. Try generating 10 random samples using ancestral sampling:

pip install transformers torch gist-import
from gist_import import GistImporter
import torch
model = GistImporter('00d7a84632d8e858ff0c208e5e44559b')['FakeTransformer']()
torch.manual_seed(1)
BOS = 0
#generate 10 samples with ancestral sampling
model.generate(input_ids=torch.tensor([[BOS]] * 10))
# output:
# tensor([[0, 1, 2, 3],
#         [0, 2, 2, 3],
#         [0, 1, 1, 3],
#         [0, 1, 2, 3],
#         [0, 1, 2, 3],
#         [0, 1, 2, 3],
#         [0, 1, 1, 3],
#         [0, 1, 2, 3],
#         [0, 1, 2, 3],
#         [0, 2, 2, 3]])

#token probabilities for each generation step of 1 sample
torch.cat(model.generate(input_ids=torch.tensor([[BOS]]), output_scores=True, return_dict_in_generate=True)['scores']).exp()
# output:   BOS     tok1    tok2    EOS
# tensor([[0.0000, 0.7500, 0.2500, 0.0000],
#         [0.0000, 0.1000, 0.9000, 0.0000],
#         [0.0000, 0.0000, 0.0000, 1.0000]])