A Toy Probabilistic Transformer for Debugging Generation Algorithms in HuggingFace🤗
by Manuel de Prada Corral
3 min read
A few weeks ago, I found myself implementing "Stochastic Beams and Where to Find Them" (sampling without replacement from a Transformer).
Debugging and verifying the correctness of a sampling algorithm in HuggingFace is not straightforward. Thus, I built a fake carcass for a Transformer model with a small vocabulary and fixed controlled probabilities that could allow to keep a close eye on the logits and the generated sequence.
stateDiagram-v2
state "[0]" as 1
state "[0,1]" as 01
state "[0,2]" as 02
state "[0,1,1]" as 011
state "[0,1,1,3]" as 0113
state "[0,1,2]" as 012
state "[0,1,2,3]" as 0123
state "[0,2,1]" as 021
state "[0,2,1,3]" as 0213
state "[0,2,2]" as 022
state "[0,2,2,3]" as 0223
note right of 0113
prob=0.075
logp=-2.59
end note
note right of 0123
prob=0.675
logp=-0.39
end note
note right of 0223
prob=0.225
logp=-1.49
end note
note right of 0213
prob=0.025
logp=-3.68
end note
[*] --> 1 : 0 (BOS)
1 --> 01 : 75%
1 --> 02 : 25%
01 --> 011 : 10%
01 --> 012 : 90%
02 --> 021 : 10%
02 --> 022 : 90%
011 --> 0113 : EOS
012 --> 0123 : EOS
021 --> 0213 : EOS
022 --> 0223 : EOS
It can be seen as a Probabilistic Finite State Automaton: it does not learn from data and ignores its input. Instead, it uses fixed rules to produce a sequence of output tokens with predefined probabilities. The simplicity of this setup is intentional, to isolate and spotlight the sequence generation process.
The vocabulary consists of only 4 IDs: token 0 (BOS, Beginning of Sequence), token 1, token 2, and token 3 (EOS, End of Sequence). It uses the length of the sequence to decide the probability distribution over the next token.
Here is how it works: if the sequence has a length of 1 (i.e., only BOS), it assigns a 75% probability to token 1 and a 25% probability to token 2. If the sequence length is 2, it gives a 10% probability to token 1 and a 90% probability to token 2. When the sequence length is 3, it always predicts EOS (with 100% probability), marking the end of the sequence.
Interfacing with Hugging Face's generate()
This tiny transformer is fully compatible with the built-in generate()
function of Hugging Face transformers
library. You can try out the different decoding algorithms, play with the probabilities and the length of the sequence, and see how the different decoding algorithms behave.
However, keep in mind that this model does not compute attention or hidden states, and returns an empty tuple for those attributes.
Utility in Debugging and Exploration
The utility of this toy Transformer lies in its ability to facilitate a granular, step-by-step examination of the sequence generation process. Since it is intended for probabilistic sampling, the default generation params are do_sample=True
and num_beams=1
.
Find the code at GitHub. Try generating 10 random samples using ancestral sampling:
pip install transformers torch gist-import
from gist_import import GistImporter
import torch
model = GistImporter('00d7a84632d8e858ff0c208e5e44559b')['FakeTransformer']()
torch.manual_seed(1)
BOS = 0
#generate 10 samples with ancestral sampling
model.generate(input_ids=torch.tensor([[BOS]] * 10))
# output:
# tensor([[0, 1, 2, 3],
# [0, 2, 2, 3],
# [0, 1, 1, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 1, 1, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 2, 2, 3]])
#token probabilities for each generation step of 1 sample
torch.cat(model.generate(input_ids=torch.tensor([[BOS]]), output_scores=True, return_dict_in_generate=True)['scores']).exp()
# output: BOS tok1 tok2 EOS
# tensor([[0.0000, 0.7500, 0.2500, 0.0000],
# [0.0000, 0.1000, 0.9000, 0.0000],
# [0.0000, 0.0000, 0.0000, 1.0000]])