Itō calculus extends classical calculus to stochastic processes, particularly those driven by Brownian motion. It includes key tools like the Itō integral and Itō's Lemma.
Brownian motion $ B_t $ is a continuous-time stochastic process with:
The Itō integral of a process $ X_t $ with respect to Brownian motion $ B_t $ is denoted: $$ \int_0^t X_s , dB_s $$
Itō's Lemma is the stochastic counterpart of the chain rule. For a function $ f(t, X_t) $ where $ X_t $ follows: $$ dX_t = \mu(t, X_t) , dt + \sigma(t, X_t) , dB_t, $$ Itō's Lemma states: $$ df(t, X_t) = \left( \frac{\partial f}{\partial t} + \mu \frac{\partial f}{\partial X} + \frac{1}{2} \sigma^2 \frac{\partial^2 f}{\partial X^2} \right) dt + \sigma \frac{\partial f}{\partial X} , dB_t. $$
Consider the SDE: $$ dX_t = \mu X_t , dt + \sigma X_t , dB_t $$
Applying Itō's Lemma to $ Y_t = \ln(X_t) $: $$ dY_t = \left( \mu - \frac{1}{2} \sigma^2 \right) dt + \sigma , dB_t $$
Solving this: $$ Y_t = \ln(X_t) = \ln(X_0) + \left( \mu - \frac{1}{2} \sigma^2 \right) t + \sigma B_t $$
Exponentiating both sides: $$ X_t = X_0 \exp \left( \left( \mu - \frac{1}{2} \sigma^2 \right) t + \sigma B_t \right) $$
Here's how this is implemented in the context of a particle filter simulation:
import numpy as np
# Parameters
y0 = 1.0
mu = 0.1
sigma = 0.2
T = 1.0
N = 100
M = 1000
dt = T / N
# Placeholder for observations and time
time = np.linspace(0, T, N+1)
true_state = np.exp((mu - 0.5 * sigma**2) * time + sigma * np.random.normal(0, np.sqrt(time)))
# Simulate and plot the true state
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.plot(time, true_state, label='True State')
plt.xlabel('Time')
plt.ylabel('State')
plt.title('True State Evolution Using Itō Calculus')
plt.legend()
plt.show()
In the code:
true_state
is calculated using the derived formula from Itō's Lemma.np.random.normal(0, np.sqrt(time))
simulates the Brownian motion increments.
This demonstrates how Itō calculus is applied to model and simulate stochastic processes, providing a powerful toolset for understanding systems influenced by randomness.Batch Size | A100 80G Batch/s | A100 80G Tok/s | 1xA100 40G Batch/s | 1xA100 40G Tok/s | 2xA100 40G Batch/s | 2xA100 40G Tok/s | 3xA100 40G Batch/s | 3xA100 40G Tok/s | 4xA100 40G Batch/s | 4xA100 40G Tok/s |
---|---|---|---|---|---|---|---|---|---|---|
1 | 47.46 | 47.46 | 37.51 | 37.51 | 29.56 | 29.56 | 29.44 | 29.44 | 28.35 | 28.35 |
4 | 49.76 | 199.04 | 38.81 | 155.24 | 30.15 | 120.6 | 28.14 | 112.56 | 28.83 | 115.32 |
8 | 49.12 | 393.0 | 41.64 | 333.12 | 30.38 | 243.04 | 28.23 | 225.84 | 28.88 | 231.04 |
16 | 49.35 | 789.6 | 39.85 | 637.6 | 28.92 | 462.72 | 26.33 | 421.28 | 26.76 | 428.16 |
32 | 47.75 | 1528.0 | 37.64 | 1204.48 | 27.71 | 886.72 | 27.87 | 892.84 | 27.98 | 895.36 |
64 | 39.55 | 2531.2 | 33.44 | 2140.16 | 27.97 | 1790.08 | 22.35 | 1430.4 | 22.4 | 1433.6 |
128 | 25.86 | 3310.1 | 22.13 | 2832.64 | 22.48 | 2877.44 | 15.77 | 2018.56 | 15.67 | 2005.76 |
256 | 15.32 | 3921.9 | 12.61 | 3227.36 | 14.15 | 3622.4 | 9.53 | 2439.68 | 8.77 | 2245.12 |
512 | 8.17 | 4183.0 | OOM | OOM | 7.4 | 3788.8 | 6.88 | 3522.56 | 4.83 | 2472.96 |
750 | OOM | OOM | OOM | OOM | 5.07 | 3802.5 | 3.73 | 2797.5 | ||
1024 | OOM | OOM | 3.03 | 3102.72 | ||||||
1500 | OOM | OOM |
The only reason this is named a Transformer is because it interfaces with the HF transformers
library. Again, it is an autorergressive probabilistic model that does not learn from data and ignores its input.
The vocabulary consists of 12 IDs: token -2 is BOS, token 11 is both EOS and PAD, and tokens 0 to 9 are vocabulary tokens.
The intent here is to test the behavior of the decoding algorithm when some of the generated sequences have very small probabilities. The idea is very simple: in each autoregressive generation step, the model always gives equal probability to the 10 vocabulary tokens, so the sequences are random permutations over the 0-9 tokens. However, depending on the first token, the sequence will have a different length, and thus a different probability and there will be a different number of sequences with that length and probability.
The second token (first after BOS) is generated with 10% probability between tokens 0 and 9:
stateDiagram-v2
state "[-2]" as 0
state "[-2,0]" as 00
state "[-2,1]" as 01
state "[-2,2]" as 02
state "[-2,3]" as 03
state "[-2,4]" as 04
state "[-2,5]" as 05
state "[-2,6]" as 06
state "[-2,7]" as 07
state "[-2,8]" as 08
state "[-2,9]" as 09
[*] --> 0 : -2 (BOS)
0 --> 00 : 10%
0 --> 01 : 10%
0 --> 02 : 10%
0 --> 03 : 10%
0 --> 04 : 10%
0 --> 05 : 10%
0 --> 06 : 10%
0 --> 07 : 10%
0 --> 08 : 10%
0 --> 09 : 10%
This first token determines completely the route that the sequence will follow. The next decoding steps will also always choose one of the 10 vocabulary tokens with equal probability, so all sequences will be random sequence of the 0-9 tokens, but the length of the sequence will depend on the first token.
The following table shows the properties of each sequence depending on its first token, and the number of different sequences with that length that the can be generate.
First Token | Seq Length | Prob | Number of Seqs | Ln(p) |
---|---|---|---|---|
0 | 12 | $10^{-13}$ | $10^{12}$ | -29.9 |
1 | 25 | $10^{-26}$ | $10^{25}$ | -59.9 |
2 | 38 | $10^{-39}$ | $10^{38}$ | -89.8 |
3 | 51 | $10^{-52}$ | $10^{51}$ | -119.7 |
4 | 64 | $10^{-65}$ | $10^{64}$ | -149.7 |
5 | 77 | $10^{-78}$ | $10^{77}$ | -179.6 |
6 | 90 | $10^{-91}$ | $10^{90}$ | -209.5 |
7 | 103 | $10^{-104}$ | $10^{103}$ | -239.5 |
8 | 116 | $10^{-117}$ | $10^{116}$ | -269.4 |
9 | 129 | $10^{-130}$ | $10^{129}$ | -299.3 |
n | $13(n + 1) - 1$ | $10^{-\text{len}-1}$ | $10^\text{len}$ | $\ln(10^{-\text{len}-1})$ |
The gist of this model is to see how different generation algorithms behave when the probability of some sequences is very small. If we use ancestral sampling, we will get 10% of each type of sequence, irrespective of the probability of the final sequence.
When using a sampling without replacement algorithm like SBS, since the sample space is so large, we should also get 10% of each type of sequence.
todo...
]]>Sampling from a probabilistic model can serve many purposes. The obvious one is to generate samples, such as images, text, or audio. However, we can also use sampling to compute expectations, such as the expected value of a function of the samples.
These notes were made with sampling from a Language Model in mind, but they are applicable to many autorregressive models. The key idea is that we cannot sample from the model directly, but we have to recursively sample the next word from the conditional distributions^{1}.
$$ \begin{aligned} y_1 &\sim p(y_1) \\ y_2 &\sim p(y_2|y_1) \\ &\vdots \\ y_T &\sim p(y_T|y_1,\dots,y_{T-1}) \end{aligned} $$
Hurray! We got a sample $\mathbf{y}=(y_1,\dots,y_T)$ from the model! However, as it happens with LLMs, often the samples are not good enough, in the sense of what humans judge as "good text".
In practice, when sampling from LLMs, often one of the $T$ steps yields an unlikely word, ruining the whole sample. Having a good calibration for unlikely words is very difficult, and there are ad-hoc interventions such as sampling adaptors. Another approach is to avoid sampling altogether, and instead just deterministically search for the most likely sequence of words, which is called beam search^{2}.
But, what if there was a principled way to get better samples?
TODO: Introduce utility function, minimum bayes risk, and importance sampling...
In machine learning, we often fit a model to produce a vector of unnormalized log-probabilities $\mathbf{\phi}=(\phi_1,\dots,\phi_n)$, and we need to sample from the corresponding categorical distribution.
To sample from the categorical distribution, we can use the inverse transform sampling method:
$$ p_i = \frac{\exp(\phi_i)}{\sum_j \exp(\phi_j)} $$
Compute the cumulative distribution function (CDF) as $F(i)=\sum_{j=1}^i p_j$.
Sample from the uniform distribution $u\sim\mathcal{U}(0,1)$.
Finally, we pick the biggest index $i$ such that $F(i)\leq u$. (complexity $O(log(n))$ by binary search, since the CDF is sorted).
In total, this naive approach has complexity $O(n + k\cdot log(n))$, where k is the number of samples.
We are interested in sampling from a random variable $X$ using a random variable $U\sim\mathcal{U}(0,1)$, so we need to find a function $T$ such that $X=T(U)$. Now,
$$ F_X(x) = \mathbb{P}(X\leq x) = \mathbb{P}(T(U)\leq x) = \mathbb{P}(U\leq T^{-1}(x)) = F_U(T^{-1}(x)) = T^{-1}(x) $$
Hence, $T=F_X^{-1}$, and we can sample from $X$ by sampling from $U$ and applying $T$.
The Gumbel trick allows to sample from the categorical distribution without computing the CDF.
Sample $n$ independent Gumbel variables $g_i\sim\mathcal{G}(0,1)$. This can be easily done using the inverse transform sampling: $g_i = -\log(-\log(u_i)),\ \ u_i\sim\mathcal{U}(0,1)$.
Compute the perturbed log-probabilities $\phi_i' = \phi_i + g_i$. Since the Gumbel distributions are a location-scale family, $\phi_i'\sim\mathcal{G}(\phi_i,1)$.
Finally, we pick the biggest index $i$ such that $\phi_i'\geq \phi_j'$ for all $j\neq i$. (complexity $O(n)$ since they are not sorted). In other words, we take the index
$$ \arg\max_i \phi_i' = \arg\max_i \phi_i + g_i $$
Lemma 1: The inverse CDF of the exponential distribution $\text{Exp}(\lambda)$ is
$$ F^{-1}(u) = -\frac{1}{\lambda}\log(1-u). $$
Lemma 2: If $X_1 \sim \text{Exp}(\lambda_1), \dots, X_n \sim \text{Exp}(\lambda_n)$ are independent, then $\min_i X_i \sim \text{Exp}(\sum_i \lambda_i)$ and $$\mathbb{P}(X_i = \min_j X_j) = \frac{\lambda_i}{\sum_j \lambda_j}.$$
Observe that the probability of a tie is zero.
Proof. We want to prove that the probability of picking the index $i$ is $p_i$, i.e., the probability of $\phi'_i$ being the biggest perturbed log-probability is $p_i$.
First part: show that $\exp(-\phi'_i)\sim \text{Exp}(p_i\alpha)$:
Recall that $\phi_i=\log p_i +\log\alpha$ are the unnormalized log-probabilities, we can write $$ \begin{aligned} \phi_i' &= \phi_i + g_i = \log p_i +\log \alpha - \log(-\log(u_i)) \\ &= -\log\left(\frac{1}{p_i\alpha} \cdot \log(\frac{1}{u_i})\right). \end{aligned} $$
Hence, $\exp(-\phi'_i) = \frac{1}{p_i\alpha} \cdot \log(\frac{1}{u_i})$, which is the inverse CDF of the exponential distribution with parameter $p_i\alpha$ (Lemma 1).
Using inverse transform sampling, since $u_i\sim\mathcal{U}(0,1)$, we have that $\exp(-\phi'_i)\sim \text{Exp}(p_i\alpha)$.
Second part: Note that $$\arg\max_i \phi_i' = \arg\min_i \exp(-\phi'_i) \sim \arg\min_i \text{Exp}(p_i\alpha). $$
Third part: Finally,
$$ \begin{aligned} \mathbb{P}(\arg\max_i \phi_i' = i) &= \mathbb{P}(\arg\min_i \text{Exp}(p_i\alpha) = i)\\ &= \mathbb{P}\left(\min_j \text{Exp}(p_j\alpha) = \text{Exp}(p_i\alpha)\right)\\ &= \frac{p_i\alpha}{\sum_j p_j\alpha} = p_i, \end{aligned} $$
where in the last step we have used Lemma 2.
We just saw how the Gumbel trick allows to sample from the categorical distribution, by computing
$$ \arg\max_i \left(\phi_i - \log(-\log(u_i)),\ \ u_i\sim\mathcal{U}(0,1)\right). $$
TODO: how Maddison et al proved that max and argmax are independent, and that taking the top-k is equivalent to sampling withouth replacement.
In order to compute expectations, if we don't have any domain-specific closed-form expression, we typically resort to Monte Carlo (MC) estimation. This involves sampling $m$ times from the model, and computing the average of the function of interest $f$:
$$ \mathbb{E}[f(X)] \approx \frac{1}{m}\sum_{i=1}^m f(x_i),\ \ x_i\sim p(x). $$
The intuition is simple: in a discrete world, the most probable samples will be sampled more often, so the average will be close to the expectation.
The Monte Carlo estimator is unbiased, but it has high variance. To compensate, we would need more samples, which is often infeasible. Also, if the distribution has low entropy, we will be inneficiently sampling the same values over and over again.
Importance sampling is a technique to reduce the variance of the Monte Carlo estimator. The idea is to sample from a different distribution $q(x)$, and then reweight the samples by the ratio of the probabilities:
$$ \mathbb{E}[f(X)] = \sum_x f(x) p(x) = \sum_x f(x) \frac{p(x)}{q(x)} q(x) = \mathbb{E}_q\left[f(X)\frac{p(X)}{q(X)}\right]. $$
TODO: stratified sampling, Horvitz-Thompson estimator, weighted reservoir sampling, priority sampling, sparse vector representations.
A categorical (generalized Bernoulli) distribution is characterized as a vector $\mathbf{p}=(p_1,\dots,p_n)$, where $p_i$ represents the probability mass of the $i$-th event from a discrete outcome space $\Omega$. Commonly, we take the index random variable $I:\Omega\to\mathbb{N}$ to map the outcome space to the natural numbers, giving
$$ \mathbb{P}(I = i) := p(i) = \begin{cases} p_i & \text{if } i \in {1,\dots,n} \\ 0 & \text{otherwise.} \end{cases} $$
As a valid distribution, it satisfies $ \sum_i p_i = 1 $. We can express the expectation of the random variable $ I $ as $ \mathbb{E}[I] = \sum_i i p_i $.
If the distribution represents, for example, the possible next words in a language model, computing the expected value may not be particularly meaningful, as it would reflect the average index of the next word.
Sometimes, we are interested in the expected value of a function of the outcomes $f:\Omega\to\mathbb{R}$, that is, $ \mathbb{E}_I[f(I^{-1}(i))] $. However, for brevity, we often write $ \mathbb{E}[f(I)] = \sum_i f(i) p_i $ (using the law of the unconscious statistician).
Thanks @Clara Meister for guiding me through the literature and thanks @Tim Vieira for pointing my initial runtime complexity mistakes!
]]>Debugging and verifying the correctness of a sampling algorithm in HuggingFace is not straightforward. Thus, I built a fake carcass for a Transformer model with a small vocabulary and fixed controlled probabilities that could allow to keep a close eye on the logits and the generated sequence.
stateDiagram-v2
state "[0]" as 1
state "[0,1]" as 01
state "[0,2]" as 02
state "[0,1,1]" as 011
state "[0,1,1,3]" as 0113
state "[0,1,2]" as 012
state "[0,1,2,3]" as 0123
state "[0,2,1]" as 021
state "[0,2,1,3]" as 0213
state "[0,2,2]" as 022
state "[0,2,2,3]" as 0223
note right of 0113
prob=0.075
logp=-2.59
end note
note right of 0123
prob=0.675
logp=-0.39
end note
note right of 0223
prob=0.225
logp=-1.49
end note
note right of 0213
prob=0.025
logp=-3.68
end note
[*] --> 1 : 0 (BOS)
1 --> 01 : 75%
1 --> 02 : 25%
01 --> 011 : 10%
01 --> 012 : 90%
02 --> 021 : 10%
02 --> 022 : 90%
011 --> 0113 : EOS
012 --> 0123 : EOS
021 --> 0213 : EOS
022 --> 0223 : EOS
It can be seen as a Probabilistic Finite State Automaton: it does not learn from data and ignores its input. Instead, it uses fixed rules to produce a sequence of output tokens with predefined probabilities. The simplicity of this setup is intentional, to isolate and spotlight the sequence generation process.
The vocabulary consists of only 4 IDs: token 0 (BOS, Beginning of Sequence), token 1, token 2, and token 3 (EOS, End of Sequence). It uses the length of the sequence to decide the probability distribution over the next token.
Here is how it works: if the sequence has a length of 1 (i.e., only BOS), it assigns a 75% probability to token 1 and a 25% probability to token 2. If the sequence length is 2, it gives a 10% probability to token 1 and a 90% probability to token 2. When the sequence length is 3, it always predicts EOS (with 100% probability), marking the end of the sequence.
generate()
This tiny transformer is fully compatible with the built-in generate()
function of Hugging Face transformers
library. You can try out the different decoding algorithms, play with the probabilities and the length of the sequence, and see how the different decoding algorithms behave.
However, keep in mind that this model does not compute attention or hidden states, and returns an empty tuple for those attributes.
The utility of this toy Transformer lies in its ability to facilitate a granular, step-by-step examination of the sequence generation process. Since it is intended for probabilistic sampling, the default generation params are do_sample=True
and num_beams=1
.
Find the code at GitHub. Try generating 10 random samples using ancestral sampling:
pip install transformers torch gist-import
from gist_import import GistImporter
import torch
model = GistImporter('00d7a84632d8e858ff0c208e5e44559b')['FakeTransformer']()
torch.manual_seed(1)
BOS = 0
#generate 10 samples with ancestral sampling
model.generate(input_ids=torch.tensor([[BOS]] * 10))
# output:
# tensor([[0, 1, 2, 3],
# [0, 2, 2, 3],
# [0, 1, 1, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 1, 1, 3],
# [0, 1, 2, 3],
# [0, 1, 2, 3],
# [0, 2, 2, 3]])
#token probabilities for each generation step of 1 sample
torch.cat(model.generate(input_ids=torch.tensor([[BOS]]), output_scores=True, return_dict_in_generate=True)['scores']).exp()
# output: BOS tok1 tok2 EOS
# tensor([[0.0000, 0.7500, 0.2500, 0.0000],
# [0.0000, 0.1000, 0.9000, 0.0000],
# [0.0000, 0.0000, 0.0000, 1.0000]])
]]>
The algorithm boils down to perturbing the accumulated log-probabilities of the beam search algorithm. Thanks to the topk Gumbel trick, we know that top-k samples with respect to the perturbed accumulated log-probabilities have the distribution of a sample without replacement from the model. This is like a magic trick🪄!! What if we could get the top-k perturbed log-probabilities without exploring the whole search space? Well, that's exactly what the stochastic beam search algorithm does.
Two particular observations that are not very obvious at first sight:
Taking a pragmatic approach, we might first think that we can use just a LogitsProcessor
to perturb the logits and then use the standard beam search implementation. Two problems arise:
LogitsProcessor
API, which only allows to modify the logits of the current node without further context.The following assumes familiarity with the HF generation pipeline, which I distilled in this post.
graph TD
A[GenerationMixin.generate] --> B[GenerationMixin.beam_search]
B --> C[BeamSearchScorer.process]
C --> B
B --> D[BeamSearchScorer.finalize]
We need to make the following changes to the beam search algorithm:
LogitsProcessor
that not only takes next token logits, but all generated logits. In this stage, we also need the previously generated perturbations.To achieve point 1, we need to modify the GenerationMixin.beam_search
and and pass additional parameters to the LogitsProcessor
.
For point 2, two modifications are needed:
beam_search
method that also keeps track of the log-probabilities as the beam scores and, separately, the processed logits which are used only for selecting the top-k candidates, so that they don't affect the true beam scores. TODO: A flag into the beam_search
method to allow disabling the use of the processed logits as beam scores, so the default behavior would be the standard beam search.We must keep in mind that the selection among finished beams happens inside BeamSearchScorer
, while the selection among active beams comes from the torch.topk
call in GenerationMixin.beam_search
.
We should pass the perturbed logits to the BeamSearchScorer
. It must save them into the BeamHypotheses
object, so that only the highest perturbed-logprob finished beams are kept. The call to torch.topk
should also be modified to use the perturbed logits, so that the active beams are selected according to the perturbed logits.
However, when BeamSearchScorer
returns the updated active beam scores, we should use the original unperturbed scores.
We also should look carefully to see what scores are being used in the BeamSearchScorer.finalize
method.
Since we are looking into generalizing the beam search algorithm, it may make sense to not just pass to the LogitsProcessor
the past processed logits, but also the past true beam scores. This would allow to implement other algorithms that require the past beam scores. Furthermore, it would also be nice for extensibility being able to provide our own BeamSearchScorer class.
Here is a collection of notes I've compiled from my dive into the codebase. This may prove beneficial for anyone looking to understand or extend HuggingFace's generation pipeline.
HuggingFace Transformer models have all one common ancestor: PreTrainedModel
. This class is defined in transformers/modeling_utils.py
. It is a subclass of torch.nn.Module
, ModuleUtilsMixin
, GenerationMixin
and PushToHubMixin
.
graph TD;
ModuleUtilsMixin-->PreTrainedModel;
GenerationMixin-->PreTrainedModel;
PushToHubMixin-->PreTrainedModel;
torch.Module-->PreTrainedModel;
PretrainedMyModel-->MyModelForConditionalGeneration;
PreTrainedModel-->PretrainedMyModel;
The generation pipeline for all Transformer models is centralized in GenerationMixin
. This class is defined in transformers/generation/utils.py
, and all models must implement prepare_inputs_for_generation
. Additionally, models can implement adjust_logits_during_generation
and _reorder_cache
.
The main method in GenerationMixin
is generate
, which orchestrates the generation process and then calls the different specialized methods such as contrastive_search
, greedy_search
, sample
, beam_search
, beam_sample
, group_beam_search
, constrained_beam_search
and assisted_decoding
.
Let's break down the generation pipeline into its different steps. Note that these steps are written with the same numbers into the code comments.This is a permalink to the generate method being analyzed in this post (note that HF is a fast moving target, so some details may be outdated soon).
Another vital point to note is that the generation happens in batches, meaning that the input_ids have a shape of (batch_size, seq_len, embed_dim)
. This is to allow, for example, to translate multiple sentences at once.
%%{init: { 'themeVariables': {'fontSize': '24px'} } }%%
timeline
1. Prepare generation_config : Merge model and users gen config
2. Set generation parameters" : Prepare logits processors and stopping criteria
3. Define model_inputs : Get encoder inputs if needed
4. Define other model kwargs
5. Prepare input_ids which for the decoder : Initialize with <bos> if needed
%%{init: { 'themeVariables': {'fontSize': '23px'} } }%%
timeline
6. Prepare `max_length` depending on stopping criteria
7. Determine generation mode : Set is_greedy, is_sample, is_beam, ... : check if arguments are consistent
8. Prepare distribution pre_processing samplers : Prepare logits_processor
9. Prepare stopping criteria
10. Go into different generation modes
The logits_processor
is a list of functions that are applied to the logits before selecting or sampling the next token. There is also a logits_warper
that is applied to the logits after the logits_processor
but only in stochastic generation modes (sample
, beam_sample
, assisted_decoding
, constraint_beam_search
and contrastive_search
). Also, in beam_sample
mode, logits_processor
is applied to the logits, but then the logits are integrated into the beam search scores, and the logits_warper
is applied to the beam search scores.
timeline
11. Prepare beam search scorer : initialize beam hypotheses
12. Interleave input_ids with n_beams additional sequences : tensor of shape [batch_size, seq_len, embed_dim] -> [batch_size*n_beams, seq_len, embed_dim]
13. Run beam search : call beam_search method
The beam search generation mode has two main components:
beam_search
method, found in GenerationMixin
, handles the primary decoding loop, maintains the beam scores and calls the model (referenced in step 13 of generate
).transformers/generation/beam_search.py
, BeamSearchScorer has one BeamHypotheses object for each sequence in the batch. It is a general construction that makes sense for generalizing beam search to diverse_beam_search (keep different groups of beams to ensure diversity).
n_beams
best hypotheses for each sequence in the batch, with its beam scores and beam indices.(batch_size, n_beams)
.beam_scores[:,1:] = -1e9
).(batch_size*n_beams)
.logits_processor
to the logits.(batch_size*n_beams, vocab_size)
.(batch_size, n_beams*vocab_size)
.2*n_beams
best scores from next_token_scores by applying torch.topk
. Derive the beam indices and token indices.beam_scorer.process
to update the beam hypotheses. Get the new beam scores, indices and next_tokens for each beam. Update input_ids
with the new tokens.This method is defined in transformers/generation/beam_search.py
and takes as output the 2*n_beams
topk elements and indexes calculated above. The beam search scorer is initialized with a BeamHypotheses
object for each sequence in the batch.
(batch_size, group_size)
(this is because of diverse beam search, we know group_size
=n_beams
for normal beam search. In this case, the tensors have dimension (batch_size, n_beams)
).2*n_beams
next scores among the n_beams*vocab_size
scores:
n_beams
best beams. If so, add the beam to the list of hypotheses of the sentence. The beam_score for this beam would be 0, since it moves from the running beams to the finished beams.n_beams
best finished beams and the n_beams
best running beams).We can see how the beam_hypotheses keep the n_beams
best finished beams, while the n_beams
best running beams are kept in the next_scores
, next_tokens
and next_indices
tensors, which are sent back and forth between the beam_search
method and the process
method, as the main loop from the beam_search
progresses through the running beams.
Why do we need to select the 2*n_beams
best beams? It is something strange at first look. From a theoretical point of view, each new generation step will always make the sequence probabilities smaller, so the first n_beams
that reach <EOS>
will always be higher probability than any possible continuation. However, there is two empirical reasons to keep more beams alive.
First, in closed-vocabulary models, we might encounter that <UNK>
is the best token at some point. Most beam search implementations will fall back to the next best token in this case, hence needing n_beams+1
tokens. Second, beam search is commonly used with length normalization, which allows longer sequences to have a higher probability as they grow longer. This means that we need to store separately the best finished beams and the best running beams, and only make the comparison between them when they are finished (thanks Clara for helping me figure this out!).
This is why HF's beam_search
saves 2*n_beams
beams. We might encounter situations where all the alive n_beams
sequences reach <EOS>
, leaving no live sequences to continue. With 2*n_beams
, we are guaranteed to have at least one non-EOS
token for each beam hypothesis.
On top of this, without length normalization, we can stop generation when n_beams
sequences reach <EOS>
. This is achieved in HF by setting early_stopping=True
. When early_stopping
is set to False
or "never"
, HF will use two different non-satisfactory heuristics to stop generation whenever the best running beam is thought to be worse than the worst finished beam. Surprisingly, no setting of early_stopping
will effectively stop early stopping and let the generation continue until all beams are finished or the maximum length is reached. To be fair, this would probably cause OOM problems.
Interestingly, the beam search in HuggingFace was adapted from facebookresearch/XLM. You can check out the original 2019 commit here. Early days when Thomas Wolf was coding and HuggingFace was still a chatbot for teenagers!
During beam search, we keep track of the following scores:
beam_scores
: The running scores of the beams. This is the sum of the log probabilities of the tokens generated so far for each beam. It is a tensor of dimension (batch_size * n_beams)
. They model logits may have been modified by the logit processors or by the length penalty.
Optionally, also:scores
: The word-per-word scores of the beams, this is, the log probabilities for every token in the vocabulary at each generation step. It is a tuple of size seq_len
of tensors of dimension (batch_size * n_beams, vocab_size)
. Beam indices are needed to recover the scores for each selected token.beam_indices
: The indices of the beams that generated the scores at each time step. I believe here beam_indices are referred to the indices of the n_beams * vocab_size
scores of the previous timestep torch.topk
call. However, I am not sure, and the indices may maintain coherence across timesteps. TODO: investigate this.En mi caso dispongo de este router de Vodafone conectado a un ONT Lucent. Soy usuario de fibra directa, no NEBA. El Sercomm tiene la última versión de firmware a septiembre de 2020, la 3.5.09.
Como ya sabrás, para obtener el nombre de usuario y contraseña de este router para realizar la conexión PPPoE hay que escuchar entre el router y el ONT y capturar el tráfico. La mayoría de tutoriales, (este para mi es de referencia) acceden al router como admin para forzar redirigir todo el tráfico entre el ONT y el router a la interfaz del PC, y así poder escuchar el tráfico con Wireshark.
No digo que esto no funciona, pero yo no fui capaz de capturar nada. Sospecho que el router hace la telecarga antes de descargar los datos PPPoE y en cuanto detecta a un admin conectado, te echa, cierra la redirección de paquetes y entonces pide las credenciales PPPoE al servidor. Pero repito, puede que yo sea un manazas y esto siga funcionando.
Para asegurarme de capturar los datos, recurrí al siguiente método: conecté el ONT a mi ordenador. Luego mi ordenador lo conecté también por ethernet al router, utilizando un adaptador USB a ethernet (los hay en amazon por 10€). De esta forma mi ordenador estaba en medio del router y el ONT. Para que esta conexión con el ordenador en medio se haga efectiva, hay que configurar las dos interfaces ethernet en modo puente. En Windows no fui capaz de configurar el modo puente y que el router y el ONT se descubrieran mutuamente (Windows!!!!) pero usando Linux no hubo problema sin más que seguir estas instrucciones. Solo falta resetear el router y capturar los datos utilizando Wireshark.
Ahí obtuve mi usuario de PPPoE para vodafone sin ningún problemas, como en el resto de guías.
Ahora bien, para configurar el router neutro contra la ONT, no basta con proporcionar el user y pass de la conexión PPPoE en la configuración WAN del router. Además, se debe configurar el router para que todo el tráfico saliente hacia la WAN sea etiquetado como VLAN 100. Esto en OpenWRT se puede conseguir desde la interfaz LuCi sin más que acceder al apartado Network -> Switch.
Si tienes un Linksys EA8300 o cualquier otro router con el switch híbrido IPQ40xx, aún tendrás que bucear un poco por la red para configurar manualmente el switch pese a los bugs del driver.
]]>