In this blog post, we will explore the essential concepts of Itō calculus and how they apply to solving stochastic differential equations (SDEs). We will also see how these concepts are implemented in Python code to simulate and filter stochastic processes.

\n\nItō calculus extends classical calculus to stochastic processes, particularly those driven by Brownian motion. It includes key tools like the Itō integral and Itō's Lemma.

\nBrownian motion $ B_t $ is a continuous-time stochastic process with:

\n- \n
- $ B_0 = 0 $. \n
- Independent increments. \n
- Normally distributed increments: $ B_{t+s} - B_t \\sim \\mathcal{N}(0, s) $. \n

The Itō integral of a process $ X_t $ with respect to Brownian motion $ B_t $ is denoted:\n$$ \\int_0^t X_s , dB_s $$

\nItō's Lemma is the stochastic counterpart of the chain rule. For a function $ f(t, X_t) $ where $ X_t $ follows:\n$$ dX_t = \\mu(t, X_t) , dt + \\sigma(t, X_t) , dB_t, $$\nItō's Lemma states:\n$$ df(t, X_t) = \\left( \\frac{\\partial f}{\\partial t} + \\mu \\frac{\\partial f}{\\partial X} + \\frac{1}{2} \\sigma^2 \\frac{\\partial^2 f}{\\partial X^2} \\right) dt + \\sigma \\frac{\\partial f}{\\partial X} , dB_t. $$

\nConsider the SDE:\n$$ dX_t = \\mu X_t , dt + \\sigma X_t , dB_t $$

\nApplying Itō's Lemma to $ Y_t = \\ln(X_t) $:\n$$ dY_t = \\left( \\mu - \\frac{1}{2} \\sigma^2 \\right) dt + \\sigma , dB_t $$

\nSolving this:\n$$ Y_t = \\ln(X_t) = \\ln(X_0) + \\left( \\mu - \\frac{1}{2} \\sigma^2 \\right) t + \\sigma B_t $$

\nExponentiating both sides:\n$$ X_t = X_0 \\exp \\left( \\left( \\mu - \\frac{1}{2} \\sigma^2 \\right) t + \\sigma B_t \\right) $$

\nHere's how this is implemented in the context of a particle filter simulation:

\n`import numpy as np\n\n# Parameters\ny0 = 1.0\nmu = 0.1\nsigma = 0.2\nT = 1.0\nN = 100\nM = 1000\ndt = T / N\n\n# Placeholder for observations and time\ntime = np.linspace(0, T, N+1)\ntrue_state = np.exp((mu - 0.5 * sigma**2) * time + sigma * np.random.normal(0, np.sqrt(time)))\n\n# Simulate and plot the true state\nimport matplotlib.pyplot as plt\n\nplt.figure(figsize=(10, 6))\nplt.plot(time, true_state, label='True State')\nplt.xlabel('Time')\nplt.ylabel('State')\nplt.title('True State Evolution Using Itō Calculus')\nplt.legend()\nplt.show()\n`

\nIn the code:

\n- \n
`true_state`

is calculated using the derived formula from Itō's Lemma. \n`np.random.normal(0, np.sqrt(time))`

simulates the Brownian motion increments.\nThis demonstrates how Itō calculus is applied to model and simulate stochastic processes, providing a powerful toolset for understanding systems influenced by randomness. \n

- \n
- Sequential Monte Carlo \n
- Geometric Brownian motion \n
- Stochastic differential equations definition \n
- full demo \n
- Applications in finance \n

Batch Size | A100 80G Batch/s | A100 80G Tok/s | 1xA100 40G Batch/s | 1xA100 40G Tok/s | 2xA100 40G Batch/s | 2xA100 40G Tok/s | 3xA100 40G Batch/s | 3xA100 40G Tok/s | 4xA100 40G Batch/s | 4xA100 40G Tok/s |
---|---|---|---|---|---|---|---|---|---|---|

1 | 47.46 | 47.46 | 37.51 | 37.51 | 29.56 | 29.56 | 29.44 | 29.44 | 28.35 | 28.35 |

4 | 49.76 | 199.04 | 38.81 | 155.24 | 30.15 | 120.6 | 28.14 | 112.56 | 28.83 | 115.32 |

8 | 49.12 | 393.0 | 41.64 | 333.12 | 30.38 | 243.04 | 28.23 | 225.84 | 28.88 | 231.04 |

16 | 49.35 | 789.6 | 39.85 | 637.6 | 28.92 | 462.72 | 26.33 | 421.28 | 26.76 | 428.16 |

32 | 47.75 | 1528.0 | 37.64 | 1204.48 | 27.71 | 886.72 | 27.87 | 892.84 | 27.98 | 895.36 |

64 | 39.55 | 2531.2 | 33.44 | 2140.16 | 27.97 | 1790.08 | 22.35 | 1430.4 | 22.4 | 1433.6 |

128 | 25.86 | 3310.1 | 22.13 | 2832.64 | 22.48 | 2877.44 | 15.77 | 2018.56 | 15.67 | 2005.76 |

256 | 15.32 | 3921.9 | 12.61 | 3227.36 | 14.15 | 3622.4 | 9.53 | 2439.68 | 8.77 | 2245.12 |

512 | 8.17 | 4183.0 | OOM | OOM | 7.4 | 3788.8 | 6.88 | 3522.56 | 4.83 | 2472.96 |

750 | OOM | OOM | OOM | OOM | 5.07 | 3802.5 | 3.73 | 2797.5 | ||

1024 | OOM | OOM | 3.03 | 3102.72 | ||||||

1500 | OOM | OOM |

Continuing the saga of the Toy Transformer Carcass (that I will eventualy merge here), I have built a few more toy models to debug generation algorithms in HuggingFace.

\n\nThe only reason this is named a Transformer is because it interfaces with the HF `transformers`

library. Again, it is an autorergressive probabilistic model that does not learn from data and ignores its input.

The vocabulary consists of 12 IDs: token -2 is BOS, token 11 is both EOS and PAD, and tokens 0 to 9 are vocabulary tokens.

\nThe intent here is to test the behavior of the decoding algorithm when some of the generated sequences have very small probabilities. The idea is very simple: in each autoregressive generation step, the model always gives equal probability to the 10 vocabulary tokens, so the sequences are random permutations over the 0-9 tokens. However, depending on the first token, the sequence will have a different length, and thus a different probability and there will be a different number of sequences with that length and probability.

\nThe second token (first after BOS) is generated with 10% probability between tokens 0 and 9:

\n`stateDiagram-v2\n state "[-2]" as 0\n state "[-2,0]" as 00\n state "[-2,1]" as 01\n state "[-2,2]" as 02\n state "[-2,3]" as 03\n state "[-2,4]" as 04\n state "[-2,5]" as 05\n state "[-2,6]" as 06\n state "[-2,7]" as 07\n state "[-2,8]" as 08\n state "[-2,9]" as 09\n\n\n [*] --> 0 : -2 (BOS)\n 0 --> 00 : 10%\n 0 --> 01 : 10%\n 0 --> 02 : 10%\n 0 --> 03 : 10%\n 0 --> 04 : 10%\n 0 --> 05 : 10%\n 0 --> 06 : 10%\n 0 --> 07 : 10%\n 0 --> 08 : 10%\n 0 --> 09 : 10%\n`

\nThis first token determines completely the route that the sequence will follow. The next decoding steps will also always choose one of the 10 vocabulary tokens with equal probability, so all sequences will be random sequence of the 0-9 tokens, but the length of the sequence will depend on the first token.

\nThe following table shows the properties of each sequence depending on its first token, and the number of different sequences with that length that the can be generate.

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFirst Token | Seq Length | Prob | Number of Seqs | Ln(p) |
---|---|---|---|---|

0 | 12 | $10^{-13}$ | $10^{12}$ | -29.9 |

1 | 25 | $10^{-26}$ | $10^{25}$ | -59.9 |

2 | 38 | $10^{-39}$ | $10^{38}$ | -89.8 |

3 | 51 | $10^{-52}$ | $10^{51}$ | -119.7 |

4 | 64 | $10^{-65}$ | $10^{64}$ | -149.7 |

5 | 77 | $10^{-78}$ | $10^{77}$ | -179.6 |

6 | 90 | $10^{-91}$ | $10^{90}$ | -209.5 |

7 | 103 | $10^{-104}$ | $10^{103}$ | -239.5 |

8 | 116 | $10^{-117}$ | $10^{116}$ | -269.4 |

9 | 129 | $10^{-130}$ | $10^{129}$ | -299.3 |

n | $13(n + 1) - 1$ | $10^{-\\text{len}-1}$ | $10^\\text{len}$ | $\\ln(10^{-\\text{len}-1})$ |

The gist of this model is to see how different generation algorithms behave when the probability of some sequences is very small. If we use ancestral sampling, we will get 10% of each type of sequence, irrespective of the probability of the final sequence.

\nWhen using a sampling without replacement algorithm like SBS, since the sample space is so large, we should also get 10% of each type of sequence.

\ntodo...

\n","date_published":"Sun, 17 Dec 2023 00:00:00 GMT"},{"id":"https://manueldeprada.com/blog/posts/sampling-theory/","url":"https://manueldeprada.com/blog/posts/sampling-theory/","title":"How LLMs got me into sampling theory","content_html":"This WIP post collects some of my notes on sampling theory. My end goal is to have a global understanding of importance sampling and Horvitz-Thompson estimators.

\n\nSampling from a probabilistic model can serve many purposes. The obvious one is to generate samples, such as images, text, or audio. However, we can also use sampling to compute expectations, such as the expected value of a function of the samples.

\nThese notes were made with sampling from a Language Model in mind, but they are applicable to many autorregressive models. The key idea is that we cannot sample from the model directly, but we have to recursively sample the next word from the conditional distributions^{1}.

$$\n\\begin{aligned}\ny_1 &\\sim p(y_1) \\\\\ny_2 &\\sim p(y_2|y_1) \\\\\n&\\vdots \\\\\ny_T &\\sim p(y_T|y_1,\\dots,y_{T-1})\n\\end{aligned}\n$$

\nHurray! We got a sample $\\mathbf{y}=(y_1,\\dots,y_T)$ from the model! However, as it happens with LLMs, often the samples are not good enough, in the sense of what humans judge as "good text".

\nIn practice, when sampling from LLMs, often one of the $T$ steps yields an unlikely word, ruining the whole sample. Having a good calibration for unlikely words is very difficult, and there are ad-hoc interventions such as sampling adaptors. Another approach is to avoid sampling altogether, and instead just deterministically search for the most likely sequence of words, which is called beam search^{2}.

But, what if there was a principled way to get better samples?

\nTODO: Introduce utility function, minimum bayes risk, and importance sampling...

\nIn machine learning, we often fit a model to produce a vector of unnormalized log-probabilities $\\mathbf{\\phi}=(\\phi_1,\\dots,\\phi_n)$, and we need to sample from the corresponding categorical distribution.

\nTo sample from the categorical distribution, we can use the inverse transform sampling method:

\n- \n
- Normalize the log-probabilities (complexity $O(n)$). \n

$$ p_i = \\frac{\\exp(\\phi_i)}{\\sum_j \\exp(\\phi_j)} $$

\n- \n
- \n
Compute the cumulative distribution function (CDF) as $F(i)=\\sum_{j=1}^i p_j$.

\n \n - \n
Sample from the uniform distribution $u\\sim\\mathcal{U}(0,1)$.

\n \n - \n
Finally, we pick the biggest index $i$ such that $F(i)\\leq u$. (complexity $O(log(n))$ by binary search, since the CDF is sorted).

\n \n

In total, this naive approach has complexity $O(n + k\\cdot log(n))$, where k is the number of samples.

\nWe are interested in sampling from a random variable $X$ using a random variable $U\\sim\\mathcal{U}(0,1)$, so we need to find a function $T$ such that $X=T(U)$. Now,

\n$$ F_X(x) = \\mathbb{P}(X\\leq x) = \\mathbb{P}(T(U)\\leq x) = \\mathbb{P}(U\\leq T^{-1}(x)) = F_U(T^{-1}(x)) = T^{-1}(x) $$

\nHence, $T=F_X^{-1}$, and we can sample from $X$ by sampling from $U$ and applying $T$.

\nThe Gumbel trick allows to sample from the categorical distribution without computing the CDF.

\n- \n
- \n
Sample $n$ independent Gumbel variables $g_i\\sim\\mathcal{G}(0,1)$. This can be easily done using the inverse transform sampling: $g_i = -\\log(-\\log(u_i)),\\ \\ u_i\\sim\\mathcal{U}(0,1)$.

\n \n - \n
Compute the perturbed log-probabilities $\\phi_i' = \\phi_i + g_i$. Since the Gumbel distributions are a location-scale family, $\\phi_i'\\sim\\mathcal{G}(\\phi_i,1)$.

\n \n - \n
Finally, we pick the biggest index $i$ such that $\\phi_i'\\geq \\phi_j'$ for all $j\\neq i$. (complexity $O(n)$ since they are not sorted). In other words, we take the index

\n \n

$$ \\arg\\max_i \\phi_i' = \\arg\\max_i \\phi_i + g_i $$

\n* Lemma 1:* The inverse CDF of the exponential distribution $\\text{Exp}(\\lambda)$ is

$$ F^{-1}(u) = -\\frac{1}{\\lambda}\\log(1-u). $$

\n* Lemma 2:* If $X_1 \\sim \\text{Exp}(\\lambda_1), \\dots, X_n \\sim \\text{Exp}(\\lambda_n)$ are independent, then $\\min_i X_i \\sim \\text{Exp}(\\sum_i \\lambda_i)$ and $$\\mathbb{P}(X_i = \\min_j X_j) = \\frac{\\lambda_i}{\\sum_j \\lambda_j}.$$

Observe that the probability of a tie is zero.

\n* Proof*. We want to prove that the probability of picking the index $i$ is $p_i$, i.e., the probability of $\\phi'_i$ being the biggest perturbed log-probability is $p_i$.

* First part: show that $\\exp(-\\phi'_i)\\sim \\text{Exp}(p_i\\alpha)$*:

Recall that $\\phi_i=\\log p_i +\\log\\alpha$ are the unnormalized log-probabilities, we can write\n$$ \\begin{aligned}\n\\phi_i' &= \\phi_i + g_i = \\log p_i +\\log \\alpha - \\log(-\\log(u_i)) \\\\\n&= -\\log\\left(\\frac{1}{p_i\\alpha} \\cdot \\log(\\frac{1}{u_i})\\right).\n\\end{aligned} $$

\nHence, $\\exp(-\\phi'_i) = \\frac{1}{p_i\\alpha} \\cdot \\log(\\frac{1}{u_i})$, which is the inverse CDF of the exponential distribution with parameter $p_i\\alpha$ (Lemma 1).

\nUsing inverse transform sampling, since $u_i\\sim\\mathcal{U}(0,1)$, we have that $\\exp(-\\phi'_i)\\sim \\text{Exp}(p_i\\alpha)$.

\n* Second part:* Note that\n$$\\arg\\max_i \\phi_i' = \\arg\\min_i \\exp(-\\phi'_i) \\sim \\arg\\min_i \\text{Exp}(p_i\\alpha).\n$$

* Third part:* Finally,

$$ \\begin{aligned}\n\\mathbb{P}(\\arg\\max_i \\phi_i' = i) &= \\mathbb{P}(\\arg\\min_i \\text{Exp}(p_i\\alpha) = i)\\\\\n&= \\mathbb{P}\\left(\\min_j \\text{Exp}(p_j\\alpha) = \\text{Exp}(p_i\\alpha)\\right)\\\\\n&= \\frac{p_i\\alpha}{\\sum_j p_j\\alpha} = p_i,\n\\end{aligned} $$

\nwhere in the last step we have used Lemma 2.

\nWe just saw how the Gumbel trick allows to sample from the categorical distribution, by computing

\n$$ \\arg\\max_i \\left(\\phi_i - \\log(-\\log(u_i)),\\ \\ u_i\\sim\\mathcal{U}(0,1)\\right). $$

\nTODO: how Maddison et al proved that max and argmax are independent, and that taking the top-k is equivalent to sampling withouth replacement.

\nIn order to compute expectations, if we don't have any domain-specific closed-form expression, we typically resort to Monte Carlo (MC) estimation. This involves sampling $m$ times from the model, and computing the average of the function of interest $f$:

\n$$ \\mathbb{E}[f(X)] \\approx \\frac{1}{m}\\sum_{i=1}^m f(x_i),\\ \\ x_i\\sim p(x). $$

\nThe intuition is simple: in a discrete world, the most probable samples will be sampled more often, so the average will be close to the expectation.

\nThe Monte Carlo estimator is unbiased, but it has high variance. To compensate, we would need more samples, which is often infeasible. Also, if the distribution has low entropy, we will be inneficiently sampling the same values over and over again.

\nImportance sampling is a technique to reduce the variance of the Monte Carlo estimator. The idea is to sample from a different distribution $q(x)$, and then reweight the samples by the ratio of the probabilities:

\n$$ \\mathbb{E}[f(X)] = \\sum_x f(x) p(x) = \\sum_x f(x) \\frac{p(x)}{q(x)} q(x) = \\mathbb{E}_q\\left[f(X)\\frac{p(X)}{q(X)}\\right]. $$

\nTODO: stratified sampling, Horvitz-Thompson estimator, weighted reservoir sampling, priority sampling, sparse vector representations.

\nA categorical (generalized Bernoulli) distribution is characterized as a vector $\\mathbf{p}=(p_1,\\dots,p_n)$, where $p_i$ represents the probability mass of the $i$-th event from a discrete outcome space $\\Omega$. Commonly, we take the index random variable $I:\\Omega\\to\\mathbb{N}$ to map the outcome space to the natural numbers, giving

\n$$ \\mathbb{P}(I = i) := p(i) = \\begin{cases}\np_i & \\text{if } i \\in {1,\\dots,n} \\\\\n0 & \\text{otherwise.}\n\\end{cases} $$

\nAs a valid distribution, it satisfies $ \\sum_i p_i = 1 $. We can express the expectation of the random variable $ I $ as $ \\mathbb{E}[I] = \\sum_i i p_i $.

\nIf the distribution represents, for example, the possible next words in a language model, computing the expected value may not be particularly meaningful, as it would reflect the average index of the next word.

\nSometimes, we are interested in the expected value of a function of the outcomes $f:\\Omega\\to\\mathbb{R}$, that is, $ \\mathbb{E}_I[f(I^{-1}(i))] $. However, for brevity, we often write $ \\mathbb{E}[f(I)] = \\sum_i f(i) p_i $ (using the law of the unconscious statistician).

\nThanks @Clara Meister for guiding me through the literature and thanks @Tim Vieira for pointing my initial runtime complexity mistakes!

\n","date_published":"Wed, 04 Oct 2023 00:00:00 GMT"},{"id":"https://manueldeprada.com/blog/posts/toy-probabilistic-transformer/","url":"https://manueldeprada.com/blog/posts/toy-probabilistic-transformer/","title":"A Toy Probabilistic Transformer for Debugging Generation Algorithms in HuggingFace🤗","content_html":"A few weeks ago, I found myself implementing "Stochastic Beams and Where to Find Them" (sampling without replacement from a Transformer).

\nDebugging and verifying the correctness of a sampling algorithm in HuggingFace is not straightforward. Thus, I built a fake carcass for a Transformer model with a small vocabulary and fixed controlled probabilities that could allow to keep a close eye on the logits and the generated sequence.

\n`stateDiagram-v2\n state "[0]" as 1\n state "[0,1]" as 01\n state "[0,2]" as 02\n state "[0,1,1]" as 011\n state "[0,1,1,3]" as 0113\n state "[0,1,2]" as 012\n state "[0,1,2,3]" as 0123\n state "[0,2,1]" as 021\n state "[0,2,1,3]" as 0213\n state "[0,2,2]" as 022\n state "[0,2,2,3]" as 0223\n \n note right of 0113\n prob=0.075\n logp=-2.59\n end note\n note right of 0123\n prob=0.675\n logp=-0.39\n end note\n note right of 0223\n prob=0.225\n logp=-1.49\n end note\n \n note right of 0213\n prob=0.025\n logp=-3.68\n end note\n\n\n [*] --> 1 : 0 (BOS)\n 1 --> 01 : 75%\n 1 --> 02 : 25%\n 01 --> 011 : 10%\n 01 --> 012 : 90%\n 02 --> 021 : 10%\n 02 --> 022 : 90%\n 011 --> 0113 : EOS\n 012 --> 0123 : EOS\n 021 --> 0213 : EOS\n 022 --> 0223 : EOS\n\n`

\n\n\n

It can be seen as a Probabilistic Finite State Automaton: it does not learn from data and ignores its input. Instead, it uses fixed rules to produce a sequence of output tokens with predefined probabilities. The simplicity of this setup is intentional, to isolate and spotlight the sequence generation process.

\nThe vocabulary consists of only 4 IDs: token 0 (BOS, Beginning of Sequence), token 1, token 2, and token 3 (EOS, End of Sequence). It uses the length of the sequence to decide the probability distribution over the next token.

\nHere is how it works: if the sequence has a length of 1 (i.e., only BOS), it assigns a 75% probability to token 1 and a 25% probability to token 2. If the sequence length is 2, it gives a 10% probability to token 1 and a 90% probability to token 2. When the sequence length is 3, it always predicts EOS (with 100% probability), marking the end of the sequence.

\n`generate()`

This tiny transformer is fully compatible with the built-in `generate()`

function of Hugging Face `transformers`

library. You can try out the different decoding algorithms, play with the probabilities and the length of the sequence, and see how the different decoding algorithms behave.

However, keep in mind that this model does not compute attention or hidden states, and returns an empty tuple for those attributes.

\nThe utility of this toy Transformer lies in its ability to facilitate a granular, step-by-step examination of the sequence generation process. Since it is intended for probabilistic sampling, the default generation params are `do_sample=True`

and `num_beams=1`

.

Find the code at GitHub. Try generating 10 random samples using ancestral sampling:

\n`pip install transformers torch gist-import\n`

\n`from gist_import import GistImporter\nimport torch\nmodel = GistImporter('00d7a84632d8e858ff0c208e5e44559b')['FakeTransformer']()\ntorch.manual_seed(1)\nBOS = 0\n#generate 10 samples with ancestral sampling\nmodel.generate(input_ids=torch.tensor([[BOS]] * 10))\n# output:\n# tensor([[0, 1, 2, 3],\n# [0, 2, 2, 3],\n# [0, 1, 1, 3],\n# [0, 1, 2, 3],\n# [0, 1, 2, 3],\n# [0, 1, 2, 3],\n# [0, 1, 1, 3],\n# [0, 1, 2, 3],\n# [0, 1, 2, 3],\n# [0, 2, 2, 3]])\n\n#token probabilities for each generation step of 1 sample\ntorch.cat(model.generate(input_ids=torch.tensor([[BOS]]), output_scores=True, return_dict_in_generate=True)['scores']).exp()\n# output: BOS tok1 tok2 EOS\n# tensor([[0.0000, 0.7500, 0.2500, 0.0000],\n# [0.0000, 0.1000, 0.9000, 0.0000],\n# [0.0000, 0.0000, 0.0000, 1.0000]])\n\n`

\n","date_published":"Tue, 27 Jun 2023 00:00:00 GMT"},{"id":"https://manueldeprada.com/blog/posts/porting-sbs-to-hf/","url":"https://manueldeprada.com/blog/posts/porting-sbs-to-hf/","title":"Porting Stochastic Beam Search to HuggingFace🤗","content_html":"Stochastic beam search is a principled way of getting a sample-without-replacement from an autoregressive model, just by perturbing the scores of the beam search algorithm. This allows to construct low-variance estimators over the model's distribution, which can be useful to estimate model's properties and explore stochastic strategies for generation.

\n\nThe algorithm boils down to perturbing the *accumulated log-probabilities* of the beam search algorithm. Thanks to the topk Gumbel trick, we know that top-k samples with respect to the perturbed accumulated log-probabilities have the distribution of a sample without replacement from the model. This is like a magic trick🪄!! What if we could get the top-k perturbed log-probabilities without exploring the whole search space? Well, that's exactly what the stochastic beam search algorithm does.

Two particular observations that are not very obvious at first sight:

\n- \n
- The algorithm relies on the probabilities being locally normalized at each generation step, so that the beam scores are the actual log-probabilities of the partially-generated sequences. The original Gumbel trick to sample from unnormalized logits works as the argmax distribution is the same, but SBS also needs the max of the perturbed logits to be correctly distributed! \n
- The key observation of SBS is that the Gumbel perturbations can be computed hierarchically and depend
*only*on the beam scores and the parent Gumbel, and since $p(\\text{<BOS>})=1$, we can compute the perturbation for the first step and then downstream efficiently. \n

Taking a pragmatic approach, we might first think that we can use just a `LogitsProcessor`

to perturb the logits and then use the standard beam search implementation. Two problems arise:

- \n
- Computing the Gumbel perturbations requires the perturbation of the parent node in the generation tree. This is not possible with the current
`LogitsProcessor`

API, which only allows to modify the logits of the current node without further context. \n - We cannot return the perturbed logits as the new logits, since we need to keep track of the original logits to compute the perturbations for the next step. \n

The following assumes familiarity with the HF generation pipeline, which I distilled in this post.

\n`graph TD\n A[GenerationMixin.generate] --> B[GenerationMixin.beam_search]\n B --> C[BeamSearchScorer.process]\n C --> B\n B --> D[BeamSearchScorer.finalize]\n \n`

\nWe need to make the following changes to the beam search algorithm:

\n- \n
- A
`LogitsProcessor`

that not only takes next token logits, but all generated logits. In this stage, we also need the previously generated perturbations. \n - We need to use the perturbed logits for selecting the top-k candidates, but the original logits for updating the beam scores, which are the true log-probabilities of the generated sequences. In the current implementation, the beam scores are updated with the processed logits, which are not the true log-probabilities of the generated sequences. The perturbed score should only be saved once the beam is finished, so that we can keep track of the top-k finished beams with the highest perturbed scores. \n

To achieve point 1, we need to modify the `GenerationMixin.beam_search`

and and pass additional parameters to the `LogitsProcessor`

.

For point 2, two modifications are needed:

\n- \n
- A more flexible
`beam_search`

method that also keeps track of the log-probabilities as the beam scores and, separately, the processed logits which are usedfor selecting the top-k candidates, so that they don't affect the true beam scores. TODO: A flag into the**only**`beam_search`

method to allow disabling the use of the processed logits as beam scores, so the default behavior would be the standard beam search. \n

We must keep in mind that the selection among *finished beams* happens inside `BeamSearchScorer`

, while the selection among *active beams* comes from the `torch.topk`

call in `GenerationMixin.beam_search`

.

We should pass the perturbed logits to the `BeamSearchScorer`

. It must save them into the `BeamHypotheses`

object, so that only the highest perturbed-logprob *finished* beams are kept. The call to `torch.topk`

should also be modified to use the perturbed logits, so that the *active* beams are selected according to the perturbed logits.\nHowever, when `BeamSearchScorer`

returns the updated active beam scores, we should use the original unperturbed scores.

We also should look carefully to see what scores are being used in the `BeamSearchScorer.finalize`

method.

Since we are looking into generalizing the beam search algorithm, it may make sense to not just pass to the `LogitsProcessor`

the past processed logits, but also the past true beam scores. This would allow to implement other algorithms that require the past beam scores. Furthermore, it would also be nice for extensibility being able to provide our own BeamSearchScorer class.

While implementing a new generation strategy for Transformer models, I found myself delving deep into the HuggingFace library. The documentation is clear with respect to the usage, but not so much with respect to the implementation details.

\nHere is a collection of notes I've compiled from my dive into the codebase. This may prove beneficial for anyone looking to understand or extend HuggingFace's generation pipeline.

\n\nHuggingFace Transformer models have all one common ancestor: `PreTrainedModel`

. This class is defined in `transformers/modeling_utils.py`

. It is a subclass of `torch.nn.Module`

, `ModuleUtilsMixin`

, `GenerationMixin`

and `PushToHubMixin`

.

`graph TD;\nModuleUtilsMixin-->PreTrainedModel;\nGenerationMixin-->PreTrainedModel;\nPushToHubMixin-->PreTrainedModel;\ntorch.Module-->PreTrainedModel;\nPretrainedMyModel-->MyModelForConditionalGeneration;\nPreTrainedModel-->PretrainedMyModel;\n`

\nThe generation pipeline for all Transformer models is centralized in `GenerationMixin`

. This class is defined in `transformers/generation/utils.py`

, and all models must implement `prepare_inputs_for_generation`

. Additionally, models can implement `adjust_logits_during_generation`

and `_reorder_cache`

.

The main method in `GenerationMixin`

is `generate`

, which orchestrates the generation process and then calls the different specialized methods such as `contrastive_search`

, `greedy_search`

, `sample`

, `beam_search`

, `beam_sample`

, `group_beam_search`

, `constrained_beam_search`

and `assisted_decoding`

.

Let's break down the generation pipeline into its different steps. Note that these steps are written with the same numbers into the code comments.This is a permalink to the generate method being analyzed in this post (note that HF is a fast moving target, so some details may be outdated soon).

\nAnother vital point to note is that the generation happens in batches, meaning that the input_ids have a shape of `(batch_size, seq_len, embed_dim)`

. This is to allow, for example, to translate multiple sentences at once.

`%%{init: { 'themeVariables': {'fontSize': '24px'} } }%%\ntimeline\n 1. Prepare generation_config : Merge model and users gen config\n 2. Set generation parameters" : Prepare logits processors and stopping criteria\n 3. Define model_inputs : Get encoder inputs if needed\n 4. Define other model kwargs\n 5. Prepare input_ids which for the decoder : Initialize with <bos> if needed\n`

\n`%%{init: { 'themeVariables': {'fontSize': '23px'} } }%%\ntimeline\n 6. Prepare `max_length` depending on stopping criteria\n 7. Determine generation mode : Set is_greedy, is_sample, is_beam, ... : check if arguments are consistent\n 8. Prepare distribution pre_processing samplers : Prepare logits_processor\n 9. Prepare stopping criteria\n 10. Go into different generation modes\n`

\nThe `logits_processor`

is a list of functions that are applied to the logits before selecting or sampling the next token. There is also a `logits_warper`

that is applied to the logits * after* the

`logits_processor`

but only in stochastic generation modes (`sample`

, `beam_sample`

, `assisted_decoding`

, `constraint_beam_search`

and `contrastive_search`

). Also, in `beam_sample`

mode, `logits_processor`

is applied to the logits, but then the logits are integrated into the beam search scores, and the `logits_warper`

is applied to the beam search scores.`timeline\n 11. Prepare beam search scorer : initialize beam hypotheses\n 12. Interleave input_ids with n_beams additional sequences : tensor of shape [batch_size, seq_len, embed_dim] -> [batch_size*n_beams, seq_len, embed_dim]\n 13. Run beam search : call beam_search method\n`

\nThe beam search generation mode has two main components:

\n- \n
- The
`beam_search`

method, found in`GenerationMixin`

, handles the primary decoding loop, maintains the beam scores and calls the model (referenced in step 13 of`generate`

). \n - In
`transformers/generation/beam_search.py`

, BeamSearchScorer has one BeamHypotheses object for each sequence in the batch. It is a general construction that makes sense for generalizing beam search to diverse_beam_search (keep different groups of beams to ensure diversity).\n- \n
- The BeamHypotheses keeps the list of the
`n_beams`

best hypotheses for each sequence in the batch, with its beam scores and beam indices. \n

\n - The BeamHypotheses keeps the list of the

- \n
- Initialize the beam_scores to 0 as a tensor of dimension
`(batch_size, n_beams)`

. \n - Set beam_scores to $-\\infty$ for all beams except the first one (
`beam_scores[:,1:] = -1e9`

). \n - View beam_scores as a 1D tensor of dimension
`(batch_size*n_beams)`

. \n - Generation loop:\n
- \n
- Run the model, get outputs for the next token over all beams of all sequences in the batch. \n
- Locally normalize the output (apply log_softmax). \n
- Apply the
`logits_processor`

to the logits. \n - Append the new logits to the running beam scores. Note that now we have a tensor of dimension
`(batch_size*n_beams, vocab_size)`

. \n - To form the next_token_scores, view as a tensor of dimension
`(batch_size, n_beams*vocab_size)`

. \n - Get the
`2*n_beams`

best scores from next_token_scores by applying`torch.topk`

. Derive the beam indices and token indices. \n - Call
`beam_scorer.process`

to update the beam hypotheses. Get the new beam scores, indices and next_tokens for each beam. Update`input_ids`

with the new tokens. \n - If all beams are finished or the stopping criteria are met, break the loop. \n

\n

This method is defined in `transformers/generation/beam_search.py`

and takes as output the `2*n_beams`

topk elements and indexes calculated above. The beam search scorer is initialized with a `BeamHypotheses`

object for each sequence in the batch.

- \n
- Create new tensors for the next scores, tokens and indices of dimension
`(batch_size, group_size)`

(this is because of diverse beam search, we know`group_size`

=`n_beams`

for normal beam search. In this case, the tensors have dimension`(batch_size, n_beams)`

). \n - For each beam hypotheses object in the scorer (i.e. for each sentence in the batch):\n
- \n
- If the sentence is finished, do nothing and continue to the next sentence. \n
- For each (token, score, index) in the top
`2*n_beams`

next scores among the`n_beams*vocab_size`

scores:\n- \n
- If the token is the EOS token, check if the beam is still among the
`n_beams`

best beams. If so, add the beam to the list of hypotheses of the sentence. The beam_score for this beam would be 0, since it moves from the running beams to the finished beams. \n - If the token is not the EOS token, add the token, score and beam_index to the next scores, tokens and indices tensors. If we have already all the running beams, break the loop (remember that we started from the top scores, so we only want to keep the
`n_beams`

best finished beams and the`n_beams`

best running beams). \n

\n - If the token is the EOS token, check if the beam is still among the

\n

We can see how the beam_hypotheses keep the `n_beams`

best finished beams, while the `n_beams`

best running beams are kept in the `next_scores`

, `next_tokens`

and `next_indices`

tensors, which are sent back and forth between the `beam_search`

method and the `process`

method, as the main loop from the `beam_search`

progresses through the running beams.

Why do we need to select the `2*n_beams`

best beams? It is something strange at first look. From a theoretical point of view, each new generation step will always make the sequence probabilities smaller, so the first `n_beams`

that reach `<EOS>`

will always be higher probability than any possible continuation. However, there is two empirical reasons to keep more beams alive.

First, in closed-vocabulary models, we might encounter that `<UNK>`

is the best token at some point. Most beam search implementations will fall back to the next best token in this case, hence needing `n_beams+1`

tokens. Second, beam search is commonly used with length normalization, which allows longer sequences to have a higher probability as they grow longer. This means that we need to store separately the best finished beams and the best running beams, and only make the comparison between them when they are finished (thanks Clara for helping me figure this out!).

This is why HF's `beam_search`

saves `2*n_beams`

beams. We might encounter situations where all the alive `n_beams`

sequences reach `<EOS>`

, leaving no live sequences to continue. With `2*n_beams`

, we are guaranteed to have at least one non-`EOS`

token for each beam hypothesis.

On top of this, without length normalization, we can stop generation when `n_beams`

sequences reach `<EOS>`

. This is achieved in HF by setting `early_stopping=True`

. When `early_stopping`

is set to `False`

or `"never"`

, HF will use two different non-satisfactory heuristics to stop generation whenever the best running beam is thought to be worse than the worst finished beam. Surprisingly, no setting of `early_stopping`

will effectively stop early stopping and let the generation continue until all beams are finished or the maximum length is reached. To be fair, this would probably cause OOM problems.

Interestingly, the beam search in HuggingFace was adapted from facebookresearch/XLM. You can check out the original 2019 commit here. Early days when Thomas Wolf was coding and HuggingFace was still a chatbot for teenagers!

\nDuring beam search, we keep track of the following scores:

\n- \n
`beam_scores`

: The running scores of the beams. This is the sum of the log probabilities of the tokens generated so far for each beam. It is a tensor of dimension`(batch_size * n_beams)`

. They model logits may have been modified by the logit processors or by the length penalty.\nOptionally, also: \n`scores`

: The word-per-word scores of the beams, this is, the log probabilities for every token in the vocabulary at each generation step. It is a tuple of size`seq_len`

of tensors of dimension`(batch_size * n_beams, vocab_size)`

. Beam indices are needed to recover the scores for each selected token. \n`beam_indices`

: The indices of the beams that generated the scores at each time step. I believe here beam_indices are referred to the indices of the`n_beams * vocab_size`

scores of the previous timestep`torch.topk`

call. However, I am not sure, and the indices may maintain coherence across timesteps. TODO: investigate this. \n

Breve resumen de mi experiencia cambiando el router Sercomm H500-s de Vodafone por un router neutro con OpenWrt, explicando los problemas para obtener las credenciales PPPoE y cómo solucionarlos.

\n\nEn mi caso dispongo de este router de Vodafone conectado a un ONT Lucent. Soy usuario de fibra directa, no NEBA. El Sercomm tiene la última versión de firmware a septiembre de 2020, la 3.5.09.

\nComo ya sabrás, para obtener el nombre de usuario y contraseña de este router para realizar la conexión PPPoE hay que escuchar entre el router y el ONT y capturar el tráfico. La mayoría de tutoriales, (este para mi es de referencia) acceden al router como admin para forzar redirigir todo el tráfico entre el ONT y el router a la interfaz del PC, y así poder escuchar el tráfico con Wireshark.

\nNo digo que esto no funciona, pero yo no fui capaz de capturar nada. Sospecho que el router hace la telecarga antes de descargar los datos PPPoE y en cuanto detecta a un admin conectado, te echa, cierra la redirección de paquetes y entonces pide las credenciales PPPoE al servidor. Pero repito, puede que yo sea un manazas y esto siga funcionando.

\nPara asegurarme de capturar los datos, recurrí al siguiente método: conecté el ONT a mi ordenador. Luego mi ordenador lo conecté también por ethernet al router, utilizando un adaptador USB a ethernet (los hay en amazon por 10€). De esta forma mi ordenador estaba en medio del router y el ONT. Para que esta conexión con el ordenador en medio se haga efectiva, hay que configurar las dos interfaces ethernet en modo puente. En Windows no fui capaz de configurar el modo puente y que el router y el ONT se descubrieran mutuamente (Windows!!!!) pero usando Linux no hubo problema sin más que seguir estas instrucciones. Solo falta resetear el router y capturar los datos utilizando Wireshark.

\nAhí obtuve mi usuario de PPPoE para vodafone sin ningún problemas, como en el resto de guías.

\nAhora bien, para configurar el router neutro contra la ONT, no basta con proporcionar el user y pass de la conexión PPPoE en la configuración WAN del router. Además, se debe configurar el router para que todo el tráfico saliente hacia la WAN sea etiquetado como VLAN 100. Esto en OpenWRT se puede conseguir desde la interfaz LuCi sin más que acceder al apartado Network -> Switch.

\nSi tienes un Linksys EA8300 o cualquier otro router con el switch híbrido IPQ40xx, aún tendrás que bucear un poco por la red para configurar manualmente el switch pese a los bugs del driver.

\n","date_published":"Tue, 06 Oct 2020 00:00:00 GMT"}]}