Manuel de Prada Corral
Hi!👋 I am a doctoral researcher at ETH Zürich and the Max Planck Institute for Intelligent Systems in Tübingen, advised by Ryan Cotterell (ETH) and Wieland Brendel (MPI). I am a fellow of the Max Planck–ETH Center for Learning Systems.
My research focuses on the statistical and algorithmic foundations of language models, with current emphasis on sampling theory and particle-based inference. I am interested in the interplay between reinforcement learning and sampling algorithms, and to linguistics research aided by language models. I admire mathematically grounded approaches that improve our understanding of modern NLP systems and their reliability.
Previously I was an ML intern at Hugging Face, working on the transformers generation stack, and a teaching assistant for ETH's Natural Language Processing and Large Language Models courses at ETH. Before that, I completed a double BSc in Mathematics and Computer Science at the University of Santiago de Compostela.
§ i Background
- Nov 2025 – now PhD Researcher, Natural Language Processing. ETH Zürich & MPI
-
2022 – 2025
MSc, Computer Science. ETH Zürich.
-
Apr – Oct 2025
ML Engineer Intern. Hugging Face, Paris.
Six-month internship on language models with the generation team. Implemented decoding strategies and key-value (KV) caching techniques in the main library and internal tooling.
- 2024 – 2025 Teaching Assistant, NLP & LLMs. ETH Zürich.
-
2023 – 2024
Research Assistant. Rycolab, ETH Zürich.
Inference-time scaling using sampling theory for LLMs. (1) Decoding for language models: Stochastic Beam Search and improving Minimum Bayes Risk decoding. (2) Improving constrained generation with globally normalized potentials — joint with MIT, Johns Hopkins, and McGill.
-
Apr – Oct 2025
-
2016 – 2022
Double BSc, Mathematics & Computer Science. Universidade de Santiago de Compostela.
-
2021 – 2022
Research Assistant. CiTIUS, USC.
(1) Detecting misinformation in IR systems by integrating language models. (2) Dataset cleaning combining n-gram and Transformer language models. Funded by the national project eRISK: Technologies for the early prediction of signs related with psychological disorders (RTI2018-093336-B-C21).
- 2021 Seminar: Introduction to the Transformers Architecture. CiTIUS training program (link, slides).
-
2021 – 2022
§ ii Papers
-
A Model of Diverse Sampling from Language Models
Preprint · Under review
TL;DR. We formalize diverse language-model sampling as a global Determinantal Point Process over complete strings and use importance sampling to enable a principled quality–diversity trade-off without retraining the model.
-
An unsupervised perplexity-based method for boilerplate removal
Natural Language Engineering · 2023
TL;DR. A language-model perplexity signal separates web boilerplate from main content without supervision, beating heuristic cleaners on multilingual crawls. Released as
pyplexity. -
CiTIUS at the TREC 2022 Health Misinformation Track
TL;DR. A multi-stage retrieval pipeline that combines BM25 with transformer-based stance and credibility classifiers to surface trustworthy health information and demote misinformation.
§ iii Currently thinking about
- ▹Sampling theory. Distributions over latent reasoning traces; sampling without replacement or non-i.i.d. from LMs; useful estimators.
- ▹Particle‑based inference. Sequential Monte Carlo for structured generation; principled decoding under uncertainty.
- ▹Linguistic interpretation of LMs. Evaluating linguistic hypotheses using language models, and assessing their validity for linguistic claims.
- ▹RL & decoding. Reinforcement‑learning algorithms for generation; minimum‑Bayes‑risk decoding.
§ iv Projects & open source
- 2023 Decoders. A library bringing custom decoding strategies on top of 🤗 Transformers; ships an implementation of Stochastic Beam Search.
- 2022 Multistage Retrieval for Health Misinformation. Two TREC participations. I built the NLP classifiers, the passage-cleaning stage, and the IR pipeline and infra.
- 2022 pyplexity. Companion code to An Unsupervised Perplexity-Based Method for Boilerplate Removal; I wrote most of it, including a high-performance distributed dataset cleaner.
- 2021 T5 pretraining in PyTorch Lightning. Implements Google’s T5 denoising objective and evaluates the resulting model.
- … Open-source contributions to 🤗 Transformers, Zellij, TeXstudio, OpenConnect, and others. Off-topic: Arduino Water Tap, TransLaTeX.
§ v skills & languages
- NLP & statistics
- Proficient in PyTorch, 🤗 Transformers, and vLLM. Statistical NLP models (CRFs, Markov networks, …); extensive research on sampling theory and efficient sampling algorithms.
- Information retrieval
- Lucene and Anserini for large-scale collection processing (TREC participations). Hadoop clusters for storage and parallel processing; academic experience with Apache Spark, cloud storage, MongoDB, HBase, and Neo4j.
- Programming
- Proficient in Python, Java, and C. High competence in C++, R, Matlab, SQL, and JavaScript.
- Hardware
- Arduino Open Hardware projects showcased at local hackathons (see off-topic projects). Experience administering advanced network environments.
- Languages
- Spanish and Galician: native. English: C1 (Cambridge CAE, July 2021). French: B1 (Alliance Française DELF, September 2015). Currently learning German.
§ vi irl
Off the keyboard, you'll find me playing basketball, hiking, tinkering with bikes or other hardware, or with a galician bagpipe in hand. I’m a native speaker of Galician and Spanish, I use English and French professionally, and I am currently learning German.