Manuel de Prada Corral

PhD Researcher · Natural Language Processing · Machine Learning
🇨🇭 ETH Zürich · 🇩🇪 Max Planck Institute (CLS program)

Hi!👋 I am a doctoral researcher at ETH Zürich and the Max Planck Institute for Intelligent Systems in Tübingen, advised by Ryan Cotterell (ETH) and Wieland Brendel (MPI). I am a fellow of the Max Planck–ETH Center for Learning Systems.

My research focuses on the statistical and algorithmic foundations of language models, with current emphasis on sampling theory and particle-based inference. I am interested in the interplay between reinforcement learning and sampling algorithms, and to linguistics research aided by language models. I admire mathematically grounded approaches that improve our understanding of modern NLP systems and their reliability.

Previously I was an ML intern at Hugging Face, working on the transformers generation stack, and a teaching assistant for ETH's Natural Language Processing and Large Language Models courses at ETH. Before that, I completed a double BSc in Mathematics and Computer Science at the University of Santiago de Compostela.

§ i Background

Nov 2025 – now PhD Researcher, Natural Language Processing. ETH Zürich & MPI
2022 – 2025
MSc, Computer Science. ETH Zürich.
- Apr – Oct 2025
  
  ML Engineer Intern. Hugging Face, Paris.
  
  Six-month internship on language models with the generation team. Implemented decoding strategies and key-value (KV) caching techniques in the main library and internal tooling.
- 2024 – 2025 Teaching Assistant, NLP & LLMs. ETH Zürich.
- 2023 – 2024
  
  Research Assistant. Rycolab, ETH Zürich.
  
  Inference-time scaling using sampling theory for LLMs. (1) Decoding for language models: Stochastic Beam Search and improving Minimum Bayes Risk decoding. (2) Improving constrained generation with globally normalized potentials — joint with MIT, Johns Hopkins, and McGill.
2016 – 2022
Double BSc, Mathematics & Computer Science. Universidade de Santiago de Compostela.
- 2021 – 2022
  
  Research Assistant. CiTIUS, USC.
  
  (1) Detecting misinformation in IR systems by integrating language models. (2) Dataset cleaning combining n-gram and Transformer language models. Funded by the national project eRISK: Technologies for the early prediction of signs related with psychological disorders (RTI2018-093336-B-C21).
- 2021 Seminar: Introduction to the Transformers Architecture. CiTIUS training program (link, slides).

§ ii Papers

A Model of Diverse Sampling from Language Models

Manuel Prada‑Corral, Yahya Emara, Timothy J. O’Donnell, Ryan Cotterell, Tim Vieira

Preprint · Under review

TL;DR. We formalize diverse language-model sampling as a global Determinantal Point Process over complete strings and use importance sampling to enable a principled quality–diversity trade-off without retraining the model.
An unsupervised perplexity-based method for boilerplate removal

Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada, Juan C. Pichel, Pablo Gamallo

Natural Language Engineering · 2023

TL;DR. A language-model perplexity signal separates web boilerplate from main content without supervision, beating heuristic cleaners on multilingual crawls. Released as pyplexity.
CiTIUS at the TREC 2022 Health Misinformation Track

Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada, Juan C. Pichel

TREC 2022 · NIST SP 500-338

TL;DR. A multi-stage retrieval pipeline that combines BM25 with transformer-based stance and credibility classifiers to surface trustworthy health information and demote misinformation.

§ iii Currently thinking about

▹Sampling theory. Distributions over latent reasoning traces; sampling without replacement or non-i.i.d. from LMs; useful estimators.
▹Particle‑based inference. Sequential Monte Carlo for structured generation; principled decoding under uncertainty.
▹Linguistic interpretation of LMs. Evaluating linguistic hypotheses using language models, and assessing their validity for linguistic claims.
▹RL & decoding. Reinforcement‑learning algorithms for generation; minimum‑Bayes‑risk decoding.

§ iv Projects & open source

2023 Decoders. A library bringing custom decoding strategies on top of 🤗 Transformers; ships an implementation of Stochastic Beam Search.
2022 Multistage Retrieval for Health Misinformation. Two TREC participations. I built the NLP classifiers, the passage-cleaning stage, and the IR pipeline and infra.
2022 pyplexity. Companion code to An Unsupervised Perplexity-Based Method for Boilerplate Removal; I wrote most of it, including a high-performance distributed dataset cleaner.
2021 T5 pretraining in PyTorch Lightning. Implements Google’s T5 denoising objective and evaluates the resulting model.
… Open-source contributions to 🤗 Transformers, Zellij, TeXstudio, OpenConnect, and others. Off-topic: Arduino Water Tap, TransLaTeX.

§ v skills & languages

NLP & statistics: Proficient in PyTorch, 🤗 Transformers, and vLLM. Statistical NLP models (CRFs, Markov networks, …); extensive research on sampling theory and efficient sampling algorithms.
Information retrieval: Lucene and Anserini for large-scale collection processing (TREC participations). Hadoop clusters for storage and parallel processing; academic experience with Apache Spark, cloud storage, MongoDB, HBase, and Neo4j.
Programming: Proficient in Python, Java, and C. High competence in C++, R, Matlab, SQL, and JavaScript.
Hardware: Arduino Open Hardware projects showcased at local hackathons (see off-topic projects). Experience administering advanced network environments.
Languages: Spanish and Galician: native. English: C1 (Cambridge CAE, July 2021). French: B1 (Alliance Française DELF, September 2015). Currently learning German.

§ vi irl

Off the keyboard, you'll find me playing basketball, hiking, tinkering with bikes or other hardware, or with a galician bagpipe in hand. I’m a native speaker of Galician and Spanish, I use English and French professionally, and I am currently learning German.