Manuel de Prada Corral

Hi!👋 I am a doctoral researcher at ETH Zürich and the Max Planck Institute for Intelligent Systems in Tübingen, advised by Ryan Cotterell (ETH) and Wieland Brendel (MPI). I am a fellow of the Max Planck–ETH Center for Learning Systems.

My research focuses on the statistical and algorithmic foundations of language models, with current emphasis on sampling theory and particle-based inference. I am interested in the interplay between reinforcement learning and sampling algorithms, and to linguistics research aided by language models. I admire mathematically grounded approaches that improve our understanding of modern NLP systems and their reliability.

Previously I was an ML intern at Hugging Face, working on the transformers generation stack, and a teaching assistant for ETH's Natural Language Processing and Large Language Models courses at ETH. Before that, I completed a double BSc in Mathematics and Computer Science at the University of Santiago de Compostela.


§ i Background

  1. Nov 2025 – now PhD Researcher, Natural Language Processing. ETH Zürich & MPI
  2. 2022 – 2025 MSc, Computer Science. ETH Zürich.
    • Apr – Oct 2025
      ML Engineer Intern. Hugging Face, Paris.

      Six-month internship on language models with the generation team. Implemented decoding strategies and key-value (KV) caching techniques in the main library and internal tooling.

    • 2024 – 2025 Teaching Assistant, NLP & LLMs. ETH Zürich.
    • 2023 – 2024
      Research Assistant. Rycolab, ETH Zürich.

      Inference-time scaling using sampling theory for LLMs. (1) Decoding for language models: Stochastic Beam Search and improving Minimum Bayes Risk decoding. (2) Improving constrained generation with globally normalized potentials — joint with MIT, Johns Hopkins, and McGill.

  3. 2016 – 2022 Double BSc, Mathematics & Computer Science. Universidade de Santiago de Compostela.
    • 2021 – 2022
      Research Assistant. CiTIUS, USC.

      (1) Detecting misinformation in IR systems by integrating language models. (2) Dataset cleaning combining n-gram and Transformer language models. Funded by the national project eRISK: Technologies for the early prediction of signs related with psychological disorders (RTI2018-093336-B-C21).

    • 2021 Seminar: Introduction to the Transformers Architecture. CiTIUS training program (link, slides).

§ ii Papers

  1. A Model of Diverse Sampling from Language Models

    Manuel Prada‑Corral, Yahya Emara, Timothy J. O’Donnell, Ryan Cotterell, Tim Vieira

    Preprint · Under review

    TL;DR. We formalize diverse language-model sampling as a global Determinantal Point Process over complete strings and use importance sampling to enable a principled quality–diversity trade-off without retraining the model.

  2. An unsupervised perplexity-based method for boilerplate removal

    Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada, Juan C. Pichel, Pablo Gamallo

    Natural Language Engineering · 2023

    TL;DR. A language-model perplexity signal separates web boilerplate from main content without supervision, beating heuristic cleaners on multilingual crawls. Released as pyplexity.

  3. CiTIUS at the TREC 2022 Health Misinformation Track

    Marcos Fernández-Pichel, Manuel Prada-Corral, David E. Losada, Juan C. Pichel

    TREC 2022 · NIST SP 500-338

    TL;DR. A multi-stage retrieval pipeline that combines BM25 with transformer-based stance and credibility classifiers to surface trustworthy health information and demote misinformation.


§ iii Currently thinking about

  • Sampling theory. Distributions over latent reasoning traces; sampling without replacement or non-i.i.d. from LMs; useful estimators.
  • Particle‑based inference. Sequential Monte Carlo for structured generation; principled decoding under uncertainty.
  • Linguistic interpretation of LMs. Evaluating linguistic hypotheses using language models, and assessing their validity for linguistic claims.
  • RL & decoding. Reinforcement‑learning algorithms for generation; minimum‑Bayes‑risk decoding.

§ iv Projects & open source

  1. 2023 Decoders. A library bringing custom decoding strategies on top of 🤗 Transformers; ships an implementation of Stochastic Beam Search.
  2. 2022 Multistage Retrieval for Health Misinformation. Two TREC participations. I built the NLP classifiers, the passage-cleaning stage, and the IR pipeline and infra.
  3. 2022 pyplexity. Companion code to An Unsupervised Perplexity-Based Method for Boilerplate Removal; I wrote most of it, including a high-performance distributed dataset cleaner.
  4. 2021 T5 pretraining in PyTorch Lightning. Implements Google’s T5 denoising objective and evaluates the resulting model.
  5. Open-source contributions to 🤗 Transformers, Zellij, TeXstudio, OpenConnect, and others. Off-topic: Arduino Water Tap, TransLaTeX.

§ v skills & languages

NLP & statistics
Proficient in PyTorch, 🤗 Transformers, and vLLM. Statistical NLP models (CRFs, Markov networks, …); extensive research on sampling theory and efficient sampling algorithms.
Information retrieval
Lucene and Anserini for large-scale collection processing (TREC participations). Hadoop clusters for storage and parallel processing; academic experience with Apache Spark, cloud storage, MongoDB, HBase, and Neo4j.
Programming
Proficient in Python, Java, and C. High competence in C++, R, Matlab, SQL, and JavaScript.
Hardware
Arduino Open Hardware projects showcased at local hackathons (see off-topic projects). Experience administering advanced network environments.
Languages
Spanish and Galician: native. English: C1 (Cambridge CAE, July 2021). French: B1 (Alliance Française DELF, September 2015). Currently learning German.

§ vi irl

Off the keyboard, you'll find me playing basketball, hiking, tinkering with bikes or other hardware, or with a galician bagpipe in hand. I’m a native speaker of Galician and Spanish, I use English and French professionally, and I am currently learning German.