About Me

I am an Assistant Professor in the NLP group at IT University of Copenhagen.

I completed my PhD at the University of Edinburgh, where my thesis focused on neural planning for generating long documents from tabular data. It received the Best Dissertation in Scotland award from SICSA. During my PhD, I also interned with the Summarization team at Google Research, London.

My research interests include:

  • Pretraining Data Quality and Safety: An active interest of mine is studying how the data a model is trained on shapes what it learns, with a focus on measuring and improving pretraining data along diversity, faithfulness, and factual safety. This includes auditing web corpora for misinformation (FineWeb-Edu Misinformation Audit), studying the quality of synthetic data (Quality Study of Rephrased Web Data), and diversity measures for pretraining data (G-Vendi base-model proxy).

  • Multilinguality: I develop methods to make LLMs effective for low-resource languages, through romanization (RomanSetu, ACL’24 and RomanLens, ACL’25) and language-relatedness-based chunking (DecoMT, EMNLP’23). I also study how these models behave, including how well they follow instructions across Indic languages (IndicIFEval), and how they reason across languages (The Reasoning Lingua Franca, EACL’26).

  • Structured and Long-Context Modeling: I work on improving models’ ability to process structured data and long sequences. This includes neural planning for generation from tabular inputs (TACL’21, TACL’22), long-context modeling for summarization and sequences (ACL’23, TMLR’25), and re-examining how tabular language models are evaluated (ICML’26).

If you are interested in my research or potential collaboration, feel free to reach out via email.

News

  • 13 Jun 2026: A study on the G-Vendi diversity metric for evaluating synthetic pretraining data. - Details
  • 30 Apr 2026: Our paper on re-examining the evaluation of Tabular Language Models was accepted to ICML 2026. - Paper
  • 24 Apr 2026: Gave a talk 'A Tale of Two Audits' at the Pioneer Centre for AI (P1), Copenhagen. - Slides
  • 3 Apr 2026: Released the FineWeb-Edu Misinformation Audit, a dataset auditing the prevalence and types of misinformation in the FineWeb-Edu pretraining corpus. - Dataset
  • 18 Mar 2026: A quality study of rephrased web data used as synthetic pretraining data. - Details
  • 25 Feb 2026: Preprint introducing IndicIFEval, a benchmark for verifiable instruction-following evaluation across 14 Indic languages. - Preprint
  • 3 Feb 2026: Preprint evaluating generalization claims in Tabular Language Models. - Preprint
  • 5 Dec 2025: Received a grant from Den Danske Maritime Fond for the ACCENT project, which develops automated methods for maritime chart correction preparation and English translation. - Funding Announcement
  • 24 Nov 2025: Received a grant from the Orient Fund for the Maritime Safety Assistant project.
  • 28 Oct 2025: Preprint on RiddleBench, a benchmark for evaluating complex reasoning and puzzle-solving in LLMs. - Preprint
  • 23 Oct 2025: Preprint on multilingual reasoning in LLMs, revealing the Lost-in-Translation failure mode when reasoning in English. - Preprint
  • 29 Aug 2025: Chimera accepted to Transactions of Machine Learning Research (TMLR). It proposes a unified state-space model for sequences, graphs, and images. - Paper
  • 12 Jun 2025: Paper on self-pretraining for genome modeling accepted to ICML 2025 Workshop on Generative AI for Biology - Paper
  • 16 May 2025: RomanLens to appear in Findings of ACL 2025 - Paper
  • 11 Feb 2025: RomanLens paper on latent romanization in multilingual LLMs - Paper
  • 25 Sep 2024: Paper on vocabulary expansion and initialization strategies for LLMs accepted to CoNLL 2024 - Paper
  • 13 Jun 2024: VerityMath paper accepted to AI4Math workshop at ICML 2024 - Paper
  • 16 May 2024: Two papers accepted to ACL: RomanSetu and a paper on Indic MT Eval - RomanSetu Preprint - Indic MT Eval Preprint
  • 25 Jan 2024: Introducing Airavata, Hindi Instruction-tuned LLM - Blog
  • 24 Jan 2024: RomanSetu for unlocking multilingual capabilities of Large Language Models via Romanization - Preprint
  • 21 Nov 2023: IndicTrans2 is accepted to Transactions of Machine Learning Research (TMLR) - Preprint
  • 13 Nov 2023: VerityMath for applying unit consistency check for math problem solving - Preprint
  • 9 Oct 2023: Two papers accepted to EMNLP. DecoMT is accepted to Main and CTQScorer to Findings - DecoMT Preprint - CTQScorer Preprint

Selected Publications

For my latest publications, please visit my Google Scholar profile.