About Me
I am an Assistant Professor in the NLP group at IT University of Copenhagen.
I completed my PhD at the University of Edinburgh, where my thesis focused on neural planning for generating long documents from tabular data. It received the Best Dissertation in Scotland award from SICSA. During my PhD, I also interned with the Summarization team at Google Research, London.
My research interests include:
Pretraining Data Quality and Safety: An active interest of mine is studying how the data a model is trained on shapes what it learns, with a focus on measuring and improving pretraining data along diversity, faithfulness, and factual safety. This includes auditing web corpora for misinformation (FineWeb-Edu Misinformation Audit), studying the quality of synthetic data (Quality Study of Rephrased Web Data), and diversity measures for pretraining data (G-Vendi base-model proxy).
Multilinguality: I develop methods to make LLMs effective for low-resource languages, through romanization (RomanSetu, ACL’24 and RomanLens, ACL’25) and language-relatedness-based chunking (DecoMT, EMNLP’23). I also study how these models behave, including how well they follow instructions across Indic languages (IndicIFEval), and how they reason across languages (The Reasoning Lingua Franca, EACL’26).
Structured and Long-Context Modeling: I work on improving models’ ability to process structured data and long sequences. This includes neural planning for generation from tabular inputs (TACL’21, TACL’22), long-context modeling for summarization and sequences (ACL’23, TMLR’25), and re-examining how tabular language models are evaluated (ICML’26).
If you are interested in my research or potential collaboration, feel free to reach out via email.
News
- 13 Jun 2026: A study on the G-Vendi diversity metric for evaluating synthetic pretraining data. - Details
- 30 Apr 2026: Our paper on re-examining the evaluation of Tabular Language Models was accepted to ICML 2026. - Paper
- 24 Apr 2026: Gave a talk 'A Tale of Two Audits' at the Pioneer Centre for AI (P1), Copenhagen. - Slides
- 3 Apr 2026: Released the FineWeb-Edu Misinformation Audit, a dataset auditing the prevalence and types of misinformation in the FineWeb-Edu pretraining corpus. - Dataset
- 18 Mar 2026: A quality study of rephrased web data used as synthetic pretraining data. - Details
- 25 Feb 2026: Preprint introducing IndicIFEval, a benchmark for verifiable instruction-following evaluation across 14 Indic languages. - Preprint
- 3 Feb 2026: Preprint evaluating generalization claims in Tabular Language Models. - Preprint
- 5 Dec 2025: Received a grant from Den Danske Maritime Fond for the ACCENT project, which develops automated methods for maritime chart correction preparation and English translation. - Funding Announcement
- 24 Nov 2025: Received a grant from the Orient Fund for the Maritime Safety Assistant project.
- 28 Oct 2025: Preprint on RiddleBench, a benchmark for evaluating complex reasoning and puzzle-solving in LLMs. - Preprint
- 23 Oct 2025: Preprint on multilingual reasoning in LLMs, revealing the Lost-in-Translation failure mode when reasoning in English. - Preprint
- 29 Aug 2025: Chimera accepted to Transactions of Machine Learning Research (TMLR). It proposes a unified state-space model for sequences, graphs, and images. - Paper
- 12 Jun 2025: Paper on self-pretraining for genome modeling accepted to ICML 2025 Workshop on Generative AI for Biology - Paper
- 16 May 2025: RomanLens to appear in Findings of ACL 2025 - Paper
- 11 Feb 2025: RomanLens paper on latent romanization in multilingual LLMs - Paper
- 25 Sep 2024: Paper on vocabulary expansion and initialization strategies for LLMs accepted to CoNLL 2024 - Paper
- 13 Jun 2024: VerityMath paper accepted to AI4Math workshop at ICML 2024 - Paper
- 16 May 2024: Two papers accepted to ACL: RomanSetu and a paper on Indic MT Eval - RomanSetu Preprint - Indic MT Eval Preprint
- 25 Jan 2024: Introducing Airavata, Hindi Instruction-tuned LLM - Blog
- 24 Jan 2024: RomanSetu for unlocking multilingual capabilities of Large Language Models via Romanization - Preprint
- 21 Nov 2023: IndicTrans2 is accepted to Transactions of Machine Learning Research (TMLR) - Preprint
- 13 Nov 2023: VerityMath for applying unit consistency check for math problem solving - Preprint
- 9 Oct 2023: Two papers accepted to EMNLP. DecoMT is accepted to Main and CTQScorer to Findings - DecoMT Preprint - CTQScorer Preprint
Selected Publications
For my latest publications, please visit my Google Scholar profile.
- The Illusion of Generalization in Tabular Language Models Aditya Gorla, Ratish Puduppully. In Proceedings of the 43rd International Conference on Machine Learning (ICML). 2026.
- RiddleBench: A New Generative Reasoning Benchmark for LLMs Deepon Halder, Alan Saji, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre. In Findings of EACL. 2026.
- The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI Alan Saji, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully. In Proceedings of EACL. 2026.
- IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan. 2026.
- Chimera: State Space Models Beyond Sequences Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu. In Transactions on Machine Learning Research. 2025.
- RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs Alan Saji, Jaavid Aktar Husain, Thanmay Jayakumar, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully. In Findings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025.
- Improving Genomic Models via Task-Specific Self-Pretraining Sohan Mupparapu, Parameswari Krishnamurthy, Ratish Puduppully. In Proceedings of the Workshop on Generative AI for Biology at the 42nd International Conference on Machine Learning. 2025.
- RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Ratish Puduppully, Anoop Kunchukuttan. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. 2024.
- VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency Vernon Toh, Ratish Puduppully, Nancy F. Chen. In AI4Math Workshop at ICML. 2024.
- Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models Code
Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre, Ai Ti Aw, Nancy F. Chen. In Proceedings of EMNLP. 2023. - IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages Code
AI4Bharat, Jay Gala, Pranjal A. Chitale, Raghavan AK, Sumanth Doddapaneni, Varun Gumma, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan. In Transactions of Machine Learning Research (TMLR). 2023. - Multi-Document Summarization with Centroid-Based Pretraining Code
Ratish Surendran Puduppully, Parag Jain, Nancy Chen, Mark Steedman. In Proceedings of ACL. 2023. - Data-to-text Generation with Variational Sequential Planning. Code
Ratish Puduppully and Yao Fu and Mirella Lapata. In Transactions of the Association for Computational Linguistics (TACL). 2022. - Data-to-text Generation with Macro Planning. Code
Ratish Puduppully and Mirella Lapata. In Transactions of the Association for Computational Linguistics (TACL). 2021. - Data-to-text Generation with Entity Modeling. Code
Ratish Puduppully, Li Dong, Mirella Lapata. In Proceedings of ACL. 2019. - Data-to-Text Generation with Content Selection and Planning Code
Ratish Puduppully, Li Dong, Mirella Lapata. In Proceedings of AAAI. 2019.