Marko Sterbentz
Marko Sterbentz

Computer Science PhD Candidate

About Me

I am a computer science PhD candidate at Northwestern University in the C3 Lab advised by Kristian Hammond. My research focuses on leveraging AI and large language models (LLMs) to automate data science processes, enabling users to ask questions of their data and receive meaningful, contextualized insights.

Specifically, this involves:

  • Modeling and representing data science knowledge so that the system understands the range of actions it can take to perform data analytics.
  • Building mechanisms and tools for extracting information from databases, which includes training and developing text-to-query models capable of answering atomic questions.
  • Developing reasoning and planning methods that utilize the available data analytics knowledge and tools to provide contextualized answers to users’ inquiries.
Download CV
Interests
  • Artificial Intelligence
  • Question Answering
  • Neurosymbolic AI
  • Language Generation
Education
  • PhD Computer Science

    Northwestern University

  • MS Computer Science

    University of Southern California

  • BS Computer Science

    Idaho State University

News
  • January 2026: The preprint for our paper on generating high quality synthetic training data for text-to-SQL is available on arXiv.
  • September 2025: I finished my internship at IBM Research. Thank you to my mentors Michael Glass, Nhan Pham, and Shankar Subramanian for an excellent summer of research!
  • June 2025: I’m excited to be starting an internship at IBM Research for the summer working on LLMs for data research.
  • November 2024: I presented the Satyrn paper at EMNLP 2024.
  • September 2024: Our paper Satyrn: A Platform for Analytics Augmented Generation was just accepted to EMNLP 2024. See you in Miami!
  • June 2024: I partcipated in the CASMI workshop “AI Safety: A Domain-Focused Approach to Anticipating Harm.” Read the full report.
Publications
(2026). RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models. Preprint.
(2024). Satyrn: A Platform for Analytics Augmented Generation. Empirical Methods in Natural Language Processing 2024 (EMNLP 2024 Main).
(2023). Lightweight Knowledge Representations for Automating Data Analysis. arXiv preprint arXiv:2311.12848.
(2023). Multi-domain Summarization from Leaderboards to Practice: Re-examining Automatic and Human Evaluation. Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM).
(2021). From Data to Information: Automating Data Science to Explore the US Court System. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (ICAIL 2021).
(2021). Requirements for Open Political Information: Transparency Beyond Open Data. arXiv preprint arXiv:2112.03119.
(2018). GPGPU Enabled Ray Directed Adaptive Volume Visualization for High Density Scans. Proceedings of the Practice and Experience on Advanced Research Computing (PEARC).

Experience

  1. Research Intern

    IBM Research
  2. Research Assistant

    Northwestern University
  3. Research Intern

    Lawrence Livermore National Laboratory
  4. Research Intern

    Idaho National Laboratory
Projects
Multi-Domain Text Summarization Evaluation

Multi-Domain Text Summarization Evaluation

Existing literature does not give much guidance on how to build the best possible multi-domain summarization model from existing components. We present an extensive evaluation of popular pre-trained models on a wide range of datasets to inform the selection of both the model and the training data for robust summarization across several domains. We find that fine-tuned BART performs better than T5 and PEGASUS, both on in-domain and outof-domain data, regardless of the dataset used for fine-tuning. While BART has the best performance, it does vary considerably across domains. A multi-domain summarizer that works well for all domains can be built by simply fine-tuning on diverse domains. It even performs better than an in-domain summarizer, even when using fewer total training examples. While the success of such a multi-domain summarization model is clear through automatic evaluation, by conducting a human evaluation, we find that there are variations that can not be captured by any of the automatic evaluation metrics and thus not reflected in standard leaderboards. Furthermore, we find that conducting reliable human evaluation can be complex as well. Even experienced summarization researchers can be inconsistent with one another in their assessment of the quality of a summary, and also with themselves when reannotating the same summary. The findings of our study are two-fold. First, BART fine-tuned on heterogeneous domains is a great multidomain summarizer for practical purposes. At the same time, we need to re-examine not just automatic evaluation metrics but also human evaluation methods to responsibly measure progress in summarization.

Natural Language Notebook

Natural Language Notebook

The U.S. court system is the nation’s arbiter of justice, tasked with the responsibility of ensuring equal protection under the law. But hurdles to information access obscure the inner workings of the system, preventing stakeholders – from legal scholars to journalists and members of the public – from understanding the state of justice in America at scale. There is an ongoing data access argument here: U.S. court records are public data and should be freely available. But open data arguments represent a half-measure; what we really need is open information. This distinction marks the difference between downloading a zip file containing a quarter-million case dockets and getting the real-time answer to a question like “Are pro se parties more or less likely to receive fee waivers?” To help bridge that gap, we introduce a novel platform and user experience that provides users with the tools necessary to explore data and drive analysis via natural language statements. Our approach leverages an ontology configuration that adds domain-relevant data semantics to database schemas to provide support for user guidance and for search and analysis without user-entered code or SQL. The system is embodied in a “natural-language notebook” user experience, and we apply this approach to the space of case docket data from the U.S. federal court system. Additionally, we provide detail on the collection, ingestion and processing of the dockets themselves, including early experiments in the use of language modeling for docket entry classification with an initial focus on motions.

Teaching