Marko Sterbentz

Marko Sterbentz

Computer Science PhD Candidate

Northwestern University

About Me

I am a computer science PhD candidate at Northwestern University in the C3 Lab advised by Kristian Hammond. My research focuses on leveraging AI and large language models (LLMs) to automate data science processes, enabling users to ask questions of their data and receive meaningful, contextualized insights.

Specifically, this involves:

Modeling and representing data science knowledge so that the system understands the range of actions it can take to perform data analytics.
Building mechanisms and tools for extracting information from databases, which includes training and developing text-to-query models capable of answering atomic questions.
Developing reasoning and planning methods that utilize the available data analytics knowledge and tools to provide contextualized answers to users’ inquiries.

Interests

Artificial Intelligence
Question Answering
Neurosymbolic AI
Language Generation

Education

PhD Computer Science
Northwestern University
MS Computer Science
University of Southern California
BS Computer Science
Idaho State University

News

January 2026: The preprint for our paper on generating high quality synthetic training data for text-to-SQL is available on arXiv.
September 2025: I finished my internship at IBM Research. Thank you to my mentors Michael Glass, Nhan Pham, and Shankar Subramanian for an excellent summer of research!
June 2025: I’m excited to be starting an internship at IBM Research for the summer working on LLMs for data research.
November 2024: I presented the Satyrn paper at EMNLP 2024.
September 2024: Our paper Satyrn: A Platform for Analytics Augmented Generation was just accepted to EMNLP 2024. See you in Miami!
June 2024: I partcipated in the CASMI workshop “AI Safety: A Domain-Focused Approach to Anticipating Harm.” Read the full report.

Publications

Marko Sterbentz, Kevin Cushing, Cameron Barrie, Kristian J. Hammond (2026). RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models. Preprint.

Cite Paper GitHub

Marko Sterbentz, Cameron Barrie, Shubham Shahi, Abhratanu Dutta, Donna Hooshmand, Harper Pack, Kristian J. Hammond (2024). Satyrn: A Platform for Analytics Augmented Generation. Empirical Methods in Natural Language Processing 2024 (EMNLP 2024 Main).

Cite Paper GitHub

Marko Sterbentz, Cameron Barrie, Donna Hooshmand, Shubham Shahi, Abhratanu Dutta, Harper Pack, Andong Li Zhao, Andrew Paley, Alexander Einarsson, Kristian Hammond (2023). Lightweight Knowledge Representations for Automating Data Analysis. arXiv preprint arXiv:2311.12848.

David Demeter, Oshin Agarwal, Simon Ben Igeri, Marko Sterbentz, Neil Molino, John Conroy, Ani Nenkova (2023). Multi-domain Summarization from Leaderboards to Practice: Re-examining Automatic and Human Evaluation. Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM).

Andrew Paley, Andong L Li Zhao, Harper Pack, Sergio Servantez, Rachel F Adler, Marko Sterbentz, Adam Pah, David Schwartz, Cameron Barrie, Alexander Einarsson, Kristian J. Hammond (2021). From Data to Information: Automating Data Science to Explore the US Court System. Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law (ICAIL 2021).

Cite Paper GitHub

Andong Luis Li Zhao, Andrew Paley, Rachel Adler, Harper Pack, Sergio Servantez, Alexander Einarsson, Cameron Barrie, Marko Sterbentz, Kristian Hammond (2021). Requirements for Open Political Information: Transparency Beyond Open Data. arXiv preprint arXiv:2112.03119.

James H Money, Marko Sterbentz, Nathan Morrical, Thomas Szewczyk, Landon Woolley (2018). GPGPU Enabled Ray Directed Adaptive Volume Visualization for High Density Scans. Proceedings of the Practice and Experience on Advanced Research Computing (PEARC).

Cite Paper GitHub

Experience

Research Intern
IBM Research June 2025 – September 2025
Research Assistant
Northwestern University September 2019 – Present
Research Intern
Lawrence Livermore National Laboratory May 2019 – August 2019
Research Intern
Idaho National Laboratory May 2018 – August 2018

Projects

Neurosymbolic Data Agents

Neurosymbolic AI

Neurosymbolic Data Agents

I’m working on building agents that leverage LLMs in conjunction with symbolic representations of analytics and data to answer complex, data-intensive questions. More details coming soon.

RingSQL - Synthetic Data Generation for Text-to-SQL

RingSQL - Synthetic Data Generation for Text-to-SQL

Recent advances in text-to-SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high-quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template-based approaches ensure correct SQL but require schema-specific templates, while LLM-based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural-language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text-to-SQL benchmarks when compared to models trained on other synthetic data.

Satyrn - Report Generation

Analytics Augmented Generation

Satyrn - Report Generation

A data platform that uses analytics augmented generation to write comprehensive reports grounded by your data.

Multi-Domain Text Summarization Evaluation

Automatic Text Summarization

Multi-Domain Text Summarization Evaluation

Existing literature does not give much guidance on how to build the best possible multi-domain summarization model from existing components. We present an extensive evaluation of popular pre-trained models on a wide range of datasets to inform the selection of both the model and the training data for robust summarization across several domains. We find that fine-tuned BART performs better than T5 and PEGASUS, both on in-domain and outof-domain data, regardless of the dataset used for fine-tuning. While BART has the best performance, it does vary considerably across domains. A multi-domain summarizer that works well for all domains can be built by simply fine-tuning on diverse domains. It even performs better than an in-domain summarizer, even when using fewer total training examples. While the success of such a multi-domain summarization model is clear through automatic evaluation, by conducting a human evaluation, we find that there are variations that can not be captured by any of the automatic evaluation metrics and thus not reflected in standard leaderboards. Furthermore, we find that conducting reliable human evaluation can be complex as well. Even experienced summarization researchers can be inconsistent with one another in their assessment of the quality of a summary, and also with themselves when reannotating the same summary. The findings of our study are two-fold. First, BART fine-tuned on heterogeneous domains is a great multidomain summarizer for practical purposes. At the same time, we need to re-examine not just automatic evaluation metrics but also human evaluation methods to responsibly measure progress in summarization.

MimIR

Information Retrieval

MimIR

Mim is an information retrieval system used for answering complex questions over large document corpora.

Natural Language Notebook

Natural Language Interface

Natural Language Notebook

The U.S. court system is the nation’s arbiter of justice, tasked with the responsibility of ensuring equal protection under the law. But hurdles to information access obscure the inner workings of the system, preventing stakeholders – from legal scholars to journalists and members of the public – from understanding the state of justice in America at scale. There is an ongoing data access argument here: U.S. court records are public data and should be freely available. But open data arguments represent a half-measure; what we really need is open information. This distinction marks the difference between downloading a zip file containing a quarter-million case dockets and getting the real-time answer to a question like “Are pro se parties more or less likely to receive fee waivers?” To help bridge that gap, we introduce a novel platform and user experience that provides users with the tools necessary to explore data and drive analysis via natural language statements. Our approach leverages an ontology configuration that adds domain-relevant data semantics to database schemas to provide support for user guidance and for search and analysis without user-entered code or SQL. The system is embodied in a “natural-language notebook” user experience, and we apply this approach to the space of case docket data from the U.S. federal court system. Additionally, we provide detail on the collection, ingestion and processing of the dockets themselves, including early experiments in the use of language modeling for docket entry classification with an initial focus on motions.

Critical Resource Exchange

Critical Resource Exchange

As part of the C3 Lab, we built a tool for evaluating and responding to requests for essential resources in emergency situations. Crises, such as the ongoing coronavirus pandemic, require extraordinary mobilization of expertise and resources beyond what established networks and policies can provide. Credible information and open communication are foundational to mounting a proper response, so we developed both an exchange, for sharing and obtaining critical resources, and a dashboard, for illustrating where and which needs are most pressing. Organizations and policymakers armed with these tools can make better decisions about allocations and interventions - whatever the issue at hand may be.

Teaching

CS 338 — Practicum on Intelligent Information Systems

Teaching Assistant

CS 338 — Practicum on Intelligent Information Systems

Spring 2025 / Fall 2024 / Fall 2022 / Spring 2022 / Winter 2022 / Fall 2020

CSCI 576 — Multimedia Systems Design

Teaching Assistant

CSCI 576 — Multimedia Systems Design

Fall 2018 / Spring 2019