Building an evaluation set to benchmark code-generation LLMs at SnT: graded on whether the code actually runs, with the observability to trust the numbers.
Code-generation models are easy to demo and hard to trust. The question the group cared about was simpler than a leaderboard: how often does a model's output actually run and pass real tests, and how much does that change when you reword the prompt or swap the architecture? Answering it needs an evaluation set you can rerun and reason about, not a single accuracy figure.
A benchmark is only worth something if it's reproducible. A number that drifts between runs, or that can't be traced back to which tasks failed and why, tells you nothing. So the work had to be measurable, repeatable, and debuggable from the start.
I built an evaluation set for code-generation LLMs on top of CodeBenchGen, turning model output into executable tasks with their own test harnesses so generated code could be graded on whether it ran and passed rather than how plausible it read. I measured correctness, runtime efficiency, and performance across static and dynamic settings, and tracked how stable models stayed as prompts and architectures changed. Around the runs I set up the observability — per-task pass rates, failure modes, and run-to-run variance, all logged — so any headline number could be traced back to the tasks behind it. I also worked on ML-based code authorship attribution for provenance and IP protection.
A reproducible benchmark and the evaluation pipeline behind it, built alongside PhD researchers and feeding into published research with the TRUX group.
Built on top of CodeBenchGen, the eval set turns a model's output into an executable task with its own test harness, so generated code is judged on whether it compiles, runs, and passes the tests, not on how plausible it reads. That made it possible to compare models on the thing that actually matters.
Graded model output across static and dynamic settings: whether it passed, how fast it ran, and how it held up as tasks got harder.
Measured how much a model's results moved when prompts were reworded or the architecture changed, separating real capability from prompt luck.
Logged per-task pass rates, failure modes, and run-to-run variance so every headline number could be traced back to the specific tasks behind it, and the benchmark stayed reproducible between runs.
Worked on ML-based attribution of code authorship, aimed at provenance and protecting intellectual property.