The Lab

Experiments at the frontier of AI.

Research is where we keep the edge sharp. We run focused experiments, build strange tools when the standard ones stop helping, and publish work that has been cited by teams at Google Brain, AWS, Stanford, and NYU.

Experiments

2026Active Research

Code Policy Models

Agentic AIReinforcement LearningProgram SynthesisMultimodal Evaluation

An AI research project exploring whether large language models can serve as effective optimizers for game-playing agents by iteratively writing and refining Python code as the policy representation - replacing neural network weights with human-readable programs and gradient descent with LLM-guided code editing. The system operates an evolutionary loop: each generation, the LLM produces candidate policy edits, which are evaluated through parallel rollouts in a target environment (currently Pokemon Blue running on a headless Game Boy emulator), with optional Gemini video analysis providing multimodal feedback on agent behavior. A tournament selection mechanism pits multiple LLM-generated candidates against an elite policy to balance exploration with stability, while the full rollout trajectory and reward signal are fed back as context for the next generation's edits.

Read article→

2026Active Research

Code Language Models

Code ModelsAgentic OptimizationEvaluationInterpretability

A research project exploring whether LLMs can act as optimizers for language models by writing and refining Python code as the model itself - replacing learned weights updated by gradient descent with human-readable rules updated by LLM-guided code edits. The system runs a multi-agent optimization loop: a planner agent reviews past results and proposes improvement ideas, parallel improver agents implement each idea on isolated branches, and an integrator agent evaluates and merges the best performers back into the main line. A constraint scanner enforces that the model stays purely rule-based - no neural networks, no corpus statistics, no learned parameters.

Read article→

2025

LucentBench

Financial IntelligenceBenchmarkingLLM EvaluationRAG

A benchmarking framework for evaluating AI performance in financial intelligence tasks. Designed to rigorously test how well language models handle real-world fund management scenarios - from research synthesis to portfolio analysis.

Read article→

2023

GPT CLI

Developer ToolsLLM InterfacesCLIOpen Source

An early open-source command-line interface for ChatGPT, enabling developers to interact with GPT models directly from their terminal.

Read article→

2022

Halo Lang

Programming LanguagesCompiler DesignDeveloper ExperienceSystems

An experimental programming language project exploring novel approaches to language design and compilation.

Read article→

2022

Torchwindow

Computer VisionPyTorchModel ObservabilityML Tooling

An open-source ML visualization library for PyTorch, providing real-time training visualization and debugging tools for machine learning engineers.

Read article→

2020

Temporal Probability Calibration

Probability CalibrationTemporal ModelsForecastingUncertainty

Published research on calibrating probabilistic predictions over time - cited by Google Brain, Amazon AWS, Stanford, NYU, and other leading institutions.

Read article→