Lab Article•2026

Code Policy Models

An AI research project exploring whether large language models can serve as effective optimizers for game-playing agents by iteratively writing and refining Python code as the policy representation - replacing neural network weights with human-readable programs and gradient descent with LLM-guided code editing. The system operates an evolutionary loop: each generation, the LLM produces candidate policy edits, which are evaluated through parallel rollouts in a target environment (currently Pokemon Blue running on a headless Game Boy emulator), with optional Gemini video analysis providing multimodal feedback on agent behavior. A tournament selection mechanism pits multiple LLM-generated candidates against an elite policy to balance exploration with stability, while the full rollout trajectory and reward signal are fed back as context for the next generation's edits.

Code Policy Models asks whether an LLM can improve an agent by editing the agent's source code directly. Instead of training neural weights, the system treats a Python policy as the object being searched, revised, tested, and carried forward.

The current testbed uses parallel rollouts in Pokemon Blue running on a headless Game Boy emulator. Each candidate policy plays, receives a reward trace, and returns enough behavioural evidence for the next generation of edits.

The interesting part is not nostalgia for an old game. It is whether code can become a practical policy representation: inspectable, debuggable, versioned, and shaped by language-model feedback instead of gradient descent alone.

←Back to lab Next Article

Code Language Models