June 24, 2025

When AI Writes All the Code: Quality Gates and Context That Actually Work

Over the past few months, I’ve significantly ramped up my use of LLM tools for writing software, both to acutely feel the shortcomings myself and to start systematically filling in the gaps.

I think everyone has experienced the amazement of one-shotting an impressive demo and the frustration of how quickly most coding “agents” fall apart beyond projects of trivial complexity and size.

Back in March, I shared my experiments with a Next.js starter project designed as an AI-friendly development environment—with rigorous TypeScript, extensive ESLint rules, and explicit coding guidelines. The key finding wasn’t that AI couldn’t write tests or implement features, but that it consistently prioritized generating implementation code first, then backfilling tests—despite clear TDD instructions. AI would also selectively follow some constraints while ignoring others as context degraded. I concluded that current AI tools are “code generation engines, not participants in disciplined development processes.”

That experience taught me something crucial: comprehensive rules weren’t enough. AI needed more than instructions—it needed systematic constraints and evolving context based on real failure patterns.

The Core Challenge

If I could summarize the challenge simply, it would be this: while humans learn and carry over experience, an AI coding agent starts from scratch with each new ticket or feature. So we need to find a way to help the agent “learn” (or at least improve). I’ve addressed this with two key pieces:

Systematic constraints that prevent AI failure modes
Comprehensive context that teaches AI to write better code from the first attempt (or at least with fewer iterations)

What’s Different This Time

I’m now at a place where I really want to share with others to get feedback, start conversation, and maybe even help one or two people. In that vein, I’m sharing a TypeScript project (although I believe the techniques apply broadly).

The project includes comprehensive quality gates:

Custom ESLint rules that make architectural violations impossible
Mutation testing to catch “coverage theater”
Validation everywhere (AI doesn’t understand trust boundaries)
ESLint + Prettier + TypeScript + Zod + dependency-cruiser + Stryker + ...

What makes this different from my Next.js experiment is the focus on constraint-driven development with multiple feedback loops:

pnpm ai:quick (<5 seconds): Immediate type and lint feedback
pnpm ai:check (<30 seconds): Comprehensive validation
pnpm ai:compliance: Full quality pipeline including mutation testing

The Real Game Changer

I think what’s worked best is systematic context refinement. When I notice patterns in AI failures or inefficiencies, I have it reflect on those issues and update the context it receives (AGENTS.md, CLAUDE.md, cursor rules). The guidelines have evolved based on actual mistakes, creating a systematic approach that reduces iteration cycles.

But here’s where it gets interesting: I’m using the AI agents themselves to build better guardrails. When code fails a quality check, I have the AI analyze what went wrong and suggest new ESLint rules or validation patterns. When I spot architectural drift, the AI helps codify the pattern into dependency-cruiser rules. It’s AI improving the constraints on AI.

Even better—after each session, I have the AI update the context documents based on what it learned. Did it waste time on a particular pattern? Update AGENTS.md. Did it miss a security consideration? Add it to the guidelines. The next AI agent gets all this accumulated wisdom.

This addresses a fundamental asymmetry: humans get better at a codebase over time, but AI starts fresh every time. By capturing and refining project wisdom based on real failure patterns, we give AI something closer to institutional memory.

Does It Actually Work?

The results have been surprisingly good. Unlike my March experiments where AI would skip TDD or selectively follow rules, this approach doesn’t give it a choice. Shortcuts simply don’t work anymore.

AI agents now write tests that actually validate behavior (mutation testing won’t let them fake it), follow architectural patterns (ESLint and dependency-cruiser enforce it), and handle validation consistently (Zod schemas everywhere).

But the real win is the feedback loop. Instead of me manually updating instructions after every failure, I have AI do the reflection and documentation. Each coding session literally makes the next one better.

The key insight? We can’t make AI be disciplined, but we can make discipline the only option. And when we use AI itself to evolve the constraints and context, we get something that actually improves over time.

This is still an experiment, and I’d love feedback, particularly from those who are skeptical! Does this approach scale? Are there failure modes I haven’t encountered yet? What constraints have you found effective?

Repo: https://github.com/mkwatson/ai-fastify-template