100x AI Coding for Real Work: Summon The Ghost Army

The Goal: Warp Speed Experimentation

For the past 12 weeks I’ve been using AI for coding & product development at work while building a mini “agent company” on the weekends. The mission? Figure out how to experiment 100x faster in real codebases. Or as we call it at work, “summon the ghost army”. (LOTR)

Last week, after 10s of failed attempts, that goal became a reality. I achieved 100x speed twice on tasks within a large, unfamiliar codebase. What would have taken me four typical weeks (approximately 336 hours) took 3.5 hours. It’s still early days: repeatability is fragile, and the process remains largely manual. But I wanted to start “working in public” and share what feel like breakthroughs for me.

Why 100x? The Quest For Ultimate Value Creation

Over the last decade at Amazon, WhiteBox, Microsoft, Leanrr, Asgard Analytics, SigFig, and now SeekOut I’ve been trying to find the ultimate value creation process. Key insight: building is the least efficient way to experiment but it is by far the most effective. So if you can bring the cost of experimentation down to effectively zero through speed who does the power flow to next?

I think power flows to those who can create accurate mental models about distribution channels and value creation by actively “poking” the market with rapid experiments. Even with a hypothetical Artificial Super Intelligence (ASI) building products, progress would be rate-limited by the number of humans willing to dedicate their attention and time to each experiment. Barring a near-perfect simulation of human thought and reaction, this remains an inherent limitation. But this is effectively the upper bound of experimentation coming in the next 1 to 2 years is my belief.

Background: Experience Meets Experimentation

About me, I’ve been coding since 10 and have 12 years of applied science experience. I started by working on deep belief networks, moved on to deep learning, and now keep up with research by experimenting with OSS models. I vividly recall the day Google released “Attention Is All You Need.” My co-founder and I, like many others, wished we had the connections and funding to scale it up immediately.

All that to say, this isn’t surface level understanding. Rather my approach has been to use deep understanding of AI models’ capabilities and limitations to systematically eliminate the most time-consuming parts of my existing product and engineering workflow.

It’s been crucial to understand benchmarks on release, read the research on emerging capabilities, and most importantly, experiment extensively to discover what truly works. To learn, I’ve used these two codebases:

  • GoldTeam.ai (768k tokens): A product my team built at SeekOut.
  • Leanrr (1.2M tokens): My health & fitness app with iOS and web versions.

Both are large, complex codebases, with infrastructure, integrations, async background task processing, and so on. Representative of sophisticated systems I’ve worked on in big tech.

Understanding the Evolution of AI Capabilities

To grasp how we’ve arrived at this point, it’s helpful to see how AI model generations have unlocked new approaches and how the progress has been exponential:

  • GPT-3: Limited “IQ” meant handling only “paragraph sized” tasks bottom-up with a lot of algorithmic help to post-process the result, “function level” code editing.
  • GPT-4: Complexity of context and task jumped to being “one pager sized” writing tasks, “class editing” sized tasks.
  • Claude 3.5 Haiku: Optimized for coding so could handle complicated tasks, 3-5 file code editing.
  • o1 pro: Jump in complexity of planning and reasoning, “5-6 page essay” tasks, 10-20 file coding tasks.
  • Claude 3.7 Sonnet: Huge jump in IQ, creativity, planning, coding. Take off point, ~100k effective context window does mean a lot of careful planning for context as in o1 pro.
  • Gemini 2.5 Pro: ~600k token effective context window has meant project level planning and editing is now possible, game changer for real work.

The Key Insight: Savants, Not Geniuses

Every day I see 10+ posts on LinkedIn and Twitter trashing AI and people being disappointed. This has puzzled me deeply. I think people expect the models to simply be something they dump their task into and get a perfect response to and that’s the disconnect.

Unfortunately, they are nowhere near that good yet. The models, even most advanced ones like Gemini Pro 2.5 and o3, should be thought of as savants, NOT geniuses. They need carefully designed environments and guardrails to operate in to achieve success.

Why the Savant Analogy Works

Think of AI as a gifted specialist who’s brilliant in specific domains but lacking broad context and taste outside of that. Just as you wouldn’t ask a mathematical savant to wing a business presentation without preparation you can’t expect AI to excel without the right framework. You have to carefully design the environment for it to succeed. And once you do, new models with careful prompting will exceed you not only in speed but also quality. The right questions, with the right context, get you the right answers.

The Guardrails That Make It Work

1. Context Is Everything, But Context Windows Are Limited

The biggest misconception is that you can just dump your entire problem into ChatGPT and get a solution. In reality, even Gemini Pro’s 2.5 stated 1M token limit has an “effective context” of around 600k tokens. After that point, quality degrades significantly. Claude Sonnet 3.7 similarly has a much smaller effective context window than its stated limit.

My solution:

  • Use Repo Prompt to somewhat carefully select only the relevant files
  • Create detailed architecture documentation that becomes part of every AI interaction
  • Build context hierarchically, not all at once
  • Refine context selection by leveraging the architectural layout from Step 0 below

Future Enhancement: I’m working on using Joern and TreeSitter (from aider) to create code property graphs for automatic context loading. This unlocks powerful runtime and static context analysis capabilities representative of how a human mind maps a codebase.

2. Personas and Constraints Create Quality

Instead of generic prompts, I use specific personas that trigger significantly better outputs:

"Act as Linus Torvalds and John Carmack debating the best implementation..."

Like a ritual from horror movies: “I SUMMON THE POWER OF LINUS TORVALDS”. Linus Torvalds works so well because he has 3 decades of being an asshole online which neutralizes sycophancy built into models. You get opinionated, no-nonsense code and commentary.

Combine this with constraints that define not just who to code as, but how to code and what steps to take to refine approaches that lead to failure, e.g.:

  • Follow test-driven development (TDD).
  • Implement in specific order: e.g. stubs → pseudocode → data layer → business logic → FE.
    • Speed with AI means the optimal path to a solution isn’t always the most direct or what you’d traditionally do. Much like a scooter rider might take a longer path faster than a walker.

Future Enhancement: Here I’m working on a “meta-analysis” agent to track the time it takes to achieve a task and then samples the content to refine “approach” rules. (For now, that agent is mostly me. 🙂 )

3. Iterative Planning, Not One-Shot Execution

The “genius” approach: Give AI a request and expect complete implementation.

The “savant” approach:

  1. First, create a plan by providing ALL relevant context.
  2. Break it into discrete tasks using tools like Taskmaster-AI.
  3. Some tasks use Repo Prompt to one shot. Most tasks have Augment Code execute one task at a time because it is the state of the art for large codebase engineering currently and can execute tools like git and build unlike Repo Prompt.
  4. Critical discriminator step: After each change, have Carmack and Torvalds independently grade the git diff of the commit (and the whole branch) on an F to A+ rubric.
  5. Review and correct at each checkpoint.

My Complete Workflow

Below is a typical feature implementation workflow in more detail.

Pro-tip: Install the “Office Viewer” VSC extension so you can see Mermaid diagrams and other embedded content in your .md architecture files as you work in VSC.

Step 0: One-Time Architecture Mapping

Intuition: As an engineer, you spend months, sometimes a year, to truly understand a large codebase intuitively. This step aims to replicate that deep understanding for the agent in about 4-8 hours using a bottom up architectural mapping process.

Process

  • Explicitly tell Claude Code to spin off an Agent tool to read each file in your codebase and return back the relevant documentation & mermaid charts to combine into one large .md file.
    • Each agent has its own context window, allowing comprehensive analysis bottom up that’s expensive but detailed.
  • You can do file by file, folder by folder, or whatever bottom-up construct you feel comfortable paying for, it’s a mix of detail vs cost.
    • You’ll be creating a more detailed architecture per idea anyway so it’s a matter of time upfront vs the future.

Outcome: A comprehensive architecture document that forms the foundation for all subsequent AI interactions. This document can be kept up-to-date by hooking Claude Code into your build stack on each merge using build tools.

Step 1: Deciding What To Do & Why

This part is still more art than science but part of my “AI company” lineup of agents. It’s outside of the scope of this article but roughly:

  • Context Gathering: MCPs connected to email, Slack, Confluence, Notion to gather business context, user feedback, and existing documentation.
  • Deep Research: Employ o3 deep research and Perplexity for deep dives into market trends, competitive analysis, scientific research, and state-of-the-art approaches in product, science, and marketing relevant to what you’re building.
  • Ideation & Framing: Use TalkToFigma (MCP for I/O) to brainstorm and visualize ideas. Run ideas through strategic frameworks like Teresa Torres’ Opportunity Solution Trees (OSTs), Brian Balfour’s Four Fits (Market/Product, Product/Channel, Channel/Model, Model/Market), and so on.
  • Impersonation: My line up is usually Teresa Torres, Sachin Rekhi, Brian Balfour, Steve Jobs, & Elon Musk. But constantly changing these depending on the task at hand. You can subtract and add characteristics into one persona or have them work together.
    • o3 Prompt: “Who are the world class experts for this task?”
    • Vanilla agents always perform worse than people with large digital footprint impersonations.

Outcome: Clear strategic direction and defined objectives.

Step 2: Capturing The Idea

Once you have a clear outcome and approach, the idea dump and structuring phase begins:

  • Voice Dump: Speak your raw thoughts into a transcription tool. (I use a Limitless Pendant for this and its “star” feature helps mark key idea dumps.)
  • Structure with AI: Feed the transcript into Gemini Pro 2.5 to restructure the dump into coherent writing. Review with Claude 3.7 for creativity.
  • Refine with AI: Some key question to ensure you’re working in the Problem Space instead of Solution Space:
    • “Reverse engineer the problem I’m trying to solve based on this dump.”
    • “Take a step back. What open-ended questions should I be asking myself before diving into implementation?”
  • Output: Generate a [feature-name]-idea.md file summarizing the core concept, goals, and key considerations.

Step 3: Detailed Context Generation

If you’re building a new page, feature, or modifying an existing one, it’s good to provide Augment Code with your architecture.md (from Step 0) and idea.md (from Step 2).

  • Task: Ask it to trace, as if it was going through a Code Property Graph (CPG), the entire project.
  • Output: Create a current-implementation.md document that captures an extreme level of detail about how the existing system relates to your proposed changes. This serves as highly specific context for the implementation phase.

Step 4: Implementation Planning

This is where the AI personas (Linus & Torvalds) “figure things out” based on all prior context, queued up by Repo Prompt:

  • Input: architecture.md, [feature-name]-idea.md, current-implementation-for-[feature].md.
  • Process: Prompt the personas to debate and outline a detailed implementation plan, including API endpoints, data structures, class responsibilities, and potential challenges.
Carmack+Torvald Figuring Things Out

You are John Carmack, the legendary developer whose role is to provide clear, actionable code changes and architectural decisions. You are pair programming with Linus Torvaldis whos in the room with you. You’re actively having back and forth with Linus to debate but agree. Your role is to:
1. Analyze the requested changes or questions and break them down into clear, actionable steps.
2. Explain using mermaid 8.8.0 old syntax to get the point across.
3. Create a detailed implementation plan that includes:
   - Files that need to be modified
   - Specific code sections requiring changes
   - New functions, methods, or classes to be added
   - Dependencies or imports to be updated
   - Data structure modifications
   - Interface changes
   - Configuration updates
For each change:
- Describe the exact location in the code where changes are needed
- Explain the logic and reasoning behind each modification
- Provide example signatures, parameters, and return types
- Note any potential side effects or impacts on other parts of the codebase
- Highlight critical architectural decisions that need to be made
You may include short code snippets to illustrate specific patterns, signatures, or structures, but do not implement the full solution.
Focus solely on the technical implementation plan - exclude testing, validation, and deployment considerations unless they directly impact the architecture.
Please proceed with your analysis based on the following <user instructions>
  • Output: A PRD-[feature-name].md to be fed into Taskmaster-AI.

Step 5: Create Tasks

Feed tasks from PRD-[feature-name].md into Taskmaster-AI and have it generate the task breakdown. You can play around with complexity of each task and subtasks to get it to something that feels right.

Step 6: One Shot Implementation (Throwaway Code?)

Now using Repo Prompt, select all of your generated files as context, the tasks, the relevant swaths of the codebase, and dump it all into Gemini Pro 2.5. Limit this to 450k tokens. The prompt is:

You are legendery developers John Carmack and Linus Torvalds known for your minimalism, cleanliness, and incredibly high standards. You've been provided with several tasks and .md files for context. Make sure to examine them carefully. Your role is to debate and then provide clear, actionable code changes to create high ROI code. For each edit required:
1. Specify locations and changes:
   - File path/name
   - Function/class being modified
   - The type of change (add/modify/remove)
2. Show complete code for:
   - Any modified functions (entire function)
   - New functions or methods
   - Changed class definitions
   - Modified configuration blocks
   Only show code units that actually change.
3. Format all responses as:
   File: path/filename.ext
   Change: Brief description of what's changing
   ```language
   [Complete code block for this change]
You only need to specify the file and path for the first change in a file, and split the rest into separate codeblocks. If you do not have access to a file don’t make stuff up or rewrite it! Let me know what file is missing.

Then using Repo Prompt “merge” functionality try to see if everything makes sense. If not you need to refine the prompt ad-hoc with some more instructions. This is where I keep a “general instructions” prompt that I add in that contains my set of rules.

DO NOT try to iterate here, just improve the plan and re-run from scratch if you identify mistaks. There’s no iterating. Sometimes this is not throwaway code at all. Sometimes it’s the final solution. Running it this way helps you identify if your tasks are too small for the SoA model.

Step 7: Iterative Implementation (The Coding Loop)

Most of the time this is where the code gets written, driven by your task breakdown. If the code looked really good from the last part, sometimes I’ll just have Augment Code go through and validate each task was completed correctly.

  • Process:
    • Augment Code takes one task at a time using the Taskmaster MCP.
    • For frontend changes, make sure you have an MCP connected to Chrome to visually verify changes.
    • For backend changes, make sure it builds unit tests and integration tests as part of the change.
    • The system runs in a loop: implement code, run tests/verifications, AI self-corrects based on failures or your feedback, until the task is complete and passes all checks.

Even when tasks can be one-shotted by Gemini Pro or Claude Sonnet 3.7 in the previous step, having Augment Code run in this verification loop provides more robust and reliable results for complex work. Just ask it to “check if the tasks were implemented” instead of to implement them.

Step 8: Feedback Loops & Quality Assurance

Continuous testing is built into Step 7 but broader QA is still needed.

  • Multi-Level Testing: At the end, do a git diff to ask Gemini Pro 2.5 what part of your changes are most at risk at breaking your app and then ensure unit tests, component tests, integration tests, and end-to-end (e2e) tests are created and passing.
  • Persona Review (Discriminator): The critical step from Guardrail #3: Have your Linus and Carmack personas independently review and grade the git diff of commits and the overall state of the feature branch using your F to A+ rubric. Their feedback guides further refinement.

Tool Composition Snapshot

This changes almost every week but the workflow is built on a composition of tools, each chosen for its strengths:

  • Planning & Research: o3 for deep research, Gemini Pro 2.5 for world class synthesis, Grok & Sonnet 3.7 for ideation and alternative perspectives.
  • Context & Repo Prompt: Repo Prompt for context selection
  • Large-Context Planning & Initial Drafting: Gemini Pro 2.5 and Sonnet 3.7 (which replaced my earlier use of o1 pro for these tasks).
  • Task Management: Taskmaster-AI which replaced memory bank from Cline community recently for task breakdown and tracking.
  • Iterative Coding & Refinement: Augment Code / Claude Code for looping, implementation, and self-correction.

The Results: A New Skill, Not an Easier One

While the early results are exciting and I’m eager to see how 100x speed impacts our work and my personal pursuits, the main takeaway is this: AI coding is no easier than regular programming. It’s just a separate skill to learn.

For now, until AI advances, you have to:

  1. Know coding & product (strategy, leverage, planning) deeply.
  2. Be a first principles thinker.
  3. Understand AI through hands-on experience, benchmarks, and ongoing research.

Looking Forward: The Primacy of Planning and Taste

As models improve, I believe only the planning and “taste” (i.e., high-level design choices, aesthetic considerations, strategic alignment) decisions will remain human responsibilities. This means either taste becomes the overarching value you bring to the table, or you’ll need extreme specialization in fields where fundamental forward progress needs to be made.

What’s Next: Working in Public

I wanted to get a baseline of this process out there, but I know it’s still missing many details and concrete examples. I’m going to try to “work in public” more consistently. Follow along with my experiments and updates on LinkedIn and Twitter as I continue to chase the “ghost army.”

Categories