building

I Built a Shark Tank in Claude Code. It Works.

I needed to validate business ideas fast. Instead of hiring consultants, I built a 5-agent team in Claude Code. It took four versions to get right. Here's what broke, what I fixed, and what the system actually produces now.

Stylized CRT-screen illustration of a shark.
On this page

    I run Ringier Slovakia - one of the largest digital media groups in CEE. I also build things with AI on evenings and weekends. Not because it's trendy, but because I want to understand where it actually helps before I push 200 people to adopt it.

    A few weeks ago I needed rapid business validation for a new product idea. The kind of analysis that usually comes from a consulting sprint. Market sizing. Competitive landscape. Unit economics. The realistic kind, not the kind where everything looks great on slide 12. I've used ChatGPT, Gemini, deep research - they're good when I drive. I wanted agents that work without me and present findings for my review.

    I didn't hire anyone. I built a team of AI agents instead. It took four versions to get it right.

    What it does now

    Five agents, each with a specific job. They run inside Claude Code from my terminal.

    The researchers (4 agents, running in parallel):

    • Market researcher - validates whether the problem is real, sizes the market three different ways, checks if the numbers agree
    • Competitive intel - maps direct and indirect competitors, tests every claimed moat with a simple question: can someone copy this in two years?
    • Digital scout - checks if anyone is actually searching for this, maps the communities where target users hang out
    • Financial modeler - builds unit economics from scratch, runs three-scenario projections, answers the one question that matters: default alive or default dead?

    The shark (1 agent, runs after researchers finish):

    • Reads everything the researchers wrote
    • Writes go-to-market strategy, risk matrix, scalability assessment
    • Runs a YC-style evaluation AND a bootstrap viability check
    • Produces a final scorecard across 13 dimensions
    • Delivers a verdict: GO, CONDITIONAL, or NO-GO

    Under the hood

    The whole thing runs in a terminal. No web interface, no custom app, no code I wrote myself. Just Claude Code - Anthropic's command-line tool for working with AI.

    The entire system is a single markdown file - 639 lines of instructions. I type /business-sharks followed by the idea, and it takes over.

    But "639 lines" undersells what those lines contain. Each agent carries a full analytical framework. The market researcher runs three independent sizing methods with actual formulas - if they disagree by more than 30%, it flags the discrepancy. The competitive intel has a four-part moat durability test where all four must pass or the moat gets classified as "WISHFUL THINKING." The financial modeler breaks CAC by channel and runs three scenarios. These aren't summaries - they're structured analyses with built-in skepticism.

    Every factual claim gets tagged at the source: FACT (verified data with URL), INFERRED (extrapolated from comparable signals), or ASSUMPTION (modeled without evidence). When I read "TAM of $2.3B" I know immediately whether that came from a Gartner report or a napkin estimate.

    The agents search the web in real time, read and write files on my machine. No APIs, no databases. The file system is the coordination layer - 16 numbered markdown files, each owned by exactly one agent.

    A real verdict looks like this:

    Average Score: 5.1/10 | Verdict: CONDITIONAL

    Defensibility/Moat: 3/10 - "NO MOAT IDENTIFIED. Five incumbents with 3-5 year head starts."
    Problem Intensity: 6/10 - "Real pain, but frequency is monthly, not daily."
    Unit Economics: 7/10 - "LTV:CAC of 4.2x if churn assumptions hold."

    What Would Need To Be True: "50 cold emails to mid-market publishers get >5% reply rate. If not, the demand is theoretical."

    No softening. No encouragement. Just the analysis and the next test to run. That's the current version. Getting here was not clean.

    Version 1: everything in one file, and it showed

    The first version was 281 lines. Four researchers write their sections into a single report file, the shark reads it and synthesizes.

    It ran. The output was mediocre.

    The biggest problem: problem validation was a checkbox. The market researcher would write two paragraphs about whether the problem was real, then move on to TAM calculations. But if the problem isn't real, nothing else matters. The TAM, the competitive landscape, the financial model - all meaningless if nobody actually has this pain badly enough to pay for a solution.

    I fixed that the same day. Added specific instructions: rate the hair-on-fire score, map every workaround people use today, assess willingness to pay, check problem frequency. If evidence is weak, say "INSUFFICIENT PROBLEM EVIDENCE." Don't fabricate demand.

    Better problem validation. Still a mess everywhere else.

    Version 2: the real rewrite

    A day later I tore it apart. Three things were broken:

    The single-file problem. All four researchers writing to the same file caused conflicts. One agent would overwrite another's section. The fix: 16 numbered files, each owned by exactly one agent. Nobody touches anyone else's file.

    The repetition problem. The financial modeler would re-describe the competitive landscape. The shark would repeat the market sizing. I added a no-repeat rule: reference other sections by number. Never re-explain what another analyst already covered.

    The silo problem. Each researcher worked in isolation. The financial modeler would build unit economics without knowing the competitive intel found five well-funded competitors offering the same thing for free. The fix was a two-pass system. In pass one, each researcher writes only their key findings. In pass two, they read each other's findings before writing the full analysis. Small change. Big difference in quality.

    Version 3: self-contained and cost-aware

    The system worked but depended on external plugins - SEO tools, metrics frameworks. Sometimes they weren't available. Sometimes they added noise. I ripped all of that out. Gave each agent its own analytical framework, built into the prompt. No external dependencies.

    I also added things I didn't know I needed until I ran it on real ideas:

    Tarpit detection and evidence tagging. The market researcher checks if an idea matches a known failure pattern. Every claim gets tagged FACT, INFERRED, or ASSUMPTION - so I never mistake a guess for a data point.

    Execution modes. I run Claude Code Max - flat subscription, no per-token billing. Haven't hit a usage limit yet, even running five Opus agents in parallel. But I wanted the system to work for API users too, so I added STANDARD, HYBRID, and QUICK modes.

    Bootstrap viability. The original only evaluated ideas through a VC lens. I added a full bootstrap assessment: path to $5K MRR, solo founder bottleneck analysis, stair-step methodology. The scorecard shows both lenses and uses whichever scores higher.

    What I learned

    The first version is never good enough - and you can't predict how it will fail. The single-file conflicts, the shallow validation, the silo problem - I only found these by running the system on real ideas and reading what came back. Build, run, read, fix.

    The prompt is the product. 281 lines to 639. Most of that growth is analytical frameworks and coordination rules. This is the same work a product manager does when writing specs - except the spec IS the product. Every sentence changes the output. And making five agents work together took more thought than any individual agent's instructions. File ownership, two-pass research, no-repeat rules - coordination problems, not intelligence problems. The AI is smart enough. The hard part is designing how the pieces fit together.

    The thing that surprised me most: I didn't write a single line of code. But knowing what was wrong - that's product judgment built over 12 years of shipping. AI agents make building accessible. They don't replace knowing what to build.

    I also started running business-sharks on existing Ringier products - not just new ideas. Feed it a product that's been running for two years and it gives you a fresh outside view. It flagged blind spots we'd missed - and when I compared its estimates against our actual numbers, they were close enough to be useful. Not perfect, but the kind of ballpark where you trust the direction.

    The gap that collapsed

    I built products for a decade before going corporate. Car marketplaces, property portals, a dating product for a New York startup. I understand how development works, how products ship, how markets behave. But I moved into management. The last few years I steered teams, not code.

    Claude Code put me back in the builder's seat. Not because I suddenly learned the latest frameworks - I didn't need to. I know what a good analysis looks like. I know what breaks when five people work in parallel without coordination rules. I know when a financial model is missing the question that matters. That's the job. The AI handles the execution. I handle the judgment.

    A 639-line markdown file replaced a consulting engagement. I'm back to building - not because I learned new tech, but because the tech finally caught up to product judgment. The bottleneck moved. It's no longer "can I build this?" It's "do I know what to build?"

    V4 refactor: SKILL.md cut from 639 lines to 274 plus 7 reference files.
    V4 refactor: SKILL.md cut from 639 to 274 lines, with the rest split into 7 reference files loaded only when needed.

    Version 4: Skill engineering as a discipline

    Three weeks after Version 3, I tore the file apart again.

    The V3 SKILL.md had grown to 639 lines. Every invocation loaded the whole file into context - every persona description, every output schema, every execution mode, every analytical framework - even when the user only needed one mode. Token cost per /business-sharks was higher than it should have been, and the working memory the agents had to hold the actual idea was being squeezed by their own instructions.

    Progressive disclosure: 274 lines plus 7 reference files. Anthropic's own engineering posts describe progressive disclosure as the canonical shape for skills: a Claude Code skill loads metadata first (about 100 tokens), the SKILL.md only when triggered, and reference files only when the work demands them. V3 collapsed all three layers into one. V4 splits them properly. SKILL.md is now 274 lines - the orchestration logic and phase flow only. Each persona's analytical framework moved into its own reference file:

    • references/defensive-principles.md - shared across every teammate. Section format, length limits, evidence-tagging rules (FACT / INFERRED / ASSUMPTION), the no-soften rule.
    • references/market-researcher.md - three independent market-sizing methods, problem-validation framework, tarpit-pattern detection.
    • references/competitive-intel.md - competitor mapping, four-part moat durability test, "WISHFUL THINKING" classification when claimed moats fail.
    • references/digital-scout.md - search-trend validation, target-audience community mapping, GEO/AEO signal scan.
    • references/financial-modeler.md - unit economics from scratch, three-scenario projections, the default-alive vs default-dead test.
    • references/chief-shark.md - synthesis rules, 13-dimension scorecard weights, WWNBT (What Would Need To Be True) framework, VC-vs-Bootstrap lens switch.
    • references/update-protocol.md - how the skill amends an existing analysis when methodology changes (covered below).

    A teammate now loads defensive-principles.md plus its own role file - about 40% less context per teammate than V3, and the framework details only enter the picture when an agent actually uses them.

    Two-pass cross-read between agents. V3 had each researcher write a full section in one pass, then chief-shark synthesized at the end. V4 splits the research into two passes. In Pass 1, each researcher writes only their Key Findings block - ten to fifteen lines. In Pass 2, each researcher reads the OTHER three Key Findings blocks first, then overwrites their own file with the full analysis. Two-pass cross-read means agent B reads agent A's raw output, not a summary - it catches what summarization would have flattened. Contradictions surface at write-time, not at synthesis-time.

    HYBRID and QUICK execution modes. STANDARD runs all four researchers as full team members in parallel - fastest, most context-heavy. HYBRID forks context for parallel research agents (background subagents that auto-terminate after writing their section files), then runs synthesis inline with chief-shark only - about 40% cheaper per token. QUICK skips Pass 1 drafts and runs all four researchers as Sonnet in a single pass - about 60% cheaper, lower depth, fast turnaround for early-stage ideas. The mode is decided once at session start and inherits across the brainstormers-idea pipeline.

    Pipeline composition: brainstormers-idea to business-sharks. The skill now detects when its input is structured output from brainstormers-idea - a sister skill that does the optimistic creative pass before the adversarial one. If docs/05-refined-idea.md exists in the path the user points to, business-sharks reads it as the starting business idea instead of asking from scratch.

    The prior creative research gets handed to each analyst with one instruction: challenge and extend, do not summarize. The mapping:

    • brainstormers/01-market-landscape.md -> market-researcher: challenge the market sizing, validate or dispute demand signals, dig deeper into problem intensity.
    • brainstormers/02-competition-map.md -> competitive-intel: stress-test the competitive gaps, find competitors brainstormers missed, assess the moat honestly.
    • brainstormers/04-revenue-models.md -> financial-modeler: take the proposed revenue models and run the actual numbers, challenge the pricing assumptions, build unit economics that brainstormers didn't.
    • (digital-scout has no equivalent prior file - it researches from scratch.)

    brainstormers-idea is optimistic by design; business-sharks is critical by design. The pipeline keeps both stances honest by feeding one into the other rather than blending them. If the prior creative research was wrong about something, the analysts have explicit permission to say so. That permission - "if it's wrong, say so" - is the line that makes the pipeline more useful than either skill on its own.

    Update protocol. Each reference file carries a version field. When V4 changes a persona's analytical framework, the skill writes a one-line note to existing analysis files flagging which sections are out of date, and offers to re-run only the affected analysts. No more re-running the whole pipeline to apply a single methodology fix. The full repo with all four V4 skills - business-sharks, brainstormers-idea, app-factory, domain-validator - lives at github.com/feronovak/claude-skills.

    FAQ

    What is the business-sharks skill?

    A Claude Code skill that runs a five-agent panel on a business idea - four parallel researchers (market, competitive, digital, financial) plus a chief-shark synthesizer. It produces a 13-dimension scorecard and a GO / CONDITIONAL / NO-GO verdict. Public repo: github.com/feronovak/claude-skills.

    How do I install business-sharks?

    With the Claude Code plugin marketplace:

    /plugin marketplace add feronovak/claude-skills
    /plugin install business-sharks

    Then invoke with /business-sharks <your idea>.

    What is the difference between a Claude Code skill and a subagent?

    A skill is a packaged workflow definition with reference files and a trigger phrase. A subagent is one of the parallel agents the skill spawns at runtime. business-sharks is a skill that orchestrates four research subagents plus chief-shark. The same skill-spawns-subagents pattern shows up in my Discord + Claude Code remote dev setup - different surface, same shape.

    Can business-sharks validate any business idea?

    It works best for software, services, and products with sub-50M EUR total addressable markets - the agents can find evidence at that scale. Hardware, regulated industries (medical, financial), and deep-tech ideas benefit from it but should not rely on it alone. Always treat the verdict as a signal, not a decision.

    How does business-sharks differ from ValidatorAI or IdeaProof?

    business-sharks is a methodology, not a SaaS - five named adversarial personas, two-pass cross-read between them, forkable code. The output is a 13-dimension scorecard you can audit. ValidatorAI and IdeaProof return single-paragraph summaries from a black box.

    Try V4

    Repo: github.com/feronovak/claude-skills. Install /business-sharks, point it at the next idea you're considering. If something breaks, the code is right there - file an issue or fork it. That's the whole offer.

    I'm Fero Novak, Managing Director at Ringier Slovakia. 12+ years in digital media, from product specialist to MD. Before that, I built products at my own company for years. I build with AI personally because I want to understand it before I ask my teams to adopt it. This is the first in a series - next up: what else I built in days, not weeks.