I run Ringier Slovakia - one of the largest digital media groups in CEE. I also build things with AI on evenings and weekends. Not because it's trendy, but because I want to understand where it actually helps before I push 200 people to adopt it.
A few weeks ago I needed rapid business validation for a new product idea. The kind of analysis that usually comes from a consulting sprint. Market sizing. Competitive landscape. Unit economics. The realistic kind, not the kind where everything looks great on slide 12. I've used ChatGPT, Gemini, deep research - they're good when I drive. I wanted agents that work without me and present findings for my review.
I didn't hire anyone. I built a team of AI agents instead. It took four versions to get it right.
What it does now
Five agents, each with a specific job. They run inside Claude Code from my terminal.
The researchers (4 agents, running in parallel):
- Market researcher - validates whether the problem is real, sizes the market three different ways, checks if the numbers agree
- Competitive intel - maps direct and indirect competitors, tests every claimed moat with a simple question: can someone copy this in two years?
- Digital scout - checks if anyone is actually searching for this, maps the communities where target users hang out
- Financial modeler - builds unit economics from scratch, runs three-scenario projections, answers the one question that matters: default alive or default dead?
The shark (1 agent, runs after researchers finish):
- Reads everything the researchers wrote
- Writes go-to-market strategy, risk matrix, scalability assessment
- Runs a YC-style evaluation AND a bootstrap viability check
- Produces a final scorecard across 13 dimensions
- Delivers a verdict: GO, CONDITIONAL, or NO-GO
Under the hood
The whole thing runs in a terminal. No web interface, no custom app, no code I wrote myself. Just Claude Code - Anthropic's command-line tool for working with AI.
The entire system is a single markdown file - 639 lines of instructions. I type /business-sharks followed by the idea, and it takes over.
But "639 lines" undersells what those lines contain. Each agent carries a full analytical framework. The market researcher runs three independent sizing methods with actual formulas - if they disagree by more than 30%, it flags the discrepancy. The competitive intel has a four-part moat durability test where all four must pass or the moat gets classified as "WISHFUL THINKING." The financial modeler breaks CAC by channel and runs three scenarios. These aren't summaries - they're structured analyses with built-in skepticism.
Every factual claim gets tagged at the source: FACT (verified data with URL), INFERRED (extrapolated from comparable signals), or ASSUMPTION (modeled without evidence). When I read "TAM of $2.3B" I know immediately whether that came from a Gartner report or a napkin estimate.
The agents search the web in real time, read and write files on my machine. No APIs, no databases. The file system is the coordination layer - 16 numbered markdown files, each owned by exactly one agent.
A real verdict looks like this:
Average Score: 5.1/10 | Verdict: CONDITIONAL
Defensibility/Moat: 3/10 - "NO MOAT IDENTIFIED. Five incumbents with 3-5 year head starts."
Problem Intensity: 6/10 - "Real pain, but frequency is monthly, not daily."
Unit Economics: 7/10 - "LTV:CAC of 4.2x if churn assumptions hold."
What Would Need To Be True: "50 cold emails to mid-market publishers get >5% reply rate. If not, the demand is theoretical."
No softening. No encouragement. Just the analysis and the next test to run. That's the current version. Getting here was not clean.
Version 1: everything in one file, and it showed
The first version was 281 lines. Four researchers write their sections into a single report file, the shark reads it and synthesizes.
It ran. The output was mediocre.
The biggest problem: problem validation was a checkbox. The market researcher would write two paragraphs about whether the problem was real, then move on to TAM calculations. But if the problem isn't real, nothing else matters. The TAM, the competitive landscape, the financial model - all meaningless if nobody actually has this pain badly enough to pay for a solution.
I fixed that the same day. Added specific instructions: rate the hair-on-fire score, map every workaround people use today, assess willingness to pay, check problem frequency. If evidence is weak, say "INSUFFICIENT PROBLEM EVIDENCE." Don't fabricate demand.
Better problem validation. Still a mess everywhere else.
Version 2: the real rewrite
A day later I tore it apart. Three things were broken:
The single-file problem. All four researchers writing to the same file caused conflicts. One agent would overwrite another's section. The fix: 16 numbered files, each owned by exactly one agent. Nobody touches anyone else's file.
The repetition problem. The financial modeler would re-describe the competitive landscape. The shark would repeat the market sizing. I added a no-repeat rule: reference other sections by number. Never re-explain what another analyst already covered.
The silo problem. Each researcher worked in isolation. The financial modeler would build unit economics without knowing the competitive intel found five well-funded competitors offering the same thing for free. The fix was a two-pass system. In pass one, each researcher writes only their key findings. In pass two, they read each other's findings before writing the full analysis. Small change. Big difference in quality.
Version 3: self-contained and cost-aware
The system worked but depended on external plugins - SEO tools, metrics frameworks. Sometimes they weren't available. Sometimes they added noise. I ripped all of that out. Gave each agent its own analytical framework, built into the prompt. No external dependencies.
I also added things I didn't know I needed until I ran it on real ideas:
Tarpit detection and evidence tagging. The market researcher checks if an idea matches a known failure pattern. Every claim gets tagged FACT, INFERRED, or ASSUMPTION - so I never mistake a guess for a data point.
Execution modes. I run Claude Code Max - flat subscription, no per-token billing. Haven't hit a usage limit yet, even running five Opus agents in parallel. But I wanted the system to work for API users too, so I added STANDARD, HYBRID, and QUICK modes.
Bootstrap viability. The original only evaluated ideas through a VC lens. I added a full bootstrap assessment: path to $5K MRR, solo founder bottleneck analysis, stair-step methodology. The scorecard shows both lenses and uses whichever scores higher.
What I learned
The first version is never good enough - and you can't predict how it will fail. The single-file conflicts, the shallow validation, the silo problem - I only found these by running the system on real ideas and reading what came back. Build, run, read, fix.
The prompt is the product. 281 lines to 639. Most of that growth is analytical frameworks and coordination rules. This is the same work a product manager does when writing specs - except the spec IS the product. Every sentence changes the output. And making five agents work together took more thought than any individual agent's instructions. File ownership, two-pass research, no-repeat rules - coordination problems, not intelligence problems. The AI is smart enough. The hard part is designing how the pieces fit together.
The thing that surprised me most: I didn't write a single line of code. But knowing what was wrong - that's product judgment built over 12 years of shipping. AI agents make building accessible. They don't replace knowing what to build.
I also started running business-sharks on existing Ringier products - not just new ideas. Feed it a product that's been running for two years and it gives you a fresh outside view. It flagged blind spots we'd missed - and when I compared its estimates against our actual numbers, they were close enough to be useful. Not perfect, but the kind of ballpark where you trust the direction.
The gap that collapsed
I built products for a decade before going corporate. Car marketplaces, property portals, a dating product for a New York startup. I understand how development works, how products ship, how markets behave. But I moved into management. The last few years I steered teams, not code.
Claude Code put me back in the builder's seat. Not because I suddenly learned the latest frameworks - I didn't need to. I know what a good analysis looks like. I know what breaks when five people work in parallel without coordination rules. I know when a financial model is missing the question that matters. That's the job. The AI handles the execution. I handle the judgment.
A 639-line markdown file replaced a consulting engagement. I'm back to building - not because I learned new tech, but because the tech finally caught up to product judgment. The bottleneck moved. It's no longer "can I build this?" It's "do I know what to build?"
I'm Fero Novak, Managing Director at Ringier Slovakia. 12+ years in digital media, from product specialist to MD. Before that, I built products at my own company for years. I build with AI personally because I want to understand it before I ask my teams to adopt it. This is the first in a series - next up: what else I built in days, not weeks.