Yumina Studio Agent + "Roblox for AI Games" — Findings, Plan & Recommendations
Date: 2026-06-01 Author: Jefray (+ Claude) Status: Direction agreed. Phase 1 ready to build. Phase 2 (substrate) pending founder decisions / CEO review.
TL;DR
- The studio AI assistant was failing ~30-52% of runs (describe-and-stop + read-spiral on large TSX). The describe-and-stop half is fixed and shipped (PR #24). The large-file half is Phase 1 below.
- The Gemini chat errors are mostly Google's upstream + non-bypassable safety filter (not ours); a diagnostics patch is staged.
- Strategic question answered: do not rent the agent loop. Own the loop (it holds the moat: credit metering, held-edit gate, crash recovery, multi-model OpenRouter arbitrage). Steal the large-file tooling from the frontier agents. Build the substrate for the games future. Keep play client-executed.
- How to grow toward "Roblox for AI games": copy Godot's project model (real files, real compiler, a curated library over a platform engine SDK) and Yumina's own server for the AI/state/social layer. Do not copy Roblox's server-authoritative cost model, and do not open raw
npm install.
1. The vision (what we're building toward)
Yumina = "Roblox for AI-native games." Worlds evolve text → visual novel → real games (eventually game-engine-class). More code, smarter agents, larger worlds over time. The architecture must raise the complexity ceiling without breaking the free-tier economics or the security model.
2. What we found this session
2a. The two production bugs (investigated + acted on)
Bug A — Studio AI assistant "describes a change then stops."
- Root cause: the agent loop ends on any text-only LLM turn. A one-time nudge only fired for text > 200 chars. Dense CJK plans are short (the reported screenshot was exactly 195 chars), so they slipped the gate and the run completed without editing.
- Prod evidence (14d): 734/2466 (29.8%) completed runs made zero edits; daily rate climbed to 52% on 5/31. Worst models: gpt-5.4 59%, deepseek-v4-pro 54%, sonnet-4.6 26%.
- Fix SHIPPED (PR #24, merged to main): nudge on any non-empty text-only turn, up to 2×/run, softened wording so a genuinely-done model can decline instead of fabricating an edit.
Bug B — Gemini "Generation stopped unexpectedly (reason: error)."
- Not in the DB (failed chats aren't persisted). Only in PostHog
llm_error. 7d: gemini-3-flash-preview = 1,575 errors / 557 sessions (it's the #1 chat model at ~29k msgs/7d, so even a few % = many users). - ~half is Google's non-bypassable safety filter (our
BLOCK_NONEcan't override child-safety →PROHIBITED_CONTENT), ~half is Google's preview-model upstream instability. Not temperature (no provider rejects on sampling). Context length only bites the 163K-context models (deepseek-v3.2), not Gemini (1M window). - Staged (uncommitted): capture the OpenRouter generation id + real upstream reason in logs so the cause stops being a mystery. Real cure is a product call (move off
-preview/ add a fallback model). - Also surfaced two real server bugs:
"cannot execute INSERT in a read-only transaction"(~106/7d) and amessagesforeign-key violation (18/7d). Worth a separate fix.
2b. How the studio agent works today (the moat)
streamAgentLoop (packages/server/src/routes/agent.ts): server-side SSE tool loop, MAX_ITERATIONS=50, read/write tool split, 14 tools, DB-backed agent_runs with crash recovery (heartbeat + Redis cross-replica stop), per-turn credit metering (agent.ts:1526), the held-edit publish-review gate (pending-edit.ts), undo snapshots, multi-model OpenRouter routing. None of this comes from any framework — it is the moat.
2c. The large-file failures (mechanics + scale)
- Read-spiral: TSX > 200K chars is snipped to a 60K preview; weak models re-read the same id, hit the 3-read guard, and quit (the "Stopped: re-read index.tsx three times" terminations).
- Unclosed-bracket class:
edit_custom_uirequires a uniqueold_codematch and recompiles the WHOLE file (sucrase); on a 4,500-line file the model edits from partial context, unbalances a brace far from the edit, and every edit then fails until it widens context. - Prod scale (3,077 worlds): 58% multi-file; 80% of published worlds multi-file; 345 (11%) have index.tsx > 50KB; largest 435KB (Life Restart Simulator), 370KB, 180KB. All still sucrase-compiled TSX — zero npm/engines yet. So today's pain is "giant TSX files," not "game engines."
2d. Framework research (build-on-top vs adopt)
- OpenClaw (TS monorepo): has a genuinely embeddable
@openclaw/agent-core+ a pluggable file-tooloperationsbridge (the pattern that lets an agent edit a virtual FS). But the good parts areprivate/unversioned + coupled to a vendored "pi" fork → vendor-and-own-forever, no stability contract. - Hermes (Python "agent OS", 90+ tools): drivable only as a Python sidecar over ACP-stdio + an FS-materialization shim. Heavy infra/second-language for ~5% use. Conditional NO.
- Both have great patterns to steal: paged reads, self-healing/fuzzy edits, read-dedup loop-breaker, 3-layer output budget, diff-as-permission, sub-agent context isolation, and LSP-after-every-edit.
- Verdict: the agent loop is the most commoditized AND most moat-wired layer → own it, steal the tooling.
2e. Substrate / harness / play-runtime (research)
- Substrate: don't build your own sandbox yet. For the heavy/game tier use a managed snapshot-first provider (Daytona or E2B bake-off); the idle policy (aggressive pause + snapshot) matters more than the provider — keep-warm = ~$37/world/mo, snapshot-first = ~$0.03-0.08/world/mo. Self-host Firecracker is the endgame, deferred until volume justifies a platform team.
- Harness: keep our loop. Claude Agent SDK is Anthropic-only + individual-use-licensed → can't back a multi-tenant free tier and kills multi-model arbitrage. Add MCP so the agent learns Yumina's domain.
- Play-runtime: keep it client-executed in the iframe forever (~$0/player compute). Widen the renderer DOM → Canvas2D/Pixi → WebGL/Three, lazy-loaded, tied to world-complexity tiers. Server-side / pixel-streaming for free players is economically fatal. (Rosebud/Websim ship the engine to the browser; only Roblox runs server sim, and Roblox isn't also paying an LLM bill per session.)
- Migration/moat: a v20 world is already a "project" (
rootComponent.files). Two-tier behind onerootComponentcontract: simple/text worlds keep today's lightweight JSONB+iframe path (cheap); complex/ game worlds opt into "project mode" (real compiler + real FS + stronger sandbox). Strong lean: own the harness, swap the substrate.
2f. How Roblox & Godot actually work (the principle)
Neither uses npm.
| Roblox | Godot | → concept | |
|---|---|---|---|
| Engine/API | Roblox APIs (Luau) | Node API (GDScript/C#) | the runtime contract you code against |
| Project | instance tree (.rbxl) | scene/node tree + files | composable structure |
| Packages | Wally (curated) | Asset Library (curated) | a VETTED library, NOT npm install * |
| Build | publish to host | export to WASM/native | a real build/bundle step |
| Execution | server-authoritative (expensive) | client-side (cheap) | where the sim runs = where cost lives |
The lesson: a real module system over a curated library over a platform engine SDK, with a real build step and a clear client/server split. They curate on purpose: open npm in a multi-tenant untrusted platform is a supply-chain/malware surface, a download-size bomb, and a runtime-contract killer.
3. The decision & strategy
Three layers, treated differently. Two horizons, phased.
ADOPT (don't reinvent) OWN (the moat) BUILD (the real unlock)
┌────────────────────────┐ ┌──────────────────────┐ ┌────────────────────────────┐
│ large-file TOOLING: │ │ agent loop SHELL: │ │ SUBSTRATE (games bet): │
│ paged reads, self-heal │▶│ credit metering, │ │ real compiler + curated │
│ edits, grep, sub-agents│ │ held-edit gate, crash│ │ library over the useYumina │
│ + TS parser/LSP tool │ │ recovery, multi-model│ │ engine SDK; stronger play │
│ │ │ OpenRouter arbitrage │ │ sandbox; renderer tiers │
│ steal from openclaw/ │ │ DO NOT rent — billing│ │ PLAY: client-executed iframe│
│ hermes │ │ + recovery live here │ │ (Godot model), server=AI/ │
└────────────────────────┘ └──────────────────────┘ │ state/social/credits │
└────────────────────────────┘Yumina's "engine" is the useYumina() SDK + iframe runtime (analogous to Roblox's APIs / Godot's node API). Creators compose against THAT + a curated library, not against raw npm. The per-world sandbox with a terminal is the AI agent's workshop (to run builds/tests), NOT how the creator's world is modeled.
4. The plan
Phase 0 — SHIPPED
- Nudge fix for describe-and-stop (PR #24, live on main).
Phase 1 — Large-file tooling on our loop (NOW, ~weeks, in-process, no new infra)
Targets the 345 worlds with 50KB+ TSX and the read-spiral/bracket failures.
- W1 (P1) TS parser / LSP-after-edit: a
validate_tsxcapability (TS compiler API / oxc/swc/esbuild parse) auto-run after everyedit_custom_ui, injecting the exact error line so the agent fixes the brace in-turn. Kills the unclosed-bracket class. (Approved: full parser.) - W2 (P1) Self-healing fuzzy edit for
edit_custom_ui(escalating match + structured no-match hint). - W3 (P2) Read-dedup loop-breaker (cache id+offset+limit; stub + hard-block on re-read).
- W4 (P2) Structural/anchored edits (replace a function/JSX block by name/range via W1's parser).
- W5 (P2) Sub-agent-per-file (delegate one huge file to an isolated child; parent context stays clean).
- W6 (P2) File-splitting tool/skill (refactor a giant index.tsx into multiple files — attacks the 435KB-single-file root cause and is the on-ramp to the project model).
- W7 (P3) Lightweight planning/todo tool (keeps multi-step builds on track; complements the nudge).
All behind flags, independently testable, credits/recovery/held-edit/multi-model intact.
Phase 2 — Substrate (CEO-gated, the games bet)
mode: "simple" | "project"flag;rootComponent.filesstays the single contract.- Real compiler (esbuild/SWC; esbuild-wasm client-side where the threat model allows) replacing the regex bundler → real modules.
- Curated library (blessed Pixi/Three/etc. exposed via the SDK) — the Wally/Asset-Library model. No open
npm installon the default/free tier. - Real FS / blob storage + history for project-tier worlds.
- Stronger play sandbox: a true separate origin (
sandbox.yumina.io) before running real code (audit flagged current sandbox is opaque-same-origin). - Play-runtime renderer tiers inside the iframe (Canvas2D/Pixi → WebGL/Three), lazy-loaded; never server-executed on the free tier.
- Managed sandbox (Daytona/E2B) ONLY for project-tier server builds, paid-gated, aggressive pause-on-idle.
Spike — rented harness (time-boxed, decision-gated)
Port project-tier authoring onto a rented harness (Claude Agent SDK / Codex / openclaw-core) in a throwaway branch; measure credit/recovery integration loss + switching cost; decide rent-vs-own from data.
5. My suggestions (knowing the vision)
- The agent loop is not the risk to the vision; the substrate ceiling, security, and play-runtime are. The regex bundler + JSONB + opaque-same-origin iframe is what caps complexity. Invest there, not in renting a loop you'd have to re-wire your billing and recovery around.
- Make
useYumina()a real engine contract, and ship a curated library — that's the move that lets worlds get arbitrarily complex without opening npm. This is exactly what Roblox/Godot do. It's safer, cheaper, and gives you a stable platform contract. - Be Godot's project model + Yumina's server. Not Roblox's cost model. Client runs the game (cheap); your server provides the AI brain, persistence, multiplayer, and credits — the things a local engine can't. That hybrid is your real engine and your moat.
- Sequence: ship Phase 1 now (relief for the power users hurting today), settle the founder decisions in CEO review, then build Phase 2 as the durable moat. Keep the rent-the-harness option alive as a measured spike, not a blind bet.
- Start W6 (file-splitting) early — it both fixes today's 435KB-file pain and is the natural on-ramp to the multi-file project model, so it pays off in both horizons.
- Treat free-tier sandbox policy as the load-bearing economic decision. If sandboxes are credit/paid-gated (only during active heavy authoring/build), the cost problem largely evaporates.
6. Open founder decisions (for /plan-ceo-review)
- Do free users get real sandboxes, or is project-mode/sandbox paid-gated? (drives free-tier economics)
- Is "game" visual fidelity or interactive systems? (decides Canvas2D-in-iframe vs needing real engines)
- Will heavy 3D / streamed worlds EVER touch the free tier? (recommend: structurally no)
- Is multi-model OpenRouter arbitrage permanent, or a bootstrapping phase? (decides if Claude SDK ever fits)
- Self-host Firecracker eventually, or stay on managed sandboxes? (sets whether the endgame substrate happens)
- One-way or reversible "project-ification" of a world? (reversible keeps the cost lever)
7. NOT in scope (Phase 1) / Risks
NOT in scope now: open npm, game engines, server-side game logic, pixel streaming, the harness swap, moving text-tier worlds off JSONB, replacing the iframe runtime.
Risks: idle sandbox cost blowout (Phase 2 — mitigate with pause-on-idle + snapshot); untrusted-code escape (Phase 2 — microVM isolation + egress lockdown + a real sandbox origin); two-codepath drift (keep rootComponent.files as the single contract); Phase 1 regressions (gate the auto-validate + fuzzy-edit behind size thresholds + flags so the common small-world case stays fast).
Foundation for AI-native Roblox (added 2026-06-01, code-grounded audit)
The core finding: today the AI is the rules engine — state changes because the model types [var: op value] in prose and the server regex-scrapes + applies it (response-parser.ts → messages.ts:1040 applyEffects → runReactionChain). Reactions only react to what the AI already did. Consequence: client/AI state writes are taken on faith (sessions.ts:428-472 merges unvalidated client variables; sandbox setVariable is fire-and-forget) — a player can set gold=999999, a creator's code can mint coins. Fine for narrative; fatal for games (need deterministic rules), multiplayer (need shared truth), and economy (need unforgeable value).
The foundational move: keep AI-prose-driven state for narrative; add a server-authoritative layer for value / shared state / game rules. The AI and client request changes via validated verbs; the server decides. This is the single shift from text-sim to game engine.
Architecture: one stable contract (useYumina() + rootComponent), four layers:
- Project (authoring): multi-file + real compiler (esbuild) + curated library. NEW (Phase 2). W6 is the on-ramp.
- Authoritative state (the keystone): server-validated ledger the client/AI can't forge. REUSE the existing hash-chained ledger (
transaction-hash.ts) + atomic overdraft-guarded deduction (credit-service.ts). - Multiplayer (shared sessions): host-as-DM rooms + Redis pub/sub + a
state_version. REUSE the dormantorigin/multiplayerbranch (~80% of co-op v1) + main's Redis room-manager. - Economy (creator-defined): per-world economy manifest + economy ledger + server-mediated
economy.buy/spend/grantverbs. REUSE the mature money rails (Stripe Connect payouts,creatorEarnings, 20% take, 7-day hold, dispute handling, mushie P2P transfer).
~70% of the plumbing already exists (money rails mature; multiplayer branch ~80%; reactions are a clean deterministic substrate; rootComponent is already a project). The foundation is wiring existing pieces onto a server-authoritative spine, not a rewrite.
Fastest path: (1) finish Phase 1 large-file tooling + Phase 1.5 validate-and-stamp-on-write (also closes the forgeable-state hole) → (2) the server-authoritative state keystone → (3) Economy v1: virtual currencies bought with mushies, riding existing payouts (weeks, lowest legal risk) → (4) Multiplayer v1: rebase the branch, co-op narrative behind a flag (weeks) → (5) Project substrate (real compiler + curated library + render tiers), CEO-gated, parallel → deferred: real-money DevEx (legal-heavy), realtime-twitch (websockets + tick loop, different cost class), rented-harness spike.
Founder decisions (recommendations):
- Economy convertibility line — recommend NO direct cash-out of in-game currency. Players spend real money / mushies; creators earn cashable mushies via the existing rail; in-game currency stays virtual + non-redeemable. Keeps you out of money-transmitter/KYC/tax weight. This is the highest-stakes call.
- "Game" = visual fidelity vs deterministic systems (decides project-substrate vs authoritative-state priority; lean authoritative-state first).
- Multiplayer = co-op narrative now, realtime deferred (recommend yes).
- Free-tier stays client-executed; gate server-sim/sandbox behind paid.
Biggest risks: forgeable state must be closed before ANY economy of value (Phase 1.5 first); real-money convertibility = regulatory weight; the "AI-types-brackets" mutation model is antithetical to deterministic games (bolting an authoritative rule layer onto it is the highest-uncertainty piece); host-pays multiplayer has griefing/cost-abuse exposure; the opaque same-origin iframe (NOT the sandbox.yumina.io the docs claim) needs a real separate origin before running heavier/curated-library code.
Appendix — key file references
- Agent loop:
packages/server/src/routes/agent.ts(streamAgentLoop, nudge ~1585, read-spiral guard ~1786, credit metering ~1526, snipLargeResult ~1022). - Tools:
packages/server/src/lib/studio-tools/{tools.ts,tool-executor.ts,system-prompt.ts}(edit_custom_ui ~1235, grep_world ~340, read pagination ~116). - World persistence + held-edit gate:
packages/server/src/lib/pending-edit.ts. - World schema (v20, rootComponent.files):
packages/engine/src/world/schema.ts(~302-351). - Play runtime:
packages/app/src/features/chat/world-renderer.tsx+packages/app/sandbox/*. - LLM/Gemini handling:
packages/server/src/lib/llm/openrouter.ts(finish_reason ~426/~649),packages/server/src/routes/messages.ts(stream error ~929). - Prior architecture audit:
docs/architecture-audit-2026-05-29.md.
