Yumina Studio Agent + "Roblox for AI Games" — Findings, Plan & Recommendations

Date: 2026-06-01 Author: Jefray (+ Claude) Status: Direction agreed. Phase 1 ready to build. Phase 2 (substrate) pending founder decisions / CEO review.

TL;DR

The studio AI assistant was failing ~30-52% of runs (describe-and-stop + read-spiral on large TSX). The describe-and-stop half is fixed and shipped (PR #24). The large-file half is Phase 1 below.
The Gemini chat errors are mostly Google's upstream + non-bypassable safety filter (not ours); a diagnostics patch is staged.
Strategic question answered: do not rent the agent loop. Own the loop (it holds the moat: credit metering, held-edit gate, crash recovery, multi-model OpenRouter arbitrage). Steal the large-file tooling from the frontier agents. Build the substrate for the games future. Keep play client-executed.
How to grow toward "Roblox for AI games": copy Godot's project model (real files, real compiler, a curated library over a platform engine SDK) and Yumina's own server for the AI/state/social layer. Do not copy Roblox's server-authoritative cost model, and do not open raw npm install.

1. The vision (what we're building toward)

Yumina = "Roblox for AI-native games." Worlds evolve text → visual novel → real games (eventually game-engine-class). More code, smarter agents, larger worlds over time. The architecture must raise the complexity ceiling without breaking the free-tier economics or the security model.

2. What we found this session

2a. The two production bugs (investigated + acted on)

Bug A — Studio AI assistant "describes a change then stops."

Root cause: the agent loop ends on any text-only LLM turn. A one-time nudge only fired for text > 200 chars. Dense CJK plans are short (the reported screenshot was exactly 195 chars), so they slipped the gate and the run completed without editing.
Prod evidence (14d): 734/2466 (29.8%) completed runs made zero edits; daily rate climbed to 52% on 5/31. Worst models: gpt-5.4 59%, deepseek-v4-pro 54%, sonnet-4.6 26%.
Fix SHIPPED (PR #24, merged to main): nudge on any non-empty text-only turn, up to 2×/run, softened wording so a genuinely-done model can decline instead of fabricating an edit.

Bug B — Gemini "Generation stopped unexpectedly (reason: error)."

Not in the DB (failed chats aren't persisted). Only in PostHog llm_error. 7d: gemini-3-flash-preview = 1,575 errors / 557 sessions (it's the #1 chat model at ~29k msgs/7d, so even a few % = many users).
~half is Google's non-bypassable safety filter (our BLOCK_NONE can't override child-safety → PROHIBITED_CONTENT), ~half is Google's preview-model upstream instability. Not temperature (no provider rejects on sampling). Context length only bites the 163K-context models (deepseek-v3.2), not Gemini (1M window).
Staged (uncommitted): capture the OpenRouter generation id + real upstream reason in logs so the cause stops being a mystery. Real cure is a product call (move off -preview / add a fallback model).
Also surfaced two real server bugs: "cannot execute INSERT in a read-only transaction" (~106/7d) and a messages foreign-key violation (18/7d). Worth a separate fix.

2b. How the studio agent works today (the moat)

streamAgentLoop (packages/server/src/routes/agent.ts): server-side SSE tool loop, MAX_ITERATIONS=50, read/write tool split, 14 tools, DB-backed agent_runs with crash recovery (heartbeat + Redis cross-replica stop), per-turn credit metering (agent.ts:1526), the held-edit publish-review gate (pending-edit.ts), undo snapshots, multi-model OpenRouter routing. None of this comes from any framework — it is the moat.

2c. The large-file failures (mechanics + scale)

Read-spiral: TSX > 200K chars is snipped to a 60K preview; weak models re-read the same id, hit the 3-read guard, and quit (the "Stopped: re-read index.tsx three times" terminations).
Unclosed-bracket class: edit_custom_ui requires a unique old_code match and recompiles the WHOLE file (sucrase); on a 4,500-line file the model edits from partial context, unbalances a brace far from the edit, and every edit then fails until it widens context.
Prod scale (3,077 worlds): 58% multi-file; 80% of published worlds multi-file; 345 (11%) have index.tsx > 50KB; largest 435KB (Life Restart Simulator), 370KB, 180KB. All still sucrase-compiled TSX — zero npm/engines yet. So today's pain is "giant TSX files," not "game engines."

2d. Framework research (build-on-top vs adopt)

OpenClaw (TS monorepo): has a genuinely embeddable @openclaw/agent-core + a pluggable file-tool operations bridge (the pattern that lets an agent edit a virtual FS). But the good parts are private/unversioned + coupled to a vendored "pi" fork → vendor-and-own-forever, no stability contract.
Hermes (Python "agent OS", 90+ tools): drivable only as a Python sidecar over ACP-stdio + an FS-materialization shim. Heavy infra/second-language for ~5% use. Conditional NO.
Both have great patterns to steal: paged reads, self-healing/fuzzy edits, read-dedup loop-breaker, 3-layer output budget, diff-as-permission, sub-agent context isolation, and LSP-after-every-edit.
Verdict: the agent loop is the most commoditized AND most moat-wired layer → own it, steal the tooling.

2e. Substrate / harness / play-runtime (research)

Substrate: don't build your own sandbox yet. For the heavy/game tier use a managed snapshot-first provider (Daytona or E2B bake-off); the idle policy (aggressive pause + snapshot) matters more than the provider — keep-warm = ~$37/world/mo, snapshot-first = ~$0.03-0.08/world/mo. Self-host Firecracker is the endgame, deferred until volume justifies a platform team.
Harness: keep our loop. Claude Agent SDK is Anthropic-only + individual-use-licensed → can't back a multi-tenant free tier and kills multi-model arbitrage. Add MCP so the agent learns Yumina's domain.
Play-runtime: keep it client-executed in the iframe forever (~$0/player compute). Widen the renderer DOM → Canvas2D/Pixi → WebGL/Three, lazy-loaded, tied to world-complexity tiers. Server-side / pixel-streaming for free players is economically fatal. (Rosebud/Websim ship the engine to the browser; only Roblox runs server sim, and Roblox isn't also paying an LLM bill per session.)
Migration/moat: a v20 world is already a "project" (rootComponent.files). Two-tier behind one rootComponent contract: simple/text worlds keep today's lightweight JSONB+iframe path (cheap); complex/ game worlds opt into "project mode" (real compiler + real FS + stronger sandbox). Strong lean: own the harness, swap the substrate.

2f. How Roblox & Godot actually work (the principle)

Neither uses npm.

	Roblox	Godot	→ concept
Engine/API	Roblox APIs (Luau)	Node API (GDScript/C#)	the runtime contract you code against
Project	instance tree (`.rbxl`)	scene/node tree + files	composable structure
Packages	Wally (curated)	Asset Library (curated)	a VETTED library, NOT `npm install *`
Build	publish to host	export to WASM/native	a real build/bundle step
Execution	server-authoritative (expensive)	client-side (cheap)	where the sim runs = where cost lives

The lesson: a real module system over a curated library over a platform engine SDK, with a real build step and a clear client/server split. They curate on purpose: open npm in a multi-tenant untrusted platform is a supply-chain/malware surface, a download-size bomb, and a runtime-contract killer.

3. The decision & strategy

Three layers, treated differently. Two horizons, phased.

   ADOPT (don't reinvent)      OWN (the moat)              BUILD (the real unlock)
 ┌────────────────────────┐ ┌──────────────────────┐ ┌────────────────────────────┐
 │ large-file TOOLING:    │ │ agent loop SHELL:    │ │ SUBSTRATE (games bet):     │
 │ paged reads, self-heal │▶│ credit metering,     │ │ real compiler + curated    │
 │ edits, grep, sub-agents│ │ held-edit gate, crash│ │ library over the useYumina │
 │ + TS parser/LSP tool   │ │ recovery, multi-model│ │ engine SDK; stronger play  │
 │                        │ │ OpenRouter arbitrage │ │ sandbox; renderer tiers    │
 │ steal from openclaw/   │ │ DO NOT rent — billing│ │ PLAY: client-executed iframe│
 │ hermes                 │ │ + recovery live here │ │ (Godot model), server=AI/  │
 └────────────────────────┘ └──────────────────────┘ │ state/social/credits       │
                                                      └────────────────────────────┘

Yumina's "engine" is the useYumina() SDK + iframe runtime (analogous to Roblox's APIs / Godot's node API). Creators compose against THAT + a curated library, not against raw npm. The per-world sandbox with a terminal is the AI agent's workshop (to run builds/tests), NOT how the creator's world is modeled.

4. The plan

Phase 0 — SHIPPED

Nudge fix for describe-and-stop (PR #24, live on main).

Phase 1 — Large-file tooling on our loop (NOW, ~weeks, in-process, no new infra)

Targets the 345 worlds with 50KB+ TSX and the read-spiral/bracket failures.

W1 (P1) TS parser / LSP-after-edit: a validate_tsx capability (TS compiler API / oxc/swc/esbuild parse) auto-run after every edit_custom_ui, injecting the exact error line so the agent fixes the brace in-turn. Kills the unclosed-bracket class. (Approved: full parser.)
W2 (P1) Self-healing fuzzy edit for edit_custom_ui (escalating match + structured no-match hint).
W3 (P2) Read-dedup loop-breaker (cache id+offset+limit; stub + hard-block on re-read).
W4 (P2) Structural/anchored edits (replace a function/JSX block by name/range via W1's parser).
W5 (P2) Sub-agent-per-file (delegate one huge file to an isolated child; parent context stays clean).
W6 (P2) File-splitting tool/skill (refactor a giant index.tsx into multiple files — attacks the 435KB-single-file root cause and is the on-ramp to the project model).
W7 (P3) Lightweight planning/todo tool (keeps multi-step builds on track; complements the nudge).

All behind flags, independently testable, credits/recovery/held-edit/multi-model intact.

Phase 2 — Substrate (CEO-gated, the games bet)

mode: "simple" | "project" flag; rootComponent.files stays the single contract.
Real compiler (esbuild/SWC; esbuild-wasm client-side where the threat model allows) replacing the regex bundler → real modules.
Curated library (blessed Pixi/Three/etc. exposed via the SDK) — the Wally/Asset-Library model. No open npm install on the default/free tier.
Real FS / blob storage + history for project-tier worlds.
Stronger play sandbox: a true separate origin (sandbox.yumina.io) before running real code (audit flagged current sandbox is opaque-same-origin).
Play-runtime renderer tiers inside the iframe (Canvas2D/Pixi → WebGL/Three), lazy-loaded; never server-executed on the free tier.
Managed sandbox (Daytona/E2B) ONLY for project-tier server builds, paid-gated, aggressive pause-on-idle.

Spike — rented harness (time-boxed, decision-gated)

Port project-tier authoring onto a rented harness (Claude Agent SDK / Codex / openclaw-core) in a throwaway branch; measure credit/recovery integration loss + switching cost; decide rent-vs-own from data.

5. My suggestions (knowing the vision)

The agent loop is not the risk to the vision; the substrate ceiling, security, and play-runtime are. The regex bundler + JSONB + opaque-same-origin iframe is what caps complexity. Invest there, not in renting a loop you'd have to re-wire your billing and recovery around.
Make useYumina() a real engine contract, and ship a curated library — that's the move that lets worlds get arbitrarily complex without opening npm. This is exactly what Roblox/Godot do. It's safer, cheaper, and gives you a stable platform contract.
Be Godot's project model + Yumina's server. Not Roblox's cost model. Client runs the game (cheap); your server provides the AI brain, persistence, multiplayer, and credits — the things a local engine can't. That hybrid is your real engine and your moat.
Sequence: ship Phase 1 now (relief for the power users hurting today), settle the founder decisions in CEO review, then build Phase 2 as the durable moat. Keep the rent-the-harness option alive as a measured spike, not a blind bet.
Start W6 (file-splitting) early — it both fixes today's 435KB-file pain and is the natural on-ramp to the multi-file project model, so it pays off in both horizons.
Treat free-tier sandbox policy as the load-bearing economic decision. If sandboxes are credit/paid-gated (only during active heavy authoring/build), the cost problem largely evaporates.

6. Open founder decisions (for /plan-ceo-review)

Do free users get real sandboxes, or is project-mode/sandbox paid-gated? (drives free-tier economics)
Is "game" visual fidelity or interactive systems? (decides Canvas2D-in-iframe vs needing real engines)
Will heavy 3D / streamed worlds EVER touch the free tier? (recommend: structurally no)
Is multi-model OpenRouter arbitrage permanent, or a bootstrapping phase? (decides if Claude SDK ever fits)
Self-host Firecracker eventually, or stay on managed sandboxes? (sets whether the endgame substrate happens)
One-way or reversible "project-ification" of a world? (reversible keeps the cost lever)

7. NOT in scope (Phase 1) / Risks

NOT in scope now: open npm, game engines, server-side game logic, pixel streaming, the harness swap, moving text-tier worlds off JSONB, replacing the iframe runtime.

Risks: idle sandbox cost blowout (Phase 2 — mitigate with pause-on-idle + snapshot); untrusted-code escape (Phase 2 — microVM isolation + egress lockdown + a real sandbox origin); two-codepath drift (keep rootComponent.files as the single contract); Phase 1 regressions (gate the auto-validate + fuzzy-edit behind size thresholds + flags so the common small-world case stays fast).

Foundation for AI-native Roblox (added 2026-06-01, code-grounded audit)

The core finding: today the AI is the rules engine — state changes because the model types [var: op value] in prose and the server regex-scrapes + applies it (response-parser.ts → messages.ts:1040 applyEffects → runReactionChain). Reactions only react to what the AI already did. Consequence: client/AI state writes are taken on faith (sessions.ts:428-472 merges unvalidated client variables; sandbox setVariable is fire-and-forget) — a player can set gold=999999, a creator's code can mint coins. Fine for narrative; fatal for games (need deterministic rules), multiplayer (need shared truth), and economy (need unforgeable value).

The foundational move: keep AI-prose-driven state for narrative; add a server-authoritative layer for value / shared state / game rules. The AI and client request changes via validated verbs; the server decides. This is the single shift from text-sim to game engine.

Architecture: one stable contract (useYumina() + rootComponent), four layers:

Project (authoring): multi-file + real compiler (esbuild) + curated library. NEW (Phase 2). W6 is the on-ramp.
Authoritative state (the keystone): server-validated ledger the client/AI can't forge. REUSE the existing hash-chained ledger (transaction-hash.ts) + atomic overdraft-guarded deduction (credit-service.ts).
Multiplayer (shared sessions): host-as-DM rooms + Redis pub/sub + a state_version. REUSE the dormant origin/multiplayer branch (~80% of co-op v1) + main's Redis room-manager.
Economy (creator-defined): per-world economy manifest + economy ledger + server-mediated economy.buy/spend/grant verbs. REUSE the mature money rails (Stripe Connect payouts, creatorEarnings, 20% take, 7-day hold, dispute handling, mushie P2P transfer).

~70% of the plumbing already exists (money rails mature; multiplayer branch ~80%; reactions are a clean deterministic substrate; rootComponent is already a project). The foundation is wiring existing pieces onto a server-authoritative spine, not a rewrite.

Fastest path: (1) finish Phase 1 large-file tooling + Phase 1.5 validate-and-stamp-on-write (also closes the forgeable-state hole) → (2) the server-authoritative state keystone → (3) Economy v1: virtual currencies bought with mushies, riding existing payouts (weeks, lowest legal risk) → (4) Multiplayer v1: rebase the branch, co-op narrative behind a flag (weeks) → (5) Project substrate (real compiler + curated library + render tiers), CEO-gated, parallel → deferred: real-money DevEx (legal-heavy), realtime-twitch (websockets + tick loop, different cost class), rented-harness spike.

Founder decisions (recommendations):

Economy convertibility line — recommend NO direct cash-out of in-game currency. Players spend real money / mushies; creators earn cashable mushies via the existing rail; in-game currency stays virtual + non-redeemable. Keeps you out of money-transmitter/KYC/tax weight. This is the highest-stakes call.
"Game" = visual fidelity vs deterministic systems (decides project-substrate vs authoritative-state priority; lean authoritative-state first).
Multiplayer = co-op narrative now, realtime deferred (recommend yes).
Free-tier stays client-executed; gate server-sim/sandbox behind paid.

Biggest risks: forgeable state must be closed before ANY economy of value (Phase 1.5 first); real-money convertibility = regulatory weight; the "AI-types-brackets" mutation model is antithetical to deterministic games (bolting an authoritative rule layer onto it is the highest-uncertainty piece); host-pays multiplayer has griefing/cost-abuse exposure; the opaque same-origin iframe (NOT the sandbox.yumina.io the docs claim) needs a real separate origin before running heavier/curated-library code.

Appendix — key file references

Agent loop: packages/server/src/routes/agent.ts (streamAgentLoop, nudge ~1585, read-spiral guard ~1786, credit metering ~1526, snipLargeResult ~1022).
Tools: packages/server/src/lib/studio-tools/{tools.ts,tool-executor.ts,system-prompt.ts} (edit_custom_ui ~1235, grep_world ~340, read pagination ~116).
World persistence + held-edit gate: packages/server/src/lib/pending-edit.ts.
World schema (v20, rootComponent.files): packages/engine/src/world/schema.ts (~302-351).
Play runtime: packages/app/src/features/chat/world-renderer.tsx + packages/app/sandbox/*.
LLM/Gemini handling: packages/server/src/lib/llm/openrouter.ts (finish_reason ~426/~649), packages/server/src/routes/messages.ts (stream error ~929).
Prior architecture audit: docs/architecture-audit-2026-05-29.md.

Yumina Studio Agent + "Roblox for AI Games" — Findings, Plan & Recommendations ​

TL;DR ​

1. The vision (what we're building toward) ​

2. What we found this session ​

2a. The two production bugs (investigated + acted on) ​

2b. How the studio agent works today (the moat) ​

2c. The large-file failures (mechanics + scale) ​

2d. Framework research (build-on-top vs adopt) ​

2e. Substrate / harness / play-runtime (research) ​

2f. How Roblox & Godot actually work (the principle) ​

3. The decision & strategy ​

4. The plan ​

Phase 0 — SHIPPED ​

Phase 1 — Large-file tooling on our loop (NOW, ~weeks, in-process, no new infra) ​

Phase 2 — Substrate (CEO-gated, the games bet) ​

Spike — rented harness (time-boxed, decision-gated) ​

5. My suggestions (knowing the vision) ​

6. Open founder decisions (for /plan-ceo-review) ​

7. NOT in scope (Phase 1) / Risks ​

Foundation for AI-native Roblox (added 2026-06-01, code-grounded audit) ​

Appendix — key file references ​