Scenario
Self-Improvement Ping-Pong Smoke (v7)
- The ping-pong tool spec is preloaded in workspace config. Reply with exactly: Spec saved.
- Call self_improve_backup_live, then call self_improve_create_sandbox. Reply with exactly: Sandbox ready.
- Call self_improve_generate_tool with an empty JSON object {} (uses the saved spec). Reply with exactly: Tool generated.
- Call self_improve_build_sandbox. Reply with exactly: Build done.
- Call self_improve_invoke_sandbox_tool with {}. If the output contains TOOL TEST PASSED, reply exactly: Tool works. Otherwise reply exactly: Tool broken.
- Call self_improve_invoke_sandbox_tool with {} again (second execution). If the output contains TOOL TEST PASSED, reply exactly: Tool works. Otherwise reply exactly: Tool broken.
- Call self_improve_invoke_sandbox_tool with {} again (third execution). If the output contains TOOL TEST PASSED, reply exactly: Tool works. Otherwise reply exactly: Tool broken.
- Call self_improve_verify_sandbox_tool with {}. If the output contains VERIFICATION PASSED and TOOL TEST PASSED, reply exactly: PASS. Otherwise reply exactly: FAIL.
- Call self_improve_promote_sandbox. Reply with exactly: Promoted.
- Use tool calling to invoke ping_self with {}. If the tool result is pong, reply exactly: Live OK. Otherwise reply exactly: Live FAIL.
- Use tool calling to invoke ping_self with {} again (second live execution). If the tool result is pong, reply exactly: Live OK. Otherwise reply exactly: Live FAIL.
- Use tool calling to invoke ping_self with {} again (third live execution). If the tool result is pong, reply exactly: Live OK. Otherwise reply exactly: Live FAIL.