Scenario
Core Calculator Tool Smoke (v7)
- The core calculator tool spec is preloaded in workspace config. Reply with exactly: Spec saved.
- Call self_improve_backup_live, then call self_improve_create_sandbox. Reply with exactly: Sandbox ready.
- Call self_improve_generate_tool with {} to generate the core calculator tool from the saved spec. Reply with exactly: Tool generated.
- Call self_improve_build_sandbox. Reply with exactly: Build done.
- Call self_improve_invoke_sandbox_tool with tool_name calc_eval, arguments_json {"expression":"1+6"}, expected_result 7. If the output contains TOOL TEST PASSED, reply exactly: Tool works. Otherwise reply exactly: Tool broken.
- Call self_improve_invoke_sandbox_tool with tool_name calc_eval, arguments_json {"expression":"65466641/2*6"}, expected_result 196399923. If the output contains TOOL TEST PASSED, reply exactly: Tool works. Otherwise reply exactly: Tool broken.
- Call self_improve_invoke_sandbox_tool with tool_name calc_eval, arguments_json {"expression":"12*(5-2)"}, expected_result 36. If the output contains TOOL TEST PASSED, reply exactly: Tool works. Otherwise reply exactly: Tool broken.
- Call self_improve_invoke_sandbox_tool with tool_name calc_eval, arguments_json {"expression":"(100-25)/5"}, expected_result 15. If the output contains TOOL TEST PASSED, reply exactly: Tool works. Otherwise reply exactly: Tool broken.
- Call self_improve_verify_sandbox_tool with {}. If the output contains VERIFICATION PASSED and TOOL TEST PASSED, reply exactly: PASS. Otherwise reply exactly: FAIL.
- Call self_improve_promote_sandbox. Reply with exactly: Promoted.
- Use tool calling to invoke calc_eval with {"expression":"1+6"}. If the tool result is 7, reply exactly: Live OK. Otherwise reply exactly: Live FAIL.
- Use tool calling to invoke calc_eval with {"expression":"65466641/2*6"}. If the tool result is 196399923, reply exactly: Live OK. Otherwise reply exactly: Live FAIL.
- Use tool calling to invoke calc_eval with {"expression":"12*(5-2)"}. If the tool result is 36, reply exactly: Live OK. Otherwise reply exactly: Live FAIL.
- Use tool calling to invoke calc_eval with {"expression":"(100-25)/5"}. If the tool result is 15, reply exactly: Live OK. Otherwise reply exactly: Live FAIL.