Open large language models (LLMs) like Meta's LLaMA, Google's Gemma, and OpenAI's GPT-OSS now support tool-calling (aka function-calling) to let the model invoke external APIs or tools. We survey how each model supports tool-calling, methods for improving its reliability (prompt engineering, supervised fine-tuning, instruction tuning, RLHF, parameter-efficient tuning like LoRA), and when to use each approach. We cover failure modes, benchmarks, integration architecture, training requirements, and a decision table comparing approaches on performance, cost, latency, safety, dev effort, and maintainability.
Open large language models (LLMs) like Meta's LLaMA, Google's Gemma, and OpenAI's GPT-OSS now support tool-calling (aka function-calling) to let the model invoke external APIs or tools. In brief: tool-calling is defined as having the model decide which external function to use and with what arguments, then output a structured call (often JSON or code) that a surrounding system executes. These open LLMs provide different mechanisms: for example, Google's Gemma uses prompt-based schemas (and even a specialized "FunctionGemma" variant), LLaMA models integrate via frameworks like Llama Stack or llama.cpp with JSON tool definitions, and GPT-OSS (OpenAI's open weights) exposes a Chat API with a tools parameter for function calling. Tool-calling can be enabled purely by prompting (giving the model the function schema in the prompt) or by fine-tuning/adapters (to bias the model's outputs). We survey how each model supports tool-calling, methods for improving its reliability (prompt engineering, supervised fine-tuning (SFT), instruction tuning, RLHF, parameter-efficient tuning like LoRA), and when to use each approach. Concrete examples show how prompting alone often suffices for many use-cases (fast, low-cost, flexible), whereas fine-tuning is justified when tasks are complex, domain-specific, or require utmost accuracy. We list common failure modes (e.g. format/parsing errors, hallucinated function names or arguments, multi-turn inconsistency) and discuss benchmarks (BFCL, HammerBench) that measure function-calling accuracy, coverage, and end-to-end success. We also cover practical integration architecture, with code sketches for each model's API, training requirements (data format, hyperparameters), and a decision table comparing approaches (prompting vs. fine-tuning vs. RLHF) on performance, cost, latency, safety, dev effort, and maintainability. Finally, we discuss compute & data needs, safety/alignment issues (e.g. ensuring valid code execution), and lifecycle concerns (updating tool schemas vs. retraining). All statements are supported by primary sources, official docs, and recent research.
Tool/Function Calling: The ability of an LLM to invoke external programs or APIs. Concretely, the model is given a set of available functions (each with a name, parameters, and description, often as a JSON schema) and, upon a user query, it decides whether and which function to call. It then outputs a structured function call with arguments (e.g. a JSON or code snippet) that a wrapper executes. This turns the LLM from a passive answer generator into an agent that performs actions. Databricks explains that "function calling" (often used interchangeably with "tool use") requires the model to interpret the user request, choose relevant functions, and construct a correctly formatted function call with arguments. Agentic Workflow: A pipeline in which the LLM's output (the function call) triggers code execution or API calls, whose results are fed back (possibly in multiple turns) into the LLM to produce a final answer. For example, an agent may: (1) parse the user query, (2) select and call an external tool/function, (3) execute it, (4) return the result, and (5) produce a natural-language answer. The AWS pattern describes this: "the LLM receives the query and tool metadata … it chooses the most relevant tool, constructs input arguments, and returns a structured function call", then the tool is run and its output is provided back to the LLM for a final response. Instruction Tuning & Supervised Fine-Tuning (SFT): Techniques where a base LLM is further trained on examples. Instruction tuning uses labeled examples of user instructions and correct outputs, possibly including tool calls. SFT on function-calling would involve training the model on (query, function call) pairs so that it learns to produce well-formed calls. Google notes that Gemma and other open models "can be fine-tuned with the intent of improving its performance on a specific task or domain", and supports tuning via full-parameter or PEFT (LoRA) methods. RLHF (Reinforcement Learning with Human Feedback): A training loop where the model's outputs are judged by a reward model. In theory, one can assign higher rewards to outputs that correctly call the function (compared to missing or wrong calls), reinforcing proper tool use. Parameter-Efficient Tuning (Adapters/LoRA): Instead of updating all model weights, only a small set of adapter weights are trained. Google's docs specifically recommend LoRA to fine-tune Gemma "with less compute resources," noting LoRA can achieve similar results with far less cost. Similar strategies (QLoRA, Prefix Tuning, etc.) can be applied to LLaMA and GPT-OSS. Prompt Engineering: Crafting the input prompt (system+user messages, tool definitions) to elicit correct function calls. This often means providing a schema (e.g. JSON with function name/types) and instructions like "only respond with JSON function calls". For Gemma and GPT-OSS, it is currently the primary mechanism. FunctionGemma: A special 270M variant of Gemma-3 trained for agentic tasks. It introduces control tokens so that developers can reliably prompt for tool use. By contrast, the regular Gemma outputs plain text and the caller must parse the format. Response Parsing: After the model outputs a function call (or final answer), an external component must parse that output. For Gemma or GPT-OSS, one checks if the output matches the function schema (since these models don't insert special token for calls). In LLaMA stacks, the agent often detects a function call and then executes it, appending results back into the message history.
Meta's LLaMA 3 models support tool use via community frameworks. Although Meta's docs note that when tool calling is enabled the model "automatically decides if it needs one or more of the available tools" to answer a prompt, in practice users leverage tools like llama.cpp or Llama Stack. For example, llama.cpp (the popular C++ implementation) added OpenAI-style function calling support: it recognizes native formats for Llama 3 models (including built-in tools like wolfram_alpha, web_search, code_interpreter) and can parse them. If a prompt's format isn't recognized, llama.cpp falls back to a generic parser (less efficient). In Python frameworks (e.g. Llama Stack or llama-server), the procedure is: define tool metadata and tell the model about it in the system prompt. The Red Hat Llama Stack example shows defining tools as JSON-like entries (e.g. { "tool_name": "favorite_color_tool", "description": "...", "parameters": {"city": {"type":"string"}, "country": {"type":"string"}} }). One then calls a chat-completion API (passing model_id and tools list); LLaMA-3 will reply with a tool_call entry when it wants to use a tool. The application code must detect this, execute the appropriate function, and append the tool's output back as a "tool" message with the same call ID. This loop continues until the model finally answers in text. Note that Llama's tool calling can be multi-turn and stateful, and supports both one-off and parallel tool use if enabled. According to llama.cpp docs, LLaMA3's native function format handling reduces token use; otherwise the generic fallback still works.
Google's Gemma (part of the Gemini family) supports function-calling via carefully crafted prompts and a plugin-like interface. The official Gemma guide explains that you give the model a block of function definitions (as JSON schema with names, descriptions, parameters) and instruct it to respond with a function call in a specific format. For example, a prompt might include: You have access to functions. If you call a function, output ONLY in this format: [{"name": "get_product_name_by_PID", "parameters": {"PID": "..."}}] When asked a query, Gemma will then output a JSON function call matching that schema. Alternatively, one can use a Python-style pseudo-call format in brackets. Gemma itself does not mark tool calls with a special token; the caller must detect calls by matching the prompted structure. Importantly, Google released a special variant called FunctionGemma (270M) that is trained for tool use: it recognizes control tokens for tool definitions and reliably outputs structured calls. For regular Gemma (27B or 70B), users simply rely on the JSON prompt. Google's docs also warn that Gemma cannot execute code, so any generated code or API call must be run by the developer in a sandbox. In practice, one can use Google's Gemini API or a local Gemma model. Prompt structure (with system message describing available functions) is the primary mechanism. After the model outputs the function call, the application code parses the JSON, invokes the function (e.g., a web API, database query, or Python function), and supplies the result back to Gemma in the conversation. Because Gemma is open-weight, it can also be fine-tuned to improve this behavior.
OpenAI's GPT-OSS (e.g. models on Hugging Face like openai/gpt-oss-20b and 120b) provides OpenAI-compatible function-calling out of the box. These models are quantized "MXFP4" weights of GPT-style models. Using an inference engine like vLLM or Ollama, one can serve GPT-OSS and interact with it via an OpenAI-like Chat Completions API. Crucially, the Chat API for GPT-OSS supports a tools parameter just like OpenAI's models. For example: This yields a tool_call in the response that looks like a JSON function call (e.g. {"name":"get_weather","parameters":{"city":"Berlin"}}). The application then runs get_weather(city="Berlin") and feeds the result back to the model in the next turn or directly includes it in the final answer. GPT-OSS thus natively supports tool calling in any environment (local or cloud) that emulates the OpenAI Chat interface. It also supports OpenAI's Agents SDK, allowing asynchronous tools via decorators. If running GPT-OSS directly (e.g. via vLLM's Python API), one must use the "Harmony" formatting where the prompt is first encoded, then parsed after generation. But for most uses, the OpenAI-compatible API makes GPT-OSS function-calling identical to GPT-4's. OpenAI's docs explicitly note that GPT-OSS can perform tool calls and browsing via this API.
There are two broad approaches to enable and improve tool-calling in LLMs: Prompt Engineering: The simplest method. Provide the model with (1) a description of each tool (name, description, parameters) and (2) instructions to use them, all in the prompt. Google's Gemma guide shows this approach: by formatting the prompt as JSON schemas and telling the model "you have access to functions…only output this JSON when calling". Similarly, for GPT-OSS/Llama, one uses the tools= argument or system messages with function metadata. Prompting is fast to implement and requires no weight updates. Databricks and others note that with careful prompting and parsing, even small open models can reach near state-of-the-art tool use. For example, Friendli.ai reports that LLaMA 3 (70B) can match GPT-4 on function-calling tasks using only prompting techniques. Fahey's industry guide advises: "Start with function calling and prompt engineering — they're fast, cheap, and powerful." This is ideal for rapidly adding tools to an assistant without retraining. Instruction Tuning / SFT: If simple prompting is insufficient (e.g. the model often mis-formats calls), one can fine-tune the model on tool-use examples. For instance, one could assemble (query, function definition, correct function call) triplets and train the model to reproduce the call. Google's Gemma documentation emphasizes that as an open-weight model, Gemma can be tuned on task-specific data. LLaMA likewise can be instruction-tuned on synthetic or collected function-calling dialogs. In practice, SFT for tool-calling could be done with an open-source LLaMA or GPT-OSS on GPUs, using Hugging Face Transformers or LoRA frameworks. This embeds the tool-calling behavior into the model's weights, helping with consistency and rare cases. Parameter-Efficient Fine-Tuning like LoRA can be used: Google notes LoRA can make Gemma reliably adopt new tasks with far less compute. Reinforcement Learning (RLHF/RL): One can also train with feedback: e.g. give a positive reward when the model issues the correct function call. This approach is used by OpenAI for aligning responses (ChatGPT's RLHF), and could in principle align tool-use. However, no published protocol specifically for RLHF on function calls in open LLMs exists yet. It remains a possible advanced method. Adapters/LoRA: Adapter-based tuning lets one achieve most of fine-tuning's benefits cheaply. If one only needs to refine the model's tendency to call functions properly, it may suffice to train LoRA adapters on a moderate dataset of function-calling examples. This is often done via frameworks like Unsloth or Axolotl for LLaMA/Gemma. Each method has trade-offs: prompting is easiest and best for fast deployment, but relies on model's native ability. SFT/LoRA can yield higher accuracy and custom behavior, but requires labeled data and compute. RLHF can enforce alignment (safety, style) but is costly. Best practices often combine them: e.g. prompt engineering + a light fine-tuning on edge cases.
Use-case considerations guide the choice. Prompt engineering and runtime orchestration (using APIs or tool servers) are usually preferred when speed, cost, and flexibility are key. For instance, if you control neither the base model nor need massive customization, it's wise to "start with function calling or prompt templates." This works well for most agents or chatbots, where you simply list available tools and rely on the LLM to use them. It's also ideal if you expect to change tools frequently: updating the prompt is easier than retraining. Fine-tuning (full or adapter-based) is worthwhile when you need a robust, consistent behavior that prompt templates can't guarantee. Examples include: High-stakes domains: e.g. medical/legal assistants where output format and accuracy are critical. Fahey's guide notes fine-tuning for domain-specific tone and facts. Small, closed environments: e.g. an internal chatbot with a fixed set of functions, where you can afford the compute up-front to train the model once and then deploy indefinitely. Efficiency/resource constraints: ironically, sometimes running a larger fine-tuned model offline can be cheaper at scale than orchestrating many API calls or managing chains. Legacy models: if stuck on an older LLM that poorly understands structured calls, fine-tuning is the only way to teach it to call functions. In contrast, prompt engineering (with or without RAG) wins when the knowledge base is dynamic or when you want end-to-end traceability. For example, a customer service bot that uses retrieval (RAG) and function calling for tasks might use minimal fine-tuning. The decision tree in Fahey's article sums it up: "Do you control the base model? If No, start with function calling or prompts. If Yes, then add tools as needed; only fine-tune for tone/behavior if necessary."
Tool-calling systems have many pitfalls. Common failure modes include: Format/Parsing Errors: Models sometimes generate JSON or code that isn't well-formed. Extra text, missing quotes/brackets, or wrong types can break the parsing. For instance, generating a raw Python call without exact schema can confuse the agent. This is mitigated by strict prompt enforcement (e.g. "ONLY output JSON") and regex validation. Hallucinated Functions or Params: The model may output a function name or parameter not in the provided schema. HammerBench's analysis found "parameter name hallucinations" (PN_HR) to be the primary source of failure. For example, Gemma might invent a field that isn't defined. Similarly, GPT-OSS/GPT might hallucinate a function or argument. Verifying outputs against allowed schemas is essential; some frameworks reject calls with unknown names. Missing Arguments: The model may forget to include required parameters, or request a function without all needed slots. HammerBench measures PN_MR (parameter missing rate) when a predicted call omits some arguments. In practice, this means the agent must detect incomplete calls and possibly re-prompt the model for missing info. Wrong Tool Selection: The model might pick the wrong tool from the list, especially if tools have overlapping domains. Accuracy of function name selection tends to be high, but when it goes wrong, the entire answer fails. Careful naming and description of functions in prompts can reduce this. Multi-turn Incoherence: In conversations requiring multiple calls, the agent might "forget" earlier context or mix up tool states. For instance, after two tool calls, the third answer might ignore one result. Ensuring the conversation log correctly interleaves user, assistant, and tool messages is key. Latency/Cost Issues: If a function call triggers a slow API or a heavy computation, user experience can suffer. For example, embedding-based or web-search tools add latency. The orchestration code must manage timeouts or asynchronous calls if supported. Safety & Security: A big risk is that the model generates malicious code or calls unintended services. Google explicitly warns: "Always put safeguards in place to validate any generated code before executing it." Moreover, guardrails should check that function outputs do not leak sensitive info or violate policies. Infinite Loops: In poorly designed agents, the model might keep invoking tools repeatedly without reaching a final answer. Heuristics (like a max number of tool calls) and clear stopping criteria in the prompt can mitigate this. Incompatible Updates: If tool APIs change (parameters or schemas), the model may still output old formats. This can break the agent unexpectedly. Hence, versioning tool schemas in the prompt and monitoring call failures is important. Framework Mismatch: For LLaMA, if the prompt template doesn't match what llama.cpp expects, it defaults to generic parsing (costly). This may waste tokens or produce less reliable calls.
Function-calling capability is measured by specialized metrics beyond generic LLM scores. Recent work (BFCL, HammerBench) identifies key metrics: Function Name Accuracy (Func Acc): Correctness of the chosen function name in each call. Argument Accuracy (Args Acc): Given the function name, whether all parameter values are correctly provided. HammerBench's Args Acc is true iff both function and all arguments are correct. Irrelevant Functions Accuracy: Fraction of times the model correctly does not call any function when none is needed (no false positives). Parameter Hallucination/Missing Rates: HammerBench defines PN_HR (hallucination rate of spurious parameter names) and PN_MR (missing required parameter names). High PN_HR/MR indicates poor slot filling. Conversation-level Success Rate (SR): In multi-turn tasks, the proportion of dialogs where every function call is correct (complete accuracy across the session). Progress Rate (PR): For conversations, PR tracks how many turns are correct in sequence. Benchmarks use these and more. For instance, the Berkeley Function-Calling Leaderboard (BFCL) reports overall accuracy and even includes cost and latency metrics. In BFCL's interface, "Overall Accuracy is the unweighted average of all sub-categories" and it tracks computation cost (USD estimate for the test suite) and latency (seconds per query). HammerBench (OPPO/SJTU) explicitly evaluates multi-turn scenarios with these metrics. Their findings echo common sense: function names are usually easy (high Func Acc), but the hard part is filling arguments under ambiguous or noisy instructions. They observed that parameter-name hallucinations in the first turn can doom the conversation, and that providing values earlier (steady user input) improves performance. Other practical metrics include overall accuracy/F1 of tool use, latency (time from user query to final answer including tool execution), safety/hallucination rate (how often the model makes an irrelevant or dangerous call), and cost (for cloud deployment, cost of API calls or extra compute). In summary, evaluating tool use means measuring not just language quality but execution correctness.
LLaMA Models Setup: Obtain a LLaMA-3 checkpoint (or LLaMA-2 if relevant) in a supported framework. Install Llama Stack or llama.cpp (with MCP). Ensure the model supports --jinja or Chat API with tools. Data Formatting: If fine-tuning, create a dataset of user queries, function schemas, and correct tool-call outputs. For prompt-only, prepare JSON schemas of your tools. Fine-Tuning (Optional): You can SFT on tool-calling data using Hugging Face or Keras. For LoRA, you would add adapter layers and train on (query→function call) pairs. Hyperparameters: often a low LR (e.g. 1e-5), batch-size dictated by GPU. Train until the model reliably outputs valid JSON calls. Inference (Prompt Approach): In code, define tools=[{name, description, parameters}]. Call client.chat_completion(messages, tools=tools). On receiving a tool_calls field, execute the tool in your code, append the result with role=tool and same call_id, and repeat the completion call. Gemma (Google) Setup: Use Google AI services or run Gemma locally via Colab/LLM tools. Google provides a "Fine-tuning Gemma" colab and Keras scripts. Data: If fine-tuning, gather (prompt→target) pairs. For tool-calling, you'd include the function definitions in the prompt and the desired function call in target. Fine-Tuning: Google's docs show Keras+LoRA, or Hugging Face PEFT. Typical steps: convert your data to Gemma's format, use a TensorFlow or PyTorch trainer (they mention Vertex AI or Unsloth). Hyperparams: a few epochs with small LR, batch-size limited by GPU. Inference (Prompt): Construct a prompt string with system instructions and JSON schema. The model's response will contain a JSON call (e.g. {"name":"get_product_name_by_PID","parameters":{"PID":"1234"}}). Parse it with json.loads. GPT-OSS Setup: Install vLLM or Ollama. For server deployment, run vllm serve openai/gpt-oss-20b (or 120b). This downloads the model and serves a local API. API Use: Use OpenAI's Python SDK (set base_url to your server). Provide tools list in the API call. The response yields response.choices[0].message.tool_calls which you execute. Fine-Tuning: GPT-OSS is open-source, so one can fine-tune it via Hugging Face. For example, one could use the openai-harmony library to encode/decode its special format if fine-tuning on the responses format. Inference (Harmony): If calling the model directly (not via API), use the Harmony encoding for prompts so that tool-calls can be parsed. OpenAI provides openai-harmony and an Agents SDK which can proxy to vLLM.
Model Size & GPUs: LLaMA/GPT-OSS 20B can be served on a single A100 or H100. The 120B GPT-OSS needs multi-GPU (≥60GB VRAM). Gemma 27B/70B similarly requires GPUs or TPUs. Fine-tuning 20–70B models generally requires multiple A100s/H100s or cloud TPUs for days. LoRA can reduce this to one or two GPUs. Training Data: For supervised fine-tuning, you need examples of (query → function call). This can be hand-crafted or generated. Real user logs (with manual annotations) are ideal but expensive. Some use GPT-4 to synthesize tool-call data. Google suggests even a few dozen examples can nudge behavior, but more data gives more robust results. Inference Latency: Every tool call adds time. Benchmarks like BFCL include latency; expect slower response when calling heavy tools. On modern GPUs, the LLM generation itself (per query) may take ~1–2 seconds for 20B, ~5+ seconds for 70–120B. Smaller fine-tuned models (like Gemma 270M) can be very fast. Costs: Running open models on your hardware costs (electricity, infra) vs cloud API charges. The BFCL board even estimates USD per 1M tokens. Fine-tuning costs (cloud) can be thousands of dollars for large models. Prompt-only approaches incur minimal compute cost beyond inference. Integration Effort: Serving GPT-OSS via vLLM is turnkey (OpenAI-like API). LLaMA requires more setup (llama.cpp or stack). Gemma might require Google Cloud or custom deployment. Using MCP-compliant frameworks (e.g. Llama Stack) simplifies integration but a learning curve exists.
Tool-enabled agents magnify AI risks, so alignment is crucial. Strategies include: Strict schemas and validation: Only accept function calls that match the allowed schema exactly. Discard or re-prompt on anything else. Code Execution Safeguards: If using a code-execution tool, run code in a secure sandbox or use static analysis to block dangerous operations. Google explicitly warns to validate any generated code before execution. Authorization Check: For sensitive tools (e.g. financial, personal data), implement access controls. Don't let the model arbitrarily call any tool. Output Filtering: After a tool returns data, ensure it doesn't contain disallowed content or PII. For example, if a search tool returns something malicious, sanitize it. Prompt Guardrails: Use system prompts to enforce style/safety ("You should not output profanity or disallowed content"). Instructions can include an explicit refusal style if the requested function is illegal or harmful. Monitoring & Logging: Record every tool invocation and LLM response for review. If the model repeatedly hallucinates or goes off-rails, you can identify and fix it (e.g. by adding RLHF or extra training data). Human-in-the-Loop: For critical tasks (medical/legal), require human approval before final action. The AI can suggest tool calls, but a human verifies. Published guidance (e.g. Google's Responsible GenAI Toolkit) emphasizes designing tests for "failure conditions" where the model should refuse. In practice, a combination of thorough testing (success/failure/border cases) and external checks is used.
Updating Tools: If you add or modify tools, you must update the model's prompts or fine-tuning dataset. Prompt-based systems can just change the JSON schema in the system prompt. Fine-tuned models may need incremental training or have adapters that can be patched. Keeping a registry of tools (with versions) is advisable. Schema Versioning: Similar to APIs, version your function schemas so the model doesn't produce outdated calls after an update. Model Updates: Open LLMs evolve (e.g. LLaMA 3.3 → 3.4, Gemma 3 → 4). A fine-tuned model may not transfer to a new base easily. You might need to re-fine-tune or adapt your LoRA to the new model. Drift Monitoring: Over time, user questions may change style or domain. Continuously evaluate the agent (e.g. via tools like Langfuse or Langchain's Eval) to catch regressions. Maintainability: Prompt solutions are easiest to maintain (edit a few JSON lines). Fine-tuned systems require version control of models and data. Using parameter-efficient tunings (LoRA) helps because you can keep the base model separate and update only small weight files. In summary, design your system to be modular: keep tool definitions external, log all interactions, and treat models as replaceable components. That way, when you want to upgrade Gemma or switch to a new LLaMA, you only need to retrain adapters or fine-tune on updated data rather than overhaul your entire pipeline.
The following table summarizes the trade-offs across approaches: Prompt Engineering + Tools — Performance: Moderate to High (if model is strong); depends on model's baseline skill. Commonly very effective for many queries. — Latency: Low (no extra compute at inference except tools) — Cost: Low (no training cost, just prompt design) — Safety/Reliability: Variable — prone to hallucinations without validation, but easy to iterate. Encourages transparency. — Dev Effort: Low (just write good prompts/schemas) — Maintainability: High (easy to update prompts/tools on the fly) Supervised Fine-Tuning (SFT) — Performance: High (can perfect output format and handle edge cases if well-trained) — Latency: Moderate (same inference cost as base model) — Cost: High (GPU cost for training) — Safety/Reliability: Better — model can be taught rules, but errors may be harder to fix than prompt tweaks. — Dev Effort: High (need data, training pipelines) — Maintainability: Moderate (changes require retraining or more tuning) Parameter-Efficient Tuning (LoRA) — Performance: High (similar to SFT) — Latency: Moderate (slightly extra compute) — Cost: Medium (much less than full SFT) — Safety/Reliability: Better — retains base safety but adds specificity; still needs validation. — Dev Effort: Medium (training simpler, less resource) — Maintainability: High (can attach/detach adapters) RLHF — Performance: Variable (potentially highest if reward model is good) — Latency: Low (no extra inference cost beyond base model) — Cost: Very High (iterative human labeling or feedback loops) — Safety/Reliability: Potentially Best for alignment, but risky if reward is mis-specified. — Dev Effort: Very High (complex to set up) — Maintainability: Low (hard to iterate once locked in) Runtime Orchestration (MCP Frameworks) — Performance: N/A (this is integration, not a model change) — Latency: Variable (depends on external tool speeds) — Cost: N/A (depends on tools used) — Safety/Reliability: High (keeps model and tools separate, so model errors can be caught at interface) — Dev Effort: Medium (setup framework, define tools) — Maintainability: High (tools/services are modular) Notes: Prompt-based agents are "fast, cheap, and powerful", but if high precision or domain knowledge is needed, adding fine-tuning/LoRA gives more control. Dev effort refers to how much work is needed to implement and update the approach; maintainability is how easy it is to change over time.
A typical agent architecture for tool-calling works as follows. The user's query is sent to the LLM along with metadata of available tools. The LLM selects a tool, outputs a structured function call, and a "tool runner" executes that call. The result is then fed back into the LLM (if needed) to produce the final answer. This loop can repeat for multiple steps. In practice, this is implemented with a JSON-based "Model Context Protocol" (MCP) or API. For instance, using Llama Stack or an OpenAI-compatible API, the system prompt includes all tool schemas. When the model's response contains a function call, the framework detects it (via schema match) and invokes the tool. The tool's output is then appended to the conversation (role "tool") and the model is called again to incorporate that information into its reasoning. The flow: 1. User Query → LLM receives query + tool metadata 2. Tool Selection → LLM chooses the most relevant tool 3. Argument Construction → LLM constructs structured input arguments 4. Execution → Tool runner executes the function call 5. Result Integration → Tool output is appended to conversation 6. Final Response → LLM incorporates tool result and produces natural-language answer This loop can repeat for multi-step tasks. The orchestration code must robustly handle bad model outputs, often by validation (checking JSON), fallback paths (re-prompt the user or model), and logging errors for human review.