Examples / Evaluations / Thresholds

Thresholds

Has SpecsRequires API Keys

Shows how to set minimum acceptable thresholds for metrics. This example demonstrates: - Defining threshold requirements (e.g., success_rate >= 0.95) - Evaluation failure when thresholds aren't met - Multiple threshold configurations - Using thresholds for quality gates in CI/CD - Balancing strictness with practical tolerance

Source Code

-- Example: CI/CD Thresholds
-- This demonstrates quality gates for automated testing pipelines

-- Import completion tool from standard library
local done = require("tactus.tools.done")

greeter = Agent {
    model = "openai/gpt-4o-mini",
    system_prompt = [[You are a friendly greeter.

Generate a warm, personalized greeting for the given name.
Call the 'done' tool with your greeting.]],
    initial_message = "Generate a greeting for {name}",
    tools = {done}
}

Procedure {
    input = {
            name = field.string{required = true}
    },
    output = {
            greeting = field.string{required = true}
    },
    function(input)

    -- Have agent generate greeting
        greeter()

        -- Get result
        if done.called() then
            return {
                greeting = done.last_result() or "Task completed" or "Hello!"
            }
        end

        return {greeting = "No greeting generated"}

    -- BDD Specifications
    end
}

Specification([[
Feature: Greeting Generation with Thresholds

  Scenario: Agent generates greeting
    Given the procedure has started
    And the input name is "Alice"
    And the agent "greeter" responds with "I've generated a warm greeting."
    And the agent "greeter" calls tool "done" with args {"reason": "Hello! Welcome, it's great to see you!"}
    When the procedure runs
    Then the done tool should be called
    And the procedure should complete successfully
]])

-- Pydantic AI Evaluations with CI/CD Thresholds
-- Note: Evaluations framework is partially implemented.
-- Commented out until field.contains, field.llm_judge are available.
--[[
Evaluation({
    runs = 5,
    parallel = true,

    dataset = {
        {
            name = "greeting_alice",
            inputs = {name = "Alice"}
        },
        {
            name = "greeting_bob",
            inputs = {name = "Bob"}
        },
        {
            name = "greeting_charlie",
            inputs = {name = "Charlie"}
        }
    },

    evaluators = {
        -- Check greeting includes the name
        field.contains{},

        -- LLM judge for quality
        field.llm_judge{}
    },

    -- Quality gates for CI/CD
    thresholds = {
        min_success_rate = 0.80,  -- Require 80% success rate
        max_cost_per_run = 0.01,  -- Max $0.01 per run
        max_duration = 10.0,      -- Max 10 seconds per run
        max_tokens_per_run = 500  -- Max 500 tokens per run
    }
}
)
]]--

Quick Start

Run the example:

$tactus run 05-evaluations/03-thresholds.tac

Test with mocks:

$tactus test 05-evaluations/03-thresholds.tac --mock

Note

This example requires API keys. Set your OPENAI_API_KEY environment variable before running.

View source on GitHub →

Explore more examples

Learn Tactus through practical, runnable examples organized by topic.

Part of the Anthus Platform
Tactus icon

Tactus

Tactus is a programming language and runtime for durable AI agent procedures with checkpointing, sandboxing, and built-in human-in-the-loop controls.

PART OF

The Anthus Platform

Solve complex business problems with AI and ML using a proven, reusable technology stack. These interoperable building blocks give our solutions a stronger operational foundation: durable procedures, MLOps control loops, workload orchestration, knowledge systems, observability, and programmable media workflows.

Plexus

MLOps platform for agent evaluation and iteration.

Tactus

Durable runtime for agent procedures.

Korporus

Agent operating system and federated shell.

Biblicus

Corpus analysis for extraction and retrieval.

Babulus

Marketing automation built around VideoML.

Kanbus

Durable multi-agent task management.

Caducus

Monitoring, alerts, and operator support.

Free and open-source softwareDesigned cybernetically by Ryan Porter
Contact us

How can we help?

GitHub

Browse the code.

LinkedIn

Company updates.

Discord

Join the chat.