Evaluator

The Evaluator block allows you to assess the quality of content using customizable evaluation metrics. This is particularly useful for evaluating AI-generated text, ensuring outputs meet specific criteria, and building quality-control mechanisms into your workflows.

The Evaluator block images will be added soon. Currently, the block uses standard evaluation metrics to assess content quality.

Overview

The Evaluator block utilizes LLMs to objectively evaluate content based on custom metrics you define. This is especially useful for:

Assessing the quality of AI-generated content
Evaluating responses against specific criteria
Creating scoring frameworks for different types of content
Building objective feedback loops in your workflows

Configuration Options

Evaluation Metrics

Define custom metrics to evaluate content against. Each metric includes:

Name: A short identifier for the metric
Description: A detailed explanation of what the metric measures
Range: The numeric range for scoring (e.g., 1-5, 0-10)

Example metrics:

Accuracy (1-5): How factually accurate is the content?
Clarity (1-5): How clear and understandable is the content?
Relevance (1-5): How relevant is the content to the original query?

Content

The content to be evaluated. This can be:

Directly provided in the block configuration
Connected from another block's output (typically an Agent block)
Dynamically generated during workflow execution

Model Selection

Choose an LLM provider to perform the evaluation:

OpenAI (GPT-4o, o1, o3-mini)
Anthropic (Claude 3.7 Sonnet)
Google (Gemini 2.5 Pro, Gemini 2.0 Flash)
Groq, Cerebras
Ollama Local Models
And more

The chosen model should have strong reasoning capabilities to provide accurate evaluations.

API Key

Your API key for the selected LLM provider. This is securely stored and used for authentication.

How It Works

The Evaluator block takes the provided content and your custom metrics
It generates a specialized prompt that instructs the LLM to evaluate the content
The prompt includes clear guidelines on how to score each metric
The LLM evaluates the content and returns numeric scores for each metric
The Evaluator block formats these scores as structured output for use in your workflow

Inputs and Outputs

Inputs

Content: The text or structured data to evaluate
Metrics: Custom evaluation criteria with scoring ranges
Model Settings: LLM provider and parameters

Outputs

Content: A summary of the evaluation
Model: The model used for evaluation
Tokens: Usage statistics
Metric Scores: Numeric scores for each defined metric

Example Usage

Here's an example of how an Evaluator block might be configured for assessing customer service responses:

# Example Evaluator Configuration
metrics:
  - name: Empathy
    description: How well does the response acknowledge and address the customer's emotional state?
    range: 
      min: 1
      max: 5
  - name: Solution
    description: How effectively does the response solve the customer's problem?
    range:
      min: 1
      max: 5
  - name: Clarity
    description: How clear and easy to understand is the response?
    range:
      min: 1
      max: 5
 
model: Anthropic/claude-3-opus

Best Practices

Use specific metric descriptions: Clearly define what each metric measures to get more accurate evaluations
Choose appropriate ranges: Select scoring ranges that provide enough granularity without being overly complex
Connect with Agent blocks: Use Evaluator blocks to assess Agent block outputs and create feedback loops
Use consistent metrics: For comparative analysis, maintain consistent metrics across similar evaluations
Combine multiple metrics: Use several metrics to get a comprehensive evaluation

Evaluator

On this page