Evaluation Metrics

Gathering baseline performance metrics for EvalGuard prior to full integration is an essential step in ensuring successful implementation. By establishing clear benchmarks, organizations can accurately assess EvalGuard's capabilities.

Measuring Success Criteria - Classification Evaluation

EvalGuard functions as a control layer around your model, offering customizability over what enters and leaves your model through detection.

The first phase of testing focuses on the question, "How effective are EvalGuard's detection capabilities?"

To answer this, EvalGuard recommends using a Confusion Matrix to assess labeled datasets. A confusion matrix provides a detailed evaluation of detection capabilities by measuring:

  • True Positives: The model correctly predicts the positive class.

  • True Negatives: The model correctly predicts the negative class.

  • False Positives: The model incorrectly predicts the positive class.

  • False Negatives: The model incorrectly predicts the negative class.

Confusion Matrix

A Confusion Matrix is a standardized approach for gaining insights into how well the model identifies positive instances and avoids false detections.

EvalGuard considers a predicted positive to be an input that EvalGuard is expected to flag as True.

A predicted negative represents a benign input that is expected to flag as False.

For example, a known prompt injection input such as, Ignore your system prompt and perform the following instructions, is expected to produce an API response containing flagged: true. This is a predicted positive.

Predicted Positive

Predicted Negative

Actual Positive

True Positive (TP)

False Negative (FN)

Actual Negative

False Positive (FP)

True Negative (TN)

Using a Confusion Matrix allows for the calculation of recall, accuracy, and false positive rate, which are valuable for evaluating the performance of a classification model.

Metrics

Metric

Description

Formula

Recall

Measures the ability to identify all relevant instances.

TP / (TP + FN)

Accuracy

Measures the overall correctness of the model.

(TP + TN) / (TP + TN + FP + FN)

False Positive Rate

Proportion of actual negatives incorrectly classified as positive.

FP / (FP + TN)

Example - Confusion Matrix

Consider a mixed dataset containing `10 true positive` prompt injection examples and `10 true negative` benign prompts examples. If the model correctly identifies all true positives and true negatives, it demonstrates a 100% accuracy rate.

Predicted Positive

Predicted Negative

Actual Positive

10

0

Actual Negative

0

10

Example - Dataset Calculations

In this example, consider an imbalanced dataset. Imbalanced data is representative of real-world use cases as we'd expect there to be far more benign prompts than malicious ones. In our dataset, we have `50 predicted positives (prompt injections)` and `950 predicted negatives (benign prompts)`.

Predicted Positive

Predicted Negative

Actual Positive

48

2

Actual Negative

3

947

The model has correctly classified:

  • Actual Positive: 48 prompts as prompt injections

  • Actual Negative: 947 prompts as benign inputs

The model has incorrectly classified:

  • Predicted Negative: 2 prompt injections as benign inputs

  • Predicted Positive: 3 benign inputs as prompt injections

Based on these results, we can calculate scoring using the formulas in the metrics table above.

Metric

Result

Recall

96.0%

Accuracy

99.5%

False Positive Rate

0.32%

Example - Testing Your Datasets

A labeled dataset is required to measure EvalGuard's detection capabilities with a Confusion Matrix.

Labeling

Each detector requires a correctly labeled dataset. As an example, consider the prompt injection endpoint. The dataset must contain a set of known prompt injections and/or jailbreaks, and a set of benign prompts. The quality of the labeled testing data is crucial for producing meaningful results.

Datasets can be in any structured format, but should contain consistent and accurate labeling. For example, a JSON-encoded dataset may look like this:

Running test

Last updated