Evaluation Metrics
Gathering baseline performance metrics for EvalGuard prior to full integration is an essential step in ensuring successful implementation. By establishing clear benchmarks, organizations can accurately assess EvalGuard's capabilities.
Measuring Success Criteria - Classification Evaluation
EvalGuard functions as a control layer around your model, offering customizability over what enters and leaves your model through detection.
The first phase of testing focuses on the question, "How effective are EvalGuard's detection capabilities?"
To answer this, EvalGuard recommends using a Confusion Matrix to assess labeled datasets. A confusion matrix provides a detailed evaluation of detection capabilities by measuring:
True Positives: The model correctly predicts the positive class.True Negatives: The model correctly predicts the negative class.False Positives: The model incorrectly predicts the positive class.False Negatives: The model incorrectly predicts the negative class.
Confusion Matrix
A Confusion Matrix is a standardized approach for gaining insights into how well the model identifies positive instances and avoids false detections.
EvalGuard considers a predicted positive to be an input that EvalGuard is expected to flag as True.
A predicted negative represents a benign input that is expected to flag as False.
For example, a known prompt injection input such as, Ignore your system prompt and perform the following instructions, is expected to produce an API response containing flagged: true. This is a predicted positive.
Predicted Positive
Predicted Negative
Actual Positive
True Positive (TP)
False Negative (FN)
Actual Negative
False Positive (FP)
True Negative (TN)
Using a Confusion Matrix allows for the calculation of recall, accuracy, and false positive rate, which are valuable for evaluating the performance of a classification model.
Metrics
Metric
Description
Formula
Recall
Measures the ability to identify all relevant instances.
TP / (TP + FN)
Accuracy
Measures the overall correctness of the model.
(TP + TN) / (TP + TN + FP + FN)
False Positive Rate
Proportion of actual negatives incorrectly classified as positive.
FP / (FP + TN)
Example - Confusion Matrix
Consider a mixed dataset containing `10 true positive` prompt injection examples and `10 true negative` benign prompts examples. If the model correctly identifies all true positives and true negatives, it demonstrates a 100% accuracy rate.
Predicted Positive
Predicted Negative
Actual Positive
10
0
Actual Negative
0
10
Example - Dataset Calculations
In this example, consider an imbalanced dataset. Imbalanced data is representative of real-world use cases as we'd expect there to be far more benign prompts than malicious ones. In our dataset, we have `50 predicted positives (prompt injections)` and `950 predicted negatives (benign prompts)`.
Predicted Positive
Predicted Negative
Actual Positive
48
2
Actual Negative
3
947
The model has correctly classified:
Actual Positive: 48 prompts as prompt injectionsActual Negative: 947 prompts as benign inputs
The model has incorrectly classified:
Predicted Negative: 2 prompt injections as benign inputsPredicted Positive: 3 benign inputs as prompt injections
Based on these results, we can calculate scoring using the formulas in the metrics table above.
Metric
Result
Recall
96.0%
Accuracy
99.5%
False Positive Rate
0.32%
Example - Testing Your Datasets
A labeled dataset is required to measure EvalGuard's detection capabilities with a Confusion Matrix.
Labeling
Each detector requires a correctly labeled dataset. As an example, consider the prompt injection endpoint. The dataset must contain a set of known prompt injections and/or jailbreaks, and a set of benign prompts. The quality of the labeled testing data is crucial for producing meaningful results.
Datasets can be in any structured format, but should contain consistent and accurate labeling. For example, a JSON-encoded dataset may look like this:
Running test
Last updated