🏋️♀️ Health & Fitness Evaluations with Azure AI Foundry 🏋️♂️¶
This notebook demonstrates how to evaluate a Generative AI model using the Azure AI Foundry ecosystem. We'll highlight the interplay of three key SDKs:
azure-ai-projects
(AIProjectClient
): to manage & orchestrate evaluations from the cloud.azure-ai-inference
: to perform model inference (optional, but relevant if you want to generate responses for evaluation).azure-ai-evaluation
: to run automated metrics for LLM output quality & safety.
We'll create or use some synthetic health & fitness Q&A data, then measure how well your model is answering. We'll do a local evaluation and a cloud evaluation on an Azure AI Project. 🚀
🍉 Notebook Contents¶
- Setup & Imports
- Mermaid Diagram of the Flow
- Local Evaluation Example
- Cloud Evaluation with
AIProjectClient
- Conclusion
⚠️Disclaimer¶
This notebook deals with a hypothetical health & fitness scenario. No real medical advice is provided. Always seek professional guidance when needed!
1. Setup and Imports¶
We'll install necessary libraries, import them, and define some synthetic data.
Dependencies¶
azure-ai-projects
(manages project-based evaluations in the cloud)azure-ai-evaluation
(provides built-in metrics likeF1ScoreEvaluator
,RelevanceEvaluator
, etc.)azure-ai-inference
(optionally used if you want to generate completions/chats to produce data to evaluate)azure-identity
(for Azure authentication)opentelemetry-sdk
andazure-core-tracing-opentelemetry
if you want to enable advanced tracing (optional).
Synthetic Data¶
We'll create a small JSONL with health & fitness Q&A pairs for demonstration.
%%capture
# If you need to install these, uncomment:
# !pip install azure-ai-projects azure-ai-evaluation azure-ai-inference azure-identity
import json
import os
import uuid
from pathlib import Path
from typing import Dict, Any
from azure.identity import DefaultAzureCredential
# We'll create a synthetic dataset in JSON Lines format
synthetic_eval_data = [
{
"query": "How can I start a beginner workout routine at home?",
"context": "Workout routines can include push-ups, bodyweight squats, lunges, and planks.",
"response": "You can just go for 10 push-ups total.",
"ground_truth": "At home, you can start with short, low-intensity workouts. Examples: push-ups, lunges, and planks in short sets."
},
{
"query": "Are diet sodas healthy for daily consumption?",
"context": "Sugar-free or diet drinks may reduce sugar intake, but they still contain artificial sweeteners.",
"response": "Yes, diet sodas are 100% healthy.",
"ground_truth": "Diet sodas are lower in sugar than regular soda, but they're not necessarily 'healthy' for daily consumption due to artificial additives."
},
{
"query": "What's the capital of France?",
"context": "France is a country in Europe. Paris is the capital.",
"response": "London.",
"ground_truth": "Paris."
}
]
# Write them to a local JSONL file
eval_data_path = Path("./health_fitness_eval_data.jsonl")
with eval_data_path.open("w", encoding="utf-8") as f:
for row in synthetic_eval_data:
f.write(json.dumps(row) + "\n")
print(f"Sample evaluation data written to {eval_data_path.resolve()}")
3. Local Evaluation¶
We'll show how to run local, code-based evaluation. Specifically, we'll combine a couple of built-in evaluators:
Then we'll see how they do on each row in our synthetic data.
Steps¶
- Import the evaluators
- Construct a local
evaluate(...)
run specifying each evaluator. - Inspect results.
Note¶
For RelevanceEvaluator (and other AI-assisted evaluators like Groundedness or Coherence), we need a GPT model config. We'll skip real model endpoints here, but the code is shown.
We'll also show how we can do code-based custom evaluators (like a simple function that checks length!).
import os
from azure.ai.evaluation import (
evaluate,
F1ScoreEvaluator,
RelevanceEvaluator
)
# We'll define a dummy custom evaluator that just logs the length of the response.
def response_length_eval(response, **kwargs):
return {"resp_length": len(response)}
# If you have a real Azure OpenAI model, fill in:
my_model_config = {
"azure_endpoint": os.environ.get("MY_AOAI_ENDPOINT", "https://dummy-endpoint.azure.com/"),
"api_key": os.environ.get("MY_AOAI_KEY", "fake-key"),
"azure_deployment": os.environ.get("MY_AOAI_DEPLOYMENT_NAME", "gpt-4"),
"api_version": os.environ.get("MY_AOAI_API_VERSION", "2023-03-15-preview"),
}
# Let's instantiate them
f1_eval = F1ScoreEvaluator() # no GPT needed
relevance_eval = RelevanceEvaluator(model_config=my_model_config) # GPT-based
result = evaluate(
data=str(eval_data_path),
evaluators={
"f1_score": f1_eval,
"relevance": relevance_eval,
"resp_len": response_length_eval
},
# We can specify how to map the columns from our data to the arguments of each evaluator.
evaluator_config={
"f1_score": {
"column_mapping": {
"response": "${data.response}",
"ground_truth": "${data.ground_truth}"
}
},
"relevance": {
"column_mapping": {
"query": "${data.query}",
"response": "${data.response}"
}
},
"resp_len": {
"column_mapping": {
"response": "${data.response}"
}
},
},
# We won't specify azure_ai_project or output_path to keep it local.
)
print("Local evaluation result =>\n")
print(result)
Inspecting Local Results¶
The evaluate(...)
function returns a dictionary with:
metrics
: aggregated metrics across rows (like averagef1_score
or averagerelevance
)rows
: row-by-row results with inputs and the computed evaluator outputstraces
: if you had debugging or additional info
Example:
{
'metrics': { ... },
'rows': [
{
'inputs.response': 'Yes, diet sodas are 100% healthy.',
'outputs.f1_score.f1_score': 0.0,
'outputs.relevance.relevance': 3.0,
'outputs.resp_len.resp_length': 31,
...
},
...
]
}
Our result
object can be used for analysis or exported to another location.
4. Azure Evaluation with AIProjectClient
¶
Sometimes, we want to scale our evaluation to the cloud, track the results in an Azure AI Project, or schedule recurring evaluations. We'll do that by using:
AIProjectClient
fromazure-ai-projects
Evaluation
fromazure.ai.projects.models
.
We'll show how you might:
- Upload the local JSONL to your Azure AI Project
- Create an
Evaluation
referencing a built-in evaluator - Submit the evaluation & poll for results
- Fetch & check the final status and (optionally) get a link to AI Studio.
Prerequisites¶
- Azure AI Foundry (AI Hub / Project) with a project-level connection string.
- A GPT-based Azure OpenAI deployment if you want to do GPT-based evaluators like Relevance.
Let's demonstrate now. 🎉¶
import os
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
Evaluation, Dataset, EvaluatorConfiguration, ConnectionType
)
from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
# We'll show a minimal example, referencing code from sample.
# 1. Connect to Azure AI Project
project_connection_string = os.environ.get("PROJECT_CONNECTION_STRING", "<YOUR_CONNECTION_STRING>")
credential = DefaultAzureCredential()
project_client = AIProjectClient.from_connection_string(
credential=credential,
conn_str=project_connection_string,
)
print("✅ Created AIProjectClient.")
# 2. Upload data for evaluation
data_id, _ = project_client.upload_file(str(eval_data_path))
print("✅ Uploaded local JSONL. Data asset ID:", data_id)
# 3. Let's define a connection for GPT-based evaluator (Relevance). We'll assume you have a default AOAI conn.
default_connection = project_client.connections.get_default(connection_type=ConnectionType.AZURE_OPEN_AI)
deployment_name = os.environ.get("DEPLOYMENT_NAME", "gpt-4")
api_version = os.environ.get("AOAI_API_VERSION", "2023-06-01-preview")
# 4. Construct the evaluation object
evaluation = Evaluation(
display_name="Health Fitness Remote Evaluation",
description="Evaluating dataset for correctness and violence.",
data=Dataset(id=data_id),
evaluators={
# NLP-based
"f1_score": EvaluatorConfiguration(
id=F1ScoreEvaluator.id,
),
# GPT-based
"relevance": EvaluatorConfiguration(
id=RelevanceEvaluator.id,
init_params={
"model_config": default_connection.to_evaluator_model_config(
deployment_name=deployment_name, api_version=api_version
)
},
),
# Safety-based (violence)
"violence": EvaluatorConfiguration(
id=ViolenceEvaluator.id,
init_params={"azure_ai_project": project_client.scope},
),
},
)
evaluation_response = project_client.evaluations.create(
evaluation=evaluation,
)
print("✅ Created evaluation job. ID:", evaluation_response.id)
# Optionally, we can fetch the status.
get_evaluation_response = project_client.evaluations.get(evaluation_response.id)
print("Current status:", get_evaluation_response.status)
# The evaluation may still be in progress. We can poll or just wait.
print("You can check the Azure AI Project UI to see the final results!")
if isinstance(get_evaluation_response.properties, dict):
print("AI Studio link:", get_evaluation_response.properties.get("AiStudioEvaluationUri"))
Observing AI Foundry Evaluation Results¶
In the output logs, you'll see an AiStudioEvaluationUri
that links to your Azure AI Project in the Azure portal. There, you can:
- View aggregated metrics (like average F1 Score or average Relevance Score)
- Inspect row-level details to see which queries had the highest or lowest performance.
Once the job completes, the final status is Succeeded
, Failed
, or Cancelled
. You can store these metrics for auditing or continuous improvement.
Scheduling Evaluations¶
Using AIProjectClient
, you can also schedule recurring evaluations (e.g., daily) on new or streaming data. Check out EvaluationSchedule
in the docs for more advanced usage.
5. Conclusion 🏁¶
In this notebook, we:
- Created synthetic health & fitness Q&A data.
- Ran a local evaluation with the
evaluate()
function fromazure-ai-evaluation
. - Demonstrated a cloud evaluation using the
AIProjectClient
to create anEvaluation
.
Next Steps & Tips¶
- Add OpenTelemetry to trace your calls for advanced debugging.
- Combine more metrics:
GroundednessEvaluator
,SelfHarmEvaluator
, etc. for a thorough analysis. - Create your own custom code-based or prompt-based evaluators to handle domain-specific success metrics.
- Explore the Adversarial or Simulator features in
azure-ai-evaluation
to generate test data.
Resources¶
Thanks for following along – keep building healthy, high-quality AI apps! 🍏✨