Evaluation¶
This guide covers the benchmark evaluation workflows in examples/evaluation/.
Overview¶
The evaluation examples provide two benchmark tracks:
- QReCC rewrite evaluation (
examples/evaluation/qrecc_eval.py) - MultiDoc2Dial grounded generation evaluation (
examples/evaluation/md2d_eval.py)
A separate indexing workflow is included for MultiDoc2Dial corpus ingestion:
- MultiDoc2Dial corpus indexing (
examples/evaluation/md2d_indexing.py)
Prerequisites¶
- Install dependencies:
- Create required credentials:
orcheo credential create openai_api_key --secret sk-your-key-here
orcheo credential create pinecone_api_key --secret your-pinecone-key
Data Sources¶
Default configs use hosted benchmark artifacts:
- QReCC test set via Hugging Face (
config_qrecc.json) - MultiDoc2Dial validation dialogs and corpus (
config_md2d.json,config_md2d_indexing.json)
Workflow: QReCC¶
Upload and run:
orcheo workflow upload examples/evaluation/qrecc_eval.py \
--config-file examples/evaluation/config_qrecc.json
orcheo workflow run <workflow-id> --verbose
Pipeline summary:
QReCCDatasetNodeloads and structures conversations.ConversationalBatchEvalNodeexecutes per-turn rewrite evaluation.RougeMetricsNode+SemanticSimilarityMetricsNodescore outputs.AnalyticsExportNodemerges corpus metrics and per-conversation views.
Workflow: MultiDoc2Dial¶
Step 1: Index corpus (run once per index/namespace)¶
orcheo workflow upload examples/evaluation/md2d_indexing.py \
--config-file examples/evaluation/config_md2d_indexing.json
orcheo workflow run <workflow-id>
Step 2: Run evaluation¶
orcheo workflow upload examples/evaluation/md2d_eval.py \
--config-file examples/evaluation/config_md2d.json
orcheo workflow run <workflow-id> --verbose
Pipeline summary:
MultiDoc2DialDatasetNodeloads/normalizes conversations.ConversationalBatchEvalNoderuns rewrite -> retrieval -> generation per turn.TokenF1MetricsNode,BleuMetricsNode, andRougeMetricsNodescore outputs.AnalyticsExportNodeaggregates metric outputs into a report payload.
Config Notes¶
Key config values you will typically tune:
max_conversations/max_documentsfor faster iteration loops.retrieval.embed_modelandretrieval.dimensionsfor embedding behavior.vector_store.pinecone.index_nameandnamespacefor isolation.generation.modeland rewrite model settings.
Expected Outputs¶
AnalyticsExportNode returns:
report.metrics: corpus-level metric mapreport.per_conversation: per-conversation aggregated scorestable: markdown-style metric tablereport_json: pretty-printed JSON payload