Advanced RAG 03：运用 RAGAs 与 LlamaIndex 评估 RAG 应用

编者按：目前，检索增强生成（Retrieval Augmented Generation，RAG）技术已经广泛使用于各种大模型应用场景。然而，如何准确评估 RAG 系统的性能和效果，一直是业界和学界共同关注的重点问题。若无法对 RAG 系统进行全面、客观的评估，也难以针对性地优化和改进它。因此，开发一套科学、可靠的 RAG 系统评估指标体系，对于推动RAG技术的进一步发展具有重要意义。

本文是Advanced RAG系列的第三篇，介绍了由 RAGAs（Retrieval Augmented Generation Assessment）提出的 RAG 评估指标框架，并阐述了如何使用 RAGAs 与 LlamaIndex 实现整个评估流程。

作者 | Florian June

编译 | 岳扬

🚢🚢🚢欢迎小伙伴们加入AI技术软件及技术交流群，追踪前沿热点，共探技术难题~

如果你已经为某个真实业务系统开发了检索增强生成（RAG）应用程序，可能会想了解该 RAG 应用的效果如何。换句话说，您可能想评估该 RAG App 的性能。

另外，如果发现现有的 RAG 应用效果不佳，可能还需要验证使用的 RAG 性能优化方法是否有效。换句话说，需要进行评估，确定这些性能优化方法是否起了作用。

在本文，我们首先介绍了由 RAGAs（Retrieval Augmented Generation Assessment）[1] 提出的 RAG 评估指标，RAGAs 是一个用于评估 RAG pipelines 的框架。然后，我们将解释如何使用 RAGAs + LlamaIndex 实现整个评估流程。

01 RAG 评估指标

简单来说，RAG 流程涉及三个步骤：输入用户提供给系统的问题或者需要解决的任务（input query） 、检索上下文（retrieved context）和根据用户提供的问题和检索到的上下文，由语言模型生成回答或者解决方案（the response generated by the LLM）。这三个步骤构成了 RAG 过程中最重要的三要素，并且相互依存。

因此，如图 1 所示，可以通过衡量这三者之间的相关程度来评估 RAG 的效果如何。

图 1：可以通过衡量这些三要素之间的相关程度来评估 RAG 的效果如何。Image by author。

这篇论文[1]提到了 3 个指标：Faithfulness（译者注：生成的内容是否忠实于用户输入的问题和检索到的上下文）、Answer Relevance（译者注：生成的回答是否与用户提出的问题相关）和 Context Relevance（译者注：生成的回答是否与用户提供的背景信息相符），这些指标无需访问人工标注的数据集或参考答案。

此外，RAGAs 网站[2]还介绍了另外两个指标：Context Precision（译者注：上下文精确度，即生成的模型响应中有多少是与上下文相关的）和 Context Recall（译者注：上下文召回率，生成的模型响应中有多少能够涵盖上下文相关的信息）。

1.1 Faithfulness/Groundedness

Faithfulness 指标用于评估模型回答是否基于给定的上下文，有助于开发人员判断、避免 RAG 系统产生错觉，并确保检索到的上下文可以有效地用于 RAG 系统输出的生成。

如果该指标分数较低，则表示 LLM 的响应不符合检索到的知识，提供带有幻觉的答案可能性就会增加。 例如：

图 2：Faithfulness 分数高和 Faithfulness 分数低的模型回答对比示例

资料来源：https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html

为了预估 Faithfulness 的数值大小，我们首先使用 LLM 提取一组语句 S(a(q))。具体方法如下：

Given a question and answer, create one or more statements from each sentence in the given answer.
question: [question]
answer: [answer]

生成 S(a(q)) 后，LLM 会判断每条语句 Si 是否都能从 c(q) 中推理出来。这一验证步骤通过以下 prompt 进行：

Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a brief explan ation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.

statement: [statement 1]
...
statement: [statement n]

考虑给定的上下文和以下陈述，然后确定它们是否得到上下文中存在的信息的支持。在做出结论（是/否）之前，为每个陈述提供一个简短的解释。最后，按照给定的格式对每个陈述做出最终的判决。请不要偏离指定的格式。




陈述：[陈述1]

...

陈述：[陈述n]。

最终的 Faithfulness 指标分数 F 计算公式为 F = |V| / |S|，其中 |V| 表示在验证过程中，LLM（大语言模型）认为能够根据输入的问题和检索到的上下文推导出来的语句数量，而 |S| 表示总语句数量。

1.2 Answer Relevance

该指标衡量生成的答案与用户输入的 query 之间的相关程度。分数越高，相关程度越高。 例如：

图 3：相关程度高的答案和相关程度低的模型回答对比示例

资料来源：https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html

为了估计模型回答与用户输入的 query 之间的相关程度，我们让 LLM 根据给定的答案 a(q)，生成 n 个潜在问题 qi，如下所示：

Generate a question for the given answer.

answer: [answer]

然后，我们使用文本嵌入模型（text embedding model）获取所有问题的嵌入（embeddings）。

对于每个 qi，都要计算与问题 q 的相似度 sim(q,qi)，对应于嵌入之间的余弦相似度。问题 q 的答案相关程度得分 AR 计算如下：

1.3 Context Relevance

这是一个用于衡量检索质量的指标，主要评估检索到的上下文对用户提供给系统的问题的支持程度。 得分低表示检索到的内容中存在大量不相关的内容，可能会影响 LLM 生成的最终答案。例如：

图 4：高上下文相关性和低上下文相关性

资料来源：https://docs.ragas.io/en/latest/concepts/metrics/context_relevancy.html

为了评估上下文的相关性，我们需要使用 LLM 从上下文 (c(q)) 中提取一组关键句子 (Sext) 。这些句子对于帮助 LLM 正确回答问题至关重要。prompt 如下：

Please extract relevant sentences from the provided context that can potentially help answer the following question. 
If no relevant sentences are found, or if you believe the question cannot be answered from the given context, 
return the phrase "Insufficient Information". 


While extracting candidate sentences you’re not allowed to make any changes to sentences from given context.

请从提供的上下文中提取与以下问题潜在相关的句子。如果找不到相关的句子，或者您认为该问题无法从给定的上下文中得到答案，请返回短语“信息不足”。在提取候选句子时，不得对给定上下文中的句子进行任何更改。

在 RAGAs 中，对于上下文中的每个句子，可以使用以下公式在句子层面计算其与 query 的相关性：

1.4 Context Recall

该指标衡量的是检索到的上下文与标注的答案之间的一致性程度。 它使用基准答案和检索到的上下文进行计算，数值越高，表示性能越强。例如：

图 5：高上下文召回率和低上下文召回率

资料来源：https://docs.ragas.io/en/latest/concepts/metrics/context_recall.html

在实施评估流程时，需要提供人工标注的基准数据。

计算公式如下：

1.5 Context Precision

这一指标相对复杂，它用于衡量检索到的包含真实信息的所有相关上下文是否都排在前列。得分越高，表示精确度越高。

该指标的计算公式如下：

上下文精确度（Context Precision）的优点在于其能够感知 ranking effect （译者注：指的是在检索结果中，相关的内容是否能够在排名中被正确地放置在顶部）。但它的缺点是，如果相关的检索结果很少，但排名都很靠前，得分也会很高。因此，有必要通过结合其他几个指标来考虑整体效果。

02 使用 RAGAs + LlamaIndex 对 RAG App 进行评估

主要流程如图 6 所示：

图 6：Main process. Image by author.

2.1 评估系统运行环境配置

安装 ragas：使用以下命令通过 pip 安装 ragas。

pip install ragas

然后，检查 ragas 的当前版本。

(py) Florian:~ Florian$ pip list | grep ragas
ragas                        0.0.22

值得一提的是，

使用pip install git+https://github.com/explodinggradients/ragas.git安装最新版本（v0.1.0rc1）的ragas，则不支持 LlamaIndex。

然后，导入相关库，设置环境变量和全局变量。

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
dir_path = "YOUR_DIR_PATH"

from llama_index import VectorStoreIndex, SimpleDirectoryReader

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
    context_precision
)

from ragas.llama_index import evaluate

目录中只有一个 PDF 文件，即 “TinyLlama: An Open Source Small Language Model”[3]。

(py) Florian:~ Florian$ ls /Users/Florian/Downloads/pdf_test/
tinyllama.pdf

2.2 用 LlamaIndex 构建简单的 RAG 查询引擎

documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

默认情况下，LlamaIndex 使用 OpenAI 模型，但可以使用 ServiceContext 轻松配置 LLM 和嵌入模型（embedding model）。

2.3 构建评估数据集

由于有些评估指标需要使用人工标注数据集，我自己编写了一些问题，并标注有相应的答案。

eval_questions = [
 "Can you provide a concise description of the TinyLlama model?",
 "I would like to know the speed optimizations that TinyLlama has made.",
 "Why TinyLlama uses Grouped-query Attention?",
 "Is the TinyLlama model open source?",
 "Tell me about starcoderdata dataset",
]
eval_answers = [
 "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",
 "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.", 
 "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",
 "Yes, TinyLlama is open-source",
 "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",
]
eval_answers = [[a] for a in eval_answers]

2.4 评估指标的选择和使用 RAGAs 进行评估

metrics = [
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_precision,
    context_recall,
]

result = evaluate(query_engine, metrics, eval_questions, eval_answers)
result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')

请注意，默认情况下，在 RAGAs 中使用的是 OpenAI 模型。

在 RAGAs 中，如果想要使用其他 LLM（如 Gemini）与 LlamaIndex 一起对 RAG 系统进行评估，我在 RAGAs 0.0.22 版本中没有找到任何能够实现这个想法的方法，即便在调试了 RAGAs 的源代码后也没有找到。

2.5 Final code

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
dir_path = "YOUR_DIR_PATH"

from llama_index import VectorStoreIndex, SimpleDirectoryReader

from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
    context_precision
)

from ragas.llama_index import evaluate

documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

eval_questions = [
 "Can you provide a concise description of the TinyLlama model?",
 "I would like to know the speed optimizations that TinyLlama has made.",
 "Why TinyLlama uses Grouped-query Attention?",
 "Is the TinyLlama model open source?",
 "Tell me about starcoderdata dataset",
]
eval_answers = [
 "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",
 "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.", 
 "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",
 "Yes, TinyLlama is open-source",
 "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",
]
eval_answers = [[a] for a in eval_answers]

metrics = [
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_precision,
    context_recall,
]

result = evaluate(query_engine, metrics, eval_questions, eval_answers)
result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')

请注意，在终端（terminal）运行程序时，pandas 数据框可能无法完全显示。如图 6 所示，我们可以将其导出为 CSV 文件来查看。

图 6：Final result. Image by author.

从图 6 中可以明显看出，第四个问题 “Tell me about starcoderdata dataset, ” 所有指标全部是 0 。这是因为 LLM 无法为这个问题提供回答。第二和第三个问题的上下文精确率（context precision）为0，这表明检索到的上下文中相关的上下文没有排在最前面。第二个问题的上下文召回率（context recall）为 0，表明检索到的上下文与人工标注的答案不匹配。

现在，再来看看 0 到 3 号问题的相关评估情况。模型对这些问题的回答相关性得分都很高，表明模型回答与问题之间相关程度很高。此外，Faithfulness 指标的分数并不低，这表明答案主要是从上下文中得出或总结出来的，因此可以得出结论，这些答案并非由 LLM 产生的幻觉。

此外，我们发现，尽管上下文相关程度（Context Relevance）得分较低，但 gpt-3.5-turbo-16k（RAGAs 使用的默认模型）仍然能够从中推导出答案。

从这些结果来看，显然这个基础的 RAG 系统还有很大的改进空间。