编程开源技术交流,分享技术与知识

网站首页 > 开源技术 正文

实现RAG管道中的上下文压缩和过滤

wxchong 2024-08-16 06:09:48 开源技术 15 ℃ 0 评论

每日推荐一篇专注于解决实际问题的外文,精准翻译并深入解读其要点,助力读者培养实际问题解决和代码动手的能力。

原文标题:Implement Contextual Compression And Filtering In RAG Pipeline

原文地址:https://medium.com/dphi-tech/implement-contextual-compression-and-filtering-in-rag-pipeline-4e9d4a92aa8f


实现RAG管道中的上下文压缩和过滤

上下文压缩器和过滤器

在 RAG 中,我们可能面临的最大问题之一是,检索器究竟能检索到哪些内容。

检索到的上下文并非全部有用。在传递的大块信息中,只有极少部分对整个答案具有实际意义。

有时,某个特定问题需要多个分块的答案或事实,而这些答案或事实需要综合在一起。

此外,在某些情况下,我们必须对问题进行澄清,而我们不希望其他事实与实际查询一起被引入到上下文学习窗口中。

什么是上下文压缩?

检索面临的一个挑战是,我们通常不知道在将数据输入系统时,我们的文档存储系统会面临哪些具体查询。

这意味着,与查询最相关的信息可能会被埋藏在包含大量无关文本的文档中。在应用程序中传递完整的文档可能会导致更昂贵的 LLM 调用和更差的响应。

因此,"上下文压缩 "的概念就派上了用场。这个想法是:-

  • 我们有某种基础检索器,可以检索到大量不同的信息。
  • 然后,我们将这些信息添加到文件压缩器中。
  • 压缩器对这些信息进行过滤和处理,只提取对回答问题有用的信息。

要使用上下文压缩检索器,您需要:

  • 一个基础检索器
  • 文件压缩器

上下文压缩的步骤:

  • 上下文压缩检索器将查询传递给基础检索器、
  • 然后,它将初始文件通过文件压缩器。
  • 文件压缩器可获取文件列表,并通过减少文件内容或完全删除文件来缩短文件列表。

技术栈

  • Langchain:支持使用 LLM 创建应用程序的框架
  • Llmware BLING 模型:大型语言模型(实验性)
  • Chromadb : 矢量存储

代码实现

安装所需的依赖项

pip install -qU langchain huggingface_hub chromadb pypdf python-dotenv transformers entence-transformers

导入所需依赖库

from langchain.llms import HuggingFaceHub
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.prompts import PromptTemplate
from dotenv import load_dotenv

设置 Huggingafechub token

import os
from getpass import getpass
os.environ["HUGGINGFACEHUB_API_TOKEN"] = getpass("Enter HuggingFace Hub Token:")

加载数据

loader = PyPDFLoader("/content/CommonInsuranceTerms (1).pdf")
documents = loader.load()
print(len(documents))
print(documents[0].page_content)

输出结果如下:

16
Glossary of Common Insurance Terms 
NOTICE:  This document is for informational purposes only and is not in tended to alter or replace the 
insurance policy. Additionally, this informational sheet is not  intended to fully set out your rights and 
obligations or the rights and obligations of the insurance comp any. If you have questions about your insurance, 
you should consult your insurance agent, the insurance company,  or the language of the insurance policy. 
A 
Accelerated death benefits  - An insurance policy with an accelerated death benefits provi sion will pay - 
under certain conditions - all or part of the policy death bene fits while the policyholder is still alive. These 
conditions include proof that the policyholder is terminally il l, has a specified life-thr eatening disease or is in a 
long-term care facility such as a nursing home. By accepting an  accelerated benefit payment, a person could be 
ruled ineligible for Medicaid or  other government benefits. The  proceeds may also be taxable. 
Accident  - An unforeseen, unintended event. 
Accident-only policies  - Policies that pay only in cas es arising from an accident or injury. 
Accidental death benefits  - If a life insurance policy includes an accidental death bene fit, the cause of death 
will be examined to determine whether the insured′s death meets  the policy′s definition of accidental. 
Actual cash value (ACV)  - The value of your property, based on the current cost to rep lace it minus 
depreciation. Also see "replacement cost." 
Additional livin g expenses (ALE)  - Reimburses the policyholder for the cost of temporary housin g, food, and 
other essential living expenses, if the home is damaged by a co vered peril that makes the home temporarily 
uninhabitable.  
Adjuster  - An individual employed by an insurer to evaluate losses and settle policyholder claims.  
Administrative expense charge  - An amount deducted, usually monthly, from the policy. 
Agent  - A person who sells insurance policies. Must be licensed by t he Alabama Department of Insurance to 
legally sell and transact insurance business.  
Annuitant  - A person who receives the payments from an annuity during hi s or her lifetime. 

设置文本分割器

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700,chunk_overlap=70)
split_documents = text_splitter.split_documents(documents)
print(len(split_documents))
print(split_documents[0])

输出内容如下:

65
Document(page_content='Glossary of Common Insurance Terms \nNOTICE:  This document is for informational purposes only and is not in tended to alter or replace the \ninsurance policy. Additionally, this informational sheet is not  intended to fully set out your rights and \nobligations or the rights and obligations of the insurance comp any. If you have questions about your insurance, \nyou should consult your insurance agent, the insurance company,  or the language of the insurance policy. \nA \nAccelerated death benefits  - An insurance policy with an accelerated death benefits provi sion will pay - \nunder certain conditions - all or part of the policy death bene fits while the policyholder is still alive. These', metadata={'source': '/content/CommonInsuranceTerms (1).pdf', 'page': 0})

在 HuggingFace 上使用 BLING(Best Little Instruct-following No-GPU)模型系列

链接:https://huggingface.co/llmware

Providing enterprise-grade LLM-based development framework, tools, and fine-tuned models

https://github.com/llmware-ai/llmware/tree/main/examples

Setup the embeddings

industry-bert-insurance-v0.1 是经过领域微调的 sentence_transformer 嵌入模型系列的一部分。

模型说明

industry-bert-insurance-v0.1是一个领域微调的基于BERT的768参数句子转换器模型,旨在作为在保险行业领域中嵌入的“直接”替代品。该模型在大量公开的保险业文档中进行了训练。

embeddings = SentenceTransformerEmbeddings(model_name="llmware/industry-bert-insurance-v0.1")

设置 LLM

模型说明

https://medium.com/@darrenoberst/small-instruct-following-llms-for-rag-use-case-54c55e4b41a8

bling-sheared-llama-1.3b-0.1是BLING(“Best Little Instruction-following No-GPU-required”)模型系列的一部分,它是在Sheared-LLaMA-1.3B基础模型上进行指令训练的。

BLING模型使用经过蒸馏的高质量定制指令数据集进行微调,针对特定子集的指令任务,旨在提供一个高质量的Instruct模型,即使在不使用任何先进的量化优化的情况下,也能在CPU笔记本上做‘推理准备’。

repo_id ="llmware/bling-sheared-llama-1.3b-0.1"
llm = HuggingFaceHub(repo_id=repo_id,
                     model_kwargs={"temperature":0.3,"max_length":500})

打印文件的辅助功能

def pretty_print_docs(docs):
  print(f"\n{'-'* 100}\n".join([F"Document{i+1}:\n\n" + d.page_content for i,d in enumerate(docs)]))

设置矢量存储

vectorstore = Chroma.from_documents(split_documents,
                                    embeddings,
                                    collection_metadata={"hnsw:space":"cosine"},
                                    persist_directory="/content/stores/insurance")
vectorstore.persist()

设置检索器

retriever = vectorstore.as_retriever(search_kwargs={"k":2})

获取与查询匹配的相关上下文

docs = retriever.get_relevant_documents(query="What is Group life insurance?")
pretty_print_docs(docs)

输出如下:

Document1:

or claim payment. Insurance companies also may have grievance p rocedures. 
Group life insurance  - This type of life insurance provides coverage to a group of people under one contract. 
Most group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life 
insurance can also be sold to associations to cover their membe rs and to lending institutions to cover the 
amounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be 
issued a master policy and each person in the group will receiv e a certificate of insurance. 
Group of companies  - Several insurance companies u nder common ownership and often  common 
management.
----------------------------------------------------------------------------------------------------
Document2:

Most group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life 
insurance can also be sold to associations to cover their membe rs and to lending institutions to cover the 
amounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be 
issued a master policy and each person in the group will receiv e a certificate of insurance. 
Group of companies  - Several insurance companies u nder common ownership and often  common 
management.

使用 LLMChainExtractor 添加上下文压缩功能

  • 添加一个 LLMChainExtractor 来遍历最初返回的文档。
  • 只从每份文档中提取与查询相关的上下文。
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
#making the compressor
compressor = LLMChainExtractor.from_llm(llm=llm)
#compressor retriver = base retriever + compressor
compression_retriever = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=compressor)

默认压缩器提示

print(compressor.llm_chain.prompt.template)

输出提示如下:

Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return NO_OUTPUT. 

Remember, *DO NOT* edit the extracted parts of the context.

> Question: {question}
> Context:
>>>
{context}
>>>
Extracted relevant parts:

为上下文压缩添加过滤器

  • 使用 LLMChainFilter 选择要传递给 LLM 的查询
#
from getpass import getpass
import os
from langchain.embeddings import OpenAIEmbeddings
from langchain.retrievers.document_compressors import EmbeddingsFilter
#
os.environ["OPENAI_API_KEY "] = getpass()
#
embdeddings_filter = EmbeddingsFilter(embeddings=embeddings)
compression_retriever_filter = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=embdeddings_filter)
#
compressed_docs = compression_retriever_filter.get_relevant_documents(query="What is Group Life Insurance?")
pretty_print_docs(compressed_docs)

输出如下:

Document1:

or claim payment. Insurance companies also may have grievance p rocedures. 
Group life insurance  - This type of life insurance provides coverage to a group of people under one contract. 
Most group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life 
insurance can also be sold to associations to cover their membe rs and to lending institutions to cover the 
amounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be 
issued a master policy and each person in the group will receiv e a certificate of insurance. 
Group of companies  - Several insurance companies u nder common ownership and often  common 
management.
----------------------------------------------------------------------------------------------------
Document2:

Most group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life 
insurance can also be sold to associations to cover their membe rs and to lending institutions to cover the 
amounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be 
issued a master policy and each person in the group will receiv e a certificate of insurance. 
Group of companies  - Several insurance companies u nder common ownership and often  common 
management.

实现用于问答的 RetrievalQA 链

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=compression_retriever_filter,
                                 verbose=True)
#
Ask Question
qa("What is Coinsurance?")

输出如下:

> Entering new RetrievalQA chain...

> Finished chain.
{'query': 'What is Coinsurance?',
 'result': ' Coinsurance is the percentage of each health care bill a person must pay out of their own pocket. Non-covered charges and deductibles are in addition to this amount. Coinsurance maximum is the most you will have to pay in coinsurance during a policy period (usually a year) before your health plan begins paying 100 percent of the cost of your covered health services. The coinsurance maximum generally does not apply to copayments or other expenses you might be required to pay.\n\nC'}

qa("What is Group Life Insurance?")

输出如下:

> Entering new RetrievalQA chain...

> Finished chain.
{'query': 'What is Group Life Insurance?',
 'result': ' Group life insurance provides coverage to a group of people under one contract. \nMost group contracts are sold to businesses that w ant to provid e life insurance for their employees. Group life \ninsurance can also be sold to associations to cover their membe rs and to lending institutions to cover the \namounts of their debtor loans. Most group policies are for term  insurance. Generally, the business will be \nissued a master policy and each person in the group will receiv e a certificate of insurance. \nGroup of companies  - Several insurance companies u nder common ownership and often  common \nmanagement'}

管道

串联压缩器和文档转换器

  • embeddings: langchain_core.embeddings.Embeddings【Required】用于嵌入文档内容和查询。
  • k: Optional[int] = 20 要返回的相关文档的数量。可以设置为 None,在这种情况下,必须指定相似度阈值。默认值为 20。
  • similarity_fn: 可调用 = 用于比较文档的相似性函数。该函数将两个矩阵(List[List[float]])作为输入,并返回一个分数矩阵,其中数值越高表示相似度越高。
  • similarity_threshold: Optional[float] = None 用于判断两个文档是否相似到足以被视为冗余的阈值。默认为 None,如果 k 设置为 None,则必须指定。

在这里,我们创建了一个由冗余过滤器 + 相关过滤器组成的管道,其中冗余过滤器过滤掉重复的上下文,而相关过滤器仅提取相关的上下文。

  • EmbeddingsRedundantFilter 我们可以识别相似文档并过滤掉冗余文档。
  • EmbeddingsFilter 通过嵌入文档和查询,只返回与查询有足够相似嵌入的文档,从而提供了一种更便宜、更快速的选择。
from langchain.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
#
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)
relevant_filter = EmbeddingsFilter(embeddings=embeddings,k=5)
#making the pipeline
pipeline_compressor = DocumentCompressorPipeline(transformers=[redundant_filter,relevant_filter])
# compressor retriever
compression_retriever_pipeline = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=pipeline_compressor)
## print the prompt
print(compression_retriever_pipeline)
## Get relevant documents
compressed_docs = compression_retriever_pipeline.get_relevant_documents(query="What is Coinsurance?")
pretty_print_docs(compressed_docs)

输出如下:

Document1:

Claimant  - A person who makes an insurance claim. 
Coinsurance  - The percentage of each health care bill a person must pay ou t of their own pocket. Non-covered 
charges and deductibles are in addition to this amount. 
Coinsurance maximum  - The most you will have to pay in coinsurance during a policy  period (usually a 
year) before your health plan begins paying 100 percent of the cost of your covered health services. The 
coinsurance maximum generally does not apply to copayments or o ther expenses you might be required to pay. 
Collision coverage  - Pays for damage to a car with out regard to who caused an acc ident. The company must

compressed_docs = compression_retriever_pipeline.get_relevant_documents(query="What is Earned premium?")
pretty_print_docs(compressed_docs)

使用LLM(llmware/bling-sheared-llama-1.3b-0.1模型)实现问答 RAG 管道

这个模型应该用来生成简短的文本作为回复,主要用于不需要较长回复的聊天机器人的应用。此外,根据我的实验观察,与 Zephyr-beta-7b 或 Openai 相比,该 LLM 无法生成有效的回复。我使用上述 LLM 只是出于实验目的。选择正确的LLM可以提高生成响应的正确性。

llmware 模型的提示格式

from langchain.prompts import PromptTemplate
template ="""
<human>:
Context:{context}

Question:{question}

Use the above Context to answer the user's question.Consider only the Context provided above to formulate response.If the Question asked does not match with the Context provided just say 'I do not know thw answer'.
<bot>:

"""
prompt = PromptTemplate(input_variables=["context","question"],template=template)
chain_type_kwargs = {"prompt":prompt}
print(prompt)

####
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=compression_retriever_pipeline,
                                 chain_type_kwargs=chain_type_kwargs,
                                 return_source_documents=True,
                                 verbose=True)
#
qa("What is Group Insurance Policy?")

输出如下:


response = qa("What is Long-term care benefits?")
print(response['result'].split("<|endoftext|>")[0])

输出如下:

Entering new RetrievalQA chain...

> Finished chain.
Long-term care benefits - Coverage that provides help for people when they are unable to care for themselves because of prolonged illness or disability. Benefits are triggered by specific findings of "cognitive impairment" or inability to perform certain actions known as "Activities of Daily Living." Benefits can range from help with daily activities while recuperating at home to skilled nursing care provided in a nursing home.

print(response)

输出如下:

创建新管道

压缩器 + 冗余过滤器 + 相关过滤器

压缩器:LLMChainExtractor 将遍历最初返回的文档,并从每个文档中只提取与查询相关的内容。

#
compressor = LLMChainExtractor.from_llm(llm=OpenAI(temperature=0.3,openai_api_key=api_key))
#
new_pipeline = DocumentCompressorPipeline(transformers=[compressor,redundant_filter,relevant_filter])
new_compression_retriever = ContextualCompressionRetriever(base_retriever=retriever,
                                                       base_compressor=new_pipeline)
compressed_docs = new_compression_retriever.get_relevant_documents(query="What is Coinsurance?")
pretty_print_docs(compressed_docs)

输出如下:

Document1:

Coinsurance - The percentage of each health care bill a person must pay out of their own pocket. Coinsurance maximum - The most you will have to pay in coinsurance during a policy period (usually a year) before your health plan begins paying 100 percent of the cost of your covered health services.

实现问答链

from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=new_compression_retriever,
                                 chain_type_kwargs=chain_type_kwargs,
                                 return_source_documents=True,
                                 verbose=True)
#
response = qa("What is Coinsurance?")
print(response['result'].split("<|endoftext|>")[0])

输出如下:

Finished chain.
 No, Coinsurance is the percentage of each health care bill a person must pay out of their own pocket.

结论

总之,要应对文档存储系统中的检索挑战,就必须采用深思熟虑的方法来提高效率和响应能力。在数据摄取过程中,特定查询的固有不确定性往往会导致文档中包含不相关的信息。反过来,在使用大型语言模型时,这又会导致成本增加和响应效果不理想。上下文压缩的概念是解决这一问题的重要方法。通过使用基础检索器收集各种信息,然后利用文档压缩器,系统可以过滤和处理数据,只保留有效应对用户查询所需的相关细节。这种方法不仅优化了资源的使用,还有助于全面提高系统性能和用户体验。

参考

[1] https://python.langchain.com/docs/get_started/introduction

往期推荐

[1] 从查询到高质量回答:发挥 RAG 和 Rerankers 的潜力

[2] 评估检索增强生成(RAG)的三步法

[3] 指导AI进行推理:提示工程如何弥补RAG系统中的差距

[4] 如何使用提示压缩来削减 RAG 80% 成本

[5] 在不同的 RAG 阶段注入知识图谱

[6] 检索增强生成(RAG)中的创新

[7] 使用主动检索增强生成 FLARE 实现更优越的 RAG

[8] 提高RAG性能的高级查询转换

本文暂时没有评论,来添加一个吧(●'◡'●)

欢迎 发表评论:

最近发表
标签列表