4️⃣Evaluation-Driven Development

EED(Evaluation-Driven Development)는 Evaluation 평가에 기반하여 LLM을 개발하는 방법론 입니다. LlamaIndex는 2가지 형태의 EED를 구성할 수 있습니다.

Why EDD?

정확성 및 관련성 향상: EDD는 파이프라인의 출력과 관련된 잠재적인 문제를 식별하고 해결하여 생성된 응답이 정확하고 관련성이 있으며 제공된 컨텍스트와 일관성을 유지하도록 도와줍니다.
약점 및 기회 식별: EDD는 파이프라인을 개선할 수 있는 영역을 쉽게 감지하여 개발자가 개선이 필요한 특정 측면에 집중할 수 있도록 합니다. 이러한 지속적인 평가 프로세스는 전반적인 성능 최적화로 이어집니다.
모델 선택 및 파라미터 튜닝 가이드: EDD는 다양한 모델과 파라미터 구성을 평가하여 당면한 특정 작업에 가장 적합한 모델 아키텍처와 하이퍼파라미터를 선택할 수 있도록 안내합니다.
견고성 및 일반화 보장: EDD는 파이프라인이 다양한 입력 시나리오와 데이터 분포에 걸쳐 일관되게 작동하도록 보장하여 견고성과 일반화 기능을 향상시킵니다.
사용자 기대치에 부합: EDD는 파이프라인의 출력을 사용자의 기대 및 요구 사항에 맞춰 조정하여 생성된 응답이 대상 고객의 특정 요구 사항에 맞게 조정되도록 지원합니다.
지속적인 개선 및 반복: EDD는 지속적인 개선과 반복 문화를 장려하여 개발자가 객관적인 평가 지표를 기반으로 정보에 입각한 의사 결정을 내릴 수 있도록 지원합니다.

요약하면, EDD는 지속적인 개선과 최적화를 촉진하며, 생성된 응답이 정확하고 관련성이 있으며 사용자의 기대에 부합하도록 보장함으로써 고품질 RAG 파이프라인을 개발하는 데 중요한 역할의 기대에 부합하도록 보장합니다.

How to Implement EDD?

DatasetGenerator를 사용하여 평가 질문(evaluation questions)을 자동 생성합니다.
faithfulness(충실도), relevancy(관련성)의 평가 evaluator를 정의합니다.
BatchEvalRunner를 사용하여 응답에 대한 평가를 비동기적으로 실행합니다.
평가 결과를 비교합니다.

EED 활용 예시

RAG Pipeline에 LLM을 결정하는 방법
RAG Pipeline에 적합한 RAG 전략을 결정하는 방법

이제 위의 2가지 예시를 LlmaIndex 코드로 구현해 보겠습니다.

사례 1. Evaluation-Driven Development (EDD) for Multi Document RAG Pipeline with GPT-3.5 and Zephyr-7b-beta

EDD를 사용하여 Metadata replacement + node sentence window를 위한 다중 문서 RAG 파이프라인에 가장 적합한 두 가지 LLM을 결정하는 방법을 보여 드립니다:

gpt-3.5-turbo
zephyr-7b-beta

Library 설치

%pip install -q llama-index-embeddings-openai
%pip install -q llama-index-embeddings-huggingface
%pip install -q llama-index-llms-openai
%pip install -q llama-index-llms-huggingface

%pip install 'pymemgpt[local]'==0.3.6
%pip install --upgrade llama-index-embeddings-huggingface

%pip install -q llama_index pypdf sentence-transformers transformers accelerate bitsandbytes

POC with Metadata Replacement + Node Sentence Window

SentenceWindowNodeParser는 주변 단어와 문장을 고려한 문장 표현을 만드는 데 사용할 수 있는 도구입니다. 검색 중에 검색된 문장을 LLM에 전달하기 전에 MetadataReplacement와 NodePostProcessor를 사용하여 단일 문장을 주변 문장이 포함된 창으로 대체합니다. 이는 문장의 전체 의미를 이해하는 것이 필수적인 기계 번역이나 요약과 같은 작업에 유용할 수 있습니다. 이는 보다 세분화된 세부 정보를 검색하는 데 도움이 되므로 대용량 문서에 가장 유용합니다.

Load documents

import logging, sys, os
import nest_asyncio
from dotenv import load_dotenv  

nest_asyncio.apply()

!echo "OPENAI_API_KEY=<Your OpenAI Key>" >> .env
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

from llama_index.core import SimpleDirectoryReader

titles = [
    "DevOps_Self-Service_Pipeline_Architecture",
    "DevOps_Self-Service_Terraform_Project_Structure",
    "DevOps_Self-Service_Pipeline_Security_Guardrails"
    ]

documents = {}
for title in titles:
    documents[title] = SimpleDirectoryReader(input_files=[f"./data/{title}.pdf"]).load_data()
print(f"loaded documents with {len(documents)} documents")

Set up node parser

from llama_index.core import set_global_service_context
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceWindowNodeParser, SimpleNodeParser

# create the sentence window node parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)
simple_node_parser = SimpleNodeParser.from_defaults()

gpt-3.5-turbo

Extract nodes and build index

#define LLM and embedding model
llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = "local:BAAI/bge-base-en-v1.5"

from llama_index.core import VectorStoreIndex

# extract nodes and build index
document_list = SimpleDirectoryReader("data").load_data()
nodes = node_parser.get_nodes_from_documents(document_list)
sentence_index = VectorStoreIndex(nodes, embed_model=embed_model)

modules.json: 100%|██████████| 349/349 [00:00<00:00, 668kB/s]
config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 321kB/s]
README.md: 100%|██████████| 94.6k/94.6k [00:00<00:00, 519kB/s]
sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 129kB/s]
tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 758kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 625kB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 3.73MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 317kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 558kB/s]

Define query engine

from llama_index.core.postprocessor import MetadataReplacementPostProcessor

metadata_query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    llm=llm,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ]
)

Run test queries

response = metadata_query_engine.query("DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에 대해 요약해줄래?")
print(str(response))

The article discusses a curated list of actions for pipeline security and guardrails, covering source code, dependent libraries, base image, infrastructure, and pipelines. These actions are implemented in reusable GitHub Actions workflows for both infrastructure and application pipelines to ensure compliance by developers when creating workflows for their applications.

response = metadata_query_engine.query("DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에서 Harden Runner란?")
print(str(response))

Harden-Runner is a purpose-built security monitoring agent for pipelines that automatically discovers and correlates outbound traffic with each step in the pipeline, prevents exfiltration of credentials, and detects tampering of source code during the build. It is a crucial component used in all pipelines - infrastructure and application pipelines for both CI and CD workflows due to its unique nature and purpose.

zephyr-7b-beta

Huggingface hub에서 모델을 다운로드 하겠습니다.

import torch
from transformers import BitsAndBytesConfig
from llama_index.core import PromptTemplate
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import set_global_tokenizer

# load a model in 4bit using NF4 quantization with double quantization with the compute dtype bfloat16 for faster training
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

try:
    llm_zephyr = HuggingFaceLLM(
        model_name="HuggingFaceH4/zephyr-7b-beta",
        tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
        query_wrapper_prompt=PromptTemplate(
            "<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"
        ),
        context_window=3900,
        max_new_tokens=256,
        model_kwargs={"quantization_config": quantization_config},
        generate_kwargs={
            "do_sample": True,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        },
        device_map="auto",
    )
except Exception:
    print(
        "Failed to load and quantize model, likely due to CUDA being missing. "
        "Loading full precision model instead."
    )
    llm_zephyr = HuggingFaceLLM(
        model_name="HuggingFaceH4/zephyr-7b-beta",
        tokenizer_name="HuggingFaceH4/zephyr-7b-beta",
        query_wrapper_prompt=PromptTemplate(
            "<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"
        ),
        context_window=3900,
        max_new_tokens=256,
        generate_kwargs={
            "do_sample": True,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        },
        device_map="auto",
    )

# set tokenizer for proper token counting
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
set_global_tokenizer(tokenizer.encode)

 [A model-00008-of-00008.safetensors: 100%|██████████| 816M/816M [00:07<00:00, 110MB/s][A
Downloading shards: 100%|██████████| 8/8 [02:16<00:00, 17.11s/it]
Loading checkpoint shards: 100%|██████████| 8/8 [00:09<00:00,  1.20s/it]
generation_config.json: 100%|██████████| 111/111 [00:00<00:00, 276kB/s]
tokenizer_config.json: 100%|██████████| 1.43k/1.43k [00:00<00:00, 4.01MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 16.9MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 1.92MB/s]
added_tokens.json: 100%|██████████| 42.0/42.0 [00:00<00:00, 104kB/s]
special_tokens_map.json: 100%|██████████| 168/168 [00:00<00:00, 503kB/s]

Extract nodes and build index

from llama_index.core import VectorStoreIndex

document_list = SimpleDirectoryReader("data").load_data()
nodes = node_parser.get_nodes_from_documents(document_list)
sentence_index_zephyr = VectorStoreIndex(nodes, embed_model="local:BAAI/bge-base-en-v1.5")

Define query engine

from llama_index.core.postprocessor import MetadataReplacementPostProcessor

metadata_query_engine_zephyr = sentence_index_zephyr.as_query_engine(
    similarity_top_k=2,
    llm=llm_zephyr,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ]
)

Run test queries

query = "DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에 대해 요약해줄래?"
response = metadata_query_engine_zephyr.query(query)

print(str(response))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


여러 보안 액션들을 수집해 보관하였으며, 소스 코드, 종속 라이브러리, 기반 이미지, 인프라스터ktur, 파이프라인에 대한 보안 가드레일도 포함하였다. 이 액션들은 GitHub Actions의 수동 작업으로 설정되고, 개발자들이 애플리케이션의 호출 워플로우를 개발할 때 준수할 수 있도록 설정되었다. 보안 가드레일은 애플리케이션의 인프라스터ktur에서 시작하여 소스 코드에 이르기까지 앱 전반에 걸쳐 있다. 저는 다음의 네 개의 글에 대해 추가로 참고하도록 권고한다:

query = "DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에서 Harden Runner란 무엇이야?"
response = metadata_query_engine_zephyr.query(query)

print(str(response))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Harden Runner은 DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에서 사용되는 유일한 액션입니다. 이는 모든 파이프라인에서 사용되는 이유는 하arden 러너의 고유한 성격과 목적으로 설명됩니다. 하arden 러너는 인프라스터와 애플리케이션 파이프라인에서 CI 및 CD 워크플로우에서 사용됩니다. 하arden 러너의 주요 기능은 다음과 같습니다:

1. 출발 경로에서 각 단계에 대해 자동으로 디스커버하고 연관시켜 전달되는 외부 트래픽입니다. 솔루션 공급 체인 보안 위반에

Evaluations

데이터 세트 생성기를 사용하여 평가 질문을 자동 생성합니다.
충실도(faithfulness)와 관련성(relevancy)에 대한 평가자를 정의합니다.
BatchEvalRunner를 사용하여 응답에 대한 평가를 비동기적으로 실행합니다.
평가 결과를 비교합니다.

Generate evaluation questions

#%pip install spacy

import random
random.seed(42)
from llama_index.core.evaluation import DatasetGenerator
import nest_asyncio

nest_asyncio.apply()

# load data
document_list = SimpleDirectoryReader("data").load_data()

question_dataset = []
if os.path.exists("question_dataset.txt"):
    with open("question_dataset.txt", encoding='utf-8') as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(document_list)
    generated_questions = data_generator.generate_questions_from_nodes()
    print(f"Generated {len(generated_questions)} questions.")

    # randomly pick 30 questions
    generated_questions = random.sample(generated_questions, 30)
    question_dataset.extend(generated_questions)
    print(f"Randomly picked {len(question_dataset)} questions.")

    # save the questions into a txt file
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

for i, question in enumerate(question_dataset, start=1):
    print(f"{i}. {question}")

/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/llama_index/core/evaluation/dataset_generation.py:212: DeprecationWarning: Call to deprecated class DatasetGenerator. (Deprecated in favor of `RagDatasetGenerator` which should be used instead.)
  return cls(


Generated 490 questions.
Randomly picked 30 questions.
1. Who is credited for the photo included in the document?
2. How are Microservice CD GitHub Actions workflow and Infrastructure/Application Pipelines integrated?
3. What types of source code are mentioned in the self-service pipeline architecture?
4. How does the recommended file structure for Terraform modules contribute to better organization and understanding of the codebase?
5. What is the purpose of Infracost in the context of cloud cost management?
6. What is the significance of whitelisting outbound endpoints for actions in the workflow?
7. Can you explain the role of Harden-Runner in both CI and CD workflows?
8. How does image immutability drive the workflows orchestrated by GitHub Actions?
9. What additional files might need to be included in the file structure of a reusable Terraform module based on its complexity?
10. Where does the Terraform code and GitHub Actions workflow code reside for the application?
11. Why is it important to reduce code duplication when working with infrastructure code?
12. How does the workflow determine which environment to run Terraform for based on the manual trigger?
13. How can developers ensure that their CI pipeline is not affected by the Trivy timeout error mentioned in the context information?
14. Can you describe the high-level overview of the two typical pipelines mentioned in the document?
15. What aspects of the development process do the actions for pipeline security and guardrails cover?
16. What is the purpose of setting the log level in the Checkov action?
17. What is the role of the "pipeline integration glue" in the self-service pipeline architecture?
18. How does the author differentiate between pipeline architecture for microservices and serverless workloads?
19. How can alternative flows be incorporated into the pipelines according to the document?
20. Why is Harden-Runner the only action used in all pipelines, including infrastructure and application pipelines?
21. In what ways does Harden-Runner support the secure deployment of applications in the pipeline workflows?
22. How does Trivy help in identifying critical security issues in container images?
23. How do the actions for pipeline security and guardrails ensure adherence by developers in their workflow development?
24. Where are the three types of source code supposed to reside for a microservice?
25. Why is it recommended to include TruffleHog in your pipelines as a security measure for your application?
26. Why do developers not need to know the details of security and guardrail measures when using high-quality reusable workflows and modules?
27. When was the last modified date of the file "DevOps_Self-Service_Terraform_Project_Structure.pdf"?
28. What is the file size of "DevOps_Self-Service_Terraform_Project_Structure.pdf"?
29. How can the use of reusable Terraform modules contribute to efficiency and consistency in infrastructure projects?
30. What is the default output format for the results of the Checkov action?

Define evaluators

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# use gpt-4 to evaluate
llm=OpenAI(temperature=0.1, llm="gpt-4-1106-preview")

faithfulness_gpt4 = FaithfulnessEvaluator(llm=llm)
relevancy_gpt4 = RelevancyEvaluator(llm=llm)

Define evaluation batch runner

from llama_index.core.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=8,
    show_progress=True
)

def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Correct: {correct}. Score: {score}")
    return score

Evaluation on gpt-3.5-turbo

eval_results = await runner.aevaluate_queries(
    metadata_query_engine, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [00:05<00:00,  5.08it/s]
100%|██████████| 60/60 [00:04<00:00, 12.29it/s]

------------------
faithfulness Correct: 28. Score: 0.9333333333333333
relevancy Correct: 25. Score: 0.8333333333333334

Evaluation on zephyr-7b

eval_results = await runner.aevaluate_queries(
    metadata_query_engine_zephyr, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

  0%|          | 0/30 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
100%|██████████| 30/30 [04:51<00:00,  9.71s/it]  
100%|██████████| 60/60 [00:03<00:00, 17.07it/s]

------------------
faithfulness Correct: 26. Score: 0.8666666666666667
relevancy Correct: 26. Score: 0.8666666666666667

최종 결과 비교

LLM

GPT3.5-Turbo

Zephyr-7B

Faithfulness Correct

0.933

0.867

Relevancy Correct

0.833

0.867

사례 2. Evaluation-Driven Development (EDD) for Multi Document RAG Pipeline

EDD(Evaluation-Driven Development)를 사용하여 이 두 가지 전략 중 다중 문서 RAG 파이프라인에 더 적합한 전략을 결정하는 방법을 보여드립니다:

Recursive retriever + document agent
Metadata replacement + node sentence window

Setup Environments

%pip install -q llama-index-embeddings-openai
%pip install -q llama-index-embeddings-huggingface
%pip install -q llama-index-llms-openai

%pip install -q llama_index pypdf sentence-transformers

import logging, sys, os
import nest_asyncio
from dotenv import load_dotenv  

nest_asyncio.apply()

!echo "OPENAI_API_KEY=<Your OpenAI Key>"" >> .env
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

Common Tasks

Load documents

from llama_index.core import SimpleDirectoryReader

titles = [
    "DevOps_Self-Service_Pipeline_Architecture",
    "DevOps_Self-Service_Terraform_Project_Structure",
    "DevOps_Self-Service_Pipeline_Security_Guardrails"
    ]

documents = {}
for title in titles:
    documents[title] = SimpleDirectoryReader(input_files=[f"./data/{title}.pdf"]).load_data()
print(f"loaded documents with {len(documents)} documents")

loaded documents with 3 documents

Recursive retriever + document agent

다양한 유형의 쿼리/요약을 제공하려면 각 문서에 대해 여러 개의 인덱스를 만드는 것이 가장 좋습니다. 그런 다음 문서 에이전트는 쿼리 엔진을 통해 질문에 기반한 트래픽을 올바른 인덱스로 전달하여 정확한 답변을 검색하는 디스패처 역할을 합니다.

인덱스 노드는 문서 에이전트 앞에 있는 추가 계층으로 도입됩니다. 하나의 인덱스 노드는 하나의 문서 에이전트에 해당하며, 각 문서에 대한 목록 쿼리 엔진(요약용)과 벡터 쿼리 엔진(Q&A용)에 매핑됩니다.

from llama_index.core import (
    VectorStoreIndex,
    SummaryIndex,
    Response
)
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
from llama_index.agent.openai import OpenAIAgent
import pandas as pd
import openai
import os

#define LLM
llm = OpenAI(temperature=0.1, model_name="gpt-3.5-turbo")

Create document agents

# Build agents dictionary
agents = {}

for title in titles:

    # build vector index
    vector_index = VectorStoreIndex.from_documents(documents[title])

    # build summary index
    list_index = SummaryIndex.from_documents(documents[title])

    # define query engines
    vector_query_engine = vector_index.as_query_engine(llm=llm)
    list_query_engine = list_index.as_query_engine(llm=llm)

    # define tools
    query_engine_tools = [
        QueryEngineTool(
            query_engine=vector_query_engine,
            metadata=ToolMetadata(
                name="vector_tool",
                description=f"Useful for retrieving specific context related to {title}",
            ),
        ),
        QueryEngineTool(
            query_engine=list_query_engine,
            metadata=ToolMetadata(
                name="summary_tool",
                description=f"Useful for summarization questions related to {title}",
            ),
        ),
    ]

    # build agent
    function_llm = OpenAI(model="gpt-3.5-turbo-0613")
    agent = OpenAIAgent.from_tools(
        query_engine_tools,
        llm=function_llm,
        verbose=False,
    )

    agents[title] = agent

Create index nodes

# define index nodes that link to the document agents
nodes = []
for title in titles:
    doc_summary = (
        f"이 콘텐츠에는 {title}에 대한 자세한 내용이 포함되어 있습니다."
        f"{title}에 대한 특정 사실을 조회해야 하는 경우 이 인덱스를 사용합니다.\n"
        "여러 문서를 쿼리하려는 경우 이 인덱스를 사용하지 마세요."
    )
    node = IndexNode(text=doc_summary, index_id=title)
    nodes.append(node)

# define retriever
vector_index = VectorStoreIndex(nodes)
vector_retriever = vector_index.as_retriever(similarity_top_k=1)

Define recursive retriever and query engine

# define recursive retriever
# note: can pass `agents` dict as `query_engine_dict` since every agent can be used as a query engine
recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=agents,
    verbose=False,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

# define query engine
recursive_query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever,
    response_synthesizer=response_synthesizer,
    llm=llm,
)

Run test queries

response = recursive_query_engine.query("")
print(str(response))

DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일은 소프트웨어 개발과 배포를 자동화하는 파이프라인에서 보안 설정과 제한 사항을 정의하여 안전한 코드 작성, 배포, 시스템 보안을 유지하는 것을 의미합니다. 가드레일은 허용되는 작업과 동작을 제한하는 정책으로, 코드 스캔, 보안 검사, 권한 제한 등이 포함될 수 있습니다. 이러한 조치는 시스템의 안전성을 높이고 악의적인 코드나 취약점을 방지하는 역할을 합니다.

response = recursive_query_engine.query("DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에서 Harden Runner란 무엇이야?")
print(str(response))

Harden Runner는 파이프라인에서 악성 패턴을 감지하고 예방하기 위해 설계된 특별한 보안 모니터링 에이전트입니다. 이는 파이프라인의 각 단계에서 발생하는 아웃바운드 트래픽을 자동으로 탐지하고 연관시키며, 자격 증명 유출을 방지하고, 소스 코드의 조작을 감지하며, 침해된 종속성과 빌드 도구를 식별합니다. 초기에 감사 모드로 구성한 후에는 일부 워크플로우 실행 후에 블록 모드로 전환할 수 있으며, 아웃바운드 엔드포인트를 화이트리스트로 설정하고 무단 엔드포인트 호출에 대한 알림을 설정함으로써 작동합니다. 이는 고유한 보안 기능으로 인해 모든 유형의 파이프라인에서 사용되는 중요한 구성 요소입니다.

Metadata Replacement + Node Sentence Window

SentenceWindowNodeParser는 주변 단어와 문장을 고려한 문장 표현을 만드는 데 사용할 수 있는 도구입니다. 검색 중에 검색된 문장을 LLM에 전달하기 전에 MetadataReplacementNodePostProcessor를 사용하여 단일 문장을 주변 문장이 포함된 창으로 대체합니다. 이는 문장의 전체 의미를 이해하는 것이 필수적인 기계 번역이나 요약과 같은 작업에 유용할 수 있습니다. 이는 보다 세분화된 세부 정보를 검색하는 데 도움이 되므로 대용량 문서에 가장 유용합니다.

Set up node parser

from llama_index.core import set_global_service_context
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceWindowNodeParser, SimpleNodeParser

# create the sentence window node parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)
simple_node_parser = SimpleNodeParser.from_defaults()

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512
)

modules.json: 100%|██████████| 349/349 [00:00<00:00, 698kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 267kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 10.9MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 118kB/s]
tokenizer_config.json: 100%|██████████| 363/363 [00:00<00:00, 732kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 39.5MB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 839kB/s]
special_tokens_map.json: 100%|██████████| 239/239 [00:00<00:00, 577kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 457kB/s]

Extract nodes and build index

from llama_index.core import VectorStoreIndex

document_list = SimpleDirectoryReader("data").load_data()
nodes = node_parser.get_nodes_from_documents(document_list)
sentence_index = VectorStoreIndex(nodes, embed_model=embed_model)

Define query engine

from llama_index.core.postprocessor import MetadataReplacementPostProcessor

metadata_query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    llm=llm,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

Run test queries

query = "DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에 대해 요약해줄래?"
response = metadata_query_engine.query(query)
print(str(response))

DevOps 셀프 서비스 중심 파이프라인은 Terraform을 사용하여 시크릿을 자동으로 GitHub에 삽입하여 수동 시크릿 생성을 제거합니다. 애플리케이션 코드의 PR 생성/병합 또는 수동 트리거 시, GitHub Actions 워크플로우에 의해 CI 및 CD를 위한 애플리케이션 파이프라인이 트리거되어 앱을 빌드, 테스트, 스캔 및 클라우드로 배포합니다. DevOps 셀프 서비스는 클라우드 네이티브 아키텍처의 성장 트렌드로 인해 최근 몇 년간 주목을 받았습니다. 3-2-1 규칙은 셀프 서비스 DevOps 실천의 전체 아키텍처를 개요화하며, 이를 통해 DevOps 셀프 서비스의 주요 구성 요소를 설명합니다.

query = "DevOps 셀프 서비스 중심 파이프라인 보안 및 가드레일에서 Harden Runner란 무엇이야?"
response = metadata_query_engine.query(query)
print(str(response))

Harden-Runner is a purpose-built security monitoring agent developed by StepSecurity for pipelines. It is designed to detect and prevent malicious patterns in software supply chain security breaches. Harden-Runner automatically discovers and correlates outbound traffic with each step in the pipeline, prevents exfiltration of credentials, detects tampering of source code during the build, and identifies compromised dependencies and build tools.

Evaluations

데이터세트 생성기를 사용하여 30개의 평가 문항을 자동 생성합니다.
충실도(faithfulness)와 관련성(relevancy)에 대한 평가자를 정의합니다.
BatchEvalRunner를 사용하여 두 전략에서 두 쿼리 엔진의 응답에 대한 평가를 비동기적으로 실행합니다.
평가 결과를 비교하여 더 높은 점수를 받은 전략을 선택합니다.

Generate evaluation questions

import random
random.seed(42)
from llama_index.core.evaluation import DatasetGenerator
import nest_asyncio

nest_asyncio.apply()

# load data
document_list = SimpleDirectoryReader("data").load_data()

question_dataset = []
if os.path.exists("question_dataset.txt"):
    with open("question_dataset.txt", "r") as f:
        for line in f:
            question_dataset.append(line.strip())
else:
    # generate questions
    data_generator = DatasetGenerator.from_documents(document_list)
    generated_questions = data_generator.generate_questions_from_nodes()
    print(f"Generated {len(generated_questions)} questions.")

    # randomly pick 30 questions
    generated_questions = random.sample(generated_questions, 30)
    question_dataset.extend(generated_questions)
    print(f"Randomly picked {len(question_dataset)} questions.")

    # save the questions into a txt file
    with open("question_dataset.txt", "w") as f:
        for question in question_dataset:
            f.write(f"{question.strip()}\n")

for i, question in enumerate(question_dataset, start=1):
    print(f"{i}. {question}")

1. Who is credited for the photo included in the document?
2. How are Microservice CD GitHub Actions workflow and Infrastructure/Application Pipelines integrated?
3. What types of source code are mentioned in the self-service pipeline architecture?
4. How does the recommended file structure for Terraform modules contribute to better organization and understanding of the codebase?
5. What is the purpose of Infracost in the context of cloud cost management?
6. What is the significance of whitelisting outbound endpoints for actions in the workflow?
7. Can you explain the role of Harden-Runner in both CI and CD workflows?
8. How does image immutability drive the workflows orchestrated by GitHub Actions?
9. What additional files might need to be included in the file structure of a reusable Terraform module based on its complexity?
10. Where does the Terraform code and GitHub Actions workflow code reside for the application?
11. Why is it important to reduce code duplication when working with infrastructure code?
12. How does the workflow determine which environment to run Terraform for based on the manual trigger?
13. How can developers ensure that their CI pipeline is not affected by the Trivy timeout error mentioned in the context information?
14. Can you describe the high-level overview of the two typical pipelines mentioned in the document?
15. What aspects of the development process do the actions for pipeline security and guardrails cover?
16. What is the purpose of setting the log level in the Checkov action?
17. What is the role of the "pipeline integration glue" in the self-service pipeline architecture?
18. How does the author differentiate between pipeline architecture for microservices and serverless workloads?
19. How can alternative flows be incorporated into the pipelines according to the document?
20. Why is Harden-Runner the only action used in all pipelines, including infrastructure and application pipelines?
21. In what ways does Harden-Runner support the secure deployment of applications in the pipeline workflows?
22. How does Trivy help in identifying critical security issues in container images?
23. How do the actions for pipeline security and guardrails ensure adherence by developers in their workflow development?
24. Where are the three types of source code supposed to reside for a microservice?
25. Why is it recommended to include TruffleHog in your pipelines as a security measure for your application?
26. Why do developers not need to know the details of security and guardrail measures when using high-quality reusable workflows and modules?
27. When was the last modified date of the file "DevOps_Self-Service_Terraform_Project_Structure.pdf"?
28. What is the file size of "DevOps_Self-Service_Terraform_Project_Structure.pdf"?
29. How can the use of reusable Terraform modules contribute to efficiency and consistency in infrastructure projects?
30. What is the default output format for the results of the Checkov action?

Define evaluators

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

# use gpt-4 to evaluate
llm=OpenAI(temperature=0.1, llm="gpt-4")

faithfulness_gpt4 = FaithfulnessEvaluator(llm=llm)
relevancy_gpt4 = RelevancyEvaluator(llm=llm)

Define evaluation batch runner

from llama_index.core.evaluation import BatchEvalRunner

runner = BatchEvalRunner(
    {"faithfulness": faithfulness_gpt4, "relevancy": relevancy_gpt4},
    workers=10,
    show_progress=True
)

Be mindful not to define workers to be too high, as it may run into rate limit issues with OpenAI.

def get_eval_results(key, eval_results):
    results = eval_results[key]
    correct = 0
    for result in results:
        if result.passing:
            correct += 1
    score = correct / len(results)
    print(f"{key} Correct: {correct}. Score: {score}")
    return score

Evaluation of recursive retriever + document agent

eval_results = await runner.aevaluate_queries(
    recursive_query_engine, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [02:00<00:00,  4.02s/it]
100%|██████████| 60/60 [00:03<00:00, 19.95it/s]

------------------
faithfulness Correct: 29. Score: 0.9666666666666667
relevancy Correct: 29. Score: 0.9666666666666667

Evaluation of metadata replacement + node sentence window

eval_results = await runner.aevaluate_queries(
    metadata_query_engine, queries=question_dataset
)

print("------------------")
score = get_eval_results("faithfulness", eval_results)
score = get_eval_results("relevancy", eval_results)

100%|██████████| 30/30 [00:04<00:00,  6.51it/s]
100%|██████████| 60/60 [00:03<00:00, 19.03it/s]

------------------
faithfulness Correct: 28. Score: 0.9333333333333333
relevancy Correct: 26. Score: 0.8666666666666667

최종 결과 비교

Retriever + document agent

Metadata replacement + node sentence window

Faithfulness Correct

0.966

0.933

Relevancy Correct

0.966

0.867

PreviousEvaluation NextFine-tuning

Last updated 1 year ago

hashtagWhy EDD?

hashtagHow to Implement EDD?

hashtagEED 활용 예시

hashtag사례 1. Evaluation-Driven Development (EDD) for Multi Document RAG Pipeline with GPT-3.5 and Zephyr-7b-beta

hashtagLibrary 설치

hashtagPOC with Metadata Replacement + Node Sentence Window

hashtagLoad documents

hashtagSet up node parser

hashtaggpt-3.5-turbo

hashtagExtract nodes and build index

hashtagDefine query engine

hashtagRun test queries

hashtagzephyr-7b-beta

hashtagExtract nodes and build index

hashtagDefine query engine

hashtagRun test queries

hashtagEvaluations

hashtagGenerate evaluation questions

hashtagDefine evaluators

hashtagDefine evaluation batch runner

hashtagEvaluation on gpt-3.5-turbo

hashtagEvaluation on zephyr-7b

hashtag최종 결과 비교

hashtag사례 2. Evaluation-Driven Development (EDD) for Multi Document RAG Pipeline

hashtagSetup Environments

hashtagCommon Tasks

hashtagLoad documents

hashtagRecursive retriever + document agent

hashtagCreate document agents

hashtagCreate index nodes

hashtagDefine recursive retriever and query engine

hashtagRun test queries

hashtagMetadata Replacement + Node Sentence Window

hashtagSet up node parser

hashtagExtract nodes and build index

hashtagDefine query engine

hashtagRun test queries

hashtagEvaluations

hashtagGenerate evaluation questions

hashtagDefine evaluators

hashtagDefine evaluation batch runner

hashtagEvaluation of recursive retriever + document agent

hashtagEvaluation of metadata replacement + node sentence window

hashtag최종 결과 비교

Why EDD?

How to Implement EDD?

EED 활용 예시

사례 1. Evaluation-Driven Development (EDD) for Multi Document RAG Pipeline with GPT-3.5 and Zephyr-7b-beta

Library 설치

POC with Metadata Replacement + Node Sentence Window

Load documents

Set up node parser

gpt-3.5-turbo

Extract nodes and build index

Define query engine

Run test queries

zephyr-7b-beta

Extract nodes and build index

Define query engine

Run test queries

Evaluations

Generate evaluation questions

Define evaluators

Define evaluation batch runner

Evaluation on gpt-3.5-turbo

Evaluation on zephyr-7b

최종 결과 비교

사례 2. Evaluation-Driven Development (EDD) for Multi Document RAG Pipeline

Setup Environments

Common Tasks

Load documents

Recursive retriever + document agent

Create document agents

Create index nodes

Define recursive retriever and query engine

Run test queries

Metadata Replacement + Node Sentence Window

Set up node parser

Extract nodes and build index

Define query engine

Run test queries

Evaluations

Generate evaluation questions

Define evaluators

Define evaluation batch runner

Evaluation of recursive retriever + document agent

Evaluation of metadata replacement + node sentence window

최종 결과 비교