IT 세계의 후아

[프로그래머스]LV1_완주하지 못한 선수(Hash)

후__아 — Fri, 27 Sep 2024 23:49:50 +0900

participant: 참가자 목록
completion: 완주자 목록
return: 완주하지 못한 참가자 한 명의 이름
*중복된 이름이 있을 수 있음

# 내 첫 풀이
def solution(participant, completion):
    names_dic = dict(zip(participant, [0]*len(participant)))
    
    for i in participant:
        names_dic[i] += 1
    
    for j in completion:
        names_dic[j] -= 1
    
    return [k for k,v in names_dic.items() if v!=0][0]

# 다른 사람 풀이 참고-Hash 활용
def solution(participant, completion):
    answer = ''
    temp = 0
    dic = {}
    
    for part in participant:
        dic[hash(part)] = part
        temp += int(hash(part))
    for com in completion:
        temp -= hash(com)
    answer = dic[temp]

    return answer

[논문]QLoRA: Efficient Finetuning of Quantized LLMs

후__아 — Thu, 22 Aug 2024 15:13:22 +0900

LLM Fine-Tuning에 대해 찾아보다 QLoRA를 접한 후 공부가 필요하다 느껴 관련 논문을 리뷰해보고자 한다

하지만 그 전에 'Quantization' 양자화에 대한 것도 공부해야 한다...(역시 공부는 공부를 부르고...)

들어가기에 앞서..

※ Quantization 양자화

정확하고 세밀한 단위의 입력값 → 단순화한 단위값(경량화)

즉, 정보를 표현하는 데 필요한 비트의 수를 줄여주는 것

ex) 인공신경망에서, 가중치 매개변수(weight) & 활성 노드 연산(activation function) 양자화

→ lower-bit의 수학연산 & 신경망 중간 계산값 양자화

※ 장단점

메모리 액세스↓ 연산량↓ 전력 효율성↑

but 압축되는 과정에서 채널의 수가 줄어드는 만큼 정보가 손실됨

정확도가 기존 모델에 비해 낮아질 수밖에 없음

∴ 모델을 손상시키지 않으면서 크기와 계산 비용을 줄이는 것이 목표

※ 종류

보통 tensorflow/pytorch의 파라미터는 32bit 부동소수점 연산, FP32(float32) 형태로 저장됨

이를 INT8/INT4 or FP8/FP4로 변환하게 됨

- Dynamic Quantization 동적 양자화

weight에 대해 먼저 양자화, 계산 수행 직전에 동적으로 양자화 됨

- Static Quantization 정적 양자화(Post Training Quantization)

훈련 이후 양자화 적용, parameter size가 큰 모델에서 효과적

- Quantization Aware Training 양자화 인식 교육

훈련 도중 양자화를 고려하여 모델을 조정가중치 양자화에 대한 학습(fake quantization node)을 포함

→ 원본 모델을 보다 양자화에 robust하게 만듦

보다 높은 accuracy

참고) 머신러닝 효율화 기법

양자화에 대해.. 그리고 관련 다른 기법들과도 헷갈려서(알던 것도 다 까먹었기 때문에)

더 가닥을 잘 잡기 위해 기본개념 복습!!

https://www.youtube.com/watch?v=2ySpRWvUShI

학습을 효율적으로 도와주는 기법

(1) 정규화 Normalization

데이터 x값 간에 차이가 너무 나면 정규화 과정이 필요함

가중치값을 조절하는 것은 굉장히 어렵고 비효율적임

표준정규분포를 따르도록 하는 StandardScaler(표준화)가 대표적

- Batch Normalization

대규모 학습 데이터셋을 작은 batch로 나누어 학습시킬 수 있음

예를 들어, batch size=15일 때 epoch(전체를 학습시키는 경우)=1이라면 batch size=5일 때 epoch=3, 1일 때 15가 됨

Batch별 같은 feature에 해당하는 값들(위치가 똑같은 값들)을 정규화

∴ batch size에 따라 성능이 좌우될 수 있

- Layer Normalization

하나의 batch/token의 값(한번에 들어온 값)을 정규화 → batch normalization보다 계산이 쉬움

트랜스포머에서 사용 多(트랜스포머 이후의 언어모델에 대부분 사용)

(2) 최적화 Optimization

- Gradient Descent

1) Batch Gradient Descent : 모든 데이터를 한번에 다 넣어서 가중치를 업데이트

2) Mini-Batch Gradient Descent : 데이터를 조금씩 쪼개서 업데이트

3) Stochastic Gradient Descent: batch size가 1인 경우, 하나의 배치당 가중치 업데이트 한 번 → 랜덤으로 가중치가 바뀔 수 있음

- Momentum

직전 가중치의 업데이트 방향을 반영

update(t) = r * update(t-1) + n∇w

W(t+1) = W(t) - update(t)

- RMSprop

학습률(learning rate)을 각 가중치별로 조정

GD가 상대적으로 큰 가중치에는 작은 학습률, 작으면 큰 학습률을 적용하여 수렴 속도를 향상시킴

∴ 가중치 업데이트가 많을수록 덜 학습을 하도록 함

- Adam(Adaptive moment estimator)

Gradient와 learning rate를 모두 조정하는 방식(모멘텀 + RMSprop)

(3) Dropout

과적합 해소를 위한 방식, regularization의 대표 방

모델을 만든 후 노드 몇 개를 의도적으로 삭제함

특정 노드의 의존도↓ 여러 개의 다른 신경망 모델을 앙상블하는 효과 ∴ 보다 일반화된 패턴

https://arxiv.org/pdf/2305.14314

오늘 읽어보려는 논문은 "QLoRA: Efficient Finetuning of Quantized LLMs"

(QLoRA를 알기 전에 LoRA를 공부해야 한다는 걸 잊고 시작한 나란 바보가 QLoRA 읽다 말고 LoRA를 읽은 리뷰

→ )

0. Abstract

- QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance

- backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA)

- introduces a number of innovations

(a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights

(b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants*

(c) Paged Optimizers to manage memory spikes

- provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation

- find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots.

* the quantization constant

- scaling factor로서 양자화 과정 가운데 값들이 어떻게 quantized format의 range에 scale될지를 결정함

- scaling 이후 값들의 상대적 차이를 유지시킴으로써, neural network의 행동을 보존시킴

- dequantization에서 정확하게 기존 값을 복원시키는 데에 사용됨

1. Introduction

- LLM 모델을 finetuning 하는 건 성능을 올리는 데에 효과적 but GPU 메모리를 필요로 하기에 굉장히 비쌈

- QLoRA를 통해, it is possible to finetune a quantized 4-bit model without any performance degradation

- QLORA’s efficiency enables us to perform an in-depth study of instruction finetuning and chatbot performance on model scales that would be impossible using regular finetuning due to memory overhead

- Guanaco(LLaMA 7B 기반 학습된 언어모델)를 학습시킴으로써 trained model에 관한 trend를 발견함

first, data quality is far more important than dataset size

second, dataset suitability matters more than size for a given task.

also provide a extensive analysis of chatbot performance that uses both human raters and GPT-4 for evaluation

- 토너먼트 형식으로 모델을 비교 & Elo scores(determine the ranking of chatbot performance)

Finetuning 기법 차이

2. Background

Block-wise k-bit Quantization

To ensure that the entire range of the low-bit data type is used, the input data type is commonly rescaled into the target data type range through normalization by the absolute maximum of the input elements, which are usually structured as a tensor*.

quantizing a FP32 tensor into a Int8 tensor with range [-127, 127]

The problem with this approach is that if a large magnitude value (i.e., an outlier) occurs in the input tensor, then the quantization bins**—certain bit combinations—are not utilized well with few or no numbers quantized in some bins.

To prevent the outlier issue, a common approach is to chunk the input tensor into blocks that are independently quantized, each with their own quantization constant c. We chunk the input tensor X ∈ R b×h into n contiguous blocks of size B by flattening the input tensor and slicing the linear segment into n = (b × h)/B blocks. We quantize these blocks independently with Equation 1 to create a quantized tensor and n quantization constants ci .

즉, outlier 문제를 해결하기 위해 input tensor X를 flattening하고 크기가 B인 n개의 연속적인 블록으로 나누었고,

결국 각각의 블록이 양자화된 값(ci)을 만들어낸다는 뜻.

* tensor: 데이터의 배열 ex_scalar - vector - matrix - 3d tensor - nd tensor

** quantization bin: 양자화 함수 y=Q(x)에서 K-levd scalar quantizer는 k+1개의 decision level(d0, d1, ... ... , dk)과 K개의 output level(y0, y1, ... ... , yk)로 구성됨. 이때 di+1부터 di까지의 region을 quantization bin이라고 칭함.

Low-rank Adapters

Low-rank Adapter* (LoRA) finetuning is a method that reduces memory requirements by using a small set of trainable parameters, often termed adapters, while not updating the full model parameters which remain fixed.

LoRA augments a linear projection through an additional factorized projection.

+ LLM의 가중치 행렬에 근사화하는 두 개의 작은 행렬을 파인튜닝함

XW=y를 받을 때 LoRA의 계산법

* adapter: 기존에 학습이 완료된 모델 사이사이에 학습가능한 작은 feed-forward networks를 삽입하는 구조

Memory Requirement of Parameter-Efficient Finetuning

While LoRA was designed as a 3 Parameter Efficient Finetuning (PEFT) method*, most of the memory footprint** for LLM finetuning comes from activation gradients and not from the learned LoRA parameters. ... ... gradient checkpointing is important but also that aggressively reducing the amount of LoRA parameter yields only minor memory benefits. This means we can use more adapters without significantly increasing the overall training memory footprint. As discussed later, this is crucial for recovering full 16-bit precision performance.

* PEFT: 사전학습된 LLM의 대부분의 파라미터는 고정, 필요한 일부 파라미터만 파인튜닝함 → 저장공간&계산능력↓ Catastrophic Forgetting 극복

** memory footprint: the amount of main memory that a program uses or ferences while running

3. QLoRA Finetuning

- QLoRA achieves high-fidelity 4-bit finetuning via two techniques we propose—4-bit NormalFloat (NF4) quantization and Double Quantization. Additionally, we introduce Paged Optimizers, to prevent memory spikes during gradient checkpointing from causing out-of-memory errors that have traditionally made finetuning on a single machine difficult for large models.

4-bit NormalFloat Quantization

- NormalFloat (NF) data type builds on Quantile Quantization which is an information-theoretically optimal data type that ensures each quantization bin has an equal number of values assigned from the input tensor. Quantile quantization works by estimating the quantile of the input tensor through the empirical cumulative distribution function.

즉, Quantile Quantization 기법은 누적분포 함수의 quantile을 추적하여 4-bit quantization 수행하도록 함

- SRAM quantile과 같은 fast quantile approximation algorithm을 사용하지만, large quantization errors for outliers가 발생하는 한계 O

- Expensive quantile estimates and approximation errors can be avoided when input tensors come from a distribution fixed up to a quantization constant. ... ... transform all weights to a single fixed distribution by scaling σ such that the distribution fits exactly into the range of our data type.

데이터 타입과 neural network weights를 [-1, 1]로 정규화함

Double Quantization

- the process of quantizing the quantization constants for additional memory savings.

- treats quantization constants c2^FP32 of the first quantization as inputs to a second quantization.

- On average, for a blocksize of 64, this quantization reduces the memory footprint per parameter from 32/64 = 0.5 bits, to 8/64 + 32/(64 · 256) = 0.127 bits, a reduction of 0.373 bits per parameter.

4bit NFQ로 압축된 c2를 8bit로 한 번 더 압축시켜 c1을 계산함 → 파라미터 당 0.373bit의 리소스 절약

Paged Optimizers

- The feature works like regular memory paging* between CPU RAM and the disk. We use this feature to allocate paged memory for the optimizer states which are then automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step.

* memory paging: 프로세스의 논리 주소 공간을 page 단위로 자르고, 메모리의 물리적 주소 공간을 frame 단위로 자른 뒤, page를 frame에 할당하는 가상 메모리 관리 기법

QLoRA

- QLORA has one storage data type (usually 4-bit NormalFloat) and a computation data type (16-bit BrainFloat). We dequantize the storage data type to the computation data type to perform the forward and backward pass, but we only compute weight gradients for the LoRA parameters which use 16-bit BrainFloat*.

데이터를 4bit로 압축 저장하지만, weight gradient를 계산할 때는 16bit BrainFlot으로 압축 해제하여 수행함

* 16-bit BrainFloat(BF16): 32비트 부동 소수점 형식보다 정확도↓ 메모리 요구 사항↓ ∴모델 학습에 용이

cf) FP16 역시 메모리 사용량↓ 모델 훈련에는 일반적으로 FP32를 사용하고, 추론 단계에서 FP16을 사용해 연산 속도를 높이는 편.

≫ Mixed Precision: FP16, FP32를 혼합하며 모델학습에 사용하는 방식

4. QLoRA vs Standard Finetuing

- whether QLoRA can perform as well as full-model finetuning.

- want to analyze the components of QLoRA including the impact of NormalFloat4 over standard Float4.

- Our results consistently show that 4-bit QLORA with NF4 data type matches 16- bit full finetuning and 16-bit LoRA finetuning performance on academic benchmarks with wellestablished evaluation setups. We have also shown that NF4 is more effective than FP4 and that double quantization does not degrade performance.

5. Pushing the Chatbot State-of-the-art with QLoRA & 6. Qualitative Analysis

- we use the MMLU (Massively Multitask Language Understanding) benchmark to measure performance on a range of language understanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more.

- also test generative language capabilities through both automated and human evaluations.

MMLU 테스트 정확도 비교

Elo rating for a tournament between models where models compete to generate the best response for a prompt, judged by human raters or GPT-4.

7. Related Works

- Quantization of Large Language Models

- Finetuning with Adapters

- Instruction Finetuning

To help a pretrained LLM follow the instructions provided in a prompt, instruction finetuning uses input-output pairs of various data sources to finetune a pretrained LLM to generate the output given the input as a prompt.

- Chatbots

We do not use reinforcement learning, but our best model, Guanaco, is finetuned on multi-turn chat interactions from the Open Assistant dataset which was designed to be used for RLHF training*.

* RLHF training(Reinforcement Learning from Human Feedback): 사람의 피드백을 기반으로 ML 모델을 최적화함으로써, 자가학습을 보다 효율적으로 수행하는 ML 기법. AI 시스템이 더 인간적으로 보이도록 훈련시킴.

8. Limitations and Discussion

- we did not establish that QLORA can match full 16-bit finetuning performance at 33B and 65B scales.

- we did not evaluate on other benchmarks such as BigBench, RAFT, and HELM, and it is not ensured that our evaluations generalize to these benchmarks. On the other hand, we perform a very broad study on MMLU and develop new methods for evaluating chatbots.

- another limitation is that we only do a limited responsible AI evaluation of Guanaco.

- we did not evaluate different bit-precisions, such as using 3-bit base models, or different adapter methods

9. Broader Impacts

- Our QLORA finetuning method is the first method that enables the finetuning of 33B parameter models on a single consumer GPU and 65B parameter models on a single professional GPU, while not degrading performance relative to a full finetuning baseline. We have demonstrated that our best 33B model trained on the Open Assistant dataset can rival ChatGPT on the Vicuna benchmark.

- Another potential source of impact is deployment to mobile phones. We believe our QLORA method might enable the critical milestone of enabling the finetuning of LLMs on phones and other low resource settings.

cf)

https://aws.amazon.com/ko/what-is/reinforcement-learning-from-human-feedback/

https://huggingface.co/blog/4bit-transformers-bitsandbytes

https://jaeyung1001.tistory.com/entry/bf16-fp16-fp32%EC%9D%98-%EC%B0%A8%EC%9D%B4%EC%A0%90

https://training.continuumlabs.ai/training/the-fine-tuning-process/parameter-efficient-fine-tuning/the-quantization-constant

https://wikidocs.net/232761

https://devocean.sk.com/blog/techBoardDetail.do?ID=164779&boardType=techBlog

https://www.databricks.com/kr/blog/efficient-fine-tuning-lora-guide-llms

https://www.sciencedirect.com/topics/engineering/quantization-bin

https://guanaco-model.github.io/

https://pytorch.org/blog/introduction-to-quantization-on-pytorch/

https://pytorch.org/docs/stable/quantization.html

https://m.post.naver.com/viewer/postView.nhn?volumeNo=19437431&memberNo=20717909

[AI]RAG 기본 이론&실습(3)

후__아 — Mon, 5 Aug 2024 17:26:02 +0900

https://hoooa.tistory.com/67

[AI]RAG 기본 이론&실습(2)

https://hoooa.tistory.com/65 에서 정리했던 RAG의 기본 파이프라인(Data Load, Text Split, Indexing, Retrieval, Generation)을 한층 자세하게 들어가보자~ [AI]RAG 기본 이론&실습(1)RAG에 대해선 이전에 아주 짧게 다뤄

hoooa.tistory.com

에 이어서 마지막 RAG 정리글!

4. Vector Store

임베딩 벡터를 효육적으로 저장&검색하는 시스템(DB)

- 벡터 저장: 고차원 임베딩 벡터(텍스트/이미지/소리 등)를 처리 가능한 데이터 저장 구조 필요

- 벡터 검색: 저장벡터 중 사용자 쿼리에 가장 유사한 벡터를 찾는 과정, ex) 코사인 유사도/유클리드 거리 등

- 결과 반환

1. Chroma

- 임베딩/메타데이터 저장, 문서/쿼리 임베딩, 임베딩 검색 가능

(1) 유사도 기반 검색

### Chroma - 유사도 기반
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

loader = TextLoader(path + 'test.txt')
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=250,
    chunk_overlap=50,
    encoding_name='cl100k_base'
)

texts = text_splitter.split_text(data[0].page_content)
embeddings_model = OpenAIEmbeddings()
db = Chroma.from_texts(
    texts, 
    embeddings_model,
    collection_name = 'test',
    persist_directory = path,
    collection_metadata = {'hnsw:space': 'cosine'}, # l2 is the default
)

print(texts[0])
db

한국의 역사는 수천 년에 걸쳐 이어져 온 긴 여정 속에서 다양한 문화와 전통이 형성되고 발전해 왔습니다. 고조선에서 시작해 삼국 시대의 경쟁, 그리고 통일 신라와 고려를 거쳐 조선까지, 한반도는 많은 변화를 겪었습니다.

고조선은 기원전 2333년 단군왕검에 의해 세워졌다고 전해집니다. 이는 한국 역사상 최초의 국가로, 한민족의 시원이라 할 수 있습니다. 이후 기원전 1세기경에는 한반도와 만주 일대에서 여러 소국이 성장하며 삼한 시대로 접어듭니다.
<langchain_community.vectorstores.chroma.Chroma at 0x7ac8a7e18ac0>

query = '한국의 최초 국가는 어디인가요?'
docs = db.similarity_search(query)
print(docs[0].page_content)

WARNING:chromadb.segment.impl.vector.local_persistent_hnsw:Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
한국의 역사는 수천 년에 걸쳐 이어져 온 긴 여정 속에서 다양한 문화와 전통이 형성되고 발전해 왔습니다. 고조선에서 시작해 삼국 시대의 경쟁, 그리고 통일 신라와 고려를 거쳐 조선까지, 한반도는 많은 변화를 겪었습니다.

고조선은 기원전 2333년 단군왕검에 의해 세워졌다고 전해집니다. 이는 한국 역사상 최초의 국가로, 한민족의 시원이라 할 수 있습니다. 이후 기원전 1세기경에는 한반도와 만주 일대에서 여러 소국이 성장하며 삼한 시대로 접어듭니다.

(2) MMR

최대 한계 관련성(Maximum Marginal Relevance) 검색 방식

유사성과 다양성의 균형 → 검색 결과 품질 향상

쿼리와 관련성이 높으면서, 서로 다른 측면/정보를 제공하도록 설정(유사도 상위 fetch_k개의 문서)

### Chroma - MMR

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings


loader = PyMuPDFLoader(path+'SPRI_AI_Brief_2023년12월호_F.pdf')
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,
    chunk_overlap=200,
    encoding_name='cl100k_base'
)

documents = text_splitter.split_documents(data)
print(len(documents))

embeddings_model = OpenAIEmbeddings()
db2 = Chroma.from_documents(
    documents, 
    embeddings_model,
    collection_name = 'esg',
    persist_directory = path,
    collection_metadata = {'hnsw:space': 'cosine'}, # l2 is the default
)

db2

<langchain_community.vectorstores.chroma.Chroma at 0x7ac8a426e980>

# MMR
# 상위 10개의 유사 문서 중 서로 다른 정보를 제공하는 4개 문서 선택
mmr_docs = db2.max_marginal_relevance_search(query, k=4, fetch_k=10)
print(len(mmr_docs))
print(mmr_docs[0].page_content)


## 유사도 검색으로 하면
# query = '통이치엔원의 세부내용을 알려줘?'
# docs = db2.similarity_search(query)
# print(docs[0].page_content)

4
1. 정책/법제  
2. 기업/산업 
3. 기술/연구 
 4. 인력/교육
알리바바 클라우드, 최신 LLM ‘통이치엔원 2.0’ 공개
n 알리바바 클라우드가 복잡한 지침 이해, 광고문구 작성, 추론, 암기 등에서 성능이 향상된 최신 
LLM ‘통이치엔원 2.0’을 공개
n 알리바바 클라우드는 산업별로 특화된 생성 AI 모델을 공개하는 한편, 모델 개발과 애플리케이션 
구축 절차를 간소화하는 올인원 AI 모델 구축 플랫폼도 출시
KEY Contents
£ 알리바바의 통이치엔원 2.0, 주요 벤치마크 테스트에서 여타 LLM 능가
n 중국의 알리바바 클라우드가 2023년 10월 31일 열린 연례 기술 컨퍼런스에서 최신 LLM ‘통이
치엔원(Tongyi Qianwen) 2.0’을 공개
∙알리바바 클라우드는 통이치엔원 2.0이 2023년 4월 출시된 1.0 버전보다 복잡한 지침 이해, 
광고문구 작성, 추론, 암기 등에서 성능이 향상되었다고 설명
∙통이치엔원 2.0은 언어 이해 테스트(MMLU), 수학(GSM8k), 질문 답변(ARC-C)과 같은 벤치마크 
테스트에서 라마(Llama-2-70B)와 GPT-3.5를 비롯한 주요 AI 모델을 능가 
∙통이치엔원 2.0은 알리바바 클라우드의 웹사이트와 모바일 앱을 통해 대중에 제공되며 개발자는 
API를 통해 사용 가능 
n 알리바바 클라우드는 여러 산업 영역에서 생성 AI를 활용해 사업 성과를 개선할 수 있도록 지원
하는 산업별 모델도 출시
∙산업 영역은 고객지원, 법률 상담, 의료, 금융, 문서관리, 오디오와 동영상 관리, 코드 개발, 캐릭터 
제작을 포함
n 알리바바 클라우드는 급증하는 생성 AI 수요에 대응해 모델 개발과 애플리케이션 구축 절차를 
간소화하는 올인원 AI 모델 구축 플랫폼 ‘젠AI(GenAI)’도 공개
∙이 플랫폼은 데이터 관리, 모델 배포와 평가, 신속한 엔지니어링을 위한 종합 도구 모음을 제공하여 
다양한 기업들이 맞춤형 AI 모델을 한층 쉽게 개발할 수 있도록 지원

print(mmr_docs[-1].page_content)   # 가장 유사도가 낮은 문서

£ AI 에이전트가 의료와 교육, 생산성, 엔터테인먼트·쇼핑 영역의 서비스 대중화를 주도할 것
n 에이전트로 인해 주목할 만한 변화는 고비용 서비스의 대중화로 특히 △의료 △교육 △생산성 △
엔터테인먼트·쇼핑의 4개 영역에서 대규모 변화 예상
∙(의료) 에이전트가 환자 분류를 지원하고 건강 문제에 대한 조언을 제공하며 치료의 필요 여부를 결정하면서 
의료진의 의사결정과 생산성 향상에 기여
∙(교육) 에이전트가 1대 1 가정교사의 역할을 맡아 모든 학생에게 평등한 교육 기회를 제공할 수 있으며, 
아이가 좋아하는 게임이나 노래 등을 활용해 시청각 기반의 풍부한 맞춤형 교육 경험을 제공
∙(생산성) 사용자의 아이디어를 기반으로 에이전트가 사업계획과 발표 자료 작성, 제품 이미지 생성을 
지원하며, 임원의 개인 비서와 같은 역할도 수행 
∙(엔터테인먼트·쇼핑) 쇼핑 시 에이전트가 모든 리뷰를 읽고 요약해 최적의 제품을 추천하고 사용자 대신 
주문할 수 있으며 사용자의 관심사에 맞춤화된 뉴스와 엔터테인먼트를 구독 가능
☞ 출처 : GatesNotes, AI is about to completely change how you use computers, 2023.11.09.

2. FAISS

Facebook AI Simlarity Search, 벡터의 압축된 표현 사용 - 메모리 사용 ↓ 검색 속도↑

(1) 유사도 기반 검색

- l2(default): 유클리디안 거리

- ip(내적): 두 벡터의 방향성

- cosine: 각도가 작을수록 G

# pip install faiss-cpu sentence-transformers

### FAISS - 유사도 기반
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(
    model_name='jhgan/ko-sbert-nli',
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings':True},
)

vectorstore = FAISS.from_documents(documents,
                                   embedding = embeddings_model,
                                   distance_strategy = DistanceStrategy.COSINE
                                  )

query = '질문~~'
docs = vectorstore.similarity_search(query)
print(len(docs))
print(docs[0].page_content)

mmr_docs = vectorstore.max_marginal_relevance_search(query, k=4, fetch_k=10)
print(len(mmr_docs))
print(mmr_docs[0].page_content)

## FAISS DB 로컬 저장
vectorstore.save_local(path+'faiss')

db3 = FAISS.load_local(path+'faiss', embeddings_model)   # 저장 시 사용된 임베딩 모델과 동일해야 함

5. Retriever

Retrieval Augmented Generation의 검색도구

LangChain이 제공하는 다양한 검색도구

1. Vector Store Retriever

대량의 텍스트 데이터에서 효율적 검색

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy
from langchain_community.embeddings import HuggingFaceEmbeddings

# 데이터 로드 및 chunk 분할
loader = PyMuPDFLoader(path+'SPRI_AI_Brief_2023년12월호_F.pdf')
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,
    chunk_overlap=200,
    encoding_name='cl100k_base'
)

documents = text_splitter.split_documents(data)

# 임베딩 후 저장
embeddings_model = HuggingFaceEmbeddings(
    model_name='jhgan/ko-sbert-nli',
    model_kwargs={'device':'cpu'},
    encode_kwargs={'normalize_embeddings':True},
)
vectorstore = FAISS.from_documents(documents,
                                   embedding = embeddings_model,
                                   distance_strategy = DistanceStrategy.COSINE  
                                  )

# 단일 검색
query = '통이치엔원의 세부내용을 알려줘'
retriever = vectorstore.as_retriever(search_kwargs={'k': 1}   )# 가장 유사도가 높은 문장 하나
docs = retriever.get_relevant_documents(query)
print("*******************단일 검색*******************")
print(len(docs))
print(docs[0])

# MMR 검색
retriever = vectorstore.as_retriever(
    search_type='mmr',
    search_kwargs={'k': 5, 'fetch_k': 50}
)
docs = retriever.get_relevant_documents(query)
print("*******************MMR 검색*******************")
print(len(docs))
print(docs[0])

# MMR 검색2
retriever = vectorstore.as_retriever(
    search_type='mmr',
    search_kwargs={'k': 5, 'lambda_mult': 0.15}   # lambda_mult: 관련성-다양성 균형, 작을수록 다양성 G
)
docs = retriever.get_relevant_documents(query)
print(len(docs))
print(docs[-1])

# 유사도 점수 임계값 기반 검색
# Similarity score threshold (기준 스코어 이상인 문서를 대상으로 추출)
retriever = vectorstore.as_retriever(
    search_type='similarity_score_threshold',
    search_kwargs={'score_threshold': 0.3}  # 쿼리와 최소 0.3 이상의 유사도인 문서만
)
docs = retriever.get_relevant_documents(query)
print(len(docs))

# 메타데이터 필터링
retriever = vectorstore.as_retriever(
    search_kwargs={'filter': {'format':'PDF 1.4'}}
)
docs = retriever.get_relevant_documents(query)
print(len(docs))

# 실제 답변 생성
# 검색 - 프롬프트 생성 - 모델 - 문서 포맷팅 - 체인 - 실행

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# retrieval
retriever = vectorstore.as_retriever(
    search_type = 'mmr',
    search_kwargs = {'k': 5, 'lambda_mult': 0.15}
)
docs = retriever.get_relevant_documents(query)

# prompt
template = '''Answer the question based only on the following context:
{context}
Question: {question}
'''
prompt = ChatPromptTemplate.from_template(template)

# model
llm = ChatOpenAI(
    model = 'gpt-3.5-turbo-0125',
    temperature = 0,
    max_tokens = 500,
)

def format_docs(docs):
  return '\n\n'.join([d.page_content for d in docs])

# chain
chain = prompt | llm | StrOutputParser()

# run
response = chain.invoke({'context': (format_docs(docs)), 'question': query})

2. Multi Query Retriever

VSRetriever의 한계 극복

입력된 쿼리의 의미를 다각도로 포착 == 단일 쿼리 기반 다양한 관점의 멀티 쿼리를 자동 생성

ㄴLLM을 통해 입력 문장을 Paraphrasing

# 예시에서는 임베딩 모델을 huggingface에서 따로 다운받고 FAISS를 활용했는데, 본인이 테스트한 문서에서 맞지 않는 듯 하여 설정을 바꿨다

from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma


# 데이터 로드 및 chunk 분할
loader = PyMuPDFLoader(path+'SPRI_AI_Brief_2023년12월호_F.pdf')
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,
    chunk_overlap=200,
    encoding_name='cl100k_base'
)

documents = text_splitter.split_documents(data)

embeddings_model = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents, 
    embeddings_model,
    collection_name = 'test',
    persist_directory = path,
    collection_metadata = {'hnsw:space': 'cosine'}, # l2 is the default
)

from langchain.retrievers.multi_query import MultiQueryRetriever

quest = '통이치엔원에 대해 알려줘'

llm = ChatOpenAI(
    model='gpt-3.5-turbo-0125',
    temperature=0,
    max_tokens=500,
)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever = vectorstore.as_retriever(), llm = llm
)

# 로깅 설정: multiquery에 대한 정보를 로그로 기록&확인
import logging
logging.basicConfig()
logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)

unique_docs = retriever_from_llm.get_relevant_documents(query=quest)
print(len(unique_docs))
print(unique_docs[0])

INFO:langchain.retrievers.multi_query:Generated queries: ['1. 어떤 정보가 통이치엔원에 대해 있는가?', '2. 통이치엔원에 관련된 자료를 찾아볼까요?', '3. 통이치엔원에 대한 내용을 알고 싶어요.']
7
page_content='1. 정책/법제  
2. 기업/산업 
3. 기술/연구 
 4. 인력/교육
알리바바 클라우드, 최신 LLM ‘통이치엔원 2.0’ 공개
n 알리바바 클라우드가 복잡한 지침 이해, 광고문구 작성, 추론, 암기 등에서 성능이 향상된 최신 
LLM ‘통이치엔원 2.0’을 공개
n 알리바바 클라우드는 산업별로 특화된 생성 AI 모델을 공개하는 한편, 모델 개발과 애플리케이션 
구축 절차를 간소화하는 올인원 AI 모델 구축 플랫폼도 출시
KEY Contents
£ 알리바바의 통이치엔원 2.0, 주요 벤치마크 테스트에서 여타 LLM 능가
n 중국의 알리바바 클라우드가 2023년 10월 31일 열린 연례 기술 컨퍼런스에서 최신 LLM ‘통이
치엔원(Tongyi Qianwen) 2.0’을 공개
∙알리바바 클라우드는 통이치엔원 2.0이 2023년 4월 출시된 1.0 버전보다 복잡한 지침 이해, 
광고문구 작성, 추론, 암기 등에서 성능이 향상되었다고 설명
∙통이치엔원 2.0은 언어 이해 테스트(MMLU), 수학(GSM8k), 질문 답변(ARC-C)과 같은 벤치마크 
테스트에서 라마(Llama-2-70B)와 GPT-3.5를 비롯한 주요 AI 모델을 능가 
∙통이치엔원 2.0은 알리바바 클라우드의 웹사이트와 모바일 앱을 통해 대중에 제공되며 개발자는 
API를 통해 사용 가능 
n 알리바바 클라우드는 여러 산업 영역에서 생성 AI를 활용해 사업 성과를 개선할 수 있도록 지원
하는 산업별 모델도 출시
∙산업 영역은 고객지원, 법률 상담, 의료, 금융, 문서관리, 오디오와 동영상 관리, 코드 개발, 캐릭터 
제작을 포함
n 알리바바 클라우드는 급증하는 생성 AI 수요에 대응해 모델 개발과 애플리케이션 구축 절차를 
간소화하는 올인원 AI 모델 구축 플랫폼 ‘젠AI(GenAI)’도 공개
∙이 플랫폼은 데이터 관리, 모델 배포와 평가, 신속한 엔지니어링을 위한 종합 도구 모음을 제공하여 
다양한 기업들이 맞춤형 AI 모델을 한층 쉽게 개발할 수 있도록 지원' metadata={'author': 'dj', 'creationDate': "D:20231208132838+09'00'", 'creator': 'Hwp 2018 10.0.0.13462', 'file_path': '/content/drive/MyDrive/재정정보경진대회/SPRI_AI_Brief_2023년12월호_F.pdf', 'format': 'PDF 1.4', 'keywords': '', 'modDate': "D:20231208132838+09'00'", 'page': 11, 'producer': 'Hancom PDF 1.3.0.542', 'source': '/content/drive/MyDrive/재정정보경진대회/SPRI_AI_Brief_2023년12월호_F.pdf', 'subject': '', 'title': '', 'total_pages': 23, 'trapped': ''}

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough


# Prompt
template = '''Answer the question based only on the following context:
{context}

Question: {question}
'''

prompt = ChatPromptTemplate.from_template(template)

# Model
llm = ChatOpenAI(
    model='gpt-3.5-turbo-0125',
    temperature=0,
)

def format_docs(docs):
    return '\n\n'.join([d.page_content for d in docs])

# Chain
chain = (
    {'context': retriever_from_llm | format_docs, 'question': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Run
response = chain.invoke('통이치엔원에 대해 요약해서 알려주세요.')
response

INFO:langchain.retrievers.multi_query:Generated queries: ['1. 요약해서 통이치엔원에 대한 정보를 알려드릴까요?', '2. 통이치엔원에 대한 간략한 설명을 드릴까요?', '3. 통이치엔원에 대한 요약 정보를 제공해 드릴까요?']
알리바바 클라우드가 최신 LLM '통이치엔원 2.0'을 공개했는데, 이는 복잡한 지침 이해, 광고문구 작성, 추론, 암기 등에서 성능이 향상된 AI 모델이다. 이 모델은 다양한 벤치마크 테스트에서 다른 주요 AI 모델을 능가하며, 산업별로 특화된 생성 AI 모델을 제공하고 올인원 AI 모델 구축 플랫폼도 출시했다.

3. Contextual compression

검색된 문서 중 쿼리와 관련된 정보만 추출하여 반환

무관한 정보를 제거하는 방식

우선 기본 검색기를 먼저 설정한 후

quest = '통이치엔원에 대해 알려줘'

llm = ChatOpenAI(
    model='gpt-3.5-turbo-0125',
    temperature=0,
    max_tokens=500,
)
base_retriever = vectorstore.as_retriever(
                                search_type='mmr',
                                search_kwargs={'k':7, 'fetch_k': 20})
docs = base_retriever.get_relevant_documents(question)

해당 문서들을 효율적으로 압축함

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=base_retriever
)

compressed_docs = compression_retriever.get_relevant_documents(quest)
print(len(compressed_docs))

[Document(metadata={'author': 'dj', 'creationDate': "D:20231208132838+09'00'", 'creator': 'Hwp 2018 10.0.0.13462', 'file_path': '/content/drive/MyDrive/재정정보경진대회/SPRI_AI_Brief_2023년12월호_F.pdf', 'format': 'PDF 1.4', 'keywords': '', 'modDate': "D:20231208132838+09'00'", 'page': 11, 'producer': 'Hancom PDF 1.3.0.542', 'source': '/content/drive/MyDrive/재정정보경진대회/SPRI_AI_Brief_2023년12월호_F.pdf', 'subject': '', 'title': '', 'total_pages': 23, 'trapped': ''}, page_content='알리바바 클라우드, 최신 LLM ‘통이치엔원 2.0’ 공개\n알리바바 클라우드가 복잡한 지침 이해, 광고문구 작성, 추론, 암기 등에서 성능이 향상된 최신 \nLLM ‘통이치엔원 2.0’을 공개\n알리바바 클라우드는 산업별로 특화된 생성 AI 모델을 공개하는 한편, 모델 개발과 애플리케이션 \n구축 절차를 간소화하는 올인원 AI 모델 구축 플랫폼도 출시\n알리바바의 통이치엔원 2.0, 주요 벤치마크 테스트에서 여타 LLM 능가\n중국의 알리바바 클라우드가 2023년 10월 31일 열린 연례 기술 컨퍼런스에서 최신 LLM ‘통이\n치엔원(Tongyi Qianwen) 2.0’을 공개\n알리바바 클라우드는 통이치엔원 2.0이 2023년 4월 출시된 1.0 버전보다 복잡한 지침 이해, \n광고문구 작성, 추론, 암기 등에서 성능이 향상되었다고 설명\n통이치엔원 2.0은 언어 이해 테스트(MMLU), 수학(GSM8k), 질문 답변(ARC-C)과 같은 벤치마크 \n테스트에서 라마(Llama-2-70B)와 GPT-3.5를 비롯한 주요 AI 모델을'),
 Document(page_content='한국전쟁이 발발하여 큰 피해를 입었습니다. 전쟁 후 남한은 빠른 경제 발전을 이루며 오늘날에 이르렀습니다.'),
 Document(metadata={'author': 'dj', 'creationDate': "D:20231208132838+09'00'", 'creator': 'Hwp 2018 10.0.0.13462', 'file_path': '/content/drive/MyDrive/재정정보경진대회/SPRI_AI_Brief_2023년12월호_F.pdf', 'format': 'PDF 1.4', 'keywords': '', 'modDate': "D:20231208132838+09'00'", 'page': 7, 'producer': 'Hancom PDF 1.3.0.542', 'source': '/content/drive/MyDrive/재정정보경진대회/SPRI_AI_Brief_2023년12월호_F.pdf', 'subject': '', 'title': '', 'total_pages': 23, 'trapped': ''}, page_content='FTC는 아마존 AI 비서 ‘알렉사(Alexa)’와 스마트홈 보안 기기 ‘링(Ring)’이 소비자의 사적 \n정보를 알고리즘 훈련에 사용하여 프라이버시를 침해한 혐의를 조사하는 등 법적 권한을 활용해 AI \n관련 불법 행위에 대처하고 있음\n* FTC는 2023년 5월 31일 동의를 받지 않고 어린이들의 음성과 위치 정보를 활용한 ‘알렉사’와 고객의 사적 영상에 대하여 \n직원에게 무제한 접근 권한을 부여한 ‘링’에 3,080만 달러(약 420억 원)의 과징금을 부과')]

cf)

https://wikidocs.net/231364

[error]ModuleNotFoundError: No module named 'pillow_heif'

후__아 — Thu, 1 Aug 2024 17:46:00 +0900

https://hoooa.tistory.com/67 에서 UnstructuredPDFLoader 실습하던 중 만난 오류,,

from langchain_community.document_loaders import UnstructuredPDFLoader

pdf = '/content/drive/MyDrive/재정정보경진대회/data/train_source/1-1 2024 주요 재정통계 1권.pdf'
loader = UnstructuredPDFLoader(pdf, mode='elements')
pages = loader.load()

print(len(pages))
pages[20].page_content[:10]

pillow_heif 처음 들어보는 모듈인데 뭘까..찾아봐도 모르겠던 와중

pip install unstructured[all-docs]

이 아이를 다시 설치해보라는 얘기가 있어서 했더니 성공,,,ㅎㅎㅎ

파이썬 Unsturctured 라이브러리

unstructed data → structured data로 변환

PDF, HTML, JSON, XML 등

'pip install unstructured[파일 형태]' # or [all-docs]

- Data Loader로 다양하게 쓰임

ex) from langchain_unstructured import UnstructuredLoader

from langchain_community.document_loaders import UnstructuredCSVLoader

cf)

https://python.langchain.com/v0.2/docs/integrations/providers/unstructured/

[AI]RAG 기본 이론&실습(2)

후__아 — Thu, 1 Aug 2024 17:31:33 +0900

https://hoooa.tistory.com/6 5 에서 정리했던 RAG의 기본 파이프라인(Data Load, Text Split, Indexing, Retrieval, Generation)을 한층 자세하게 들어가보자~

[AI]RAG 기본 이론&실습(1)

RAG에 대해선 이전에 아주 짧게 다뤄봤어서 공부가 필요한 상황,,,경진대회 문제라도 제대로 풀려면 해야된다 아자아자!!! (이제는 더 이상 물러설 곳이 없다) https://hoooa.tistory.com/58에 이어서 Lang

hoooa.tistory.com

1. Data Load

불러오고자 하는 데이터의 형태에 따라 다양한 Document Loader를 활용할 수 있음!

웹 문서 WebBaseLoader

특정 웹 페이지에서 문서를 가져오기

import bs4
from langchain_community.document_loaders import WebBaseLoader

url1 = "https://blog.langchain.dev/week-of-1-22-24-langchain-release-notes/"
url2 = "https://blog.langchain.dev/week-of-2-5-24-langchain-release-notes/"

loader = WebBaseLoader(
    web_paths=(url1, url2),	# 로드할 웹페이지 url - 단일 문자열 or 시퀀스 배열
    bs_kwargs=dict(
        parse_only = bs4.SoupStrainer( # 특정 클래스 이름의 HTML 요소만 추출
            class_ = ("article-header", "article-content")
        )
    ),
)
docs = loader.load()
print(len(docs))
docs[0]

텍스트 문서 TextLoader

# 텍스트 문서
path = '/content/drive/MyDrive/재정정보경진대회/'
from langchain_community.document_loaders import TextLoader

loader = TextLoader(path + 'test.txt')
data = loader.load()

print(type(data))
data[0].page_content

<class 'list'>
안녕하세요~ 테스트 load 테스트 하고 있습니다~

폴더 DirectoryLoader

csv 파일 CSVLoader

PDF

- PyPDFLoader(PDF문서 페이지별) / UnstructuredPDFLoader(형식없는 PDF) / PyMuPDFLoader(상세한 MetaData) / OnlinePDFLoader(온라인에 업로드된 PDF) / PyPDFDirectoryLoader(특정 폴더의 모든 PDF)

!pip install -q pypdf
#!pip install unstructured unstructured-inference
!pip install unstructured[all-docs]
!pip install pymupdf

from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import OnlinePDFLoade
from langchain_community.document_loaders import PyPDFDirectoryLoader

pdf = 'test.pdf'

# PyPDFLoader
loader = PyPDFLoader(pdf)
pages = loader.load()

print(len(pages))
print(pages[20])

# UnstructuredPDFLoader
loader = UnstructuredPDFLoader(pdf, mode='elements')	# elements: 텍스트 청크가 분리된 채로 유지 - 원본 레이아웃과 유사함
pages = loader.load()

# PyMuPDFLoader
loader = PyMuPDFLoader(pdf)
pages = loader.load()

# OnlinePDFLoader
loader = OnlinePDFLoader("https://arxiv.org/pdf/1706.03762.pdf")    # Transformer 논문
pages = loader.load()
pages[0].page_content[:1000]

# PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader('./')
data = loader.load()

+UnstructuredPDFLoader 오류 해결~ https://hoooa.tistory.com/68

2. Text Split

LLM의 입력 토큰 한도에 맞추기 위해 긴 문서 → Chunk로 분리

- 각 청크가 독립적 의미를 갖도록 나눠야함

- LLM 모델의 입력 크기/비용을 고려하여 적합한 최적 크기를 조정할 수 있음

CharacterTextSplitter

개별 문자(Separator)를 기준으로 청크 분리

from langchain_community.document_loaders import TextLoader

loader = TextLoader(path+'test.txt')
data = loader.load()

from langchain_text_splitters import CharacterTextSplitter
ts = CharacterTextSplitter(
    separator = '',       # 청크 나누는 기준
    chunk_size = 500,     # 청크 최대 길이
    chunk_overlap = 100,  # 인접 청크 사이 중복으로 포함될 문자 수
    length_function = len,  # 청크 길이 계산 함수
)

texts = ts.split_text(data[0].page_content)
print(len(texts))
print(len(texts[0]))


text_splitter = CharacterTextSplitter(
    separator = '\n',   # 줄바꿈 문자 기준으로 청크 나누기
    chunk_size = 500,
    chunk_overlap  = 100,
    length_function = len,
)

texts = text_splitter.split_text(data[0].page_content)
texts[0]

한국의 역사는 수천 년에 걸쳐 이어져 온 긴 여정 속에서 다양한 문화와 전통이 형성되고 발전해 왔습니다. 고조선에서 시작해 삼국 시대의 경쟁, 그리고 통일 신라와 고려를 거쳐 조선까지, 한반도는 많은 변화를 겪었습니다.\n고조선은 기원전 2333년 단군왕검에 의해 세워졌다고 전해집니다. 이는 한국 역사상 최초의 국가로, 한민족의 시원이라 할 수 있습니다. 이후 기원전 1세기경에는 한반도와 만주 일대에서 여러 소국이 성장하며 삼한 시대로 접어듭니다.\n...<중략>...\n해방 후 한반도는 남북으로 분단되어 각각 다른 정부가 수립되었고, 1950년에는 한국전쟁이 발발하여 큰 피해를 입었습니다. 전쟁 후 남한은 빠른 경제 발전을 이루며 오늘날에 이르렀습니다.

RecursiveCharacterTextSplitter

재귀적으로 텍스트 분할, 의미적으로 관련있는 청크 조각들이 모이도록 함

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 100,
    length_function = len,
)

texts = text_splitter.split_text(data[0].page_content)
print(len(texts[0]), len(texts[1]))
texts[0]

Tokenizer

토큰 수 기준으로 분할

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=600,
    chunk_overlap=200,
    encoding_name='cl100k_base'
)

docs = text_splitter.split_documents(data)

3. Embedding

텍스트 → 숫자 벡터

- 텍스트 데이터를 벡터 공간 내에서 다룸: 텍스트 간 유사성 계산, 머신러닝/자연어처리 가능

- 활용

ㄴ의미 검색: 의미적 유사 텍스트 검색, 관련도 높은 문서/정보

ㄴ문서 분류: 특정 카테고리/주제에 분류

ㄴ텍스트 유사도 계싼

임베딩 모델 1. OpenAIEmbeddings

embed_documents(문서), embed_query(단일 쿼리)

from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()
embeddings = embeddings_model.embed_documents(
    [
        '안녕하세요!',
        '어! 오랜만이에요',
        '이름이 어떻게 되세요?',
        '날씨가 추워요',
        'Hello LLM!'
    ]
)

# 텍스트 리스트 개수(임베딩 과정을 거친 총 문서 수), 첫번째 문서의 벡터 차원
print(len(embeddings), len(embeddings[0]))
embeddings[0][:10]

(5, 1536)
[-0.010432514362037182,
 -0.013580637983977795,
 -0.0064862752333283424,
 -0.018673377111554146,
 -0.018267985433340073,
 0.01667175441980362,
 -0.009222672320902348,
 0.003898732829838991,
 -0.00743641285225749,
 0.010071462020277977]

# embed_query: 단일 쿼리 문자열 - 임베딩
embedded_query = embeddings_model.embed_query('첫인사를 하고 이름을 물어봤나요?')
embedded_query[:10]

[0.003605559002608061,
 -0.024263586848974228,
 0.010929940268397331,
 -0.04110211506485939,
 -0.004533691331744194,
 0.021859880536794662,
 -0.004130976274609566,
 0.020613981410861015,
 -0.006814695429056883,
 0.007387306075543165]

해당 문서와 쿼리 간의 유사도를 측정해보면,

# 코사인 유사도(-1 ~ 1)
# 상위 문서와 쿼리 간 유사도 측정
import numpy as np
from numpy import dot
from numpy.linalg import norm

def cos_sim(a, b):
  return dot(a,b) / (norm(a) * norm(b))

for embedding in embeddings:
  print(cos_sim(embedding, embedded_query))

0.8348635137337618
0.8153783857089105
0.8844739248939817
0.7899103053431074
0.7468845030598241

세번째 문서('이름이 어떻게 되세요?')와 쿼리('첫인사를 하고 이름을 물어봤나요?')의 유사도가 가장 높게 나옴

2. HuggingFaceEmbeddings

sentence-transformers 라이브러리를 통해 사전훈련된 임베딩 모델 활

### HuggingFaceEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(
    model_name = 'jhgan/ko-sroberta-nli',   # 사용할 모델: 자연어 추론NLI에 적합한 ko-sroberta
    model_kwargs = {'device': 'cpu'},       # 'cuda'는 GPU
    # 임베딩 정규화하여 모든 벡터가 같은 범위 값을 같도록 -> 유사도 계산 시 일관성 높임
    encode_kwargs = {'normalize_embeddings': True},
)

embeddings_model

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
), model_name='jhgan/ko-sroberta-nli', cache_folder=None, model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': True}, multi_process=False, show_progress=False)

embeddings = embeddings_model.embed_documents(
    [
        '안녕하세요!',
        '어! 오랜만이에요',
        '이름이 어떻게 되세요?',
        '날씨가 추워요',
        'Hello LLM!'
    ]
)
embedded_query = embeddings_model.embed_query('첫인사를 하고 이름을 물어봤나요?')

for embedding in embeddings:
    print(cos_sim(embedding, embedded_query))

0.5899016189601531
0.4182631225980652
0.7240604521610333
0.05702662997392148
0.4316418328113528

cf)

https://wikidocs.net/231364

[AI]RAG 기본 이론&실습(1)

후__아 — Thu, 1 Aug 2024 16:06:59 +0900

RAG에 대해선 이전에 아주 짧게 다뤄봤어서 공부가 필요한 상황,,,

경진대회 문제라도 제대로 풀려면 해야된다 아자아자!!! ~~(이제는 더 이상 물러설 곳이 없다)~~

https://hoooa.tistory.com/58에 이어서 LangChain과 관련된 RAG를 공부 및 실습해보고자 한다!

[AI]LangChain 기본 이론&실습(1)

※ Langchain 대규모 언어 모델(LLM)과 애플리케이션의 통합을 간소화하는 SDK API를 노출하여 기본 LLM의 구현 세부 사항을 요약 => 코드를 크게 변경하지 않고도 모델 교체/대체 가능≫ 언어모델 용도

hoooa.tistory.com

※ RAG(Retrieval-Augmented Generation)

기존의 LLM을 확장, 더욱 정화하고 풍부한 정보를 제공하기 위함

학습 데이터에 불포함된 외부 데이터를 실시간으로 검색(retrieval) & 답변 생성(generation)

≫ Hallucination 방지 & 최신 정보 반영

>> 기본 구조

- 검색 단계(Retrieval Phase): 질문/컨텍스트 in → 관련 외부 데이터 검색 from 검색 엔진/DB 등

- 생성 단계(Generation Phase): 검색 정보+기존 지식 → 주어진 질문에 대한 답변 생성

1. Load Data

RAG에 사용할 데이터 불러오기

# 데이터 로드
from langchain_community.document_loaders import WebBaseLoader
url = 'https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EC%A0%95%EC%B1%85%EA%B3%BC_%EC%A7%80%EC%B9%A8'
loader = WebBaseLoader(url)

docs = loader.load()    # 웹페이지 텍스트 -> Documents
print(len(docs))
print(len(docs[0].page_content))
print(docs[0].page_content[5000:6000])

1
13153
좀 더 빠르게 강력한 수단을 이용해야 합니다. 특히 정책 문서에 명시된 원칙을 지키지 않는 것은 대부분의 경우 다른 사용자에게 받아들여지지 않습니다 (다른 분들에게 예외 상황임을 설득할 수 있다면 가능하기는 하지만요). 이는 당신을 포함해서 편집자 개개인이 정책과 지침을 직접 집행 및 적용한다는 것을 의미합니다.
특정 사용자가 명백히 정책에 반하는 행동을 하거나 정책과 상충되는 방식으로 지침을 어기는 경우, 특히 의도적이고 지속적으로 그런 행위를 하는 경우 해당 사용자는 관리자의 제재 조치로 일시적, 혹은 영구적으로 편집이 차단될 수 있습니다. 영어판을 비롯한 타 언어판에서는 일반적인 분쟁 해결 절차로 끝낼 수 없는 사안은 중재위원회가 개입하기도 합니다.

문서 내용
정책과 지침의 문서 내용은 처음 읽는 사용자라도 원칙과 규범을 잘 이해할 수 있도록 다음 원칙을 지켜야 합니다.

명확하게 작성하세요. 소수만 알아듣거나 준법률적인 단어, 혹은 지나치게 단순한 표현은 피해야 합니다. 명확하고, 직접적이고, 모호하지 않고, 구체적으로 작성하세요. 지나치게 상투적인 표현이나 일반론은 피하세요. 지침, 도움말 문서 및 기타 정보문 문서에서도 "해야 합니다" 혹은 "하지 말아야 합니다" 같이 직접적인 표현을 굳이 꺼릴 필요는 없습니다.
가능한 간결하게, 너무 단순하지는 않게. 정책이 중언부언하면 오해를 부릅니다. 불필요한 말은 생략하세요. 직접적이고 간결한 설명이 마구잡이식 예시 나열보다 더 이해하기 쉽습니다. 각주나 관련 문서 링크를 이용하여 더 상세히 설명할 수도 있습니다.
규칙을 만든 의도를 강조하세요. 사용자들이 상식대로 행동하리라 기대하세요. 정책의 의도가 명료하다면, 추가 설명은 필요 없죠. 즉 규칙을 '어떻게' 지키는지와 더불어 '왜' 지켜야 하는지 확실하게 밝혀야 합니다.
범위는 분명히, 중복은 피하기. 되도록 앞부분에서 정책 및 지침의 목적과 범위를 분명하게 밝혀야 합니다. 독자 대부분은 도입부 초반만 읽고 나가버리니까요. 각 정책 문서의 내용은 해당 정

2. Text Split

데이터를 Chunk로 분할

검색 효율성 ↑

# 텍스트 분할
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = splitter.split_documents(docs)

print(len(splits))
print(splits[10])   # page_content: 분할된 텍스트 조각 / metadata: 원본 문서의 정보

18
page_content='제안과 채택
 백:아님 § 관료주의  문서를 참고하십시오. 단축백:제안
제안 문서란 정책과 지침으로 채택하자고 의견을 묻는 문서이나 아직 위키백과 내에 받아들여지는 원칙으로 확립되지는 않은 문서입니다. {{제안}} 틀을 붙여 공동체 내에서 정책이나 지침으로 채택할 지 의견을 물을 수 있습니다. 제안 문서는 정책과 지침이 아니므로 아무리 실제 있는 정책이나 지침을 요약하거나 인용해서 다른 문서에 쓴다고 해도 함부로 정책이나 지침 틀을 붙여서는 안 됩니다.
'제안'은 완전 새로운 원칙이라기보다, 기존의 불문율이나 토론 총의의 문서를 통한 구체화에 가깝습니다. 많은 사람들이 쉽게 제안을 받아들이도록 하기 위해서는, 기초적인 원칙을 우선 정하고 기본 틀을 짜야 합니다. 정책과 지침의 기본 원칙은 "왜 지켜야 하는가?", "어떻게 지켜야 하는가?" 두 가지입니다. 특정 원칙을 정책이나 지침으로 확립하기 위해서는 우선 저 두 가지 물음에 성실하게 답하는 제안 문서를 작성해야 합니다.
좋은 아이디어를 싣기 위해 사랑방이나 관련 위키프로젝트에 도움을 구해 피드백을 요청할 수 있습니다. 이 과정에서 공동체가 어느 정도 받아들일 수 있는 원칙이 구체화됩니다. 많은 이와의 토론을 통해 공감대가 형성되고 제안을 개선할 수 있습니다.
정책이나 지침은 위키백과 내의 모든 편집자들에게 적용되는 원칙이므로 높은 수준의 총의가 요구됩니다. 제안 문서가 잘 짜여졌고 충분히 논의되었다면, 더 많은 공동체의 편집자와 논의를 하기 위해 승격 제안을 올려야 합니다. 제안 문서 맨 위에 {{제안}}을 붙여 제안 안건임을 알려주고, 토론 문서에 {{의견 요청}}을 붙인 뒤 채택 제안에 관한 토론 문단을 새로 만들면 됩니다. 많은 편집자들에게 알리기 위해 관련 내용을 {{위키백과 소식}}에 올리고 사랑방에 이를 공지해야 하며, 합의가 있을 경우 미디어위키의 sitenotice(위키백과 최상단에 노출되는 구역)에 공지할 수도 있습니다.' metadata={'source': 'https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EC%A0%95%EC%B1%85%EA%B3%BC_%EC%A7%80%EC%B9%A8', 'title': '위키백과:정책과 지침 - 위키백과, 우리 모두의 백과사전', 'language': 'ko'}

3. Indexing

분할 텍스트 → 검색 가능한 형태로 변환: 텍스트 - 임베딩 - 벡터저장소에 저장 - 유사성 검색검색 시간↓ 정확도↑

# indexing
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vs = Chroma.from_documents(documents=splits,
                           embedding=OpenAIEmbeddings())

docs = vs.similarity_search("격하 과정에 대해 설명해주세요.")
print(len(docs))  # 저장된 문서 중 가장 유사한 문서들 개수
print(docs[0].page_content)   # 그 중 가장 유사도가 높은 첫 번째 문서

4
격하
특정 정책이나 지침이 편집 관행이나 공동체 규범이 바뀌며 쓸모없어질 수 있고, 다른 문서가 개선되어 내용이 중복될 수 있으며, 불필요한 내용이 증식할 수도 있습니다. 이 경우 편집자들은 정책을 지침으로 격하하거나, 정책 또는 지침을 보충 설명, 정보문, 수필 또는 중단 문서로 격하할 것을 제안할 수 있습니다. 
격하 과정은 채택 과정과 비슷합니다. 일반적으로 토론 문서 내 논의가 시작되고 프로젝트 문서 상단에 {{새로운 토론|문단=진행 중인 토론 문단}} 틀을 붙여 공동체의 참여를 요청합니다. 논의가 충분히 이루어진 후, 제3의 편집자가 토론을 종료하고 평가한 후 상태 변경 총의가 형성되었는지 판단해야 합니다. 폐지된 정책이나 지침은 최상단에 {{중단}} 틀을 붙여 더 이상 사용하지 않는 정책/지침임을 알립니다.
소수의 공동체 인원만 지지하는 수필, 정보문 및 기타 비공식 문서는 일반적으로 주된 작성자의 사용자 이름공간으로 이동합니다. 이러한 논의는 일반적으로 해당 문서의 토론란에서 이루어지며, 간혹 위키백과:의견 요청을 통해 처리되기도 합니다.

같이 보기
위키백과:위키백과의 정책과 지침 목록
위키백과:의견 요청
수필

위키백과:제품, 절차, 정책
위키백과:위키백과 공동체의 기대와 규범
기타 링크

4. Retrieval & Generation

사용자 입력을 바탕으로 쿼리 생성 후, 인덱싱된 데이터에서 가장 관련성 높은 정보 검색 by LangChain의 Retriever()

# 검색&생성까지 포함된 전체 코드

from langchain_openai import ChatOpenAI
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):  # doc 결합
  return '\n\n'.join(doc.page_content for doc in docs)

url = 'https://ko.wikipedia.org/wiki/%EC%9C%84%ED%82%A4%EB%B0%B1%EA%B3%BC:%EC%A0%95%EC%B1%85%EA%B3%BC_%EC%A7%80%EC%B9%A8'
loader = WebBaseLoader(url)

docs = loader.load()    # 웹페이지 텍스트 -> Documents

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = splitter.split_documents(docs)

vs = Chroma.from_documents(documents=splits,
                           embedding=OpenAIEmbeddings())

template = '''Answer the question based only on the following context:
{context}
Question: {question}
'''

prom = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(model='gpt-3.5-turbo-0125', temperature=0)
retriever = vs.as_retriever()   # 검색

rag_chain = (
    {'context': retriever | format_docs,
     'question': RunnablePassthrough()}
    | prom
    | model
    | StrOutputParser()
)

rag_chain.invoke("격하 과정에 대해 설명해주세요.")

격하 과정은 특정 정책이나 지침이 더 이상 필요하지 않거나 개선이 필요한 경우에 해당 정책이나 지침을 수정하거나 중단하는 과정을 말합니다. 이를 위해 편집자들은 해당 정책이나 지침을 다른 형태로 변형하거나 중단할 것을 제안하고, 이에 대한 토론을 거친 후 결정이 내려집니다. 격하 과정은 채택 과정과 유사하며, 토론을 통해 공동체의 참여를 유도하고, 결정이 내려진 후에는 해당 정책이나 지침이 중단되었음을 알리는 틀을 붙여줍니다.

cf)

https://wikidocs.net/231364

https://aws.amazon.com/ko/what-is/langchain/

https://www.samsungsds.com/kr/insights/what-is-langchain.html

[error]ValidationError: 1 validation error for ChatOpenAI

후__아 — Thu, 1 Aug 2024 14:28:25 +0900

https://wikidocs.net/231375

랭체인 LLM 실습 도중 오류 발생

Parameters {'presence_penalty', 'frequency_penalty', 'stop'} should be specified explicitly. Instead they were passed in as part of `model_kwargs` parameter. (type=value_error)

# LLM
from langchain_openai import ChatOpenAI

params = {# 기본 파라미터
    "temperature": 0.7,
    "max_tokens": 100,
}

kwargs = {# 선택 파라미터
    "frequency_penalty": 0.5,
    "presence_penalty": 0.5,
    "stop": ['\n']
}

model = ChatOpenAI(model = "gpt-3.5-turbo-0125", **params, model_kwargs = kwargs)
quest = "태양계에서 가장 큰 행성은 무엇인가요?"
resp = model.invoke(input = quest)

resp

파라미터 오류인 듯 싶어 랭체인 LLM 모델의 파라미터를 수정해보려고 했다.

https://api.python.langchain.com/en/latest/llms/langchain_openai.llms.base.OpenAI.html 에서 찾아보니까

'해당 파라미터가 specified explicitly 해야 된다'고 오류에 나와있는데

model_kwargs로 설정할 때는 아니라고 돼있어서

'presence_penalty' 인자 설정을 확인해봤더니

params에 기본 인자로 넣어도 되게 생겨서 넣어봤는데 됐다..얏호~

# 수정 이후 코드

from langchain_openai import ChatOpenAI

params = {# 기본 파라미터
    "temperature": 0.7,
    "max_tokens": 100,
    "frequency_penalty": 0.5,
    "presence_penalty": 0.5,
    "stop": ['\n']
}

model = ChatOpenAI(model = "gpt-3.5-turbo-0125", **params)
quest = "태양계에서 가장 큰 행성은 무엇인가요?"
resp = model.invoke(input = quest)

resp.content

[AI]LangChain 기본 이론&실습(3)

후__아 — Wed, 31 Jul 2024 18:25:10 +0900

LangChain이 제공하는 언어 모델 두 가지

※ LLM

단일 요청에 대한 복잡한 출력 생성 ex) 문서 요약, 질문 답변 생성, etc

텍스트 문자열 in → 텍스트 문자열 out

+표준화된 인터페이스 → 다양한 LLM 제공 업체 간 호환성 → 유연한 모델 전환/다중 LLM 통합 가

※ ChatModel

사용자와의 상호작용을 통한 연속적 대화 관리 ex) 챗봇

메시지 리스트 in → 하나의 메시지 out

대화의 맥락을 유지하며 적절한 응답 생성

+다양한 모델 제공 업체/작동 모드

※ LLM 모델 파라미터

-Temperature: 생성된 텍스트의 다양성 조정

- Max Tokens: 생성할 최대 토큰 수(텍스트 길이 제한)

- Top P(Probability): 생성 과정에서 특정 확률 분포 내 상위 P% 토큰만 고려

- Frequency Penalty: 값이 클수록 재등장할 확률 감소시키기, 반복↓ 다양성↑

- Presence Penalty: 텍스트 내 단어의 존재 유무에 따른 해당 단어의 선택 확률 조정

- Stop Sequences: 특정 단어/구절이 등장하면 생성을 멈추도록 설정

LLM 모델 만들기
1. 파라미터 직접 전달

from langchain_openai import ChatOpenAI

params = {# 기본 파라미터
    "temperature": 0.7,
    "max_tokens": 100,
    "frequency_penalty": 0.5,
    "presence_penalty": 0.5,
    "stop": ['\n']
}

model = ChatOpenAI(model = "gpt-3.5-turbo-0125", **params)
quest = "태양계에서 가장 큰 행성은 무엇인가요?"
resp = model.invoke(input = quest)

resp.content

2. 모델 파라미터 추가-bind 메소드

특정 모델 설정을 기본값으로 사용할 때 or 일부 파라미터만 다르게 적용할 때 bind 활용

코드의 가독성&재사용성 ↑

# bind
from langchain_core.prompts import ChatPromptTemplate

prom = ChatPromptTemplate.from_messages([
    ('system', "이 시스템은 역사학 질문에 답변할 수 있습니다."),
    ('user', '{user_input}'),
])

model = ChatOpenAI(model='gpt-3.5-turbo-0125', max_tokens=100)
messages = prom.format_messages(user_input = "한국의 독립기념일은 언제인가요?")
answer1 = model.invoke(messages)
print(answer1)    # binding 이전

chain = prom | model.bind(max_tokens = 10)
answer2 = chain.invoke({'user_input': "한국의 독립기념일은 언제인가요?"})
print(answer2)

content='한국의 독립기념일은 3월 1일입니다. 이 날은 1919년 3월 1일 대한민국의 광복을 위한 독립운동이 시작된 날로 기념됩니다. 현재 대한민국에서는 3월 1일을 독립운동 기념일로 지' response_metadata={'token_usage': {'completion_tokens': 100, 'prompt_tokens': 56, 'total_tokens': 156}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'length', 'logprobs': None} id='run-861a8964-0965-4bdf-a1ad-79fab9a4c1e0-0' usage_metadata={'input_tokens': 56, 'output_tokens': 100, 'total_tokens': 156}
content='대한민국의 독' response_metadata={'token_usage': {'completion_tokens': 9, 'prompt_tokens': 56, 'total_tokens': 65}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'length', 'logprobs': None} id='run-bc0f9ec8-d128-46ca-8e74-522bc6b1f6a2-0' usage_metadata={'input_tokens': 56, 'output_tokens': 9, 'total_tokens': 65}

※ 출력 파서 Output Parser

-출력 포맷 변경: 원하는 형식으로 출력

-정보 추출: 필요한 정보만 추출

-결과 정제: 후처리 작업 수행

-조건부 로직 적용: 출력 데이터 기반 다른 처리 수

1. CSV Parser

랭체인의 CommaSeparatedListOutputParser

모델이 생성한 텍스트에서 ','로 구분된 항목 추출 & 리스트로 파싱

get_format_instructions(): 모델에 전달할 포맷 지시사항

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import CommaSeparatedListOutputParser

output_parser = CommaSeparatedListOutputParser()
format_instructions = output_parser.get_format_instructions()

prom = PromptTemplate(
    template = "다섯 명의 {인물}을 나열해주세요. \n{format_instructions}",
    input_variables = ["subject"],
    partial_variables = {'format_instructions': format_instructions},
)

llm = ChatOpenAI(model='gpt-3.5-turbo-0125', temperature = 0)  # temperature=0: 일관된 출력 생성
chain = prom | llm | output_parser
chain.invoke({"인물": "세계 최고의 부자"})

['Jeff Bezos', 'Elon Musk', 'Bernard Arnault', 'Bill Gates', 'Mark Zuckerberg']

2. JSON Parser

다음 예제에선 JsonOutputPaser와 Pydantic 사용 → 모델 출력(JSON 파싱) 후 Pydantic 모델로 구조화

#JSON
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field

# 자료구조 정의 (pydantic)
class CusineRecipe(BaseModel):
    name: str = Field(description="name of a cusine")
    recipe: str = Field(description="recipe to cook the cusine")

# 출력 파서 정의
output_parser = JsonOutputParser(pydantic_object=CusineRecipe)
format_instructions = output_parser.get_format_instructions()

print(format_instructions)

prompt = PromptTemplate(
    template="Answer the user query.\n{format_instructions}\n{query}\n",
    input_variables=["query"],
    partial_variables={"format_instructions": format_instructions},
)
chain = prompt | model | output_parser

chain.invoke({"query": "Let me know how to cook Bibimbap"})

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"name": {"title": "Name", "description": "name of a cusine", "type": "string"}, "recipe": {"title": "Recipe", "description": "recipe to cook the cusine", "type": "string"}}, "required": ["name", "recipe"]}
```

{'name': 'Bibimbap',
 'recipe': 'Bibimbap is a Korean mixed rice dish made with warm white rice topped with sautéed and seasoned vegetables, chili pepper paste, soy sauce, or doenjang, and a raw or fried egg. The ingredients are stirred together just before eating. It can be served either cold or hot.'}

cf)

https://wikidocs.net/231346

https://aws.amazon.com/ko/what-is/langchain/

https://www.samsungsds.com/kr/insights/what-is-langchain.html

[AI]LangChain 이론&실습(2)

후__아 — Wed, 31 Jul 2024 15:39:33 +0900

※ 프롬프트

사용자와 언어 모델 간의 대화 속 질문/요청 형태의 입력문

→ 프롬프트 템플릿 중요

※ 작성 원칙

- 명확성&구체성: 질문이 모호해서는 안 됨

- 배경 정보 포함: 문맥을 이해할 수 있도록 정보 제공 → Hallucination↓ 응답 관련도↑

- 간결성: 불필요한 정보 B, 최대한 간결하게 G

- 열린 질문: 예/아니오 B, 많은 정보를 제공받을 수 있도록 열린 질문 G

- 명확한 목표: 얻고자 하는 정보/결과를 정확하게 정의

- 언어/문체: 맥락에 적합하게

※ 프롬프트 템플릿(PromptTemplate)

단일문장 or 간단한 명령 == 문자열 기반

"langchain_core.prompts" 모듈의 "PromptTemplate" 클래스 사용

PromptTemplate

from langchain_core.prompts import PromptTemplate
text = "안녕하세요, 제 이름은 {name}이고, 나이는 {age}살입니다."
prom_temp = PromptTemplate.from_template(text)    # PromptTemplate 인스턴스
print(prom_temp.format(name = '홍길동', age = 30))

PromptTemplate + PromptTemplate

여러 개의 프롬프트 템플릿을 결합하여 format을 만들 수 있음

# 프롬프트 템플릿 결합
prom_temp2 = (
    prom_temp
    + PromptTemplate.from_template("\n아버지를 아버지라 부를 수 없습니다.")
    + "위의 문장을 \n{language}로 번역해주세요"
)
print(prom_temp2.format(name = '홍길동', age = 30, language = '중국어'))

체인까지 만들어보면 최종적으로 다음과 같은 코드가 됨

from langchain.chat_models import ChatOpenAI    # from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model = "gpt-3.5-turbo-0125")
chain = prom_temp2 | llm | StrOutputParser()
chain.invoke({"name": '홍길동', "age": 30, "language": '중국어'})

※ 챗 프롬프트 템플릿(ChatPromptTemplate)

대화 상황에서 여러 메시지 기반 단일 메시지 응답을 생성 → 대화형 모델/챗봇 개발

입력: 여러 메시지를 원소로 갖는 리스트

메시지: role & content로 구성

ㄴMessage 유형: System/Human/AI/Function/Tool

1. ChatPromptTemplate.from_messages 형식

전달된 메시지 기반으로 프롬프트 구성

# ChatPromptTemplate
# 2-튜플 형태의 메시지 리스트(역할, 내용)
from langchain_core.prompts import ChatPromptTemplate

chat_prom = ChatPromptTemplate.from_messages([
    ("system", "이 시스템은 음식 질문에 답변할 수 있습니다."),
    ("user", "{user_input}"),
])

messages = chat_prom.format_messages(user_input = "대한민국 음식 중 조리과정이 3분 이내인 요리는 무엇이 있나요?")
# messages

System에 '역할', User에 '질문'을 지정

from langchain_core.output_parsers import StrOutputParser

chain = chat_prom | llm | StrOutputParser()
chain.invoke({"user_input": "대한민국 음식 중 조리과정이 3분 이내인 요리는 무엇이 있나요?"})

2. MessagePromptTemplate 형식

메시지 리스트의 Role( System/Human/AI/Function/Tool)과 Content를 명확하게 표현

[SystemMessage(content='이 시스템은 음식 질문에 답변할 수 있습니다.'),
 HumanMessage(content='대한민국 음식 중 조리과정이 3분 이내인 요리는 무엇이 있나요?')]

# MessagePromptTemplate
from langchain_core.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate

chat_prom2 = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("이 시스템은 음식 질문에 답변할 수 있습니다."),
    HumanMessagePromptTemplate.from_template("{user_input}"),
])

chain2 = chat_prom2 | llm | StrOutputParser()
chain2.invoke({"user_input": "대한민국 음식 중 조리과정이 3분 이내인 요리는 무엇이 있나요?"})

cf)

https://blog.naver.com/htk1019/223413412145

https://wikidocs.net/231346

https://aws.amazon.com/ko/what-is/langchain/

https://www.samsungsds.com/kr/insights/what-is-langchain.html

[error]ModuleNotFoundError: No module named 'langchain_community' / 'langchain_openai'

후__아 — Wed, 31 Jul 2024 14:19:25 +0900

Langchain 설치하고 import 오류
1. langchain_community

!pip install langchain-community langchain-core

로 해결!

2. langchain_openai

!pip install langchain-openai

해보거나

from langchain.chat_models import ChatOpenAI

아예 다른 패키지를 사용해보기!