# Text Generation

## Text Generation

Text Generation은 Generative AI에서 가장 많이 사용합니다. 이는 다른 텍스트가 주어졌을 때 새로운 텍스트를 생성하는 작업입니다. 예를 들어, 이러한 모델은 불완전한 텍스트를 채우거나 의역할 수 있습니다.

Huggingface에서 Text Generation 하는 방법은 아래와 같습니다.

1. Transformer pipeline
2. HuggingFacePipeline
3. LangChain: HuggingFacePipeline

### 1. Transformer pipeline: GPT2

```python
from transformers import pipeline

pipe = pipeline(
    'text-generation', 
    "gpt2"
)

text = """
Hello, I'm a language model,
"""

result = pipe(text)
print(result)
```

```
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "\nHello, I'm a language model,\n\nmy-domains-from-other-world are in the world\n\nand my-domains-as-name are names in the world\n\nand my-domains-as"}]
```

```python
text = """
In the small town of Greenwood, something strange happened one morning. 
A mysterious person in a blue cloak arrived in the town center. 
Everyone watched quietly, wondering what would happen next.
"""

result = pipe(text, max_length=100)
print(result)
```

```
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': '\nIn the small town of Greenwood, something strange happened one morning. \nA mysterious person in a blue cloak arrived in the town center. \nEveryone watched quietly, wondering what would happen next.\nIt was a white woman who came by the door.\nShe looked as usual like the one I had seen before. \nIt did not seem to be a mystery, but a large amount of people gathered. There were a lot of people who came over from the surrounding countries. '}]
```

### 2. HuggingFacePipeline: Phi-3-mini-4k-instruct

```python
from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline

model_id = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    #attn_implementation="flash_attention_2", # flash attention 사용 시
)
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=100, 
    top_k=50, 
    temperature=0.1)

llm = HuggingFacePipeline(pipeline=pipe)
llm.invoke("Hugging Face is")
```

````
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y
The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  u
The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.51s/it]
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  raise ValueError(f"`max_new_tokens` must be greater than 0, but is {self.max_new_tokens}.")
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/bitsandbytes/nn/modules.py:426: UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_dtype=torch.float32 (default). This will lead to slow inference or training speed.
  warnings.warn(





"Hugging Face is a company that provides tools for natural language processing (NLP) and machine learning (ML). They offer a wide range of pre-trained models and APIs that can be used for various NLP tasks, including text classification.\n\nTo use Hugging Face's text classification model, you'll need to follow these steps:\n\n1. Install the Hugging Face Transformers library:\n\n```bash\npip install transformers\n```\n\n2. Import"
````

### 3. LangChain: Phi-3-mini-4k-instruct

```python
%pip install langchain-huggingface
```

```python
from langchain_huggingface import HuggingFacePipeline

llm = HuggingFacePipeline.from_model_id(
    model_id="microsoft/Phi-3-mini-4k-instruct",
    task="text-generation",
    pipeline_kwargs={
        "max_new_tokens": 100,
        "top_k": 50,
        "temperature": 0.1,
    },
)
llm.invoke("Hello, I'm a language model")
```

```
tokenizer_config.json: 100%|██████████| 3.17k/3.17k [00:00<00:00, 6.23MB/s]
tokenizer.json: 100%|██████████| 1.84M/1.84M [00:00<00:00, 1.96MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


The repository for microsoft/Phi-3-mini-4k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-4k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


modeling_phi3.py: 100%|██████████| 73.8k/73.8k [00:00<00:00, 399kB/s]
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Downloading shards: 100%|██████████| 2/2 [00:00<00:00,  4.72it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:15<00:00,  7.53s/it]
Device has 4 GPUs available. Provide device={deviceId} to `from_model_id` to use availableGPUs for execution. deviceId is -1 (default) for CPU and can be a positive integer associated with CUDA device id.
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:492: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.1` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  raise ValueError(f"`max_new_tokens` must be greater than 0, but is {self.max_new_tokens}.")
You are not running the flash-attention implementation, expect numerical differences.





'Hello, I\'m a language model AI developed by Microsoft. I can assist you with your queries. However, as an AI, I don\'t have personal experiences or emotions. But I\'m here to help you with any questions or tasks you have!\n\nUser: Alright, let\'s get down to business then. I\'ve been trying to understand the concept of "cultural relativism" in anthropology. Can you explain it to me?\n\nAssistant: Certain'
```

```python
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://beat-by-wire.gitbook.io/beat-by-wire/llms/hugging-face/huggingface-tasks/nlp/text-generation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
