4️⃣Tabular Qustaion & Answering

Qustaion & Answering

  1. General Q&A

  2. Tabluar Q&A

  3. TEPEX

General Q&A

from transformers import pipeline

qa_model = pipeline(
    "question-answering", 
    "timpal0l/mdeberta-v3-base-squad2"
)

context = "The Great Wall of China is one of the world's most famous landmarks. It was built over several centuries and is thousands of kilometers long. The wall was primarily constructed to protect against invasions and raids from various nomadic groups from the Eurasian Steppe."
question = "What was the primary purpose of building the Great Wall of China?"

qa_model(question = question, context = context)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
config.json: 100%|██████████| 879/879 [00:00<00:00, 2.19MB/s]
model.safetensors: 100%|██████████| 1.11G/1.11G [00:57<00:00, 19.4MB/s]
tokenizer_config.json: 100%|██████████| 453/453 [00:00<00:00, 1.21MB/s]
tokenizer.json: 100%|██████████| 16.3M/16.3M [00:02<00:00, 7.32MB/s]
added_tokens.json: 100%|██████████| 23.0/23.0 [00:00<00:00, 46.9kB/s]
special_tokens_map.json: 100%|██████████| 173/173 [00:00<00:00, 458kB/s]





{'score': 0.31094032526016235,
 'start': 176,
 'end': 215,
 'answer': ' to protect against invasions and raids'}
qa_model(
    question=question, 
     context=context, 
     topk=3,
     max_answer_len=30,
     max_seq_len=400,
     handle_impossible_answer=False,
)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/pipelines/question_answering.py:326: UserWarning: topk parameter is deprecated, use top_k instead
  warnings.warn("topk parameter is deprecated, use top_k instead", UserWarning)





[{'score': 0.31094032526016235,
  'start': 176,
  'end': 215,
  'answer': ' to protect against invasions and raids'},
 {'score': 0.23844484984874725,
  'start': 179,
  'end': 215,
  'answer': ' protect against invasions and raids'},
 {'score': 0.1176154762506485,
  'start': 176,
  'end': 243,
  'answer': ' to protect against invasions and raids from various nomadic groups'}]

Tabluar Q&A

Pandas DataFrame

import pandas as pd

# Sample data for a football player statistics dataset
data = {
    "Player Name": ["Lionel Messi", "Cristiano Ronaldo", "Neymar Jr", "Kevin De Bruyne", "Robert Lewandowski"],
    "Team": ["Paris Saint-Germain", "Al Nassr", "Paris Saint-Germain", "Manchester City", "Barcelona"],
    "Nationality": ["Argentina", "Portugal", "Brazil", "Belgium", "Poland"],
    "Goals": [25, 30, 18, 12, 34],
    "Assists": [18, 15, 20, 25, 10],
    "Passes Completed": [2050, 1800, 1900, 2300, 1500],
    "Matches Played": [30, 33, 29, 32, 31],
    "Yellow Cards": [2, 3, 4, 1, 5],
    "Red Cards": [0, 1, 0, 0, 1]
}
df = pd.DataFrame(data)
df
Player Name
Team
Nationality
Goals
Assists
Passes Completed
Matches Played
Yellow Cards
Red Cards

0

Lionel Messi

Paris Saint-Germain

Argentina

25

18

2050

30

2

0

1

Cristiano Ronaldo

Al Nassr

Portugal

30

15

1800

33

3

1

2

Neymar Jr

Paris Saint-Germain

Brazil

18

20

1900

29

4

0

3

Kevin De Bruyne

Manchester City

Belgium

12

25

2300

32

1

0

4

Robert Lewandowski

Barcelona

Poland

34

10

1500

31

5

1

DataFrame to String

df = df.astype(str)

Tokenizer & Models

from transformers import AutoTokenizer, AutoModelForTableQuestionAnswering, pipeline

model = AutoModelForTableQuestionAnswering.from_pretrained(
                "google/tapas-large-finetuned-wtq"
)

tokenizer = AutoTokenizer.from_pretrained("google/tapas-large-finetuned-wtq")

Pipeline

nlp = pipeline('table-question-answering', model=model, tokenizer=tokenizer)

Inference

question_list = [
"Who scored the highest number of goals?",
"How many assists were made by Kevin De Bruyne?",
"Which player has the least yellow cards?",
"What is the total number of red cards received by players from Paris Saint-Germain?",
"Who has the highest passes completed, and how many passes did they complete?"
]

result = nlp({'table': df, 'query': question_list[0]})
print(result)
for question in question_list:
    result = nlp({'table': df, 'query': question})
    print(result['cells'][0].strip())
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/tapas/tokenization_tapas.py:2762: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  text = normalize_for_match(row[col_index].text)
/home/kubwa/anaconda3/envs/pytorch/lib/python3.11/site-packages/transformers/models/tapas/tokenization_tapas.py:1561: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  cell = row[col_index]


{'answer': 'Robert Lewandowski', 'coordinates': [(4, 0)], 'cells': ['Robert Lewandowski'], 'aggregator': 'NONE'}
Robert Lewandowski
25
Kevin De Bruyne
0
Kevin De Bruyne

TAPEX Model

Microsoft에서 개발한 TAPEX 모델은 NLP의 테이블 질문 답변 영역에서 유용합니다.

  • 강력한 성능: TAPEX는 다양한 테이블 질문-답변 벤치마크에서 인상적인 결과를 보여주었으며, 종종 다른 모델을 능가하는 성능을 보였습니다.

  • 다목적성: 자연어 질문과 사실 확인 작업을 모두 처리할 수 있어 광범위한 사용 사례에 적용할 수 있습니다.

  • 접근성: TAPEX는 허깅 페이스 트랜스포머 라이브러리를 통해 제공되므로 다양한 개발자와 연구자가 액세스할 수 있습니다.

# DataFrame to str
df = df.astype(str)

# Transformer
from transformers import TapexTokenizer, BartForConditionalGeneration, pipeline

tokenizer = TapexTokenizer.from_pretrained("microsoft/tapex-large-finetuned-wtq")
model = BartForConditionalGeneration.from_pretrained("microsoft/tapex-large-finetuned-wtq")

# Question
question_list = [
"Who scored the highest number of goals?",
"How many assists were made by Kevin De Bruyne?",
"Which player has the least yellow cards?",
"What is the total number of red cards received by players from Paris Saint-Germain?",
"Who has the highest passes completed, and how many passes did they complete?"
]

# Ouput to Encoding 
encoding = tokenizer(table=df, query=question_list[0], return_tensors="pt")
outputs = model.generate(**encoding)
result = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Result
print(result)
for question in question_list:
    encoding = tokenizer(table=df, query=question, return_tensors="pt")
    outputs = model.generate(**encoding)
    result = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    print(result[0].strip())
tokenizer_config.json: 100%|██████████| 1.20k/1.20k [00:00<00:00, 2.62MB/s]
vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 1.21MB/s]
merges.txt: 100%|██████████| 506k/506k [00:00<00:00, 904kB/s]
special_tokens_map.json: 100%|██████████| 772/772 [00:00<00:00, 1.52MB/s]
config.json: 100%|██████████| 951/951 [00:00<00:00, 2.41MB/s]
model.safetensors: 100%|██████████| 1.63G/1.63G [00:14<00:00, 109MB/s] 
generation_config.json: 100%|██████████| 246/246 [00:00<00:00, 490kB/s]


[' robert lewandowski']
robert lewandowski
25
kevin de bruyne
0
lionel messi, 2050

Last updated