3types of content and rule extraction - prompt and datasets
This commit is contained in:
parent
9ca5c24110
commit
f3f0cde8f1
9072
Context_update/Bime - 622 sections with context.json
Normal file
9072
Context_update/Bime - 622 sections with context.json
Normal file
File diff suppressed because it is too large
Load Diff
221
Rule_extraction/3 type contents generator and rule extraction.py
Normal file
221
Rule_extraction/3 type contents generator and rule extraction.py
Normal file
|
@ -0,0 +1,221 @@
|
||||||
|
import json
|
||||||
|
from tqdm import tqdm
|
||||||
|
import time
|
||||||
|
|
||||||
|
import re
|
||||||
|
import torch
|
||||||
|
import os
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
|
||||||
|
os.environ['HF_HOME'] = "/home/admin/HFHOME"
|
||||||
|
|
||||||
|
#model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
|
||||||
|
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
|
||||||
|
#model_id = "meta-llama/Llama-3.1-70B-Instruct"
|
||||||
|
|
||||||
|
# use quantization to lower GPU usage
|
||||||
|
# 4 bit:
|
||||||
|
# bnb_config = BitsAndBytesConfig(
|
||||||
|
# load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
|
||||||
|
# )
|
||||||
|
# 8 bit:
|
||||||
|
# bnb_config = BitsAndBytesConfig(
|
||||||
|
# load_in_8bit=True, bnb_8bit_use_double_quant=True, bnb_8bit_quant_type="nf8", bnb_8bit_compute_dtype=torch.bfloat16
|
||||||
|
# )
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
model_id,
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
device_map="auto",
|
||||||
|
#quantization_config=bnb_config
|
||||||
|
)
|
||||||
|
terminators = [
|
||||||
|
tokenizer.eos_token_id,
|
||||||
|
tokenizer.convert_tokens_to_ids("<|eot_id|>")
|
||||||
|
]
|
||||||
|
model.generation_config.pad_token_id = tokenizer.eos_token_id #tokenizer.pad_token_id
|
||||||
|
|
||||||
|
|
||||||
|
def remove_think_tags(strings):
|
||||||
|
if isinstance(strings, str):
|
||||||
|
return re.sub(r'<think>[\s\S]*?</think>', '', strings)
|
||||||
|
else:
|
||||||
|
return [re.sub(r'<think>[\s\S]*?</think>', '', s) for s in strings]
|
||||||
|
|
||||||
|
|
||||||
|
SYS_PROMPT_simpler = """
|
||||||
|
Analyze the input sentence and divide it into several simpler and clearer sentences. The new sentences should retain the main meaning of the original sentence but be easier to understand. Use short and direct sentences and avoid complex terms or confusing grammatical structures.
|
||||||
|
The output should be Persian sentences only.
|
||||||
|
Each sentence should either be factual and express a minimal fact, or be conditional and express a minimal rule. A minimal fact or rule is a fact or rule that cannot be expressed in two separate and independent sentences.
|
||||||
|
"""
|
||||||
|
|
||||||
|
SYS_PROMPT_modified ="""
|
||||||
|
Edit the text below carefully for punctuation (including punctuation, spaces and half-spaces), and spelling errors, and perform a complete editorial review considering all editing points. The output should only be the corrected text without any additional explanation.
|
||||||
|
The text should be preserved completely and only edited. Do not alter the sentence structure, add or remove words, or change the original meaning in any way. Please pay attention to correcting homophone words and fix these words based on the meaning of the sentence.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
SYS_PROMPT_translated ="""
|
||||||
|
|
||||||
|
translate the Persian content below into fluent and accurate English, ensuring the following criteria are met:
|
||||||
|
|
||||||
|
Meaning and Context: Preserve the original meaning and context of the text throughout the translation. These texts are all parts of legal laws in different domains.
|
||||||
|
Linguistic Accuracy: Adhere to proper English grammar, vocabulary, and syntax, demonstrating a native-level proficiency in the language.
|
||||||
|
Proper noun: do not translate those words that are proper nouns.
|
||||||
|
Cultural Sensitivity: Translate any technical terms, idiomatic expressions, and culturally specific references appropriately to ensure clarity and relevance in Persian.
|
||||||
|
The final English translation should be professional, natural, and suitable for its intended audience, which is law's audiance.
|
||||||
|
|
||||||
|
Note that the output must be nothing but the final traslation
|
||||||
|
"""
|
||||||
|
|
||||||
|
SYS_PROMPT_fewshot ="""
|
||||||
|
You are an expert in law and also an expert in logic. When faced with any legal text, you can understand it well and understand the propositions it expresses.
|
||||||
|
You can also analyze and understand the logical relationships between these propositions correctly in the text.
|
||||||
|
|
||||||
|
Legal sentences are sometimes factual sentences and sometimes they are rule-like sentences.
|
||||||
|
If the sentences are factual and no rule-like content, they do not require logical analysis, thus say "no Rule". If the sentences are rule-based, you need to find their propositions and then express them in the form of an "if, then, unless" conditional.
|
||||||
|
To understand the output structure, follow the examples below.
|
||||||
|
|
||||||
|
Note that in some cases, to understand the rules of logical relationships in a section of the law, you must consider other information outside the text as context. If you are given a sentence with context, you may use the context in your logical analysis of the sentence.
|
||||||
|
do not say nothing other than the output like these examples:
|
||||||
|
Input:
|
||||||
|
"سرمایه های صرف شده که در اختیار واحدهای عملیات نفتی گذاشته شده یا میشود، جزء دارایی های واحد مزبور خواهد بود ولی هرگونه نقل و انتقال آنها منوط به اجازه وزارت نفت میباشد."
|
||||||
|
output:
|
||||||
|
حکم 1:
|
||||||
|
اگر {سرمایه ای صرف شود و در اختیار یکی از واحدهای عملیاتی نفت باشد}
|
||||||
|
آنگاه {آن واحد مالک آن سرمایه خواهد بود}
|
||||||
|
|
||||||
|
حکم 2:
|
||||||
|
اگر {سرمایه ای در اختیار یکی از واحدهای عملیاتی نفت باشد و آن سرمایه نقل یا انتقال داده شود}
|
||||||
|
آنگاه {آن نقل یا انتقال جایز نیست}
|
||||||
|
مگر {آن نقل یا انتقال با اجازه وزارت نفت باشد}
|
||||||
|
|
||||||
|
Input:
|
||||||
|
"صرف درآمد موقوفات به منظور بقاء عین آنها بر سایر مصارف مقدم است"
|
||||||
|
output:
|
||||||
|
اگر {موقوفهای درآمدی داشت}
|
||||||
|
آنگاه {آن درآمد باید قبل از هر مصرف دیگری به منظور بقاء عین صرف شود.}
|
||||||
|
مگر {صرف درآمد به منظور بقاء عین لازم نباشد}
|
||||||
|
|
||||||
|
Input:
|
||||||
|
"کود: هر ماده آلی، زیستی یا معدنی با منشأ طبیعی یا مصنوعی که به خاک یا گیاه اضافه می شود تا یک یا چند عنصر ضروری برای رشد گیاه را تأمین کند."
|
||||||
|
output:
|
||||||
|
اگر{مادهای ماده آلی، زیستی یا معدنی با منشأ طبیعی یا مصنوعی باشد که به خاک یا گیاه اضافه میشود که یک یا چند عنصر ضروری برای رشد گیاه را تأمین کند.}
|
||||||
|
آنگاه {آن ماده کود است}
|
||||||
|
|
||||||
|
Input:
|
||||||
|
"حفظ و افزایش بهره وری منابع معدنی که سرمایه های ملی تجدیدناپذیر هستند"
|
||||||
|
context:
|
||||||
|
"این عبارت، دنباله عبارت زیر است:
|
||||||
|
ماده 2 - اهداف و وظایف نظام مهندسی معدن عبارتند از:"
|
||||||
|
|
||||||
|
output:
|
||||||
|
حکم 1:
|
||||||
|
اگر {منابع معدنی وجود داشته باشد}
|
||||||
|
آنگاه {سرمایه ملی تجدیدناپذیر است}
|
||||||
|
|
||||||
|
حکم 2:
|
||||||
|
اگر {منابع معدنی وجود داشته باشد}
|
||||||
|
آنگاه {حفظ این منابع بر نظام مهندسی معدن واجب است.}
|
||||||
|
|
||||||
|
حکم 3:
|
||||||
|
اگر {منابع معدنی وجود داشت}
|
||||||
|
آنگاه {افزایش بهرهوری این منابع بر نظام مهندسی معدن واجب است.}
|
||||||
|
"""
|
||||||
|
|
||||||
|
def generate(system_prompt, formatted_prompt):
|
||||||
|
formatted_prompt = formatted_prompt[:50000] # to avoid GPU OOM
|
||||||
|
messages = [{"role":"system","content":system_prompt},{"role":"user","content":formatted_prompt}]
|
||||||
|
# tell the model to generate
|
||||||
|
input_ids = tokenizer.apply_chat_template(
|
||||||
|
messages,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
return_tensors="pt"
|
||||||
|
).to(model.device)
|
||||||
|
outputs = model.generate(
|
||||||
|
input_ids,
|
||||||
|
max_new_tokens=2048,
|
||||||
|
eos_token_id=terminators,
|
||||||
|
do_sample=True,
|
||||||
|
temperature=0.6,
|
||||||
|
top_p=0.9,
|
||||||
|
)
|
||||||
|
response = outputs[0][input_ids.shape[-1]:]
|
||||||
|
return tokenizer.decode(response, skip_special_tokens=True)
|
||||||
|
|
||||||
|
|
||||||
|
def simpler(text, context):
|
||||||
|
user_prompt_with_context = f"""input:
|
||||||
|
{text}
|
||||||
|
context:
|
||||||
|
"{context}"
|
||||||
|
"""
|
||||||
|
return generate(SYS_PROMPT_simpler, user_prompt_with_context)
|
||||||
|
|
||||||
|
def modified(text, context):
|
||||||
|
user_prompt_with_context = f"""input:
|
||||||
|
{text}
|
||||||
|
context:
|
||||||
|
"{context}"
|
||||||
|
"""
|
||||||
|
return generate(SYS_PROMPT_modified, user_prompt_with_context)
|
||||||
|
|
||||||
|
def translated(text):
|
||||||
|
user_prompt_with_context = f"""input:
|
||||||
|
{text}
|
||||||
|
"""
|
||||||
|
return generate(SYS_PROMPT_translated, user_prompt_with_context)
|
||||||
|
|
||||||
|
def fewshot(text, context):
|
||||||
|
user_prompt_with_context = f"""input:
|
||||||
|
{text}
|
||||||
|
context:
|
||||||
|
"{context}"
|
||||||
|
"""
|
||||||
|
return generate(SYS_PROMPT_simpler, user_prompt_with_context)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print('start')
|
||||||
|
start_time = time.time()
|
||||||
|
#inputfile = open('./merged_output.json', "r", encoding='utf-8')
|
||||||
|
inputfile = open('./merged_output_01.json', "r", encoding='utf-8')
|
||||||
|
data = json.load(inputfile)
|
||||||
|
inputfile.close()
|
||||||
|
for index, item in enumerate(tqdm(data)):
|
||||||
|
# last extractions can be uncomment if need for perform them together
|
||||||
|
# content = item.get("content","")
|
||||||
|
context = item.get("context_sentences","")
|
||||||
|
contextdict = item.get("context","")
|
||||||
|
context_sentences = contextdict.get("context_sentences","")
|
||||||
|
context = ""
|
||||||
|
for a in context_sentences:
|
||||||
|
if context == "":
|
||||||
|
context = a
|
||||||
|
else:
|
||||||
|
context = context + " " + a
|
||||||
|
# context1_simpler= simpler(content, context)
|
||||||
|
# item['simplified_content'] = context1_simpler
|
||||||
|
# context1_modified= modified(content, context)
|
||||||
|
# item['modified_content'] = context1_modified
|
||||||
|
# context1_translated= translated(content)
|
||||||
|
# item['translated_content'] = context1_translated
|
||||||
|
###
|
||||||
|
simplified_content = item['simplified_content']
|
||||||
|
simplified_content_rule = fewshot(simplified_content, context)
|
||||||
|
item['simplified_content_rule'] = simplified_content_rule
|
||||||
|
modified_content = item['modified_content']
|
||||||
|
modified_content_rule = fewshot(modified_content, context)
|
||||||
|
item['modified_content_rule'] = modified_content_rule
|
||||||
|
translated_content = item['translated_content']
|
||||||
|
translated_content_rule = fewshot(translated_content, context)
|
||||||
|
item['translated_content_rule'] = translated_content_rule
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
outputfile = open('./merged_output_02.json', "w", encoding='utf-8')
|
||||||
|
outputfile.write(json.dumps(data, ensure_ascii=False, indent = 4))
|
||||||
|
outputfile.close()
|
||||||
|
end_time = time.time()
|
||||||
|
print(f"elapsed time: {end_time-start_time}")
|
||||||
|
print("end")
|
145
Rule_extraction/3 type contents generator.py
Normal file
145
Rule_extraction/3 type contents generator.py
Normal file
|
@ -0,0 +1,145 @@
|
||||||
|
import json
|
||||||
|
from tqdm import tqdm
|
||||||
|
import time
|
||||||
|
|
||||||
|
import re
|
||||||
|
import torch
|
||||||
|
import os
|
||||||
|
from transformers import AutoTokenizer, AutoModel
|
||||||
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
|
||||||
|
os.environ['HF_HOME'] = "/home/admin/HFHOME"
|
||||||
|
|
||||||
|
#model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
|
||||||
|
model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
|
||||||
|
#model_id = "meta-llama/Llama-3.1-70B-Instruct"
|
||||||
|
|
||||||
|
# use quantization to lower GPU usage
|
||||||
|
# 4 bit:
|
||||||
|
# bnb_config = BitsAndBytesConfig(
|
||||||
|
# load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
|
||||||
|
# )
|
||||||
|
# 8 bit:
|
||||||
|
# bnb_config = BitsAndBytesConfig(
|
||||||
|
# load_in_8bit=True, bnb_8bit_use_double_quant=True, bnb_8bit_quant_type="nf8", bnb_8bit_compute_dtype=torch.bfloat16
|
||||||
|
# )
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
model_id,
|
||||||
|
torch_dtype=torch.bfloat16,
|
||||||
|
device_map="auto",
|
||||||
|
#quantization_config=bnb_config
|
||||||
|
)
|
||||||
|
terminators = [
|
||||||
|
tokenizer.eos_token_id,
|
||||||
|
tokenizer.convert_tokens_to_ids("<|eot_id|>")
|
||||||
|
]
|
||||||
|
model.generation_config.pad_token_id = tokenizer.eos_token_id #tokenizer.pad_token_id
|
||||||
|
|
||||||
|
|
||||||
|
def remove_think_tags(strings):
|
||||||
|
if isinstance(strings, str):
|
||||||
|
return re.sub(r'<think>[\s\S]*?</think>', '', strings)
|
||||||
|
else:
|
||||||
|
return [re.sub(r'<think>[\s\S]*?</think>', '', s) for s in strings]
|
||||||
|
|
||||||
|
|
||||||
|
SYS_PROMPT_simpler = """
|
||||||
|
Analyze the input sentence and divide it into several simpler and clearer sentences. The new sentences should retain the main meaning of the original sentence but be easier to understand. Use short and direct sentences and avoid complex terms or confusing grammatical structures.
|
||||||
|
The output should be Persian sentences only.
|
||||||
|
Each sentence should either be factual and express a minimal fact, or be conditional and express a minimal rule. A minimal fact or rule is a fact or rule that cannot be expressed in two separate and independent sentences.
|
||||||
|
"""
|
||||||
|
|
||||||
|
SYS_PROMPT_modified ="""
|
||||||
|
Edit the text below carefully for punctuation (including punctuation, spaces and half-spaces), and spelling errors, and perform a complete editorial review considering all editing points. The output should only be the corrected text without any additional explanation.
|
||||||
|
The text should be preserved completely and only edited. Do not alter the sentence structure, add or remove words, or change the original meaning in any way. Please pay attention to correcting homophone words and fix these words based on the meaning of the sentence.
|
||||||
|
|
||||||
|
"""
|
||||||
|
|
||||||
|
SYS_PROMPT_translated ="""
|
||||||
|
|
||||||
|
translate the Persian content below into fluent and accurate English, ensuring the following criteria are met:
|
||||||
|
|
||||||
|
Meaning and Context: Preserve the original meaning and context of the text throughout the translation. These texts are all parts of legal laws in different domains.
|
||||||
|
Linguistic Accuracy: Adhere to proper English grammar, vocabulary, and syntax, demonstrating a native-level proficiency in the language.
|
||||||
|
Proper noun: do not translate those words that are proper nouns.
|
||||||
|
Cultural Sensitivity: Translate any technical terms, idiomatic expressions, and culturally specific references appropriately to ensure clarity and relevance in Persian.
|
||||||
|
The final English translation should be professional, natural, and suitable for its intended audience, which is law's audiance.
|
||||||
|
|
||||||
|
Note that the output must be nothing but the final traslation
|
||||||
|
"""
|
||||||
|
|
||||||
|
def generate(system_prompt, formatted_prompt):
|
||||||
|
formatted_prompt = formatted_prompt[:50000] # to avoid GPU OOM
|
||||||
|
messages = [{"role":"system","content":system_prompt},{"role":"user","content":formatted_prompt}]
|
||||||
|
# tell the model to generate
|
||||||
|
input_ids = tokenizer.apply_chat_template(
|
||||||
|
messages,
|
||||||
|
add_generation_prompt=True,
|
||||||
|
return_tensors="pt"
|
||||||
|
).to(model.device)
|
||||||
|
outputs = model.generate(
|
||||||
|
input_ids,
|
||||||
|
max_new_tokens=2048,
|
||||||
|
eos_token_id=terminators,
|
||||||
|
do_sample=True,
|
||||||
|
temperature=0.6,
|
||||||
|
top_p=0.9,
|
||||||
|
)
|
||||||
|
response = outputs[0][input_ids.shape[-1]:]
|
||||||
|
return tokenizer.decode(response, skip_special_tokens=True)
|
||||||
|
|
||||||
|
|
||||||
|
def simpler(text, context):
|
||||||
|
user_prompt_with_context = f"""input:
|
||||||
|
{text}
|
||||||
|
context:
|
||||||
|
"{context}"
|
||||||
|
"""
|
||||||
|
return generate(SYS_PROMPT_simpler, user_prompt_with_context)
|
||||||
|
|
||||||
|
def modified(text, context):
|
||||||
|
user_prompt_with_context = f"""input:
|
||||||
|
{text}
|
||||||
|
context:
|
||||||
|
"{context}"
|
||||||
|
"""
|
||||||
|
return generate(SYS_PROMPT_modified, user_prompt_with_context)
|
||||||
|
|
||||||
|
def translated(text):
|
||||||
|
user_prompt_with_context = f"""input:
|
||||||
|
{text}
|
||||||
|
"""
|
||||||
|
return generate(SYS_PROMPT_translated, user_prompt_with_context)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print('start')
|
||||||
|
start_time = time.time()
|
||||||
|
inputfile = open('./merged_output.json', "r", encoding='utf-8')
|
||||||
|
data = json.load(inputfile)
|
||||||
|
inputfile.close()
|
||||||
|
for index, item in enumerate(tqdm(data)):
|
||||||
|
content = item.get("content","")
|
||||||
|
context = item.get("context_sentences","")
|
||||||
|
contextdict = item.get("context","")
|
||||||
|
context_sentences = contextdict.get("context_sentences","")
|
||||||
|
context = ""
|
||||||
|
for a in context_sentences:
|
||||||
|
if context == "":
|
||||||
|
context = a
|
||||||
|
else:
|
||||||
|
context = context + " " + a
|
||||||
|
context1_simpler= simpler(content, context)
|
||||||
|
item['simplified_content'] = context1_simpler
|
||||||
|
context1_modified= modified(content, context)
|
||||||
|
item['modified_content'] = context1_modified
|
||||||
|
context1_translated= translated(content)
|
||||||
|
item['translated_content'] = context1_translated
|
||||||
|
|
||||||
|
|
||||||
|
outputfile = open('./merged_output_01.json', "w", encoding='utf-8')
|
||||||
|
outputfile.write(json.dumps(data, ensure_ascii=False, indent = 4))
|
||||||
|
outputfile.close()
|
||||||
|
end_time = time.time()
|
||||||
|
print(f"elapsed time: {end_time-start_time}")
|
||||||
|
print("end")
|
14329
Rule_extraction/Bime - 3 Content - Llama 3.2.json
Normal file
14329
Rule_extraction/Bime - 3 Content - Llama 3.2.json
Normal file
File diff suppressed because one or more lines are too long
16195
Rule_extraction/Bime - 3 Content and Rule extraction - Llama 3.2.json
Normal file
16195
Rule_extraction/Bime - 3 Content and Rule extraction - Llama 3.2.json
Normal file
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue
Block a user