3types of content and rule extraction - prompt and datasets

2025-02-09 18:51:30 +03:30 · 2025-02-09 18:51:30 +03:30 · f3f0cde8f1
commit f3f0cde8f1
parent 9ca5c24110
5 changed files with 39962 additions and 0 deletions
--- a/Context_update/Bime
+++ b/Context_update/Bime
--- a/Rule_extraction/3
+++ b/Rule_extraction/3
@ -0,0 +1,221 @@
 import json
 from tqdm import tqdm
 import time
 import re
 import torch
 import os
 from transformers import AutoTokenizer, AutoModel
 from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
 os.environ['HF_HOME'] = "/home/admin/HFHOME"
 #model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
 model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
 #model_id = "meta-llama/Llama-3.1-70B-Instruct"
 # use quantization to lower GPU usage
 # 4 bit:
 # bnb_config = BitsAndBytesConfig(
 #     load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
 # )
 # 8 bit:
 # bnb_config = BitsAndBytesConfig(
 #     load_in_8bit=True, bnb_8bit_use_double_quant=True, bnb_8bit_quant_type="nf8", bnb_8bit_compute_dtype=torch.bfloat16
 # )
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    #quantization_config=bnb_config
 )
 terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
 ]
 model.generation_config.pad_token_id = tokenizer.eos_token_id #tokenizer.pad_token_id
 def remove_think_tags(strings):
    if isinstance(strings, str):
      return re.sub(r'<think>[\s\S]*?</think>', '', strings)
    else:
      return [re.sub(r'<think>[\s\S]*?</think>', '', s) for s in strings]
 SYS_PROMPT_simpler = """
 Analyze the input sentence and divide it into several simpler and clearer sentences. The new sentences should retain the main meaning of the original sentence but be easier to understand. Use short and direct sentences and avoid complex terms or confusing grammatical structures.
 The output should be Persian sentences only.
 Each sentence should either be factual and express a minimal fact, or be conditional and express a minimal rule. A minimal fact or rule is a fact or rule that cannot be expressed in two separate and independent sentences.
 """
 SYS_PROMPT_modified ="""
 Edit the text below carefully for punctuation (including punctuation, spaces and half-spaces), and spelling errors, and perform a complete editorial review considering all editing points. The output should only be the corrected text without any additional explanation. 
 The text should be preserved completely and only edited. Do not alter the sentence structure, add or remove words, or change the original meaning in any way. Please pay attention to correcting homophone words and fix these words based on the meaning of the sentence.
 """
 SYS_PROMPT_translated ="""
 translate the Persian content below into fluent and accurate English, ensuring the following criteria are met:
 Meaning and Context: Preserve the original meaning and context of the text throughout the translation. These texts are all parts of legal laws in different domains.
 Linguistic Accuracy: Adhere to proper English grammar, vocabulary, and syntax, demonstrating a native-level proficiency in the language.
 Proper noun: do not translate those words that are proper nouns.
 Cultural Sensitivity: Translate any technical terms, idiomatic expressions, and culturally specific references appropriately to ensure clarity and relevance in Persian.
 The final English translation should be professional, natural, and suitable for its intended audience, which is law's audiance.
 Note that the output must be nothing but the final traslation
 """
 SYS_PROMPT_fewshot ="""
 You are an expert in law and also an expert in logic. When faced with any legal text, you can understand it well and understand the propositions it expresses.
 You can also analyze and understand the logical relationships between these propositions correctly in the text.
 Legal sentences are sometimes factual sentences and sometimes they are rule-like sentences.
 If the sentences are factual and no rule-like content, they do not require logical analysis, thus say "no Rule". If the sentences are rule-based, you need to find their propositions and then express them in the form of an "if, then, unless" conditional.
 To understand the output structure, follow the examples below.
 Note that in some cases, to understand the rules of logical relationships in a section of the law, you must consider other information outside the text as context. If you are given a sentence with context, you may use the context in your logical analysis of the sentence.
 do not say nothing other than the output like these examples:
 Input:
 "سرمایه های صرف شده که در اختیار واحدهای عملیات نفتی گذاشته شده یا میشود، جزء دارایی های واحد مزبور خواهد بود ولی هرگونه نقل و انتقال آنها منوط به اجازه وزارت نفت میباشد."
 output:
 حکم 1:
 اگر {سرمایه ای صرف شود و در اختیار یکی از واحدهای عملیاتی نفت باشد}
 آنگاه {آن واحد مالک آن سرمایه خواهد بود}
 حکم 2:
 اگر {سرمایه ای در اختیار یکی از واحدهای عملیاتی نفت باشد و آن سرمایه نقل یا انتقال داده شود}
 آنگاه {آن نقل یا انتقال جایز نیست}
 مگر {آن نقل یا انتقال با اجازه وزارت نفت باشد}
 Input:
 "صرف درآمد موقوفات به منظور بقاء عین آنها بر سایر مصارف مقدم است"
 output:
 اگر {موقوفه‌ای درآمدی داشت}
 آنگاه {آن درآمد باید قبل از هر مصرف دیگری به منظور بقاء عین صرف شود.}
 مگر {صرف درآمد به منظور بقاء عین لازم نباشد}
 Input:
 "کود: هر ماده آلی، زیستی یا معدنی با منشأ طبیعی یا مصنوعی که به خاک یا گیاه اضافه می ‌شود تا یک یا چند عنصر ضروری برای رشد گیاه را تأمین کند."
 output:
 اگر{ماده‌ای ماده آلی، زیستی یا معدنی با منشأ طبیعی یا مصنوعی باشد که به خاک یا گیاه اضافه ‌میشود که یک یا چند عنصر ضروری برای رشد گیاه را تأمین کند.}
 آنگاه {آن ماده کود است}
 Input:
 "حفظ و افزایش بهره ‌وری منابع معدنی که سرمایه‌ های ملی تجدیدناپذیر هستند"
 context:
 "این عبارت، دنباله عبارت زیر است:
 ماده 2 - اهداف و وظایف نظام مهندسی معدن عبارتند از:"
 output:
 حکم 1:
 اگر {منابع معدنی وجود داشته باشد}
 آنگاه {سرمایه ملی تجدیدناپذیر است}
 حکم 2:
 اگر {منابع معدنی وجود داشته باشد}
 آنگاه {حفظ این منابع بر نظام مهندسی معدن واجب است.}
 حکم 3:
 اگر {منابع معدنی وجود داشت}
 آنگاه {افزایش بهره‌وری این منابع بر نظام مهندسی معدن واجب است.}
 """
 def generate(system_prompt, formatted_prompt):
  formatted_prompt = formatted_prompt[:50000] # to avoid GPU OOM
  messages = [{"role":"system","content":system_prompt},{"role":"user","content":formatted_prompt}]
  # tell the model to generate
  input_ids = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt"
  ).to(model.device)
  outputs = model.generate(
      input_ids,
      max_new_tokens=2048,
      eos_token_id=terminators,
      do_sample=True,
      temperature=0.6,
      top_p=0.9,
  )
  response = outputs[0][input_ids.shape[-1]:]
  return tokenizer.decode(response, skip_special_tokens=True)
 def simpler(text, context):
    user_prompt_with_context = f"""input:
    {text}
    context:
    "{context}"
    """
    return generate(SYS_PROMPT_simpler, user_prompt_with_context)
 def modified(text, context):
    user_prompt_with_context = f"""input:
    {text}
    context:
    "{context}"
    """
    return generate(SYS_PROMPT_modified, user_prompt_with_context)
 def translated(text):
    user_prompt_with_context = f"""input:
    {text}
    """
    return generate(SYS_PROMPT_translated, user_prompt_with_context)
 def fewshot(text, context):
    user_prompt_with_context = f"""input:
    {text}
    context:
    "{context}"
    """
    return generate(SYS_PROMPT_simpler, user_prompt_with_context)
 if __name__ == "__main__":
    print('start')
    start_time = time.time()
    #inputfile = open('./merged_output.json', "r", encoding='utf-8')
    inputfile = open('./merged_output_01.json', "r", encoding='utf-8')
    data = json.load(inputfile)
    inputfile.close()
    for index, item in enumerate(tqdm(data)):
        # last extractions can be uncomment if need for perform them together
        # content = item.get("content","")
        context = item.get("context_sentences","")
        contextdict = item.get("context","")
        context_sentences = contextdict.get("context_sentences","")
        context = ""
        for a in context_sentences:
            if context == "":
                context = a
            else:
                context = context + " " + a
        # context1_simpler= simpler(content, context)
        # item['simplified_content'] = context1_simpler
        # context1_modified= modified(content, context)
        # item['modified_content'] = context1_modified
        # context1_translated= translated(content)
        # item['translated_content'] = context1_translated
        ###
        simplified_content = item['simplified_content']
        simplified_content_rule = fewshot(simplified_content, context)
        item['simplified_content_rule'] = simplified_content_rule
        modified_content = item['modified_content']
        modified_content_rule = fewshot(modified_content, context)
        item['modified_content_rule'] = modified_content_rule
        translated_content = item['translated_content']
        translated_content_rule = fewshot(translated_content, context)
        item['translated_content_rule'] = translated_content_rule
    outputfile = open('./merged_output_02.json', "w", encoding='utf-8')
    outputfile.write(json.dumps(data, ensure_ascii=False, indent = 4))
    outputfile.close()
    end_time = time.time()
    print(f"elapsed time:   {end_time-start_time}")
    print("end")
--- a/Rule_extraction/3
+++ b/Rule_extraction/3
@ -0,0 +1,145 @@
 import json
 from tqdm import tqdm
 import time
 import re
 import torch
 import os
 from transformers import AutoTokenizer, AutoModel
 from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
 os.environ['HF_HOME'] = "/home/admin/HFHOME"
 #model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
 model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
 #model_id = "meta-llama/Llama-3.1-70B-Instruct"
 # use quantization to lower GPU usage
 # 4 bit:
 # bnb_config = BitsAndBytesConfig(
 #     load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
 # )
 # 8 bit:
 # bnb_config = BitsAndBytesConfig(
 #     load_in_8bit=True, bnb_8bit_use_double_quant=True, bnb_8bit_quant_type="nf8", bnb_8bit_compute_dtype=torch.bfloat16
 # )
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    #quantization_config=bnb_config
 )
 terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
 ]
 model.generation_config.pad_token_id = tokenizer.eos_token_id #tokenizer.pad_token_id
 def remove_think_tags(strings):
    if isinstance(strings, str):
      return re.sub(r'<think>[\s\S]*?</think>', '', strings)
    else:
      return [re.sub(r'<think>[\s\S]*?</think>', '', s) for s in strings]
 SYS_PROMPT_simpler = """
 Analyze the input sentence and divide it into several simpler and clearer sentences. The new sentences should retain the main meaning of the original sentence but be easier to understand. Use short and direct sentences and avoid complex terms or confusing grammatical structures.
 The output should be Persian sentences only.
 Each sentence should either be factual and express a minimal fact, or be conditional and express a minimal rule. A minimal fact or rule is a fact or rule that cannot be expressed in two separate and independent sentences.
 """
 SYS_PROMPT_modified ="""
 Edit the text below carefully for punctuation (including punctuation, spaces and half-spaces), and spelling errors, and perform a complete editorial review considering all editing points. The output should only be the corrected text without any additional explanation. 
 The text should be preserved completely and only edited. Do not alter the sentence structure, add or remove words, or change the original meaning in any way. Please pay attention to correcting homophone words and fix these words based on the meaning of the sentence.
 """
 SYS_PROMPT_translated ="""
 translate the Persian content below into fluent and accurate English, ensuring the following criteria are met:
 Meaning and Context: Preserve the original meaning and context of the text throughout the translation. These texts are all parts of legal laws in different domains.
 Linguistic Accuracy: Adhere to proper English grammar, vocabulary, and syntax, demonstrating a native-level proficiency in the language.
 Proper noun: do not translate those words that are proper nouns.
 Cultural Sensitivity: Translate any technical terms, idiomatic expressions, and culturally specific references appropriately to ensure clarity and relevance in Persian.
 The final English translation should be professional, natural, and suitable for its intended audience, which is law's audiance.
 Note that the output must be nothing but the final traslation
 """
 def generate(system_prompt, formatted_prompt):
  formatted_prompt = formatted_prompt[:50000] # to avoid GPU OOM
  messages = [{"role":"system","content":system_prompt},{"role":"user","content":formatted_prompt}]
  # tell the model to generate
  input_ids = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt"
  ).to(model.device)
  outputs = model.generate(
      input_ids,
      max_new_tokens=2048,
      eos_token_id=terminators,
      do_sample=True,
      temperature=0.6,
      top_p=0.9,
  )
  response = outputs[0][input_ids.shape[-1]:]
  return tokenizer.decode(response, skip_special_tokens=True)
 def simpler(text, context):
    user_prompt_with_context = f"""input:
    {text}
    context:
    "{context}"
    """
    return generate(SYS_PROMPT_simpler, user_prompt_with_context)
 def modified(text, context):
    user_prompt_with_context = f"""input:
    {text}
    context:
    "{context}"
    """
    return generate(SYS_PROMPT_modified, user_prompt_with_context)
 def translated(text):
    user_prompt_with_context = f"""input:
    {text}
    """
    return generate(SYS_PROMPT_translated, user_prompt_with_context)
 if __name__ == "__main__":
    print('start')
    start_time = time.time()
    inputfile = open('./merged_output.json', "r", encoding='utf-8')
    data = json.load(inputfile)
    inputfile.close()
    for index, item in enumerate(tqdm(data)):
        content = item.get("content","")
        context = item.get("context_sentences","")
        contextdict = item.get("context","")
        context_sentences = contextdict.get("context_sentences","")
        context = ""
        for a in context_sentences:
            if context == "":
                context = a
            else:
                context = context + " " + a
        context1_simpler= simpler(content, context)
        item['simplified_content'] = context1_simpler
        context1_modified= modified(content, context)
        item['modified_content'] = context1_modified
        context1_translated= translated(content)
        item['translated_content'] = context1_translated
    outputfile = open('./merged_output_01.json', "w", encoding='utf-8')
    outputfile.write(json.dumps(data, ensure_ascii=False, indent = 4))
    outputfile.close()
    end_time = time.time()
    print(f"elapsed time:   {end_time-start_time}")
    print("end")
--- a/Rule_extraction/Bime
+++ b/Rule_extraction/Bime
--- a/Rule_extraction/Bime
+++ b/Rule_extraction/Bime