data_processes/readme/readme-representer-en.md

2.1 KiB

Persian Sentence Representation Script

This script (p5_representer.py) is designed to simplify and represent complex Persian legal sentences as a set of simpler, more understandable sentences. It uses the meta-llama/Meta-Llama-3.1-8B-Instruct model for this task.

Note: For library versions, please refer to the requirements.txt file.

Model Used

  • Model: meta-llama/Meta-Llama-3.1-8B-Instruct
  • Loaded via HuggingFace Transformers (AutoModelForCausalLM, AutoTokenizer)

System and User Prompts

  • System prompt: Sets the model as a legal expert who explains legal texts in simple language for non-experts, without changing technical terms.
  • User prompt: Asks the model to rewrite the input legal text in a specified number of simple sentences in Persian.

Main Methods

1. single_section_representation(content)

  • Purpose: Simplifies a single legal text section.
  • Inputs:
    • content (str): The legal text to be simplified.
  • Outputs:
    • result (bool): Operation status.
    • desc (str): Description of the result.
    • sentences (list): List of simplified sentences.

2. do_representation(sections)

  • Purpose: Processes multiple sections and saves the results.
  • Inputs:
    • sections (dict): Dictionary where each key is a section ID and each value contains a content field.
  • Outputs:
    • operation_result (bool): Overall operation status.
    • sections (dict): The input dictionary with an added represented_sentences field for each section.

Example Input

sections = {
    "1": {"content": "این یک متن حقوقی پیچیده است که باید ساده شود."},
    "2": {"content": "متن حقوقی دوم برای بازنمایی."}
}
result, output_sections = do_representation(sections)

Output

Each section will have a new field represented_sentences containing the simplified sentences.

Notes

  • The script automatically uses GPU if available.
  • Errors for each section are logged in the ./data/represent/ directory.
  • The output JSON file is saved in ./data/represent/.