data_processes/readme/readme-representer-en.md

55 lines
2.1 KiB
Markdown

# Persian Sentence Representation Script
This script (`p5_representer.py`) is designed to simplify and represent complex Persian legal sentences as a set of simpler, more understandable sentences. It uses the `meta-llama/Meta-Llama-3.1-8B-Instruct` model for this task.
**Note:** For library versions, please refer to the `requirements.txt` file.
## Model Used
- Model: `meta-llama/Meta-Llama-3.1-8B-Instruct`
- Loaded via HuggingFace Transformers (`AutoModelForCausalLM`, `AutoTokenizer`)
## System and User Prompts
- **System prompt:** Sets the model as a legal expert who explains legal texts in simple language for non-experts, without changing technical terms.
- **User prompt:** Asks the model to rewrite the input legal text in a specified number of simple sentences in Persian.
## Main Methods
### 1. `single_section_representation(content)`
- **Purpose:** Simplifies a single legal text section.
- **Inputs:**
- `content` (str): The legal text to be simplified.
- **Outputs:**
- `result` (bool): Operation status.
- `desc` (str): Description of the result.
- `sentences` (list): List of simplified sentences.
### 2. `do_representation(sections)`
- **Purpose:** Processes multiple sections and saves the results.
- **Inputs:**
- `sections` (dict): Dictionary where each key is a section ID and each value contains a `content` field.
- **Outputs:**
- `operation_result` (bool): Overall operation status.
- `sections` (dict): The input dictionary with an added `represented_sentences` field for each section.
## Example Input
```python
sections = {
"1": {"content": "این یک متن حقوقی پیچیده است که باید ساده شود."},
"2": {"content": "متن حقوقی دوم برای بازنمایی."}
}
result, output_sections = do_representation(sections)
```
## Output
Each section will have a new field `represented_sentences` containing the simplified sentences.
## Notes
- The script automatically uses GPU if available.
- Errors for each section are logged in the `./data/represent/` directory.
- The output JSON file is saved in `./data/represent/`.