data_processes/readme/readme-keyword-extractor-en.md

# Persian Sentence Keyword Extractor

This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**.

## How it works
The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency).
It processes Persian text input, system and user prompts, and extracts the most relevant keywords.

## Requirements
- Python 3.8+
- torch, transformers, bitsandbytes
- elasticsearch helper (custom ElasticHelper class)
- Other utilities as listed in the `requirements.txt` file

For exact versions of the libraries, please check **`requirements.txt`**.

## Prompt Usage
- **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
- **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.

This combination ensures consistent keyword extraction.

## Main Methods

### `format_prompt(SENTENCE: str) -> str`
Formats the raw Persian sentence into a model-ready input.
**Input:** A single Persian sentence (`str`)
**Output:** A formatted string (`str`)

### `kw_count_calculator(text: str) -> int`
Calculates the number of keywords to extract based on text length.
**Input:** Text (`str`)
**Output:** Keyword count (`int`)

### `generate(formatted_prompt: str) -> str`
Core generation method that sends the prompt to the model.
**Input:** Formatted text prompt (`str`)
**Output:** Generated keywords as a string (`str`)

### `single_section_get_keyword(sentence: str) -> list[str]`
Main method for extracting keywords from a sentence.
**Input:** Sentence (`str`)
**Output:** List of unique keywords (`list[str]`)

### `get_sections() -> dict`
Loads section data from a compressed JSON source (via ElasticHelper).
**Output:** Dictionary of sections (`dict`)

### `convert_to_dict(sections: list) -> dict`
Converts raw section list into a dictionary with IDs as keys.

### `do_keyword_extract(sections: dict) -> tuple`
Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.
**Input:** Sections (`dict`)
**Output:** Tuple `(operation_result: bool, sections: dict)`

## Example Input/Output

**Input:**
```text
"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
```

**Output:**
```text
حقوق شهروندی
قانون اساسی
تکالیف
ایران
```

## Notes
- Large models (Llama 3.1) require GPU with sufficient memory.
- The script handles repeated keywords by removing duplicates.
- Output is automatically saved in JSON format after processing.