# Persian Sentence Keyword Extractor This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**. ## How it works The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency). It processes Persian text input, generates system and user prompts, and extracts the most relevant keywords. ## Requirements - Python 3.8+ - torch, transformers, bitsandbytes - elasticsearch helper (custom ElasticHelper class) - Other utilities as listed in the `requirements.txt` file For exact versions of the libraries, please check **`requirements.txt`**. ## Prompt Usage - **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts." - **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations. This combination ensures consistent keyword extraction. ## Main Methods ### `format_prompt(SENTENCE: str) -> str` Formats the raw Persian sentence into a model-ready input. **Input:** A single Persian sentence (`str`) **Output:** A formatted string (`str`) ### `kw_count_calculator(text: str) -> int` Calculates the number of keywords to extract based on text length. **Input:** Text (`str`) **Output:** Keyword count (`int`) ### `generate(formatted_prompt: str) -> str` Core generation method that sends the prompt to the model. **Input:** Formatted text prompt (`str`) **Output:** Generated keywords as a string (`str`) ### `single_section_get_keyword(sentence: str) -> list[str]` Main method for extracting keywords from a sentence. **Input:** Sentence (`str`) **Output:** List of unique keywords (`list[str]`) ### `get_sections() -> dict` Loads section data from a compressed JSON source (via ElasticHelper). **Output:** Dictionary of sections (`dict`) ### `convert_to_dict(sections: list) -> dict` Converts raw section list into a dictionary with IDs as keys. ### `do_keyword_extract(sections: dict) -> tuple` Main execution loop for processing multiple sections, saving output to JSON files, and logging errors. **Input:** Sections (`dict`) **Output:** Tuple `(operation_result: bool, sections: dict)` ## Example Input/Output **Input:** ```text "حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است." ``` **Output:** ```text حقوق شهروندی قانون اساسی تکالیف ایران ``` ## Notes - Large models (Llama 3.1) require GPU with sufficient memory. - The script handles repeated keywords by removing duplicates. - Output is automatically saved in JSON format after processing.