2.8 KiB
Persian Sentence Keyword Extractor
This project provides a Python script (p5_representer.py
) for extracting keywords from Persian sentences and legal text sections using transformer-based models.
How it works
The script uses the pre-trained Meta-Llama-3.1-8B-Instruct model (with quantization for efficiency).
It processes Persian text input, system and user prompts, and extracts the most relevant keywords.
Requirements
- Python 3.8+
- torch, transformers, bitsandbytes
- elasticsearch helper (custom ElasticHelper class)
- Other utilities as listed in the
requirements.txt
file
For exact versions of the libraries, please check requirements.txt
.
Prompt Usage
- System Prompt (SYS_PROMPT): Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
- User Prompt: Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.
This combination ensures consistent keyword extraction.
Main Methods
format_prompt(SENTENCE: str) -> str
Formats the raw Persian sentence into a model-ready input.
Input: A single Persian sentence (str
)
Output: A formatted string (str
)
kw_count_calculator(text: str) -> int
Calculates the number of keywords to extract based on text length.
Input: Text (str
)
Output: Keyword count (int
)
generate(formatted_prompt: str) -> str
Core generation method that sends the prompt to the model.
Input: Formatted text prompt (str
)
Output: Generated keywords as a string (str
)
single_section_get_keyword(sentence: str) -> list[str]
Main method for extracting keywords from a sentence.
Input: Sentence (str
)
Output: List of unique keywords (list[str]
)
get_sections() -> dict
Loads section data from a compressed JSON source (via ElasticHelper).
Output: Dictionary of sections (dict
)
convert_to_dict(sections: list) -> dict
Converts raw section list into a dictionary with IDs as keys.
do_keyword_extract(sections: dict) -> tuple
Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.
Input: Sections (dict
)
Output: Tuple (operation_result: bool, sections: dict)
Example Input/Output
Input:
"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
Output:
حقوق شهروندی
قانون اساسی
تکالیف
ایران
Notes
- Large models (Llama 3.1) require GPU with sufficient memory.
- The script handles repeated keywords by removing duplicates.
- Output is automatically saved in JSON format after processing.