Persian Sentence Keyword Extractor

This project provides a Python script (p5_representer.py) for extracting keywords from Persian sentences and legal text sections using transformer-based models.

How it works

The script uses the pre-trained Meta-Llama-3.1-8B-Instruct model (with quantization for efficiency).
It processes Persian text input, system and user prompts, and extracts the most relevant keywords.

Requirements

Python 3.8+
torch, transformers, bitsandbytes
elasticsearch helper (custom ElasticHelper class)
Other utilities as listed in the requirements.txt file

For exact versions of the libraries, please check requirements.txt.

Prompt Usage

System Prompt (SYS_PROMPT): Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
User Prompt: Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.

This combination ensures consistent keyword extraction.

Main Methods

`format_prompt(SENTENCE: str) -> str`

Formats the raw Persian sentence into a model-ready input.
Input: A single Persian sentence (str)
Output: A formatted string (str)

`kw_count_calculator(text: str) -> int`

Calculates the number of keywords to extract based on text length.
Input: Text (str)
Output: Keyword count (int)

`generate(formatted_prompt: str) -> str`

Core generation method that sends the prompt to the model.
Input: Formatted text prompt (str)
Output: Generated keywords as a string (str)

`single_section_get_keyword(sentence: str) -> list[str]`

Main method for extracting keywords from a sentence.
Input: Sentence (str)
Output: List of unique keywords (list[str])

`get_sections() -> dict`

Loads section data from a compressed JSON source (via ElasticHelper).
Output: Dictionary of sections (dict)

`convert_to_dict(sections: list) -> dict`

Converts raw section list into a dictionary with IDs as keys.

`do_keyword_extract(sections: dict) -> tuple`

Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.
Input: Sections (dict)
Output: Tuple (operation_result: bool, sections: dict)

Example Input/Output

Input:

"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."

Output:

حقوق شهروندی
قانون اساسی
تکالیف
ایران

Notes

Large models (Llama 3.1) require GPU with sufficient memory.
The script handles repeated keywords by removing duplicates.
Output is automatically saved in JSON format after processing.

2.8 KiB Raw Permalink Blame History

Persian Sentence Keyword Extractor

How it works

Requirements

Prompt Usage

Main Methods

format_prompt(SENTENCE: str) -> str

kw_count_calculator(text: str) -> int

generate(formatted_prompt: str) -> str

single_section_get_keyword(sentence: str) -> list[str]

get_sections() -> dict

convert_to_dict(sections: list) -> dict

do_keyword_extract(sections: dict) -> tuple