data_processes/readme/readme-keyword-extractor-en.md
2025-08-16 16:54:29 +03:30

2.8 KiB

Persian Sentence Keyword Extractor

This project provides a Python script (p5_representer.py) for extracting keywords from Persian sentences and legal text sections using transformer-based models.

How it works

The script uses the pre-trained Meta-Llama-3.1-8B-Instruct model (with quantization for efficiency).
It processes Persian text input, system and user prompts, and extracts the most relevant keywords.

Requirements

  • Python 3.8+
  • torch, transformers, bitsandbytes
  • elasticsearch helper (custom ElasticHelper class)
  • Other utilities as listed in the requirements.txt file

For exact versions of the libraries, please check requirements.txt.

Prompt Usage

  • System Prompt (SYS_PROMPT): Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
  • User Prompt: Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.

This combination ensures consistent keyword extraction.

Main Methods

format_prompt(SENTENCE: str) -> str

Formats the raw Persian sentence into a model-ready input.
Input: A single Persian sentence (str)
Output: A formatted string (str)

kw_count_calculator(text: str) -> int

Calculates the number of keywords to extract based on text length.
Input: Text (str)
Output: Keyword count (int)

generate(formatted_prompt: str) -> str

Core generation method that sends the prompt to the model.
Input: Formatted text prompt (str)
Output: Generated keywords as a string (str)

single_section_get_keyword(sentence: str) -> list[str]

Main method for extracting keywords from a sentence.
Input: Sentence (str)
Output: List of unique keywords (list[str])

get_sections() -> dict

Loads section data from a compressed JSON source (via ElasticHelper).
Output: Dictionary of sections (dict)

convert_to_dict(sections: list) -> dict

Converts raw section list into a dictionary with IDs as keys.

do_keyword_extract(sections: dict) -> tuple

Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.
Input: Sections (dict)
Output: Tuple (operation_result: bool, sections: dict)

Example Input/Output

Input:

"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."

Output:

حقوق شهروندی
قانون اساسی
تکالیف
ایران

Notes

  • Large models (Llama 3.1) require GPU with sufficient memory.
  • The script handles repeated keywords by removing duplicates.
  • Output is automatically saved in JSON format after processing.