new kw readme

2025-08-16 16:47:04 +03:30 · 2025-08-16 16:47:04 +03:30 · ab99c73b03
commit ab99c73b03
parent 16edcb599d
2 changed files with 150 additions and 0 deletions
--- a/readme/1.md
+++ b/readme/1.md
@ -0,0 +1,75 @@
 # Persian Sentence Keyword Extractor
 This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**.
 ## How it works
 The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency).  
 It processes Persian text input, generates system and user prompts, and extracts the most relevant keywords.
 ## Requirements
 - Python 3.8+
 - torch, transformers, bitsandbytes
 - elasticsearch helper (custom ElasticHelper class)
 - Other utilities as listed in the `requirements.txt` file
 For exact versions of the libraries, please check **`requirements.txt`**.
 ## Prompt Usage
 - **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
 - **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.
 This combination ensures consistent keyword extraction.
 ## Main Methods
 ### `format_prompt(SENTENCE: str) -> str`
 Formats the raw Persian sentence into a model-ready input.  
 **Input:** A single Persian sentence (`str`)  
 **Output:** A formatted string (`str`)  
 ### `kw_count_calculator(text: str) -> int`
 Calculates the number of keywords to extract based on text length.  
 **Input:** Text (`str`)  
 **Output:** Keyword count (`int`)  
 ### `generate(formatted_prompt: str) -> str`
 Core generation method that sends the prompt to the model.  
 **Input:** Formatted text prompt (`str`)  
 **Output:** Generated keywords as a string (`str`)  
 ### `single_section_get_keyword(sentence: str) -> list[str]`
 Main method for extracting keywords from a sentence.  
 **Input:** Sentence (`str`)  
 **Output:** List of unique keywords (`list[str]`)  
 ### `get_sections() -> dict`
 Loads section data from a compressed JSON source (via ElasticHelper).  
 **Output:** Dictionary of sections (`dict`)  
 ### `convert_to_dict(sections: list) -> dict`
 Converts raw section list into a dictionary with IDs as keys.  
 ### `do_keyword_extract(sections: dict) -> tuple`
 Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.  
 **Input:** Sections (`dict`)  
 **Output:** Tuple `(operation_result: bool, sections: dict)`  
 ## Example Input/Output
 **Input:**  
 ```text
 "حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
 ```
 **Output:**  
 ```text
 حقوق شهروندی
 قانون اساسی
 تکالیف
 ایران
 ```
 ## Notes
 - Large models (Llama 3.1) require GPU with sufficient memory.  
 - The script handles repeated keywords by removing duplicates.  
 - Output is automatically saved in JSON format after processing.  
--- a/readme/2.md
+++ b/readme/2.md
@ -0,0 +1,75 @@
 # استخراج‌گر کلیدواژه جملات فارسی
 این پروژه یک اسکریپت پایتون (`p5_representer.py`) برای استخراج **کلیدواژه‌ها** از جملات و سکشن‌های متون حقوقی فارسی با استفاده از **مدل‌های مبتنی بر Transformer** است.
 ## نحوه عملکرد
 این اسکریپت از مدل **Meta-Llama-3.1-8B-Instruct** (با فشرده‌سازی و کوانتش برای کارایی بیشتر) استفاده می‌کند.  
 ابتدا متن ورودی دریافت شده، با استفاده از پرامپت‌های سیستمی و کاربری آماده‌سازی می‌شود و سپس کلمات کلیدی مرتبط از متن استخراج می‌شوند.
 ## پیش‌نیازها
 - پایتون 3.8 یا بالاتر
 - کتابخانه‌های torch، transformers، bitsandbytes
 - کلاس ElasticHelper برای بارگذاری داده‌ها
 - سایر ابزارها در فایل `requirements.txt`
 برای مشاهده نسخه دقیق کتابخانه‌ها به فایل **`requirements.txt`** مراجعه کنید.
 ## استفاده از پرامپت‌ها
 - **پرامپت سیستمی (SYS_PROMPT):** نقش دستیار را تعریف می‌کند. نمونه: "شما یک دستیار حقوقی هستید."
 - **پرامپت کاربری (USER_PROMPT):** به مدل می‌گوید حداقل تعداد مشخصی کلیدواژه استخراج کند. خروجی باید فهرستی فارسی باشد، بدون علائم اضافی.
 این ترکیب باعث پایداری و دقت در استخراج کلیدواژه می‌شود.
 ## متدهای اصلی
 ### `format_prompt(SENTENCE: str) -> str`
 متن خام فارسی را به فرمت مناسب برای مدل تبدیل می‌کند.  
 **ورودی:** یک جمله فارسی (`str`)  
 **خروجی:** متن قالب‌بندی‌شده (`str`)  
 ### `kw_count_calculator(text: str) -> int`
 تعداد کلیدواژه‌ها را بر اساس طول متن محاسبه می‌کند.  
 **ورودی:** متن (`str`)  
 **خروجی:** تعداد کلیدواژه‌ها (`int`)  
 ### `generate(formatted_prompt: str) -> str`
 متد اصلی برای ارسال پرامپت به مدل و دریافت خروجی.  
 **ورودی:** پرامپت آماده‌شده (`str`)  
 **خروجی:** متن کلیدواژه‌ها (`str`)  
 ### `single_section_get_keyword(sentence: str) -> list[str]`
 متد اصلی برای استخراج کلیدواژه‌ها از یک جمله.  
 **ورودی:** جمله (`str`)  
 **خروجی:** لیستی از کلیدواژه‌های یکتا (`list[str]`)  
 ### `get_sections() -> dict`
 بارگذاری سکشن‌ها از فایل فشرده JSON با کمک کلاس ElasticHelper.  
 **خروجی:** دیکشنری سکشن‌ها (`dict`)  
 ### `convert_to_dict(sections: list) -> dict`
 تبدیل لیست سکشن‌ها به دیکشنری با کلید ID.  
 ### `do_keyword_extract(sections: dict) -> tuple`
 حلقه اصلی پردازش سکشن‌ها، ذخیره خروجی در فایل JSON و ثبت خطاها.  
 **ورودی:** سکشن‌ها (`dict`)  
 **خروجی:** تاپل `(operation_result: bool, sections: dict)`  
 ## مثال ورودی/خروجی
 **ورودی:**  
 ```text
 "حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
 ```
 **خروجی:**  
 ```text
 حقوق شهروندی
 قانون اساسی
 تکالیف
 ایران
 ```
 ## نکات
 - مدل‌های بزرگ (Llama 3.1) به GPU با حافظه بالا نیاز دارند.  
 - کلیدواژه‌های تکراری حذف می‌شوند.  
 - نتایج پردازش به‌صورت خودکار در فایل JSON ذخیره می‌شود.