new kw readme

2025-08-16 16:47:04 +03:30 · 2025-08-16 16:47:04 +03:30 · ab99c73b03
commit ab99c73b03
parent 16edcb599d
2 changed files with 150 additions and 0 deletions
--- a/readme/1.md
+++ b/readme/1.md
@ -0,0 +1,75 @@
+# Persian Sentence Keyword Extractor
+
+This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**.
+
+## How it works
+The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency).  
+It processes Persian text input, generates system and user prompts, and extracts the most relevant keywords.
+
+## Requirements
+- Python 3.8+
+- torch, transformers, bitsandbytes
+- elasticsearch helper (custom ElasticHelper class)
+- Other utilities as listed in the `requirements.txt` file
+
+For exact versions of the libraries, please check **`requirements.txt`**.
+
+## Prompt Usage
+- **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
+- **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.
+
+This combination ensures consistent keyword extraction.
+
+## Main Methods
+
+### `format_prompt(SENTENCE: str) -> str`
+Formats the raw Persian sentence into a model-ready input.  
+**Input:** A single Persian sentence (`str`)  
+**Output:** A formatted string (`str`)  
+
+### `kw_count_calculator(text: str) -> int`
+Calculates the number of keywords to extract based on text length.  
+**Input:** Text (`str`)  
+**Output:** Keyword count (`int`)  
+
+### `generate(formatted_prompt: str) -> str`
+Core generation method that sends the prompt to the model.  
+**Input:** Formatted text prompt (`str`)  
+**Output:** Generated keywords as a string (`str`)  
+
+### `single_section_get_keyword(sentence: str) -> list[str]`
+Main method for extracting keywords from a sentence.  
+**Input:** Sentence (`str`)  
+**Output:** List of unique keywords (`list[str]`)  
+
+### `get_sections() -> dict`
+Loads section data from a compressed JSON source (via ElasticHelper).  
+**Output:** Dictionary of sections (`dict`)  
+
+### `convert_to_dict(sections: list) -> dict`
+Converts raw section list into a dictionary with IDs as keys.  
+
+### `do_keyword_extract(sections: dict) -> tuple`
+Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.  
+**Input:** Sections (`dict`)  
+**Output:** Tuple `(operation_result: bool, sections: dict)`  
+
+## Example Input/Output
+
+**Input:**  
+```text
+"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
+```
+
+**Output:**  
+```text
+حقوق شهروندی
+قانون اساسی
+تکالیف
+ایران
+```
+
+## Notes
+- Large models (Llama 3.1) require GPU with sufficient memory.  
+- The script handles repeated keywords by removing duplicates.  
+- Output is automatically saved in JSON format after processing.  
--- a/readme/2.md
+++ b/readme/2.md
@ -0,0 +1,75 @@
+# استخراج‌گر کلیدواژه جملات فارسی
+
+این پروژه یک اسکریپت پایتون (`p5_representer.py`) برای استخراج **کلیدواژه‌ها** از جملات و سکشن‌های متون حقوقی فارسی با استفاده از **مدل‌های مبتنی بر Transformer** است.
+
+## نحوه عملکرد
+این اسکریپت از مدل **Meta-Llama-3.1-8B-Instruct** (با فشرده‌سازی و کوانتش برای کارایی بیشتر) استفاده می‌کند.  
+ابتدا متن ورودی دریافت شده، با استفاده از پرامپت‌های سیستمی و کاربری آماده‌سازی می‌شود و سپس کلمات کلیدی مرتبط از متن استخراج می‌شوند.
+
+## پیش‌نیازها
+- پایتون 3.8 یا بالاتر
+- کتابخانه‌های torch، transformers، bitsandbytes
+- کلاس ElasticHelper برای بارگذاری داده‌ها
+- سایر ابزارها در فایل `requirements.txt`
+
+برای مشاهده نسخه دقیق کتابخانه‌ها به فایل **`requirements.txt`** مراجعه کنید.
+
+## استفاده از پرامپت‌ها
+- **پرامپت سیستمی (SYS_PROMPT):** نقش دستیار را تعریف می‌کند. نمونه: "شما یک دستیار حقوقی هستید."
+- **پرامپت کاربری (USER_PROMPT):** به مدل می‌گوید حداقل تعداد مشخصی کلیدواژه استخراج کند. خروجی باید فهرستی فارسی باشد، بدون علائم اضافی.
+
+این ترکیب باعث پایداری و دقت در استخراج کلیدواژه می‌شود.
+
+## متدهای اصلی
+
+### `format_prompt(SENTENCE: str) -> str`
+متن خام فارسی را به فرمت مناسب برای مدل تبدیل می‌کند.  
+**ورودی:** یک جمله فارسی (`str`)  
+**خروجی:** متن قالب‌بندی‌شده (`str`)  
+
+### `kw_count_calculator(text: str) -> int`
+تعداد کلیدواژه‌ها را بر اساس طول متن محاسبه می‌کند.  
+**ورودی:** متن (`str`)  
+**خروجی:** تعداد کلیدواژه‌ها (`int`)  
+
+### `generate(formatted_prompt: str) -> str`
+متد اصلی برای ارسال پرامپت به مدل و دریافت خروجی.  
+**ورودی:** پرامپت آماده‌شده (`str`)  
+**خروجی:** متن کلیدواژه‌ها (`str`)  
+
+### `single_section_get_keyword(sentence: str) -> list[str]`
+متد اصلی برای استخراج کلیدواژه‌ها از یک جمله.  
+**ورودی:** جمله (`str`)  
+**خروجی:** لیستی از کلیدواژه‌های یکتا (`list[str]`)  
+
+### `get_sections() -> dict`
+بارگذاری سکشن‌ها از فایل فشرده JSON با کمک کلاس ElasticHelper.  
+**خروجی:** دیکشنری سکشن‌ها (`dict`)  
+
+### `convert_to_dict(sections: list) -> dict`
+تبدیل لیست سکشن‌ها به دیکشنری با کلید ID.  
+
+### `do_keyword_extract(sections: dict) -> tuple`
+حلقه اصلی پردازش سکشن‌ها، ذخیره خروجی در فایل JSON و ثبت خطاها.  
+**ورودی:** سکشن‌ها (`dict`)  
+**خروجی:** تاپل `(operation_result: bool, sections: dict)`  
+
+## مثال ورودی/خروجی
+
+**ورودی:**  
+```text
+"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
+```
+
+**خروجی:**  
+```text
+حقوق شهروندی
+قانون اساسی
+تکالیف
+ایران
+```
+
+## نکات
+- مدل‌های بزرگ (Llama 3.1) به GPU با حافظه بالا نیاز دارند.  
+- کلیدواژه‌های تکراری حذف می‌شوند.  
+- نتایج پردازش به‌صورت خودکار در فایل JSON ذخیره می‌شود.