edit kw readme

2025-08-16 16:54:29 +03:30 · 2025-08-16 16:54:29 +03:30 · 6318c7ce1e
commit 6318c7ce1e
parent ab99c73b03
4 changed files with 125 additions and 193 deletions
--- a/readme/1.md
+++ b/readme/1.md
@ -1,75 +0,0 @@
 # Persian Sentence Keyword Extractor
 This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**.
 ## How it works
 The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency).  
 It processes Persian text input, generates system and user prompts, and extracts the most relevant keywords.
 ## Requirements
 - Python 3.8+
 - torch, transformers, bitsandbytes
 - elasticsearch helper (custom ElasticHelper class)
 - Other utilities as listed in the `requirements.txt` file
 For exact versions of the libraries, please check **`requirements.txt`**.
 ## Prompt Usage
 - **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
 - **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.
 This combination ensures consistent keyword extraction.
 ## Main Methods
 ### `format_prompt(SENTENCE: str) -> str`
 Formats the raw Persian sentence into a model-ready input.  
 **Input:** A single Persian sentence (`str`)  
 **Output:** A formatted string (`str`)  
 ### `kw_count_calculator(text: str) -> int`
 Calculates the number of keywords to extract based on text length.  
 **Input:** Text (`str`)  
 **Output:** Keyword count (`int`)  
 ### `generate(formatted_prompt: str) -> str`
 Core generation method that sends the prompt to the model.  
 **Input:** Formatted text prompt (`str`)  
 **Output:** Generated keywords as a string (`str`)  
 ### `single_section_get_keyword(sentence: str) -> list[str]`
 Main method for extracting keywords from a sentence.  
 **Input:** Sentence (`str`)  
 **Output:** List of unique keywords (`list[str]`)  
 ### `get_sections() -> dict`
 Loads section data from a compressed JSON source (via ElasticHelper).  
 **Output:** Dictionary of sections (`dict`)  
 ### `convert_to_dict(sections: list) -> dict`
 Converts raw section list into a dictionary with IDs as keys.  
 ### `do_keyword_extract(sections: dict) -> tuple`
 Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.  
 **Input:** Sections (`dict`)  
 **Output:** Tuple `(operation_result: bool, sections: dict)`  
 ## Example Input/Output
 **Input:**  
 ```text
 "حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
 ```
 **Output:**  
 ```text
 حقوق شهروندی
 قانون اساسی
 تکالیف
 ایران
 ```
 ## Notes
 - Large models (Llama 3.1) require GPU with sufficient memory.  
 - The script handles repeated keywords by removing duplicates.  
 - Output is automatically saved in JSON format after processing.  
--- a/readme/2.md
+++ b/readme/2.md
@ -1,75 +0,0 @@
 # استخراج‌گر کلیدواژه جملات فارسی
 این پروژه یک اسکریپت پایتون (`p5_representer.py`) برای استخراج **کلیدواژه‌ها** از جملات و سکشن‌های متون حقوقی فارسی با استفاده از **مدل‌های مبتنی بر Transformer** است.
 ## نحوه عملکرد
 این اسکریپت از مدل **Meta-Llama-3.1-8B-Instruct** (با فشرده‌سازی و کوانتش برای کارایی بیشتر) استفاده می‌کند.  
 ابتدا متن ورودی دریافت شده، با استفاده از پرامپت‌های سیستمی و کاربری آماده‌سازی می‌شود و سپس کلمات کلیدی مرتبط از متن استخراج می‌شوند.
 ## پیش‌نیازها
 - پایتون 3.8 یا بالاتر
 - کتابخانه‌های torch، transformers، bitsandbytes
 - کلاس ElasticHelper برای بارگذاری داده‌ها
 - سایر ابزارها در فایل `requirements.txt`
 برای مشاهده نسخه دقیق کتابخانه‌ها به فایل **`requirements.txt`** مراجعه کنید.
 ## استفاده از پرامپت‌ها
 - **پرامپت سیستمی (SYS_PROMPT):** نقش دستیار را تعریف می‌کند. نمونه: "شما یک دستیار حقوقی هستید."
 - **پرامپت کاربری (USER_PROMPT):** به مدل می‌گوید حداقل تعداد مشخصی کلیدواژه استخراج کند. خروجی باید فهرستی فارسی باشد، بدون علائم اضافی.
 این ترکیب باعث پایداری و دقت در استخراج کلیدواژه می‌شود.
 ## متدهای اصلی
 ### `format_prompt(SENTENCE: str) -> str`
 متن خام فارسی را به فرمت مناسب برای مدل تبدیل می‌کند.  
 **ورودی:** یک جمله فارسی (`str`)  
 **خروجی:** متن قالب‌بندی‌شده (`str`)  
 ### `kw_count_calculator(text: str) -> int`
 تعداد کلیدواژه‌ها را بر اساس طول متن محاسبه می‌کند.  
 **ورودی:** متن (`str`)  
 **خروجی:** تعداد کلیدواژه‌ها (`int`)  
 ### `generate(formatted_prompt: str) -> str`
 متد اصلی برای ارسال پرامپت به مدل و دریافت خروجی.  
 **ورودی:** پرامپت آماده‌شده (`str`)  
 **خروجی:** متن کلیدواژه‌ها (`str`)  
 ### `single_section_get_keyword(sentence: str) -> list[str]`
 متد اصلی برای استخراج کلیدواژه‌ها از یک جمله.  
 **ورودی:** جمله (`str`)  
 **خروجی:** لیستی از کلیدواژه‌های یکتا (`list[str]`)  
 ### `get_sections() -> dict`
 بارگذاری سکشن‌ها از فایل فشرده JSON با کمک کلاس ElasticHelper.  
 **خروجی:** دیکشنری سکشن‌ها (`dict`)  
 ### `convert_to_dict(sections: list) -> dict`
 تبدیل لیست سکشن‌ها به دیکشنری با کلید ID.  
 ### `do_keyword_extract(sections: dict) -> tuple`
 حلقه اصلی پردازش سکشن‌ها، ذخیره خروجی در فایل JSON و ثبت خطاها.  
 **ورودی:** سکشن‌ها (`dict`)  
 **خروجی:** تاپل `(operation_result: bool, sections: dict)`  
 ## مثال ورودی/خروجی
 **ورودی:**  
 ```text
 "حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
 ```
 **خروجی:**  
 ```text
 حقوق شهروندی
 قانون اساسی
 تکالیف
 ایران
 ```
 ## نکات
 - مدل‌های بزرگ (Llama 3.1) به GPU با حافظه بالا نیاز دارند.  
 - کلیدواژه‌های تکراری حذف می‌شوند.  
 - نتایج پردازش به‌صورت خودکار در فایل JSON ذخیره می‌شود.  
--- a/readme/readme-keyword-extractor-en.md
+++ b/readme/readme-keyword-extractor-en.md
@ -1,34 +1,75 @@
-# Keyword Extractor
+# Persian Sentence Keyword Extractor
-This source is a script for extracting keywords from text using local LLM such as llama based on user prompts.
+This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**.
 ## How it works
-The script processes input text and extracts the most relevant keywords using a large language model(llm) and system and user prompts which are embedded in the source code.
+The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency).  
 It processes Persian text input, system and user prompts, and extracts the most relevant keywords.
 ## Requirements
 - Python 3.8+
- NLP libraries (transformers, torch, etc.)
+- torch, transformers, bitsandbytes
- Other utilities as listed in the requirements file
+- elasticsearch helper (custom ElasticHelper class)
 - Other utilities as listed in the `requirements.txt` file
-For exact versions of the libraries, please check the **`requirements.txt`** file.
+For exact versions of the libraries, please check **`requirements.txt`**.
-## Usage
+## Prompt Usage
-1. Clone the repository.
+- **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
-2. Install dependencies:
+- **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.
-   ```bash
+
-   pip install -r requirements.txt
+This combination ensures consistent keyword extraction.
   ```
 3. Run the script:
   ```bash
   python keyword_extractor.py
   ```
 ## Main Methods
 - `load_model()`: Loads the pre-trained transformer model for text processing. This is the main method for model initialization.
 - `preprocess_text(text)`: Cleans and prepares the input text (e.g., lowercasing, removing stopwords, etc.).
 - `extract_keywords(text, top_n=10)`: The core method that applies the model and retrieves the top keywords from the input text.
 - `display_results(keywords)`: Prints or saves the extracted keywords for further use.
-## Model
+### `format_prompt(SENTENCE: str) -> str`
-The script uses a LLM such as llama3.1-8B for keyword extraction. The exact model can be changed in the code if needed.
+Formats the raw Persian sentence into a model-ready input.  
 **Input:** A single Persian sentence (`str`)  
 **Output:** A formatted string (`str`)  
 ### `kw_count_calculator(text: str) -> int`
 Calculates the number of keywords to extract based on text length.  
 **Input:** Text (`str`)  
 **Output:** Keyword count (`int`)  
 ### `generate(formatted_prompt: str) -> str`
 Core generation method that sends the prompt to the model.  
 **Input:** Formatted text prompt (`str`)  
 **Output:** Generated keywords as a string (`str`)  
 ### `single_section_get_keyword(sentence: str) -> list[str]`
 Main method for extracting keywords from a sentence.  
 **Input:** Sentence (`str`)  
 **Output:** List of unique keywords (`list[str]`)  
 ### `get_sections() -> dict`
 Loads section data from a compressed JSON source (via ElasticHelper).  
 **Output:** Dictionary of sections (`dict`)  
 ### `convert_to_dict(sections: list) -> dict`
 Converts raw section list into a dictionary with IDs as keys.  
 ### `do_keyword_extract(sections: dict) -> tuple`
 Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.  
 **Input:** Sections (`dict`)  
 **Output:** Tuple `(operation_result: bool, sections: dict)`  
 ## Example Input/Output
 **Input:**  
 ```text
 "حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
 ```
 **Output:**  
 ```text
 حقوق شهروندی
 قانون اساسی
 تکالیف
 ایران
 ```
 ## Notes
 - Large models (Llama 3.1) require GPU with sufficient memory.  
 - The script handles repeated keywords by removing duplicates.  
 - Output is automatically saved in JSON format after processing.  
--- a/readme/readme-keyword-extractor-fa.md
+++ b/readme/readme-keyword-extractor-fa.md
@ -1,34 +1,75 @@
-# استخراج‌گر کلمات کلیدی
+# استخراج‌گر کلیدواژه جملات فارسی
-این سورس، یک اسکریپت برای استخراج کلمات کلیدی از متن با استفاده از مدل های زبانی بزرگی مانند لاما و بر اساس پرامپت های کاربر است.
+این پروژه یک اسکریپت پایتون (`p5_representer.py`) برای استخراج **کلیدواژه‌ها** از جملات و سکشن‌های متون حقوقی فارسی با استفاده از **مدل‌های مبتنی بر Transformer** است.
 ## نحوه عملکرد
-این اسکریپت متن ورودی را پردازش کرده و مرتبط‌ترین کلمات کلیدی را با استفاده از یک مدل زبانی بزرگ با پرامپت های سیستمی و کاربری که در سورس قابل مشاهده است، استخراج می کند
+این اسکریپت از مدل **Meta-Llama-3.1-8B-Instruct** (با فشرده‌سازی و کوانتایز مناسب به منظور کارایی بیشتر) استفاده می‌کند.  
 ابتدا متن ورودی دریافت شده، با استفاده از پرامپت‌های سیستمی و کاربری آماده‌سازی می‌شود و سپس کلمات کلیدی مرتبط از متن استخراج می‌شوند.
 ## پیش‌نیازها
 - پایتون 3.8 یا بالاتر
- کتابخانه‌های NLP (مانند transformers، torch و …)
+- کتابخانه‌های torch، transformers، bitsandbytes
- سایر ابزارهای مورد نیاز در فایل requirements.txt
+- کلاس ElasticHelper برای بارگذاری داده‌ها
 - سایر ابزارها در فایل `requirements.txt`
 برای مشاهده نسخه دقیق کتابخانه‌ها به فایل **`requirements.txt`** مراجعه کنید.
-## روش اجرا
+## استفاده از پرامپت‌ها
-1. مخزن (repository) را کلون کنید.
+- **پرامپت سیستمی (SYS_PROMPT):** نقش دستیار را تعریف می‌کند. نمونه: "شما یک دستیار حقوقی هستید."
-2. پیش‌نیازها را نصب کنید:
+- **پرامپت کاربری (USER_PROMPT):** به مدل می‌گوید حداقل تعداد مشخصی کلیدواژه استخراج کند. خروجی باید فهرستی فارسی باشد، بدون علائم اضافی.
-   ```bash
+
-   pip install -r requirements.txt
+این ترکیب باعث پایداری و دقت در استخراج کلیدواژه می‌شود.
   ```
 3. اسکریپت را اجرا کنید:
   ```bash
   python keyword_extractor.py
   ```
 ## متدهای اصلی
 - `load_model()`: بارگذاری مدل از پیش آموزش‌دیده برای پردازش متن. این متد اصلی برای آماده‌سازی مدل است.
 - `preprocess_text(text)`: پاک‌سازی و آماده‌سازی متن ورودی (مانند کوچک‌سازی حروف، حذف توقف‌واژه‌ها و ...).
 - `extract_keywords(text, top_n=10)`: متد اصلی استخراج که کلمات کلیدی را با استفاده از مدل انتخاب کرده و n کلمه برتر را برمی‌گرداند.
 - `display_results(keywords)`: نمایش یا ذخیره‌سازی کلمات کلیدی استخراج‌شده برای استفاده‌های بعدی.
-## مدل
+### `format_prompt(SENTENCE: str) -> str`
-این اسکریپت از یک مدل زبانی بزرگ مانند llama3.1-8B برای استخراج کلمات کلیدی استفاده می‌کند. در صورت نیاز می‌توانید مدل را در کد تغییر دهید.
+متن خام فارسی را به فرمت مناسب برای مدل تبدیل می‌کند.  
 **ورودی:** یک جمله فارسی (`str`)  
 **خروجی:** متن قالب‌بندی‌شده (`str`)  
 ### `kw_count_calculator(text: str) -> int`
 تعداد کلیدواژه‌ها را بر اساس طول متن محاسبه می‌کند.  
 **ورودی:** متن (`str`)  
 **خروجی:** تعداد کلیدواژه‌ها (`int`)  
 ### `generate(formatted_prompt: str) -> str`
 متد اصلی برای ارسال پرامپت به مدل و دریافت خروجی.  
 **ورودی:** پرامپت آماده‌شده (`str`)  
 **خروجی:** متن کلیدواژه‌ها (`str`)  
 ### `single_section_get_keyword(sentence: str) -> list[str]`
 متد اصلی برای استخراج کلیدواژه‌ها از یک جمله.  
 **ورودی:** جمله (`str`)  
 **خروجی:** لیستی از کلیدواژه‌های یکتا (`list[str]`)  
 ### `get_sections() -> dict`
 بارگذاری سکشن‌ها از فایل فشرده JSON با کمک کلاس ElasticHelper.  
 **خروجی:** دیکشنری سکشن‌ها (`dict`)  
 ### `convert_to_dict(sections: list) -> dict`
 تبدیل لیست سکشن‌ها به دیکشنری با کلید ID.  
 ### `do_keyword_extract(sections: dict) -> tuple`
 حلقه اصلی پردازش سکشن‌ها، ذخیره خروجی در فایل JSON و ثبت خطاها.  
 **ورودی:** سکشن‌ها (`dict`)  
 **خروجی:** تاپل `(operation_result: bool, sections: dict)`  
 ## مثال ورودی/خروجی
 **ورودی:**  
 ```text
 "حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
 ```
 **خروجی:**  
 ```text
 حقوق شهروندی
 قانون اساسی
 تکالیف
 ایران
 ```
 ## نکات
 - مدل‌های بزرگ (Llama 3.1) به GPU با حافظه بالا نیاز دارند.  
 - کلیدواژه‌های تکراری حذف می‌شوند.  
 - نتایج پردازش به‌صورت خودکار در فایل JSON ذخیره می‌شود.