edit kw readme

2025-08-16 16:54:29 +03:30 · 2025-08-16 16:54:29 +03:30 · 6318c7ce1e
commit 6318c7ce1e
parent ab99c73b03
4 changed files with 125 additions and 193 deletions
--- a/readme/1.md
+++ b/readme/1.md
@ -1,75 +0,0 @@
-# Persian Sentence Keyword Extractor
-
-This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**.
-
-## How it works
-The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency).  
-It processes Persian text input, generates system and user prompts, and extracts the most relevant keywords.
-
-## Requirements
- Python 3.8+
- torch, transformers, bitsandbytes
- elasticsearch helper (custom ElasticHelper class)
- Other utilities as listed in the `requirements.txt` file
-
-For exact versions of the libraries, please check **`requirements.txt`**.
-
-## Prompt Usage
- **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
- **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.
-
-This combination ensures consistent keyword extraction.
-
-## Main Methods
-
-### `format_prompt(SENTENCE: str) -> str`
-Formats the raw Persian sentence into a model-ready input.  
-**Input:** A single Persian sentence (`str`)  
-**Output:** A formatted string (`str`)  
-
-### `kw_count_calculator(text: str) -> int`
-Calculates the number of keywords to extract based on text length.  
-**Input:** Text (`str`)  
-**Output:** Keyword count (`int`)  
-
-### `generate(formatted_prompt: str) -> str`
-Core generation method that sends the prompt to the model.  
-**Input:** Formatted text prompt (`str`)  
-**Output:** Generated keywords as a string (`str`)  
-
-### `single_section_get_keyword(sentence: str) -> list[str]`
-Main method for extracting keywords from a sentence.  
-**Input:** Sentence (`str`)  
-**Output:** List of unique keywords (`list[str]`)  
-
-### `get_sections() -> dict`
-Loads section data from a compressed JSON source (via ElasticHelper).  
-**Output:** Dictionary of sections (`dict`)  
-
-### `convert_to_dict(sections: list) -> dict`
-Converts raw section list into a dictionary with IDs as keys.  
-
-### `do_keyword_extract(sections: dict) -> tuple`
-Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.  
-**Input:** Sections (`dict`)  
-**Output:** Tuple `(operation_result: bool, sections: dict)`  
-
-## Example Input/Output
-
-**Input:**  
-```text
-"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
-```
-
-**Output:**  
-```text
-حقوق شهروندی
-قانون اساسی
-تکالیف
-ایران
-```
-
-## Notes
- Large models (Llama 3.1) require GPU with sufficient memory.  
- The script handles repeated keywords by removing duplicates.  
- Output is automatically saved in JSON format after processing.  
--- a/readme/2.md
+++ b/readme/2.md
@ -1,75 +0,0 @@
-# استخراج‌گر کلیدواژه جملات فارسی
-
-این پروژه یک اسکریپت پایتون (`p5_representer.py`) برای استخراج **کلیدواژه‌ها** از جملات و سکشن‌های متون حقوقی فارسی با استفاده از **مدل‌های مبتنی بر Transformer** است.
-
-## نحوه عملکرد
-این اسکریپت از مدل **Meta-Llama-3.1-8B-Instruct** (با فشرده‌سازی و کوانتش برای کارایی بیشتر) استفاده می‌کند.  
-ابتدا متن ورودی دریافت شده، با استفاده از پرامپت‌های سیستمی و کاربری آماده‌سازی می‌شود و سپس کلمات کلیدی مرتبط از متن استخراج می‌شوند.
-
-## پیش‌نیازها
- پایتون 3.8 یا بالاتر
- کتابخانه‌های torch، transformers، bitsandbytes
- کلاس ElasticHelper برای بارگذاری داده‌ها
- سایر ابزارها در فایل `requirements.txt`
-
-برای مشاهده نسخه دقیق کتابخانه‌ها به فایل **`requirements.txt`** مراجعه کنید.
-
-## استفاده از پرامپت‌ها
- **پرامپت سیستمی (SYS_PROMPT):** نقش دستیار را تعریف می‌کند. نمونه: "شما یک دستیار حقوقی هستید."
- **پرامپت کاربری (USER_PROMPT):** به مدل می‌گوید حداقل تعداد مشخصی کلیدواژه استخراج کند. خروجی باید فهرستی فارسی باشد، بدون علائم اضافی.
-
-این ترکیب باعث پایداری و دقت در استخراج کلیدواژه می‌شود.
-
-## متدهای اصلی
-
-### `format_prompt(SENTENCE: str) -> str`
-متن خام فارسی را به فرمت مناسب برای مدل تبدیل می‌کند.  
-**ورودی:** یک جمله فارسی (`str`)  
-**خروجی:** متن قالب‌بندی‌شده (`str`)  
-
-### `kw_count_calculator(text: str) -> int`
-تعداد کلیدواژه‌ها را بر اساس طول متن محاسبه می‌کند.  
-**ورودی:** متن (`str`)  
-**خروجی:** تعداد کلیدواژه‌ها (`int`)  
-
-### `generate(formatted_prompt: str) -> str`
-متد اصلی برای ارسال پرامپت به مدل و دریافت خروجی.  
-**ورودی:** پرامپت آماده‌شده (`str`)  
-**خروجی:** متن کلیدواژه‌ها (`str`)  
-
-### `single_section_get_keyword(sentence: str) -> list[str]`
-متد اصلی برای استخراج کلیدواژه‌ها از یک جمله.  
-**ورودی:** جمله (`str`)  
-**خروجی:** لیستی از کلیدواژه‌های یکتا (`list[str]`)  
-
-### `get_sections() -> dict`
-بارگذاری سکشن‌ها از فایل فشرده JSON با کمک کلاس ElasticHelper.  
-**خروجی:** دیکشنری سکشن‌ها (`dict`)  
-
-### `convert_to_dict(sections: list) -> dict`
-تبدیل لیست سکشن‌ها به دیکشنری با کلید ID.  
-
-### `do_keyword_extract(sections: dict) -> tuple`
-حلقه اصلی پردازش سکشن‌ها، ذخیره خروجی در فایل JSON و ثبت خطاها.  
-**ورودی:** سکشن‌ها (`dict`)  
-**خروجی:** تاپل `(operation_result: bool, sections: dict)`  
-
-## مثال ورودی/خروجی
-
-**ورودی:**  
-```text
-"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
-```
-
-**خروجی:**  
-```text
-حقوق شهروندی
-قانون اساسی
-تکالیف
-ایران
-```
-
-## نکات
- مدل‌های بزرگ (Llama 3.1) به GPU با حافظه بالا نیاز دارند.  
- کلیدواژه‌های تکراری حذف می‌شوند.  
- نتایج پردازش به‌صورت خودکار در فایل JSON ذخیره می‌شود.  
--- a/readme/readme-keyword-extractor-en.md
+++ b/readme/readme-keyword-extractor-en.md
@ -1,34 +1,75 @@
-# Keyword Extractor
+# Persian Sentence Keyword Extractor

-This source is a script for extracting keywords from text using local LLM such as llama based on user prompts.
+This project provides a Python script (`p5_representer.py`) for extracting **keywords** from Persian sentences and legal text sections using **transformer-based models**.

 ## How it works
-The script processes input text and extracts the most relevant keywords using a large language model(llm) and system and user prompts which are embedded in the source code.
+The script uses the pre-trained **Meta-Llama-3.1-8B-Instruct** model (with quantization for efficiency).  
+It processes Persian text input, system and user prompts, and extracts the most relevant keywords.

 ## Requirements
 - Python 3.8+
- NLP libraries (transformers, torch, etc.)
- Other utilities as listed in the requirements file
+- torch, transformers, bitsandbytes
+- elasticsearch helper (custom ElasticHelper class)
+- Other utilities as listed in the `requirements.txt` file

-For exact versions of the libraries, please check the **`requirements.txt`** file.
+For exact versions of the libraries, please check **`requirements.txt`**.

-## Usage
-1. Clone the repository.
-2. Install dependencies:
-   ```bash
-   pip install -r requirements.txt
-   ```
-3. Run the script:
-   ```bash
-   python keyword_extractor.py
-   ```
+## Prompt Usage
+- **System Prompt (SYS_PROMPT):** Defines the assistant role. Example: "You are a highly accurate and detail-oriented assistant specialized in analyzing Persian legal texts."
+- **User Prompt:** Guides the model to extract a minimum number of keywords, returned as a clean Persian list without extra symbols or explanations.
+
+This combination ensures consistent keyword extraction.

 ## Main Methods
- `load_model()`: Loads the pre-trained transformer model for text processing. This is the main method for model initialization.
- `preprocess_text(text)`: Cleans and prepares the input text (e.g., lowercasing, removing stopwords, etc.).
- `extract_keywords(text, top_n=10)`: The core method that applies the model and retrieves the top keywords from the input text.
- `display_results(keywords)`: Prints or saves the extracted keywords for further use.

-## Model
-The script uses a LLM such as llama3.1-8B for keyword extraction. The exact model can be changed in the code if needed.
+### `format_prompt(SENTENCE: str) -> str`
+Formats the raw Persian sentence into a model-ready input.  
+**Input:** A single Persian sentence (`str`)  
+**Output:** A formatted string (`str`)  

+### `kw_count_calculator(text: str) -> int`
+Calculates the number of keywords to extract based on text length.  
+**Input:** Text (`str`)  
+**Output:** Keyword count (`int`)  
+
+### `generate(formatted_prompt: str) -> str`
+Core generation method that sends the prompt to the model.  
+**Input:** Formatted text prompt (`str`)  
+**Output:** Generated keywords as a string (`str`)  
+
+### `single_section_get_keyword(sentence: str) -> list[str]`
+Main method for extracting keywords from a sentence.  
+**Input:** Sentence (`str`)  
+**Output:** List of unique keywords (`list[str]`)  
+
+### `get_sections() -> dict`
+Loads section data from a compressed JSON source (via ElasticHelper).  
+**Output:** Dictionary of sections (`dict`)  
+
+### `convert_to_dict(sections: list) -> dict`
+Converts raw section list into a dictionary with IDs as keys.  
+
+### `do_keyword_extract(sections: dict) -> tuple`
+Main execution loop for processing multiple sections, saving output to JSON files, and logging errors.  
+**Input:** Sections (`dict`)  
+**Output:** Tuple `(operation_result: bool, sections: dict)`  
+
+## Example Input/Output
+
+**Input:**  
+```text
+"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
+```
+
+**Output:**  
+```text
+حقوق شهروندی
+قانون اساسی
+تکالیف
+ایران
+```
+
+## Notes
+- Large models (Llama 3.1) require GPU with sufficient memory.  
+- The script handles repeated keywords by removing duplicates.  
+- Output is automatically saved in JSON format after processing.  
--- a/readme/readme-keyword-extractor-fa.md
+++ b/readme/readme-keyword-extractor-fa.md
@ -1,34 +1,75 @@
-# استخراج‌گر کلمات کلیدی
+# استخراج‌گر کلیدواژه جملات فارسی

-این سورس، یک اسکریپت برای استخراج کلمات کلیدی از متن با استفاده از مدل های زبانی بزرگی مانند لاما و بر اساس پرامپت های کاربر است.
+این پروژه یک اسکریپت پایتون (`p5_representer.py`) برای استخراج **کلیدواژه‌ها** از جملات و سکشن‌های متون حقوقی فارسی با استفاده از **مدل‌های مبتنی بر Transformer** است.

 ## نحوه عملکرد
-این اسکریپت متن ورودی را پردازش کرده و مرتبط‌ترین کلمات کلیدی را با استفاده از یک مدل زبانی بزرگ با پرامپت های سیستمی و کاربری که در سورس قابل مشاهده است، استخراج می کند
+این اسکریپت از مدل **Meta-Llama-3.1-8B-Instruct** (با فشرده‌سازی و کوانتایز مناسب به منظور کارایی بیشتر) استفاده می‌کند.  
+ابتدا متن ورودی دریافت شده، با استفاده از پرامپت‌های سیستمی و کاربری آماده‌سازی می‌شود و سپس کلمات کلیدی مرتبط از متن استخراج می‌شوند.

 ## پیش‌نیازها
 - پایتون 3.8 یا بالاتر
- کتابخانه‌های NLP (مانند transformers، torch و …)
- سایر ابزارهای مورد نیاز در فایل requirements.txt
+- کتابخانه‌های torch، transformers، bitsandbytes
+- کلاس ElasticHelper برای بارگذاری داده‌ها
+- سایر ابزارها در فایل `requirements.txt`

 برای مشاهده نسخه دقیق کتابخانه‌ها به فایل **`requirements.txt`** مراجعه کنید.

-## روش اجرا
-1. مخزن (repository) را کلون کنید.
-2. پیش‌نیازها را نصب کنید:
-   ```bash
-   pip install -r requirements.txt
-   ```
-3. اسکریپت را اجرا کنید:
-   ```bash
-   python keyword_extractor.py
-   ```
+## استفاده از پرامپت‌ها
+- **پرامپت سیستمی (SYS_PROMPT):** نقش دستیار را تعریف می‌کند. نمونه: "شما یک دستیار حقوقی هستید."
+- **پرامپت کاربری (USER_PROMPT):** به مدل می‌گوید حداقل تعداد مشخصی کلیدواژه استخراج کند. خروجی باید فهرستی فارسی باشد، بدون علائم اضافی.
+
+این ترکیب باعث پایداری و دقت در استخراج کلیدواژه می‌شود.

 ## متدهای اصلی
- `load_model()`: بارگذاری مدل از پیش آموزش‌دیده برای پردازش متن. این متد اصلی برای آماده‌سازی مدل است.
- `preprocess_text(text)`: پاک‌سازی و آماده‌سازی متن ورودی (مانند کوچک‌سازی حروف، حذف توقف‌واژه‌ها و ...).
- `extract_keywords(text, top_n=10)`: متد اصلی استخراج که کلمات کلیدی را با استفاده از مدل انتخاب کرده و n کلمه برتر را برمی‌گرداند.
- `display_results(keywords)`: نمایش یا ذخیره‌سازی کلمات کلیدی استخراج‌شده برای استفاده‌های بعدی.

-## مدل
-این اسکریپت از یک مدل زبانی بزرگ مانند llama3.1-8B برای استخراج کلمات کلیدی استفاده می‌کند. در صورت نیاز می‌توانید مدل را در کد تغییر دهید.
+### `format_prompt(SENTENCE: str) -> str`
+متن خام فارسی را به فرمت مناسب برای مدل تبدیل می‌کند.  
+**ورودی:** یک جمله فارسی (`str`)  
+**خروجی:** متن قالب‌بندی‌شده (`str`)  

+### `kw_count_calculator(text: str) -> int`
+تعداد کلیدواژه‌ها را بر اساس طول متن محاسبه می‌کند.  
+**ورودی:** متن (`str`)  
+**خروجی:** تعداد کلیدواژه‌ها (`int`)  
+
+### `generate(formatted_prompt: str) -> str`
+متد اصلی برای ارسال پرامپت به مدل و دریافت خروجی.  
+**ورودی:** پرامپت آماده‌شده (`str`)  
+**خروجی:** متن کلیدواژه‌ها (`str`)  
+
+### `single_section_get_keyword(sentence: str) -> list[str]`
+متد اصلی برای استخراج کلیدواژه‌ها از یک جمله.  
+**ورودی:** جمله (`str`)  
+**خروجی:** لیستی از کلیدواژه‌های یکتا (`list[str]`)  
+
+### `get_sections() -> dict`
+بارگذاری سکشن‌ها از فایل فشرده JSON با کمک کلاس ElasticHelper.  
+**خروجی:** دیکشنری سکشن‌ها (`dict`)  
+
+### `convert_to_dict(sections: list) -> dict`
+تبدیل لیست سکشن‌ها به دیکشنری با کلید ID.  
+
+### `do_keyword_extract(sections: dict) -> tuple`
+حلقه اصلی پردازش سکشن‌ها، ذخیره خروجی در فایل JSON و ثبت خطاها.  
+**ورودی:** سکشن‌ها (`dict`)  
+**خروجی:** تاپل `(operation_result: bool, sections: dict)`  
+
+## مثال ورودی/خروجی
+
+**ورودی:**  
+```text
+"حقوق و تکالیف شهروندی در قانون اساسی ایران مورد تاکید قرار گرفته است."
+```
+
+**خروجی:**  
+```text
+حقوق شهروندی
+قانون اساسی
+تکالیف
+ایران
+```
+
+## نکات
+- مدل‌های بزرگ (Llama 3.1) به GPU با حافظه بالا نیاز دارند.  
+- کلیدواژه‌های تکراری حذف می‌شوند.  
+- نتایج پردازش به‌صورت خودکار در فایل JSON ذخیره می‌شود.