data_processes/readme/readme-classifier.md
2025-08-16 14:24:11 +03:30

72 lines
2.5 KiB
Markdown

# Section Classification Script
This project provides a Python script (`classification.py`) for classifying text sections using a fine-tuned transformer model. The script is designed to suggest the most relevant classes for each section of text, which is useful for legal documents, content categorization, and similar NLP tasks.
## Requirements
Before using this script, please install the required libraries:
```bash
pip install transformers pandas
```
You also need a fine-tuned classification model and its tokenizer. Update the `model_checkpoint` path in the script to point to your model.
## How It Works
- The script loads a fine-tuned transformer model for text classification.
- It processes each section of text, possibly splitting long texts into windows to fit the model's input size.
- For each section, it predicts the top classes and saves the results.
## Main Functions
- `get_class(sentences, top_k=4)`: Classifies a sentence or text and returns the top `k` classes.
- `mean_classes(input_classes)`: Aggregates class results from multiple windows of a long text.
- `get_window_classes(text)`: Handles splitting long texts into windows and aggregates their classification results.
- `single_section_classification(id, section_source)`: Classifies a single section and returns the best and other suggested classes.
- `do_classify(sections)`: Classifies all sections in a dictionary and saves the results to a JSON file.
## Usage Example
Suppose you have your sections data as a dictionary:
```python
sections = {
"1": {"content": "First section text", "other_info": {"full_path": "..."}, "qanon_title": "..."},
"2": {"content": "Second section text", "other_info": {"full_path": "..."}, "qanon_title": "..."}
}
```
You can classify all sections as follows:
```python
from classification import do_classify
result = do_classify(sections)
```
After running, the results will be saved in a JSON file in the `./data/classification/` directory.
## Output Structure
Each section will have a new field `ai_codes` with the classification results:
```json
"1": {
"content": "First section text",
"ai_codes": {
"best-class": {"label": "ClassA", "score": 0.85},
"other-classes": [
{"label": "ClassB", "score": 0.10},
{"label": "ClassC", "score": 0.05}
]
}
}
```
## Notes
- Make sure the model path in `model_checkpoint` is correct and the model files are available.
- The script supports Persian and other languages, depending on your model.
- The output JSON file will be saved in `./data/classification/`.