data_processes/readme/readme-classifier.md

# Section Classification Script

This project provides a Python script (`classification.py`) for classifying text sections using a fine-tuned transformer model. The script is designed to suggest the most relevant classes for each section of text, which is useful for legal documents, content categorization, and similar NLP tasks.

## Requirements

Before using this script, please install the required libraries:

```bash
pip install transformers pandas
```

You also need a fine-tuned classification model and its tokenizer. Update the `model_checkpoint` path in the script to point to your model.

## How It Works

- The script loads a fine-tuned transformer model for text classification.
- It processes each section of text, possibly splitting long texts into windows to fit the model's input size.
- For each section, it predicts the top classes and saves the results.

## Main Functions

- `get_class(sentences, top_k=4)`: Classifies a sentence or text and returns the top `k` classes.
- `mean_classes(input_classes)`: Aggregates class results from multiple windows of a long text.
- `get_window_classes(text)`: Handles splitting long texts into windows and aggregates their classification results.
- `single_section_classification(id, section_source)`: Classifies a single section and returns the best and other suggested classes.
- `do_classify(sections)`: Classifies all sections in a dictionary and saves the results to a JSON file.

## Usage Example

Suppose you have your sections data as a dictionary:

```python
sections = {
    "1": {"content": "First section text", "other_info": {"full_path": "..."}, "qanon_title": "..."},
    "2": {"content": "Second section text", "other_info": {"full_path": "..."}, "qanon_title": "..."}
}
```

You can classify all sections as follows:

```python
from classification import do_classify

result = do_classify(sections)
```

After running, the results will be saved in a JSON file in the `./data/classification/` directory.

## Output Structure

Each section will have a new field `ai_codes` with the classification results:

```json
"1": {
  "content": "First section text",
  "ai_codes": {
    "best-class": {"label": "ClassA", "score": 0.85},
    "other-classes": [
      {"label": "ClassB", "score": 0.10},
      {"label": "ClassC", "score": 0.05}
    ]
  }
}
```

## Notes

- Make sure the model path in `model_checkpoint` is correct and the model files are available.
- The script supports Persian and other languages, depending on your model.
- The output JSON file will be saved in `./data/classification/`.