72 lines
2.5 KiB
Markdown
72 lines
2.5 KiB
Markdown
# Section Classification Script
|
|
|
|
This project provides a Python script (`classification.py`) for classifying text sections using a fine-tuned transformer model. The script is designed to suggest the most relevant classes for each section of text, which is useful for legal documents, content categorization, and similar NLP tasks.
|
|
|
|
## Requirements
|
|
|
|
Before using this script, please install the required libraries:
|
|
|
|
```bash
|
|
pip install transformers pandas
|
|
```
|
|
|
|
You also need a fine-tuned classification model and its tokenizer. Update the `model_checkpoint` path in the script to point to your model.
|
|
|
|
## How It Works
|
|
|
|
- The script loads a fine-tuned transformer model for text classification.
|
|
- It processes each section of text, possibly splitting long texts into windows to fit the model's input size.
|
|
- For each section, it predicts the top classes and saves the results.
|
|
|
|
## Main Functions
|
|
|
|
- `get_class(sentences, top_k=4)`: Classifies a sentence or text and returns the top `k` classes.
|
|
- `mean_classes(input_classes)`: Aggregates class results from multiple windows of a long text.
|
|
- `get_window_classes(text)`: Handles splitting long texts into windows and aggregates their classification results.
|
|
- `single_section_classification(id, section_source)`: Classifies a single section and returns the best and other suggested classes.
|
|
- `do_classify(sections)`: Classifies all sections in a dictionary and saves the results to a JSON file.
|
|
|
|
## Usage Example
|
|
|
|
Suppose you have your sections data as a dictionary:
|
|
|
|
```python
|
|
sections = {
|
|
"1": {"content": "First section text", "other_info": {"full_path": "..."}, "qanon_title": "..."},
|
|
"2": {"content": "Second section text", "other_info": {"full_path": "..."}, "qanon_title": "..."}
|
|
}
|
|
```
|
|
|
|
You can classify all sections as follows:
|
|
|
|
```python
|
|
from classification import do_classify
|
|
|
|
result = do_classify(sections)
|
|
```
|
|
|
|
After running, the results will be saved in a JSON file in the `./data/classification/` directory.
|
|
|
|
## Output Structure
|
|
|
|
Each section will have a new field `ai_codes` with the classification results:
|
|
|
|
```json
|
|
"1": {
|
|
"content": "First section text",
|
|
"ai_codes": {
|
|
"best-class": {"label": "ClassA", "score": 0.85},
|
|
"other-classes": [
|
|
{"label": "ClassB", "score": 0.10},
|
|
{"label": "ClassC", "score": 0.05}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Make sure the model path in `model_checkpoint` is correct and the model files are available.
|
|
- The script supports Persian and other languages, depending on your model.
|
|
- The output JSON file will be saved in `./data/classification/`.
|