2.5 KiB
Section Classification Script
This project provides a Python script (classification.py
) for classifying text sections using a fine-tuned transformer model. The script is designed to suggest the most relevant classes for each section of text, which is useful for legal documents, content categorization, and similar NLP tasks.
Requirements
Before using this script, please install the required libraries:
pip install transformers pandas
You also need a fine-tuned classification model and its tokenizer. Update the model_checkpoint
path in the script to point to your model.
How It Works
- The script loads a fine-tuned transformer model for text classification.
- It processes each section of text, possibly splitting long texts into windows to fit the model's input size.
- For each section, it predicts the top classes and saves the results.
Main Functions
get_class(sentences, top_k=4)
: Classifies a sentence or text and returns the topk
classes.mean_classes(input_classes)
: Aggregates class results from multiple windows of a long text.get_window_classes(text)
: Handles splitting long texts into windows and aggregates their classification results.single_section_classification(id, section_source)
: Classifies a single section and returns the best and other suggested classes.do_classify(sections)
: Classifies all sections in a dictionary and saves the results to a JSON file.
Usage Example
Suppose you have your sections data as a dictionary:
sections = {
"1": {"content": "First section text", "other_info": {"full_path": "..."}, "qanon_title": "..."},
"2": {"content": "Second section text", "other_info": {"full_path": "..."}, "qanon_title": "..."}
}
You can classify all sections as follows:
from classification import do_classify
result = do_classify(sections)
After running, the results will be saved in a JSON file in the ./data/classification/
directory.
Output Structure
Each section will have a new field ai_codes
with the classification results:
"1": {
"content": "First section text",
"ai_codes": {
"best-class": {"label": "ClassA", "score": 0.85},
"other-classes": [
{"label": "ClassB", "score": 0.10},
{"label": "ClassC", "score": 0.05}
]
}
}
Notes
- Make sure the model path in
model_checkpoint
is correct and the model files are available. - The script supports Persian and other languages, depending on your model.
- The output JSON file will be saved in
./data/classification/
.