Section Classification Script

This project provides a Python script (classification.py) for classifying text sections using a fine-tuned transformer model. The script is designed to suggest the most relevant classes for each section of text, which is useful for legal documents, content categorization, and similar NLP tasks.

Requirements

Before using this script, please install the required libraries:

pip install transformers pandas

You also need a fine-tuned classification model and its tokenizer. Update the model_checkpoint path in the script to point to your model.

How It Works

The script loads a fine-tuned transformer model for text classification.
It processes each section of text, possibly splitting long texts into windows to fit the model's input size.
For each section, it predicts the top classes and saves the results.

Main Functions

get_class(sentences, top_k=4): Classifies a sentence or text and returns the top k classes.
mean_classes(input_classes): Aggregates class results from multiple windows of a long text.
get_window_classes(text): Handles splitting long texts into windows and aggregates their classification results.
single_section_classification(id, section_source): Classifies a single section and returns the best and other suggested classes.
do_classify(sections): Classifies all sections in a dictionary and saves the results to a JSON file.

Usage Example

Suppose you have your sections data as a dictionary:

sections = {
    "1": {"content": "First section text", "other_info": {"full_path": "..."}, "qanon_title": "..."},
    "2": {"content": "Second section text", "other_info": {"full_path": "..."}, "qanon_title": "..."}
}

You can classify all sections as follows:

from classification import do_classify

result = do_classify(sections)

After running, the results will be saved in a JSON file in the ./data/classification/ directory.

Output Structure

Each section will have a new field ai_codes with the classification results:

"1": {
  "content": "First section text",
  "ai_codes": {
    "best-class": {"label": "ClassA", "score": 0.85},
    "other-classes": [
      {"label": "ClassB", "score": 0.10},
      {"label": "ClassC", "score": 0.05}
    ]
  }
}

Notes

Make sure the model path in model_checkpoint is correct and the model files are available.
The script supports Persian and other languages, depending on your model.
The output JSON file will be saved in ./data/classification/.

2.5 KiB Raw Blame History