data_processes/readme/readme-classifier-en.md
2025-08-16 15:14:27 +03:30

2.4 KiB

Section Classification Script

This project provides a Python script (p1_classifier.py) for classifying text sections using a fine-tuned transformer model. The script is designed to suggest the most relevant classes for each section of text, which is useful for legal documents, content categorization, and similar NLP tasks.

Requirements

Before using this script, please install the required libraries:

pip install transformers pandas

You also need a fine-tuned classification model and its tokenizer. Update the model_checkpoint path in the script to point to your model.

How It Works

  • The script loads a fine-tuned transformer model for text classification.
  • It processes each section of text, possibly splitting long texts into windows to fit the model's input size.
  • For each section, it predicts the top classes and saves the results.

Main Functions

  • get_class(sentences, top_k=4): Classifies a sentence or text and returns the top k classes.
  • mean_classes(input_classes): Aggregates class results from multiple windows of a long text.
  • get_window_classes(text): Handles splitting long texts into windows and aggregates their classification results.
  • single_section_classification(id, section_source): Classifies a single section and returns the best and other suggested classes.
  • do_classify(sections): Classifies all sections in a dictionary and saves the results to a JSON file.

Usage Example

Suppose you have your sections data as a dictionary:

sections = {
    "1": {"content": "First section text", "other_info": {"full_path": "..."}, "qanon_title": "..."},
    "2": {"content": "Second section text", "other_info": {"full_path": "..."}, "qanon_title": "..."}
}

You can classify all sections as follows:

from p1_classifier import do_classify

result = do_classify(sections)

After running, the results will be saved in a JSON file in the ./data/classification/ directory.

Output Structure

Each section will have a new field ai_codes with the classification results:

"1": {
  "content": "First section text",
  "ai_codes": {
    "best-class": {"label": "ClassA", "score": 0.85},
    "other-classes": [
      {"label": "ClassB", "score": 0.10},
      {"label": "ClassC", "score": 0.05}
    ]
  }
}

Notes

  • Make sure the model path in model_checkpoint is correct and the model files are available.
  • The output JSON file will be saved in ./data/classification/.