data_processes/readme/readme-words-embedder-en.md
2025-08-16 14:44:56 +03:30

1.8 KiB

Sentence Embedding Generator

This project provides a Python script (embedding.py) for generating sentence embeddings using the [Sentence Transformers]library.

Requirements

Before using this script, please install the required libraries:

pip install sentence-transformers numpy

How It Works

  • The script uses the pre-trained model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
  • There are two main functions:
    • single_section_embedder(sentence): Takes a sentence (string) and returns its embedding as a vector.
    • do_word_embedder(sections): Takes a dictionary of sections (each with a content field), generates embeddings for each section, and saves the results as a JSON file.

Usage

1. Get Embedding for a Single Sentence

from embedding import single_section_embedder

sentence = "This is a sample sentence."
embedding = single_section_embedder(sentence)
print(embedding)

2. Generate Embeddings for Multiple Sections and Save to File

Suppose your data is structured like this:

sections = {
    "1": {"content": "First section text"},
    "2": {"content": "Second section text"}
}

You can generate and save embeddings as follows:

from embedding import do_word_embedder

result = do_word_embedder(sections)

After running, a file named like sections_embeddings_YEAR-MONTH-DAY-HOUR.json will be created in the ./data/embeddings/ directory, containing the embeddings for each section.

Output Structure

The output is a JSON file where each section has its embedding added:

{
  "1": {
    "content": "First section text",
    "embeddings": [0.123, 0.456, ...]
  },
  ...
}

Notes

  • Make sure the folder ./data/embeddings/ exists before running the script.
  • The script supports Persian language.