data_processes/readme/readme-words-embedder-en.md

# Sentence Embedding Generator

This project provides a Python script (`embedding.py`) for generating sentence embeddings using the [Sentence Transformers]library.

## Requirements

Before using this script, please install the required libraries:

```bash
pip install sentence-transformers numpy
```

## How It Works

- The script uses the pre-trained model: `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
- There are two main functions:
  - `single_section_embedder(sentence)`: Takes a sentence (string) and returns its embedding as a vector.
  - `do_word_embedder(sections)`: Takes a dictionary of sections (each with a `content` field), generates embeddings for each section, and saves the results as a JSON file.

## Usage

### 1. Get Embedding for a Single Sentence

```python
from embedding import single_section_embedder

sentence = "This is a sample sentence."
embedding = single_section_embedder(sentence)
print(embedding)
```

### 2. Generate Embeddings for Multiple Sections and Save to File

Suppose your data is structured like this:

```python
sections = {
    "1": {"content": "First section text"},
    "2": {"content": "Second section text"}
}
```

You can generate and save embeddings as follows:

```python
from embedding import do_word_embedder

result = do_word_embedder(sections)
```

After running, a file named like `sections_embeddings_YEAR-MONTH-DAY-HOUR.json` will be created in the `./data/embeddings/` directory, containing the embeddings for each section.

## Output Structure

The output is a JSON file where each section has its embedding added:

```json
{
  "1": {
    "content": "First section text",
    "embeddings": [0.123, 0.456, ...]
  },
  ...
}
```

## Notes

- Make sure the folder `./data/embeddings/` exists before running the script.
- The script supports Persian language.