data_processes/readme/readme-words-embedder-en.md
2025-08-16 14:44:56 +03:30

71 lines
1.8 KiB
Markdown

# Sentence Embedding Generator
This project provides a Python script (`embedding.py`) for generating sentence embeddings using the [Sentence Transformers]library.
## Requirements
Before using this script, please install the required libraries:
```bash
pip install sentence-transformers numpy
```
## How It Works
- The script uses the pre-trained model: `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
- There are two main functions:
- `single_section_embedder(sentence)`: Takes a sentence (string) and returns its embedding as a vector.
- `do_word_embedder(sections)`: Takes a dictionary of sections (each with a `content` field), generates embeddings for each section, and saves the results as a JSON file.
## Usage
### 1. Get Embedding for a Single Sentence
```python
from embedding import single_section_embedder
sentence = "This is a sample sentence."
embedding = single_section_embedder(sentence)
print(embedding)
```
### 2. Generate Embeddings for Multiple Sections and Save to File
Suppose your data is structured like this:
```python
sections = {
"1": {"content": "First section text"},
"2": {"content": "Second section text"}
}
```
You can generate and save embeddings as follows:
```python
from embedding import do_word_embedder
result = do_word_embedder(sections)
```
After running, a file named like `sections_embeddings_YEAR-MONTH-DAY-HOUR.json` will be created in the `./data/embeddings/` directory, containing the embeddings for each section.
## Output Structure
The output is a JSON file where each section has its embedding added:
```json
{
"1": {
"content": "First section text",
"embeddings": [0.123, 0.456, ...]
},
...
}
```
## Notes
- Make sure the folder `./data/embeddings/` exists before running the script.
- The script supports Persian language.