71 lines
1.8 KiB
Markdown
71 lines
1.8 KiB
Markdown
# Sentence Embedding Generator
|
|
|
|
This project provides a Python script (`p3_words_embedder.py`) for generating sentence embeddings using the [Sentence Transformers]library.
|
|
|
|
## Requirements
|
|
|
|
Before using this script, please install the required libraries:
|
|
|
|
```bash
|
|
pip install sentence-transformers numpy
|
|
```
|
|
|
|
## How It Works
|
|
|
|
- The script uses the pre-trained model: `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`.
|
|
- There are two main functions:
|
|
- `single_section_embedder(sentence)`: Takes a sentence (string) and returns its embedding as a vector.
|
|
- `do_word_embedder(sections)`: Takes a dictionary of sections (each with a `content` field), generates embeddings for each section, and saves the results as a JSON file.
|
|
|
|
## Usage
|
|
|
|
### 1. Get Embedding for a Single Sentence
|
|
|
|
```python
|
|
from p3_words_embedder import single_section_embedder
|
|
|
|
sentence = "This is a sample sentence."
|
|
embedding = single_section_embedder(sentence)
|
|
print(embedding)
|
|
```
|
|
|
|
### 2. Generate Embeddings for Multiple Sections and Save to File
|
|
|
|
Suppose your data is structured like this:
|
|
|
|
```python
|
|
sections = {
|
|
"1": {"content": "First section text"},
|
|
"2": {"content": "Second section text"}
|
|
}
|
|
```
|
|
|
|
You can generate and save embeddings as follows:
|
|
|
|
```python
|
|
from p3_words_embedder import do_word_embedder
|
|
|
|
result = do_word_embedder(sections)
|
|
```
|
|
|
|
After running, a file named like `sections_embeddings_YEAR-MONTH-DAY-HOUR.json` will be created in the `./data/embeddings/` directory, containing the embeddings for each section.
|
|
|
|
## Output Structure
|
|
|
|
The output is a JSON file where each section has its embedding added:
|
|
|
|
```json
|
|
{
|
|
"1": {
|
|
"content": "First section text",
|
|
"embeddings": [0.123, 0.456, ...]
|
|
},
|
|
...
|
|
}
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Make sure the folder `./data/embeddings/` exists before running the script.
|
|
- The script supports Persian language.
|