59 lines
1.9 KiB
Markdown
59 lines
1.9 KiB
Markdown
# Discord Chat Embeddings Visualizer
|
|
|
|
A Streamlit application that visualizes Discord chat messages using their vector embeddings in 2D space.
|
|
|
|
## Features
|
|
|
|
- **2D Visualization**: View chat messages plotted using PCA or t-SNE dimension reduction
|
|
- **Interactive Plotting**: Hover over points to see message content, author, and timestamp
|
|
- **Filtering**: Filter by source chat log files and authors
|
|
- **Multiple Datasets**: Automatically loads all CSV files from the discord_chat_logs folder
|
|
|
|
## Installation
|
|
|
|
1. Install the required dependencies:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
## Usage
|
|
|
|
Run the Streamlit application:
|
|
|
|
```bash
|
|
streamlit run streamlit_app.py
|
|
```
|
|
|
|
The app will automatically load all CSV files from the `../../discord_chat_logs/` directory.
|
|
|
|
## Data Format
|
|
|
|
The application expects CSV files with the following columns:
|
|
- `message_id`: Unique identifier for the message
|
|
- `timestamp_utc`: When the message was sent
|
|
- `author_id`: Author's Discord ID
|
|
- `author_name`: Author's username
|
|
- `author_nickname`: Author's server nickname
|
|
- `content`: The message content
|
|
- `attachment_urls`: Any attached files
|
|
- `embeds`: Embedded content
|
|
- `content_embedding`: Vector embedding of the message content (as a string representation of a list)
|
|
|
|
## Visualization Options
|
|
|
|
- **PCA**: Principal Component Analysis - faster, good for getting an overview
|
|
- **t-SNE**: t-Distributed Stochastic Neighbor Embedding - slower but may reveal better clusters
|
|
|
|
## Controls
|
|
|
|
- **Dimension Reduction Method**: Choose between PCA and t-SNE
|
|
- **Filter by Source Files**: Select which chat log files to include
|
|
- **Filter by Authors**: Select which authors to display
|
|
- **Show Data Table**: View the underlying data in table format
|
|
|
|
## Performance Notes
|
|
|
|
- For large datasets, consider filtering by authors or source files to improve performance
|
|
- t-SNE is computationally intensive and may take longer with large datasets
|
|
- The app caches data and computations for better performance
|