Files
cult-scraper/apps/cluster_map/README.md
2025-08-11 01:59:48 +01:00

59 lines
1.9 KiB
Markdown

# Discord Chat Embeddings Visualizer
A Streamlit application that visualizes Discord chat messages using their vector embeddings in 2D space.
## Features
- **2D Visualization**: View chat messages plotted using PCA or t-SNE dimension reduction
- **Interactive Plotting**: Hover over points to see message content, author, and timestamp
- **Filtering**: Filter by source chat log files and authors
- **Multiple Datasets**: Automatically loads all CSV files from the discord_chat_logs folder
## Installation
1. Install the required dependencies:
```bash
pip install -r requirements.txt
```
## Usage
Run the Streamlit application:
```bash
streamlit run streamlit_app.py
```
The app will automatically load all CSV files from the `../../discord_chat_logs/` directory.
## Data Format
The application expects CSV files with the following columns:
- `message_id`: Unique identifier for the message
- `timestamp_utc`: When the message was sent
- `author_id`: Author's Discord ID
- `author_name`: Author's username
- `author_nickname`: Author's server nickname
- `content`: The message content
- `attachment_urls`: Any attached files
- `embeds`: Embedded content
- `content_embedding`: Vector embedding of the message content (as a string representation of a list)
## Visualization Options
- **PCA**: Principal Component Analysis - faster, good for getting an overview
- **t-SNE**: t-Distributed Stochastic Neighbor Embedding - slower but may reveal better clusters
## Controls
- **Dimension Reduction Method**: Choose between PCA and t-SNE
- **Filter by Source Files**: Select which chat log files to include
- **Filter by Authors**: Select which authors to display
- **Show Data Table**: View the underlying data in table format
## Performance Notes
- For large datasets, consider filtering by authors or source files to improve performance
- t-SNE is computationally intensive and may take longer with large datasets
- The app caches data and computations for better performance