Files
cult-scraper/apps/cluster_map/README.md
2025-08-11 01:59:48 +01:00

1.9 KiB

Discord Chat Embeddings Visualizer

A Streamlit application that visualizes Discord chat messages using their vector embeddings in 2D space.

Features

  • 2D Visualization: View chat messages plotted using PCA or t-SNE dimension reduction
  • Interactive Plotting: Hover over points to see message content, author, and timestamp
  • Filtering: Filter by source chat log files and authors
  • Multiple Datasets: Automatically loads all CSV files from the discord_chat_logs folder

Installation

  1. Install the required dependencies:
pip install -r requirements.txt

Usage

Run the Streamlit application:

streamlit run streamlit_app.py

The app will automatically load all CSV files from the ../../discord_chat_logs/ directory.

Data Format

The application expects CSV files with the following columns:

  • message_id: Unique identifier for the message
  • timestamp_utc: When the message was sent
  • author_id: Author's Discord ID
  • author_name: Author's username
  • author_nickname: Author's server nickname
  • content: The message content
  • attachment_urls: Any attached files
  • embeds: Embedded content
  • content_embedding: Vector embedding of the message content (as a string representation of a list)

Visualization Options

  • PCA: Principal Component Analysis - faster, good for getting an overview
  • t-SNE: t-Distributed Stochastic Neighbor Embedding - slower but may reveal better clusters

Controls

  • Dimension Reduction Method: Choose between PCA and t-SNE
  • Filter by Source Files: Select which chat log files to include
  • Filter by Authors: Select which authors to display
  • Show Data Table: View the underlying data in table format

Performance Notes

  • For large datasets, consider filtering by authors or source files to improve performance
  • t-SNE is computationally intensive and may take longer with large datasets
  • The app caches data and computations for better performance