Compare commits
9 Commits
fb3fb70cc5
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| ce906e4f9a | |||
| fd9b25f256 | |||
| 2b8659fc95 | |||
| 647111e9d3 | |||
| 4ca7e8ab61 | |||
| 6d35b42b27 | |||
| 248cc5765f | |||
| 80c115b57d | |||
| aa9f2dc618 |
281
README.md
281
README.md
@@ -1,2 +1,281 @@
|
||||
# cult-scraper
|
||||
# Discord Data Analysis & Visualization Suite
|
||||
|
||||
A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
|
||||
|
||||
## 🌟 Features
|
||||
|
||||
### 📥 Data Collection
|
||||
- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
|
||||
- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
|
||||
- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
|
||||
|
||||
### 📊 Visualization & Analysis
|
||||
- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
|
||||
- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
|
||||
- **Image Dataset Viewer**: Browse and explore downloaded images by channel
|
||||
|
||||
### 🔧 Data Processing
|
||||
- **Batch Processing**: Process multiple CSV files with embeddings
|
||||
- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
|
||||
- **Data Filtering**: Advanced filtering by authors, channels, and timeframes
|
||||
|
||||
## 📁 Repository Structure
|
||||
|
||||
```
|
||||
cult-scraper-1/
|
||||
├── scripts/ # Core data collection scripts
|
||||
│ ├── bot.py # Discord bot for message scraping
|
||||
│ ├── image_downloader.py # Download and convert Discord images
|
||||
│ ├── embedder.py # Batch text embedding processor
|
||||
│ └── embed_class.py # Text embedding utilities
|
||||
├── apps/ # Interactive applications
|
||||
│ ├── cluster_map/ # Chat message clustering & visualization
|
||||
│ │ ├── main.py # Main Streamlit application
|
||||
│ │ ├── data_loader.py # Data loading utilities
|
||||
│ │ ├── clustering.py # Clustering algorithms
|
||||
│ │ ├── visualization.py # Plotting and visualization
|
||||
│ │ └── requirements.txt # Dependencies
|
||||
│ └── image_viewer/ # Image dataset browser
|
||||
│ ├── image_viewer.py # Streamlit image viewer
|
||||
│ └── requirements.txt # Dependencies
|
||||
├── discord_chat_logs/ # Exported CSV files from Discord
|
||||
└── images_dataset/ # Downloaded images and metadata
|
||||
└── images_dataset.json # Image dataset with base64 data
|
||||
```
|
||||
|
||||
## 🚀 Quick Start
|
||||
|
||||
### 1. Discord Data Scraping
|
||||
|
||||
First, set up and run the Discord bot to collect message data:
|
||||
|
||||
```bash
|
||||
cd scripts
|
||||
# Configure your bot token in bot.py
|
||||
python bot.py
|
||||
```
|
||||
|
||||
**Requirements:**
|
||||
- Discord bot token with message content intent enabled
|
||||
- Bot must have read permissions in target channels
|
||||
|
||||
### 2. Generate Text Embeddings
|
||||
|
||||
Process the collected chat data to add semantic embeddings:
|
||||
|
||||
```bash
|
||||
cd scripts
|
||||
python embedder.py
|
||||
```
|
||||
|
||||
This will:
|
||||
- Process all CSV files in `discord_chat_logs/`
|
||||
- Add embeddings to message content using sentence transformers
|
||||
- Save updated files with embedding vectors
|
||||
|
||||
### 3. Download Images
|
||||
|
||||
Extract and download images from Discord attachments:
|
||||
|
||||
```bash
|
||||
cd scripts
|
||||
python image_downloader.py
|
||||
```
|
||||
|
||||
Features:
|
||||
- Downloads images from attachment URLs
|
||||
- Converts to base64 for storage
|
||||
- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
|
||||
- Implements retry logic and rate limiting
|
||||
|
||||
### 4. Visualize Chat Data
|
||||
|
||||
Launch the interactive chat visualization tool:
|
||||
|
||||
```bash
|
||||
cd apps/cluster_map
|
||||
pip install -r requirements.txt
|
||||
streamlit run main.py
|
||||
```
|
||||
|
||||
**Capabilities:**
|
||||
- 2D visualization using PCA or t-SNE
|
||||
- Interactive clustering with DBSCAN/HDBSCAN
|
||||
- Filter by channels, authors, and time periods
|
||||
- Hover to see message content and metadata
|
||||
|
||||
### 5. Browse Image Dataset
|
||||
|
||||
View downloaded images in an organized interface:
|
||||
|
||||
```bash
|
||||
cd apps/image_viewer
|
||||
pip install -r requirements.txt
|
||||
streamlit run image_viewer.py
|
||||
```
|
||||
|
||||
**Features:**
|
||||
- Channel-based organization
|
||||
- Navigation controls (previous/next)
|
||||
- Image metadata display
|
||||
- Responsive layout
|
||||
|
||||
## 📋 Data Formats
|
||||
|
||||
### Discord Chat Logs (CSV)
|
||||
```csv
|
||||
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
|
||||
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
|
||||
```
|
||||
|
||||
### Image Dataset (JSON)
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"created_at": "2025-08-11 12:34:56 UTC",
|
||||
"summary": {
|
||||
"total_images": 42,
|
||||
"channels": ["memes", "general"],
|
||||
"total_size_bytes": 1234567,
|
||||
"file_extensions": [".png", ".jpg"],
|
||||
"authors": ["user1", "user2"]
|
||||
}
|
||||
},
|
||||
"images": [
|
||||
{
|
||||
"url": "https://cdn.discordapp.com/attachments/...",
|
||||
"channel": "memes",
|
||||
"author_name": "username",
|
||||
"timestamp_utc": "2025-08-11 12:34:56+00:00",
|
||||
"content": "Message text",
|
||||
"file_extension": ".png",
|
||||
"file_size": 54321,
|
||||
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## 🔧 Configuration
|
||||
|
||||
### Discord Bot Setup
|
||||
1. Create a Discord application at https://discord.com/developers/applications
|
||||
2. Create a bot and copy the token
|
||||
3. Enable the following intents:
|
||||
- Message Content Intent
|
||||
- Server Members Intent (optional)
|
||||
4. Invite bot to your server with appropriate permissions
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
# Set in scripts/bot.py
|
||||
BOT_TOKEN = "your_discord_bot_token_here"
|
||||
```
|
||||
|
||||
### Embedding Models
|
||||
The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
|
||||
|
||||
Supported models:
|
||||
- `all-MiniLM-L6-v2` (lightweight, fast)
|
||||
- `all-mpnet-base-v2` (higher quality)
|
||||
- `sentence-transformers/all-roberta-large-v1` (best quality, slower)
|
||||
|
||||
## 📊 Visualization Features
|
||||
|
||||
### Chat Message Clustering
|
||||
- **Dimensionality Reduction**: PCA, t-SNE, UMAP
|
||||
- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
|
||||
- **Interactive Controls**: Filter by source files, authors, and clusters
|
||||
- **Hover Information**: View message content, author, timestamp on hover
|
||||
|
||||
### Image Analysis
|
||||
- **Channel Organization**: Browse images by Discord channel
|
||||
- **Metadata Display**: Author, timestamp, message context
|
||||
- **Navigation**: Previous/next controls with slider
|
||||
- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
|
||||
|
||||
## 🛠️ Dependencies
|
||||
|
||||
### Core Scripts
|
||||
- `discord.py` - Discord bot framework
|
||||
- `pandas` - Data manipulation
|
||||
- `sentence-transformers` - Text embeddings
|
||||
- `requests` - HTTP requests for image downloads
|
||||
|
||||
### Visualization Apps
|
||||
- `streamlit` - Web interface framework
|
||||
- `plotly` - Interactive plotting
|
||||
- `scikit-learn` - Machine learning algorithms
|
||||
- `numpy` - Numerical computations
|
||||
- `umap-learn` - Dimensionality reduction
|
||||
- `hdbscan` - Density-based clustering
|
||||
|
||||
## 📈 Use Cases
|
||||
|
||||
### Research & Analytics
|
||||
- **Community Analysis**: Understand conversation patterns and topics
|
||||
- **Sentiment Analysis**: Track mood and sentiment over time
|
||||
- **User Behavior**: Analyze posting patterns and engagement
|
||||
- **Content Moderation**: Identify problematic content clusters
|
||||
|
||||
### Data Science Projects
|
||||
- **NLP Research**: Experiment with text embeddings and clustering
|
||||
- **Social Network Analysis**: Study communication patterns
|
||||
- **Visualization Techniques**: Explore dimensionality reduction methods
|
||||
- **Image Processing**: Analyze visual content sharing patterns
|
||||
|
||||
### Content Management
|
||||
- **Archive Creation**: Preserve Discord community history
|
||||
- **Content Discovery**: Find similar messages and discussions
|
||||
- **Moderation Tools**: Identify spam or inappropriate content
|
||||
- **Backup Solutions**: Create comprehensive data backups
|
||||
|
||||
## 🔒 Privacy & Ethics
|
||||
|
||||
- **Data Protection**: All processing happens locally
|
||||
- **User Consent**: Ensure proper permissions before scraping
|
||||
- **Compliance**: Follow Discord's Terms of Service
|
||||
- **Anonymization**: Consider removing or hashing user IDs for research
|
||||
|
||||
## 🤝 Contributing
|
||||
|
||||
1. Fork the repository
|
||||
2. Create a feature branch
|
||||
3. Make your changes
|
||||
4. Test thoroughly
|
||||
5. Submit a pull request
|
||||
|
||||
## 📄 License
|
||||
|
||||
This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
|
||||
|
||||
## 🆘 Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Bot can't read messages:**
|
||||
- Ensure Message Content Intent is enabled
|
||||
- Check bot permissions in Discord server
|
||||
- Verify bot token is correct
|
||||
|
||||
**Embeddings not generating:**
|
||||
- Install sentence-transformers: `pip install sentence-transformers`
|
||||
- Check available GPU memory for large models
|
||||
- Try a smaller model like `all-MiniLM-L6-v2`
|
||||
|
||||
**Images not downloading:**
|
||||
- Check internet connectivity
|
||||
- Verify Discord CDN URLs are accessible
|
||||
- Increase retry limits for unreliable connections
|
||||
|
||||
**Visualization not loading:**
|
||||
- Ensure all requirements are installed
|
||||
- Check that CSV files have embeddings
|
||||
- Try reducing dataset size for better performance
|
||||
|
||||
## 📚 Additional Resources
|
||||
|
||||
- [Discord.py Documentation](https://discordpy.readthedocs.io/)
|
||||
- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
|
||||
- [Streamlit Documentation](https://docs.streamlit.io/)
|
||||
- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
|
||||
58
apps/cluster_map/README.md
Normal file
58
apps/cluster_map/README.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Discord Chat Embeddings Visualizer
|
||||
|
||||
A Streamlit application that visualizes Discord chat messages using their vector embeddings in 2D space.
|
||||
|
||||
## Features
|
||||
|
||||
- **2D Visualization**: View chat messages plotted using PCA or t-SNE dimension reduction
|
||||
- **Interactive Plotting**: Hover over points to see message content, author, and timestamp
|
||||
- **Filtering**: Filter by source chat log files and authors
|
||||
- **Multiple Datasets**: Automatically loads all CSV files from the discord_chat_logs folder
|
||||
|
||||
## Installation
|
||||
|
||||
1. Install the required dependencies:
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Run the Streamlit application:
|
||||
|
||||
```bash
|
||||
streamlit run streamlit_app.py
|
||||
```
|
||||
|
||||
The app will automatically load all CSV files from the `../../discord_chat_logs/` directory.
|
||||
|
||||
## Data Format
|
||||
|
||||
The application expects CSV files with the following columns:
|
||||
- `message_id`: Unique identifier for the message
|
||||
- `timestamp_utc`: When the message was sent
|
||||
- `author_id`: Author's Discord ID
|
||||
- `author_name`: Author's username
|
||||
- `author_nickname`: Author's server nickname
|
||||
- `content`: The message content
|
||||
- `attachment_urls`: Any attached files
|
||||
- `embeds`: Embedded content
|
||||
- `content_embedding`: Vector embedding of the message content (as a string representation of a list)
|
||||
|
||||
## Visualization Options
|
||||
|
||||
- **PCA**: Principal Component Analysis - faster, good for getting an overview
|
||||
- **t-SNE**: t-Distributed Stochastic Neighbor Embedding - slower but may reveal better clusters
|
||||
|
||||
## Controls
|
||||
|
||||
- **Dimension Reduction Method**: Choose between PCA and t-SNE
|
||||
- **Filter by Source Files**: Select which chat log files to include
|
||||
- **Filter by Authors**: Select which authors to display
|
||||
- **Show Data Table**: View the underlying data in table format
|
||||
|
||||
## Performance Notes
|
||||
|
||||
- For large datasets, consider filtering by authors or source files to improve performance
|
||||
- t-SNE is computationally intensive and may take longer with large datasets
|
||||
- The app caches data and computations for better performance
|
||||
12
apps/cluster_map/cluster.py
Normal file
12
apps/cluster_map/cluster.py
Normal file
@@ -0,0 +1,12 @@
|
||||
"""
|
||||
Discord Chat Embeddings Visualizer - Legacy Entry Point
|
||||
|
||||
This file serves as a compatibility layer for the original cluster.py.
|
||||
The application has been refactored into modular components for better maintainability.
|
||||
"""
|
||||
|
||||
# Import and run the main application
|
||||
from main import main
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
226
apps/cluster_map/clustering.py
Normal file
226
apps/cluster_map/clustering.py
Normal file
@@ -0,0 +1,226 @@
|
||||
"""
|
||||
Clustering algorithms and evaluation metrics.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import streamlit as st
|
||||
from sklearn.cluster import SpectralClustering, AgglomerativeClustering, OPTICS
|
||||
from sklearn.mixture import GaussianMixture
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.metrics import silhouette_score, calinski_harabasz_score
|
||||
import hdbscan
|
||||
import pandas as pd
|
||||
from collections import Counter
|
||||
import re
|
||||
from config import DEFAULT_RANDOM_STATE
|
||||
|
||||
|
||||
def summarize_cluster_content(cluster_messages, max_words=3):
|
||||
"""
|
||||
Generate a meaningful name for a cluster based on its message content.
|
||||
|
||||
Args:
|
||||
cluster_messages: List of message contents in the cluster
|
||||
max_words: Maximum number of words in the cluster name
|
||||
|
||||
Returns:
|
||||
str: Generated cluster name
|
||||
"""
|
||||
if not cluster_messages:
|
||||
return "Empty Cluster"
|
||||
|
||||
# Combine all messages and clean text
|
||||
all_text = " ".join([str(msg) for msg in cluster_messages if pd.notna(msg)])
|
||||
if not all_text.strip():
|
||||
return "Empty Content"
|
||||
|
||||
# Basic text cleaning
|
||||
text = all_text.lower()
|
||||
|
||||
# Remove URLs, mentions, and special characters
|
||||
text = re.sub(r'http[s]?://\S+', '', text) # Remove URLs
|
||||
text = re.sub(r'<@\d+>', '', text) # Remove Discord mentions
|
||||
text = re.sub(r'<:\w+:\d+>', '', text) # Remove custom emojis
|
||||
text = re.sub(r'[^\w\s]', ' ', text) # Remove punctuation
|
||||
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
|
||||
|
||||
if not text:
|
||||
return "Special Characters"
|
||||
|
||||
# Split into words and filter out common words
|
||||
words = text.split()
|
||||
|
||||
# Common stop words to filter out
|
||||
stop_words = {
|
||||
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
|
||||
'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after',
|
||||
'above', 'below', 'between', 'among', 'until', 'without', 'under', 'over',
|
||||
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
|
||||
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
|
||||
'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them',
|
||||
'my', 'your', 'his', 'her', 'its', 'our', 'their', 'this', 'that', 'these', 'those',
|
||||
'just', 'like', 'get', 'know', 'think', 'see', 'go', 'come', 'say', 'said',
|
||||
'yeah', 'yes', 'no', 'oh', 'ok', 'okay', 'well', 'so', 'but', 'if', 'when',
|
||||
'what', 'where', 'why', 'how', 'who', 'which', 'than', 'then', 'now', 'here',
|
||||
'there', 'also', 'too', 'very', 'really', 'pretty', 'much', 'more', 'most',
|
||||
'some', 'any', 'all', 'many', 'few', 'little', 'big', 'small', 'good', 'bad'
|
||||
}
|
||||
|
||||
# Filter out stop words and very short/long words
|
||||
filtered_words = [
|
||||
word for word in words
|
||||
if word not in stop_words
|
||||
and len(word) >= 3
|
||||
and len(word) <= 15
|
||||
and word.isalpha() # Only alphabetic words
|
||||
]
|
||||
|
||||
if not filtered_words:
|
||||
return f"Chat ({len(cluster_messages)} msgs)"
|
||||
|
||||
# Count word frequencies
|
||||
word_counts = Counter(filtered_words)
|
||||
|
||||
# Get most common words
|
||||
most_common = word_counts.most_common(max_words * 2) # Get more than needed for filtering
|
||||
|
||||
# Select diverse words (avoid very similar words)
|
||||
selected_words = []
|
||||
for word, count in most_common:
|
||||
# Avoid adding very similar words
|
||||
if not any(word.startswith(existing[:4]) or existing.startswith(word[:4])
|
||||
for existing in selected_words):
|
||||
selected_words.append(word)
|
||||
if len(selected_words) >= max_words:
|
||||
break
|
||||
|
||||
if not selected_words:
|
||||
return f"Discussion ({len(cluster_messages)} msgs)"
|
||||
|
||||
# Create cluster name
|
||||
cluster_name = " + ".join(selected_words[:max_words]).title()
|
||||
|
||||
# Add message count for context
|
||||
cluster_name += f" ({len(cluster_messages)})"
|
||||
|
||||
return cluster_name
|
||||
|
||||
|
||||
def generate_cluster_names(filtered_df, cluster_labels):
|
||||
"""
|
||||
Generate names for all clusters based on their content.
|
||||
|
||||
Args:
|
||||
filtered_df: DataFrame with message data
|
||||
cluster_labels: Array of cluster labels for each message
|
||||
|
||||
Returns:
|
||||
dict: Mapping from cluster_id to cluster_name
|
||||
"""
|
||||
if cluster_labels is None:
|
||||
return {}
|
||||
|
||||
cluster_names = {}
|
||||
unique_clusters = np.unique(cluster_labels)
|
||||
|
||||
for cluster_id in unique_clusters:
|
||||
if cluster_id == -1:
|
||||
cluster_names[cluster_id] = "Noise/Outliers"
|
||||
continue
|
||||
|
||||
# Get messages in this cluster
|
||||
cluster_mask = cluster_labels == cluster_id
|
||||
cluster_messages = filtered_df[cluster_mask]['content'].tolist()
|
||||
|
||||
# Generate name
|
||||
cluster_name = summarize_cluster_content(cluster_messages)
|
||||
cluster_names[cluster_id] = cluster_name
|
||||
|
||||
return cluster_names
|
||||
|
||||
|
||||
def apply_clustering(embeddings, clustering_method="None", n_clusters=5):
|
||||
"""
|
||||
Apply clustering algorithm to embeddings and return labels and metrics.
|
||||
|
||||
Args:
|
||||
embeddings: High-dimensional embeddings to cluster
|
||||
clustering_method: Name of clustering algorithm
|
||||
n_clusters: Number of clusters (for methods that require it)
|
||||
|
||||
Returns:
|
||||
tuple: (cluster_labels, silhouette_score, calinski_harabasz_score)
|
||||
"""
|
||||
if clustering_method == "None" or len(embeddings) <= n_clusters:
|
||||
return None, None, None
|
||||
|
||||
# Standardize embeddings for better clustering
|
||||
scaler = StandardScaler()
|
||||
scaled_embeddings = scaler.fit_transform(embeddings)
|
||||
|
||||
cluster_labels = None
|
||||
silhouette_avg = None
|
||||
calinski_harabasz = None
|
||||
|
||||
try:
|
||||
if clustering_method == "HDBSCAN":
|
||||
min_cluster_size = max(2, len(embeddings) // 20) # Adaptive min cluster size
|
||||
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
|
||||
min_samples=1, cluster_selection_epsilon=0.5)
|
||||
cluster_labels = clusterer.fit_predict(scaled_embeddings)
|
||||
|
||||
elif clustering_method == "Spectral Clustering":
|
||||
clusterer = SpectralClustering(n_clusters=n_clusters, random_state=DEFAULT_RANDOM_STATE,
|
||||
affinity='rbf', gamma=1.0)
|
||||
cluster_labels = clusterer.fit_predict(scaled_embeddings)
|
||||
|
||||
elif clustering_method == "Gaussian Mixture":
|
||||
clusterer = GaussianMixture(n_components=n_clusters, random_state=DEFAULT_RANDOM_STATE,
|
||||
covariance_type='full', max_iter=200)
|
||||
cluster_labels = clusterer.fit_predict(scaled_embeddings)
|
||||
|
||||
elif clustering_method == "Agglomerative (Ward)":
|
||||
clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
|
||||
cluster_labels = clusterer.fit_predict(scaled_embeddings)
|
||||
|
||||
elif clustering_method == "Agglomerative (Complete)":
|
||||
clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='complete')
|
||||
cluster_labels = clusterer.fit_predict(scaled_embeddings)
|
||||
|
||||
elif clustering_method == "OPTICS":
|
||||
min_samples = max(2, len(embeddings) // 50)
|
||||
clusterer = OPTICS(min_samples=min_samples, xi=0.05, min_cluster_size=0.1)
|
||||
cluster_labels = clusterer.fit_predict(scaled_embeddings)
|
||||
|
||||
# Calculate clustering quality metrics
|
||||
if cluster_labels is not None and len(np.unique(cluster_labels)) > 1:
|
||||
# Only calculate if we have multiple clusters and no noise-only clustering
|
||||
valid_labels = cluster_labels[cluster_labels != -1] # Remove noise points for HDBSCAN/OPTICS
|
||||
valid_embeddings = scaled_embeddings[cluster_labels != -1]
|
||||
|
||||
if len(valid_labels) > 0 and len(np.unique(valid_labels)) > 1:
|
||||
silhouette_avg = silhouette_score(valid_embeddings, valid_labels)
|
||||
calinski_harabasz = calinski_harabasz_score(valid_embeddings, valid_labels)
|
||||
|
||||
except Exception as e:
|
||||
st.warning(f"Clustering failed: {str(e)}")
|
||||
cluster_labels = None
|
||||
|
||||
return cluster_labels, silhouette_avg, calinski_harabasz
|
||||
|
||||
|
||||
def get_cluster_statistics(cluster_labels):
|
||||
"""Get basic statistics about clustering results"""
|
||||
if cluster_labels is None:
|
||||
return {}
|
||||
|
||||
unique_clusters = np.unique(cluster_labels)
|
||||
n_clusters = len(unique_clusters[unique_clusters != -1]) # Exclude noise cluster (-1)
|
||||
n_noise = np.sum(cluster_labels == -1)
|
||||
|
||||
return {
|
||||
"n_clusters": n_clusters,
|
||||
"n_noise_points": n_noise,
|
||||
"cluster_distribution": np.bincount(cluster_labels[cluster_labels != -1]) if n_clusters > 0 else [],
|
||||
"unique_clusters": unique_clusters
|
||||
}
|
||||
75
apps/cluster_map/config.py
Normal file
75
apps/cluster_map/config.py
Normal file
@@ -0,0 +1,75 @@
|
||||
"""
|
||||
Configuration settings and constants for the Discord Chat Embeddings Visualizer.
|
||||
"""
|
||||
|
||||
# Application settings
|
||||
APP_TITLE = "The Cult - Visualised"
|
||||
APP_ICON = "🗨️"
|
||||
APP_LAYOUT = "wide"
|
||||
|
||||
# File paths
|
||||
CHAT_LOGS_PATH = "../../discord_chat_logs"
|
||||
|
||||
# Algorithm parameters
|
||||
DEFAULT_RANDOM_STATE = 42
|
||||
DEFAULT_N_COMPONENTS = 2
|
||||
DEFAULT_N_CLUSTERS = 5
|
||||
DEFAULT_DIMENSION_REDUCTION_METHOD = "t-SNE"
|
||||
DEFAULT_CLUSTERING_METHOD = "None"
|
||||
|
||||
# Visualization settings
|
||||
DEFAULT_POINT_SIZE = 8
|
||||
DEFAULT_POINT_OPACITY = 0.7
|
||||
MAX_DISPLAYED_AUTHORS = 10
|
||||
MESSAGE_CONTENT_PREVIEW_LENGTH = 200
|
||||
MESSAGE_CONTENT_DISPLAY_LENGTH = 100
|
||||
|
||||
# Performance thresholds
|
||||
LARGE_DATASET_WARNING_THRESHOLD = 1000
|
||||
|
||||
# Color palettes
|
||||
PRIMARY_COLORS = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
|
||||
"#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
|
||||
|
||||
# Clustering method categories
|
||||
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS = [
|
||||
"Spectral Clustering",
|
||||
"Gaussian Mixture",
|
||||
"Agglomerative (Ward)",
|
||||
"Agglomerative (Complete)"
|
||||
]
|
||||
|
||||
COMPUTATIONALLY_INTENSIVE_METHODS = {
|
||||
"dimension_reduction": ["t-SNE", "Spectral Embedding"],
|
||||
"clustering": ["Spectral Clustering", "OPTICS"]
|
||||
}
|
||||
|
||||
# Method explanations
|
||||
METHOD_EXPLANATIONS = {
|
||||
"dimension_reduction": {
|
||||
"PCA": "Linear, fast, preserves global variance",
|
||||
"t-SNE": "Non-linear, good for local structure, slower",
|
||||
"UMAP": "Balanced speed/quality, preserves local & global structure",
|
||||
"Spectral Embedding": "Uses graph theory, good for non-convex clusters",
|
||||
"Force-Directed": "Physics-based layout, creates natural spacing"
|
||||
},
|
||||
"clustering": {
|
||||
"HDBSCAN": "Density-based, finds variable density clusters, handles noise",
|
||||
"Spectral Clustering": "Uses eigenvalues, good for non-convex shapes",
|
||||
"Gaussian Mixture": "Probabilistic, assumes gaussian distributions",
|
||||
"Agglomerative (Ward)": "Hierarchical, minimizes within-cluster variance",
|
||||
"Agglomerative (Complete)": "Hierarchical, minimizes maximum distance",
|
||||
"OPTICS": "Density-based, finds clusters of varying densities"
|
||||
},
|
||||
"separation": {
|
||||
"Spread Factor": "Applies repulsive forces between nearby points",
|
||||
"Smart Jittering": "Adds intelligent noise to separate overlapping points",
|
||||
"Density-Based Jittering": "Stronger separation in crowded areas",
|
||||
"Perplexity Factor": "Controls t-SNE's focus on local vs global structure",
|
||||
"Min Distance Factor": "Controls UMAP's point packing tightness"
|
||||
},
|
||||
"metrics": {
|
||||
"Silhouette Score": "Higher is better (range: -1 to 1)",
|
||||
"Calinski-Harabasz": "Higher is better, measures cluster separation"
|
||||
}
|
||||
}
|
||||
86
apps/cluster_map/data_loader.py
Normal file
86
apps/cluster_map/data_loader.py
Normal file
@@ -0,0 +1,86 @@
|
||||
"""
|
||||
Data loading and parsing utilities for Discord chat logs.
|
||||
"""
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import streamlit as st
|
||||
import ast
|
||||
from pathlib import Path
|
||||
from config import CHAT_LOGS_PATH
|
||||
|
||||
|
||||
@st.cache_data
|
||||
def load_all_chat_data():
|
||||
"""Load all CSV files from the discord_chat_logs folder"""
|
||||
chat_logs_path = Path(CHAT_LOGS_PATH)
|
||||
|
||||
with st.expander("📁 Loading Details", expanded=False):
|
||||
# Display the path for debugging
|
||||
st.write(f"Looking for CSV files in: {chat_logs_path}")
|
||||
st.write(f"Path exists: {chat_logs_path.exists()}")
|
||||
|
||||
all_data = []
|
||||
|
||||
for csv_file in chat_logs_path.glob("*.csv"):
|
||||
try:
|
||||
df = pd.read_csv(csv_file)
|
||||
df['source_file'] = csv_file.stem # Add source file name
|
||||
all_data.append(df)
|
||||
st.write(f"✅ Loaded {len(df)} messages from {csv_file.name}")
|
||||
except Exception as e:
|
||||
st.error(f"❌ Error loading {csv_file.name}: {e}")
|
||||
|
||||
if all_data:
|
||||
combined_df = pd.concat(all_data, ignore_index=True)
|
||||
st.success(f"🎉 Successfully loaded {len(combined_df)} total messages from {len(all_data)} files")
|
||||
else:
|
||||
st.error("No data loaded!")
|
||||
combined_df = pd.DataFrame()
|
||||
|
||||
return combined_df if all_data else pd.DataFrame()
|
||||
|
||||
|
||||
@st.cache_data
|
||||
def parse_embeddings(df):
|
||||
"""Parse the content_embedding column from string to numpy array"""
|
||||
embeddings = []
|
||||
valid_indices = []
|
||||
|
||||
for idx, embedding_str in enumerate(df['content_embedding']):
|
||||
try:
|
||||
# Parse the string representation of the list
|
||||
embedding = ast.literal_eval(embedding_str)
|
||||
if isinstance(embedding, list) and len(embedding) > 0:
|
||||
embeddings.append(embedding)
|
||||
valid_indices.append(idx)
|
||||
except Exception as e:
|
||||
continue
|
||||
|
||||
embeddings_array = np.array(embeddings)
|
||||
valid_df = df.iloc[valid_indices].copy()
|
||||
|
||||
st.info(f"📊 Parsed {len(embeddings)} valid embeddings from {len(df)} messages")
|
||||
st.info(f"🔢 Embedding dimension: {embeddings_array.shape[1] if len(embeddings) > 0 else 0}")
|
||||
|
||||
return embeddings_array, valid_df
|
||||
|
||||
|
||||
def filter_data(df, selected_sources, selected_authors):
|
||||
"""Filter dataframe by selected sources and authors"""
|
||||
if not selected_sources:
|
||||
selected_sources = df['source_file'].unique()
|
||||
|
||||
filtered_df = df[
|
||||
(df['source_file'].isin(selected_sources)) &
|
||||
(df['author_name'].isin(selected_authors))
|
||||
]
|
||||
|
||||
return filtered_df
|
||||
|
||||
|
||||
def get_filtered_embeddings(embeddings, valid_df, filtered_df):
|
||||
"""Get embeddings corresponding to filtered dataframe"""
|
||||
filtered_indices = filtered_df.index.tolist()
|
||||
filtered_embeddings = embeddings[[i for i, idx in enumerate(valid_df.index) if idx in filtered_indices]]
|
||||
return filtered_embeddings
|
||||
211
apps/cluster_map/dimensionality_reduction.py
Normal file
211
apps/cluster_map/dimensionality_reduction.py
Normal file
@@ -0,0 +1,211 @@
|
||||
"""
|
||||
Dimensionality reduction algorithms and point separation techniques.
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import streamlit as st
|
||||
from sklearn.decomposition import PCA
|
||||
from sklearn.manifold import TSNE, SpectralEmbedding
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.neighbors import NearestNeighbors
|
||||
from scipy.spatial.distance import pdist, squareform
|
||||
from scipy.optimize import minimize
|
||||
import umap
|
||||
from config import DEFAULT_RANDOM_STATE
|
||||
|
||||
|
||||
def apply_adaptive_spreading(embeddings, spread_factor=1.0):
|
||||
"""
|
||||
Apply adaptive spreading to push apart nearby points while preserving global structure.
|
||||
Uses a force-based approach where closer points repel more strongly.
|
||||
"""
|
||||
if spread_factor <= 0:
|
||||
return embeddings
|
||||
|
||||
embeddings = embeddings.copy()
|
||||
n_points = len(embeddings)
|
||||
|
||||
print(f"DEBUG: Applying adaptive spreading to {n_points} points with factor {spread_factor}")
|
||||
|
||||
if n_points < 2:
|
||||
return embeddings
|
||||
|
||||
# For very large datasets, skip spreading to avoid hanging
|
||||
if n_points > 1000:
|
||||
print(f"DEBUG: Large dataset ({n_points} points), skipping adaptive spreading...")
|
||||
return embeddings
|
||||
|
||||
# Calculate pairwise distances
|
||||
distances = squareform(pdist(embeddings))
|
||||
|
||||
# Apply force-based spreading with fewer iterations for large datasets
|
||||
max_iterations = 3 if n_points > 500 else 5
|
||||
|
||||
for iteration in range(max_iterations):
|
||||
if iteration % 2 == 0: # Progress indicator
|
||||
print(f"DEBUG: Spreading iteration {iteration + 1}/{max_iterations}")
|
||||
|
||||
forces = np.zeros_like(embeddings)
|
||||
|
||||
for i in range(n_points):
|
||||
for j in range(i + 1, n_points):
|
||||
diff = embeddings[i] - embeddings[j]
|
||||
dist = np.linalg.norm(diff)
|
||||
|
||||
if dist > 0:
|
||||
# Repulsive force inversely proportional to distance
|
||||
force_magnitude = spread_factor / (dist ** 2 + 0.01)
|
||||
force_direction = diff / dist
|
||||
force = force_magnitude * force_direction
|
||||
|
||||
forces[i] += force
|
||||
forces[j] -= force
|
||||
|
||||
# Apply forces with damping
|
||||
embeddings += forces * 0.1
|
||||
|
||||
print(f"DEBUG: Adaptive spreading complete")
|
||||
return embeddings
|
||||
|
||||
|
||||
def force_directed_layout(high_dim_embeddings, n_components=2, spread_factor=1.0):
|
||||
"""
|
||||
Create a force-directed layout from high-dimensional embeddings.
|
||||
This creates more natural spacing between similar points.
|
||||
"""
|
||||
print(f"DEBUG: Starting force-directed layout with {len(high_dim_embeddings)} points...")
|
||||
|
||||
# For large datasets, fall back to PCA + spreading to avoid hanging
|
||||
if len(high_dim_embeddings) > 500:
|
||||
print(f"DEBUG: Large dataset ({len(high_dim_embeddings)} points), using PCA + spreading instead...")
|
||||
pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
|
||||
result = pca.fit_transform(high_dim_embeddings)
|
||||
return apply_adaptive_spreading(result, spread_factor)
|
||||
|
||||
# Start with PCA as initial layout
|
||||
pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
|
||||
initial_layout = pca.fit_transform(high_dim_embeddings)
|
||||
print(f"DEBUG: Initial PCA layout computed...")
|
||||
|
||||
# For simplicity, just apply spreading to the PCA result
|
||||
# The original optimization was too computationally intensive
|
||||
result = apply_adaptive_spreading(initial_layout, spread_factor)
|
||||
print(f"DEBUG: Force-directed layout complete...")
|
||||
return result
|
||||
|
||||
|
||||
def calculate_local_density_scaling(embeddings, k=5):
|
||||
"""
|
||||
Calculate local density scaling factors to emphasize differences in dense regions.
|
||||
"""
|
||||
if len(embeddings) < k:
|
||||
return np.ones(len(embeddings))
|
||||
|
||||
# Find k nearest neighbors for each point
|
||||
nn = NearestNeighbors(n_neighbors=k+1) # +1 because first neighbor is the point itself
|
||||
nn.fit(embeddings)
|
||||
distances, indices = nn.kneighbors(embeddings)
|
||||
|
||||
# Calculate local density (inverse of average distance to k nearest neighbors)
|
||||
local_densities = 1.0 / (np.mean(distances[:, 1:], axis=1) + 1e-6)
|
||||
|
||||
# Normalize densities
|
||||
local_densities = (local_densities - np.min(local_densities)) / (np.max(local_densities) - np.min(local_densities) + 1e-6)
|
||||
|
||||
return local_densities
|
||||
|
||||
|
||||
def apply_density_based_jittering(embeddings, density_scaling=True, jitter_strength=0.1):
|
||||
"""
|
||||
Apply smart jittering that's stronger in dense regions to separate overlapping points.
|
||||
"""
|
||||
if not density_scaling:
|
||||
# Simple random jittering
|
||||
noise = np.random.normal(0, jitter_strength, embeddings.shape)
|
||||
return embeddings + noise
|
||||
|
||||
# Calculate local densities
|
||||
densities = calculate_local_density_scaling(embeddings)
|
||||
|
||||
# Apply density-proportional jittering
|
||||
jittered = embeddings.copy()
|
||||
for i in range(len(embeddings)):
|
||||
# More jitter in denser regions
|
||||
jitter_amount = jitter_strength * (1 + densities[i])
|
||||
noise = np.random.normal(0, jitter_amount, embeddings.shape[1])
|
||||
jittered[i] += noise
|
||||
|
||||
return jittered
|
||||
|
||||
|
||||
def reduce_dimensions(embeddings, method="PCA", n_components=2, spread_factor=1.0,
|
||||
perplexity_factor=1.0, min_dist_factor=1.0):
|
||||
"""Apply dimensionality reduction with enhanced separation"""
|
||||
|
||||
# Convert to numpy array if it's not already
|
||||
embeddings = np.array(embeddings)
|
||||
|
||||
print(f"DEBUG: Starting {method} with {len(embeddings)} embeddings, shape: {embeddings.shape}")
|
||||
|
||||
# Standardize embeddings for better processing
|
||||
scaler = StandardScaler()
|
||||
scaled_embeddings = scaler.fit_transform(embeddings)
|
||||
print(f"DEBUG: Embeddings standardized")
|
||||
|
||||
# Apply the selected dimensionality reduction method
|
||||
if method == "PCA":
|
||||
print(f"DEBUG: Applying PCA...")
|
||||
reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
|
||||
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
|
||||
# Apply spreading to PCA results
|
||||
print(f"DEBUG: Applying spreading...")
|
||||
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
|
||||
|
||||
elif method == "t-SNE":
|
||||
# Adjust perplexity based on user preference and data size
|
||||
base_perplexity = min(30, len(embeddings)-1)
|
||||
adjusted_perplexity = max(5, min(50, int(base_perplexity * perplexity_factor)))
|
||||
print(f"DEBUG: Applying t-SNE with perplexity {adjusted_perplexity}...")
|
||||
|
||||
reducer = TSNE(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
|
||||
perplexity=adjusted_perplexity, n_iter=1000,
|
||||
early_exaggeration=12.0 * spread_factor, # Increase early exaggeration for more separation
|
||||
learning_rate='auto')
|
||||
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
|
||||
|
||||
elif method == "UMAP":
|
||||
# Adjust UMAP parameters for better local separation
|
||||
n_neighbors = min(15, len(embeddings)-1)
|
||||
min_dist = 0.1 * min_dist_factor
|
||||
spread = 1.0 * spread_factor
|
||||
print(f"DEBUG: Applying UMAP with n_neighbors={n_neighbors}, min_dist={min_dist}...")
|
||||
|
||||
reducer = umap.UMAP(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
|
||||
n_neighbors=n_neighbors, min_dist=min_dist,
|
||||
spread=spread, local_connectivity=2.0)
|
||||
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
|
||||
|
||||
elif method == "Spectral Embedding":
|
||||
n_neighbors = min(10, len(embeddings)-1)
|
||||
print(f"DEBUG: Applying Spectral Embedding with n_neighbors={n_neighbors}...")
|
||||
reducer = SpectralEmbedding(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
|
||||
n_neighbors=n_neighbors)
|
||||
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
|
||||
# Apply spreading to spectral results
|
||||
print(f"DEBUG: Applying spreading...")
|
||||
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
|
||||
|
||||
elif method == "Force-Directed":
|
||||
# New method: Use force-directed layout for natural spreading
|
||||
print(f"DEBUG: Applying Force-Directed layout...")
|
||||
reduced_embeddings = force_directed_layout(scaled_embeddings, n_components, spread_factor)
|
||||
|
||||
else:
|
||||
# Fallback to PCA
|
||||
print(f"DEBUG: Unknown method {method}, falling back to PCA...")
|
||||
reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
|
||||
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
|
||||
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
|
||||
|
||||
print(f"DEBUG: Dimensionality reduction complete. Output shape: {reduced_embeddings.shape}")
|
||||
return reduced_embeddings
|
||||
169
apps/cluster_map/main.py
Normal file
169
apps/cluster_map/main.py
Normal file
@@ -0,0 +1,169 @@
|
||||
"""
|
||||
Main application logic for the Discord Chat Embeddings Visualizer.
|
||||
"""
|
||||
|
||||
import streamlit as st
|
||||
import warnings
|
||||
warnings.filterwarnings('ignore')
|
||||
|
||||
# Import custom modules
|
||||
from ui_components import (
|
||||
setup_page_config, display_title_and_description, get_all_ui_parameters,
|
||||
display_performance_warnings
|
||||
)
|
||||
from data_loader import (
|
||||
load_all_chat_data, parse_embeddings, filter_data, get_filtered_embeddings
|
||||
)
|
||||
from dimensionality_reduction import (
|
||||
reduce_dimensions, apply_density_based_jittering
|
||||
)
|
||||
from clustering import apply_clustering, generate_cluster_names
|
||||
from visualization import (
|
||||
create_visualization_plot, display_clustering_metrics, display_summary_stats,
|
||||
display_clustering_results, display_data_table, display_cluster_summary
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
"""Main application function"""
|
||||
# Set up page configuration
|
||||
setup_page_config()
|
||||
|
||||
# Display title and description
|
||||
display_title_and_description()
|
||||
|
||||
# Load data
|
||||
with st.spinner("Loading chat data..."):
|
||||
df = load_all_chat_data()
|
||||
|
||||
if df.empty:
|
||||
st.error("No data could be loaded. Please check the data directory.")
|
||||
st.stop()
|
||||
|
||||
# Parse embeddings
|
||||
with st.spinner("Parsing embeddings..."):
|
||||
embeddings, valid_df = parse_embeddings(df)
|
||||
|
||||
if len(embeddings) == 0:
|
||||
st.error("No valid embeddings found!")
|
||||
st.stop()
|
||||
|
||||
# Get UI parameters
|
||||
params = get_all_ui_parameters(valid_df)
|
||||
|
||||
# Check if any sources are selected before proceeding
|
||||
if not params['selected_sources']:
|
||||
st.info("📂 **Select source files from the sidebar to begin visualization**")
|
||||
st.markdown("### Available Data Sources:")
|
||||
|
||||
# Show available sources as an informational table
|
||||
source_info = []
|
||||
for source in valid_df['source_file'].unique():
|
||||
source_data = valid_df[valid_df['source_file'] == source]
|
||||
source_info.append({
|
||||
'Source File': source,
|
||||
'Messages': len(source_data),
|
||||
'Unique Authors': source_data['author_name'].nunique(),
|
||||
'Date Range': f"{source_data['timestamp_utc'].min()} to {source_data['timestamp_utc'].max()}"
|
||||
})
|
||||
|
||||
import pandas as pd
|
||||
source_df = pd.DataFrame(source_info)
|
||||
st.dataframe(source_df, use_container_width=True, hide_index=True)
|
||||
|
||||
st.markdown("👈 **Use the sidebar to select which sources to visualize**")
|
||||
st.stop()
|
||||
|
||||
# Filter data
|
||||
filtered_df = filter_data(valid_df, params['selected_sources'], params['selected_authors'])
|
||||
|
||||
if filtered_df.empty:
|
||||
st.warning("No data matches the current filters! Try selecting different sources or authors.")
|
||||
st.stop()
|
||||
|
||||
# Display performance warnings
|
||||
display_performance_warnings(filtered_df, params['method'], params['clustering_method'])
|
||||
|
||||
# Get corresponding embeddings
|
||||
filtered_embeddings = get_filtered_embeddings(embeddings, valid_df, filtered_df)
|
||||
|
||||
st.info(f"📈 Visualizing {len(filtered_df)} messages")
|
||||
|
||||
# Reduce dimensions
|
||||
n_components = 3 if params['enable_3d'] else 2
|
||||
with st.spinner(f"Reducing dimensions using {params['method']}..."):
|
||||
reduced_embeddings = reduce_dimensions(
|
||||
filtered_embeddings,
|
||||
method=params['method'],
|
||||
n_components=n_components,
|
||||
spread_factor=params['spread_factor'],
|
||||
perplexity_factor=params['perplexity_factor'],
|
||||
min_dist_factor=params['min_dist_factor']
|
||||
)
|
||||
|
||||
# Apply clustering
|
||||
with st.spinner(f"Applying {params['clustering_method']}..."):
|
||||
cluster_labels, silhouette_avg, calinski_harabasz = apply_clustering(
|
||||
filtered_embeddings,
|
||||
clustering_method=params['clustering_method'],
|
||||
n_clusters=params['n_clusters']
|
||||
)
|
||||
|
||||
# Apply jittering if requested
|
||||
if params['apply_jittering']:
|
||||
with st.spinner("Applying smart jittering to separate overlapping points..."):
|
||||
reduced_embeddings = apply_density_based_jittering(
|
||||
reduced_embeddings,
|
||||
density_scaling=params['density_based_jitter'],
|
||||
jitter_strength=params['jitter_strength']
|
||||
)
|
||||
|
||||
# Generate cluster names if clustering was applied
|
||||
cluster_names = None
|
||||
if cluster_labels is not None:
|
||||
with st.spinner("Generating cluster names..."):
|
||||
cluster_names = generate_cluster_names(filtered_df, cluster_labels)
|
||||
|
||||
# Display clustering metrics
|
||||
display_clustering_metrics(
|
||||
cluster_labels, silhouette_avg, calinski_harabasz,
|
||||
params['show_cluster_metrics']
|
||||
)
|
||||
|
||||
# Display cluster summary with names
|
||||
if cluster_names:
|
||||
display_cluster_summary(cluster_names, cluster_labels)
|
||||
|
||||
# Create and display the main plot
|
||||
fig = create_visualization_plot(
|
||||
reduced_embeddings=reduced_embeddings,
|
||||
filtered_df=filtered_df,
|
||||
cluster_labels=cluster_labels,
|
||||
selected_sources=params['selected_sources'] if params['selected_sources'] else None,
|
||||
method=params['method'],
|
||||
clustering_method=params['clustering_method'],
|
||||
point_size=params['point_size'],
|
||||
point_opacity=params['point_opacity'],
|
||||
density_based_sizing=params['density_based_sizing'],
|
||||
size_variation=params['size_variation'],
|
||||
enable_3d=params['enable_3d'],
|
||||
cluster_names=cluster_names
|
||||
)
|
||||
|
||||
st.plotly_chart(fig, use_container_width=True)
|
||||
|
||||
# Display summary statistics
|
||||
display_summary_stats(filtered_df, params['selected_sources'] or filtered_df['source_file'].unique())
|
||||
|
||||
# Display clustering results and export options
|
||||
display_clustering_results(
|
||||
filtered_df, cluster_labels, reduced_embeddings,
|
||||
params['method'], params['clustering_method'], params['enable_3d']
|
||||
)
|
||||
|
||||
# Display data table
|
||||
display_data_table(filtered_df, cluster_labels)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
8
apps/cluster_map/requirements.txt
Normal file
8
apps/cluster_map/requirements.txt
Normal file
@@ -0,0 +1,8 @@
|
||||
streamlit>=1.28.0
|
||||
pandas>=1.5.0
|
||||
numpy>=1.24.0
|
||||
plotly>=5.15.0
|
||||
scikit-learn>=1.3.0
|
||||
umap-learn>=0.5.3
|
||||
hdbscan>=0.8.29
|
||||
scipy>=1.10.0
|
||||
43
apps/cluster_map/test_debug.py
Normal file
43
apps/cluster_map/test_debug.py
Normal file
@@ -0,0 +1,43 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Test script to debug the hanging issue in the modular app
|
||||
"""
|
||||
|
||||
import numpy as np
|
||||
import sys
|
||||
import os
|
||||
|
||||
# Add the current directory to Python path
|
||||
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
def test_dimensionality_reduction():
|
||||
"""Test dimensionality reduction functions"""
|
||||
print("Testing dimensionality reduction functions...")
|
||||
|
||||
from dimensionality_reduction import reduce_dimensions
|
||||
|
||||
# Create test data similar to what we'd expect
|
||||
n_samples = 796 # Same as the user's dataset
|
||||
n_features = 384 # Common embedding dimension
|
||||
|
||||
print(f"Creating test embeddings: {n_samples} x {n_features}")
|
||||
test_embeddings = np.random.randn(n_samples, n_features)
|
||||
|
||||
# Test PCA (should be fast)
|
||||
print("Testing PCA...")
|
||||
try:
|
||||
result = reduce_dimensions(test_embeddings, method="PCA")
|
||||
print(f"✓ PCA successful, output shape: {result.shape}")
|
||||
except Exception as e:
|
||||
print(f"✗ PCA failed: {e}")
|
||||
|
||||
# Test UMAP (might be slower)
|
||||
print("Testing UMAP...")
|
||||
try:
|
||||
result = reduce_dimensions(test_embeddings, method="UMAP")
|
||||
print(f"✓ UMAP successful, output shape: {result.shape}")
|
||||
except Exception as e:
|
||||
print(f"✗ UMAP failed: {e}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_dimensionality_reduction()
|
||||
267
apps/cluster_map/ui_components.py
Normal file
267
apps/cluster_map/ui_components.py
Normal file
@@ -0,0 +1,267 @@
|
||||
"""
|
||||
Streamlit UI components and controls for the Discord Chat Embeddings Visualizer.
|
||||
"""
|
||||
|
||||
import streamlit as st
|
||||
import numpy as np
|
||||
from config import (
|
||||
APP_TITLE, APP_ICON, APP_LAYOUT, METHOD_EXPLANATIONS,
|
||||
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS, COMPUTATIONALLY_INTENSIVE_METHODS,
|
||||
LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS,
|
||||
DEFAULT_DIMENSION_REDUCTION_METHOD, DEFAULT_CLUSTERING_METHOD
|
||||
)
|
||||
|
||||
|
||||
def setup_page_config():
|
||||
"""Set up the Streamlit page configuration"""
|
||||
st.set_page_config(
|
||||
page_title=APP_TITLE,
|
||||
page_icon=APP_ICON,
|
||||
layout=APP_LAYOUT
|
||||
)
|
||||
|
||||
|
||||
def display_title_and_description():
|
||||
"""Display the main title and description"""
|
||||
st.title(f"{APP_ICON} {APP_TITLE}")
|
||||
st.markdown("Explore Discord chat messages through their vector embeddings in 2D space")
|
||||
|
||||
|
||||
def create_method_controls():
|
||||
"""Create controls for dimension reduction and clustering methods"""
|
||||
st.sidebar.header("🎛️ Visualization Controls")
|
||||
|
||||
# 3D visualization toggle
|
||||
enable_3d = st.sidebar.checkbox(
|
||||
"Enable 3D Visualization",
|
||||
value=False,
|
||||
help="Switch between 2D and 3D visualization. 3D uses 3 components instead of 2."
|
||||
)
|
||||
|
||||
# Dimension reduction method
|
||||
method_options = ["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"]
|
||||
default_index = method_options.index(DEFAULT_DIMENSION_REDUCTION_METHOD) if DEFAULT_DIMENSION_REDUCTION_METHOD in method_options else 0
|
||||
method = st.sidebar.selectbox(
|
||||
"Dimension Reduction Method",
|
||||
method_options,
|
||||
index=default_index,
|
||||
help="PCA is fastest, UMAP balances speed and quality, t-SNE and Spectral are slower but may reveal better structures. Force-Directed creates natural spacing."
|
||||
)
|
||||
|
||||
# Clustering method
|
||||
clustering_options = ["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture",
|
||||
"Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"]
|
||||
clustering_default_index = clustering_options.index(DEFAULT_CLUSTERING_METHOD) if DEFAULT_CLUSTERING_METHOD in clustering_options else 0
|
||||
clustering_method = st.sidebar.selectbox(
|
||||
"Clustering Method",
|
||||
clustering_options,
|
||||
index=clustering_default_index,
|
||||
help="Apply clustering to identify groups. HDBSCAN and OPTICS can find variable density clusters."
|
||||
)
|
||||
|
||||
return method, clustering_method, enable_3d
|
||||
|
||||
|
||||
def create_clustering_controls(clustering_method):
|
||||
"""Create controls for clustering parameters"""
|
||||
# Always show the clusters slider, but indicate when it's used
|
||||
if clustering_method in CLUSTERING_METHODS_REQUIRING_N_CLUSTERS:
|
||||
help_text = "Number of clusters to create. This setting affects the clustering algorithm."
|
||||
disabled = False
|
||||
elif clustering_method == "None":
|
||||
help_text = "Clustering is disabled. This setting has no effect."
|
||||
disabled = True
|
||||
else:
|
||||
help_text = f"{clustering_method} automatically determines the number of clusters. This setting has no effect."
|
||||
disabled = True
|
||||
|
||||
n_clusters = st.sidebar.slider(
|
||||
"Number of Clusters",
|
||||
min_value=2,
|
||||
max_value=20,
|
||||
value=5,
|
||||
disabled=disabled,
|
||||
help=help_text
|
||||
)
|
||||
|
||||
return n_clusters
|
||||
|
||||
|
||||
def create_separation_controls(method):
|
||||
"""Create controls for point separation and method-specific parameters"""
|
||||
st.sidebar.subheader("🎯 Point Separation Controls")
|
||||
|
||||
spread_factor = st.sidebar.slider(
|
||||
"Spread Factor",
|
||||
0.5, 3.0, 1.0, 0.1,
|
||||
help="Increase to spread apart nearby points. Higher values create more separation."
|
||||
)
|
||||
|
||||
# Method-specific parameters
|
||||
perplexity_factor = 1.0
|
||||
min_dist_factor = 1.0
|
||||
|
||||
if method == "t-SNE":
|
||||
perplexity_factor = st.sidebar.slider(
|
||||
"Perplexity Factor",
|
||||
0.1, 2.0, 1.0, 0.1,
|
||||
help="Affects local vs global structure balance. Lower values focus on local details."
|
||||
)
|
||||
|
||||
if method == "UMAP":
|
||||
min_dist_factor = st.sidebar.slider(
|
||||
"Min Distance Factor",
|
||||
0.1, 2.0, 1.0, 0.1,
|
||||
help="Controls how tightly points are packed. Lower values create tighter clusters."
|
||||
)
|
||||
|
||||
return spread_factor, perplexity_factor, min_dist_factor
|
||||
|
||||
|
||||
def create_jittering_controls():
|
||||
"""Create controls for jittering options"""
|
||||
apply_jittering = st.sidebar.checkbox(
|
||||
"Apply Smart Jittering",
|
||||
value=False,
|
||||
help="Add intelligent noise to separate overlapping points"
|
||||
)
|
||||
|
||||
jitter_strength = 0.1
|
||||
density_based_jitter = True
|
||||
|
||||
if apply_jittering:
|
||||
jitter_strength = st.sidebar.slider(
|
||||
"Jitter Strength",
|
||||
0.01, 0.5, 0.1, 0.01,
|
||||
help="Strength of jittering. Higher values spread points more."
|
||||
)
|
||||
density_based_jitter = st.sidebar.checkbox(
|
||||
"Density-Based Jittering",
|
||||
value=True,
|
||||
help="Apply stronger jittering in dense regions"
|
||||
)
|
||||
|
||||
return apply_jittering, jitter_strength, density_based_jitter
|
||||
|
||||
|
||||
def create_advanced_options():
|
||||
"""Create advanced visualization options"""
|
||||
with st.sidebar.expander("⚙️ Advanced Options"):
|
||||
show_cluster_metrics = st.checkbox("Show Clustering Metrics", value=True)
|
||||
point_size = st.slider("Point Size", 4, 15, 8)
|
||||
point_opacity = st.slider("Point Opacity", 0.3, 1.0, 0.7)
|
||||
|
||||
# Density-based visualization
|
||||
density_based_sizing = st.checkbox(
|
||||
"Density-Based Point Sizing",
|
||||
value=False,
|
||||
help="Make points larger in sparse regions, smaller in dense regions"
|
||||
)
|
||||
|
||||
size_variation = 2.0
|
||||
if density_based_sizing:
|
||||
size_variation = st.slider(
|
||||
"Size Variation Factor",
|
||||
1.5, 4.0, 2.0, 0.1,
|
||||
help="How much point sizes vary based on local density"
|
||||
)
|
||||
|
||||
return show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation
|
||||
|
||||
|
||||
def create_filter_controls(valid_df):
|
||||
"""Create controls for filtering data by source and author"""
|
||||
# Source file filter
|
||||
source_files = valid_df['source_file'].unique()
|
||||
selected_sources = st.sidebar.multiselect(
|
||||
"Filter by Source Files",
|
||||
source_files,
|
||||
default=[],
|
||||
help="Select which chat log files to include"
|
||||
)
|
||||
|
||||
# Author filter
|
||||
authors = valid_df['author_name'].unique()
|
||||
default_authors = authors[:MAX_DISPLAYED_AUTHORS] if len(authors) > MAX_DISPLAYED_AUTHORS else authors
|
||||
selected_authors = st.sidebar.multiselect(
|
||||
"Filter by Authors",
|
||||
authors,
|
||||
default=default_authors,
|
||||
help="Select which authors to include"
|
||||
)
|
||||
|
||||
return selected_sources, selected_authors
|
||||
|
||||
|
||||
def display_method_explanations():
|
||||
"""Display explanations for different methods"""
|
||||
st.sidebar.markdown("---")
|
||||
with st.sidebar.expander("📚 Method Explanations"):
|
||||
st.markdown("**Dimensionality Reduction:**")
|
||||
for method, explanation in METHOD_EXPLANATIONS["dimension_reduction"].items():
|
||||
st.markdown(f"- **{method}**: {explanation}")
|
||||
|
||||
st.markdown("\n**Clustering Methods:**")
|
||||
for method, explanation in METHOD_EXPLANATIONS["clustering"].items():
|
||||
st.markdown(f"- **{method}**: {explanation}")
|
||||
|
||||
st.markdown("\n**Separation Techniques:**")
|
||||
for technique, explanation in METHOD_EXPLANATIONS["separation"].items():
|
||||
st.markdown(f"- **{technique}**: {explanation}")
|
||||
|
||||
st.markdown("\n**Metrics:**")
|
||||
for metric, explanation in METHOD_EXPLANATIONS["metrics"].items():
|
||||
st.markdown(f"- **{metric}**: {explanation}")
|
||||
|
||||
|
||||
def display_performance_warnings(filtered_df, method, clustering_method):
|
||||
"""Display performance warnings for computationally intensive operations"""
|
||||
if len(filtered_df) > LARGE_DATASET_WARNING_THRESHOLD:
|
||||
if method in COMPUTATIONALLY_INTENSIVE_METHODS["dimension_reduction"]:
|
||||
st.warning(f"⚠️ {method} with {len(filtered_df)} points may take several minutes to compute.")
|
||||
if clustering_method in COMPUTATIONALLY_INTENSIVE_METHODS["clustering"]:
|
||||
st.warning(f"⚠️ {clustering_method} with {len(filtered_df)} points may be computationally intensive.")
|
||||
|
||||
|
||||
def get_all_ui_parameters(valid_df):
|
||||
"""Get all UI parameters in a single function call"""
|
||||
# Method selection
|
||||
method, clustering_method, enable_3d = create_method_controls()
|
||||
|
||||
# Clustering parameters
|
||||
n_clusters = create_clustering_controls(clustering_method)
|
||||
|
||||
# Separation controls
|
||||
spread_factor, perplexity_factor, min_dist_factor = create_separation_controls(method)
|
||||
|
||||
# Jittering controls
|
||||
apply_jittering, jitter_strength, density_based_jitter = create_jittering_controls()
|
||||
|
||||
# Advanced options
|
||||
show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation = create_advanced_options()
|
||||
|
||||
# Filters
|
||||
selected_sources, selected_authors = create_filter_controls(valid_df)
|
||||
|
||||
# Method explanations
|
||||
display_method_explanations()
|
||||
|
||||
return {
|
||||
'method': method,
|
||||
'clustering_method': clustering_method,
|
||||
'enable_3d': enable_3d,
|
||||
'n_clusters': n_clusters,
|
||||
'spread_factor': spread_factor,
|
||||
'perplexity_factor': perplexity_factor,
|
||||
'min_dist_factor': min_dist_factor,
|
||||
'apply_jittering': apply_jittering,
|
||||
'jitter_strength': jitter_strength,
|
||||
'density_based_jitter': density_based_jitter,
|
||||
'show_cluster_metrics': show_cluster_metrics,
|
||||
'point_size': point_size,
|
||||
'point_opacity': point_opacity,
|
||||
'density_based_sizing': density_based_sizing,
|
||||
'size_variation': size_variation,
|
||||
'selected_sources': selected_sources,
|
||||
'selected_authors': selected_authors
|
||||
}
|
||||
311
apps/cluster_map/visualization.py
Normal file
311
apps/cluster_map/visualization.py
Normal file
@@ -0,0 +1,311 @@
|
||||
"""
|
||||
Visualization functions for creating interactive plots and displays.
|
||||
"""
|
||||
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import plotly.express as px
|
||||
import plotly.graph_objects as go
|
||||
import streamlit as st
|
||||
from dimensionality_reduction import calculate_local_density_scaling
|
||||
from config import MESSAGE_CONTENT_PREVIEW_LENGTH, DEFAULT_POINT_SIZE, DEFAULT_POINT_OPACITY
|
||||
|
||||
|
||||
def create_hover_text(df):
|
||||
"""Create hover text for plotly"""
|
||||
hover_text = []
|
||||
for _, row in df.iterrows():
|
||||
text = f"<b>Author:</b> {row['author_name']}<br>"
|
||||
text += f"<b>Timestamp:</b> {row['timestamp_utc']}<br>"
|
||||
text += f"<b>Source:</b> {row['source_file']}<br>"
|
||||
|
||||
# Handle potential NaN or non-string content
|
||||
content = row['content']
|
||||
if pd.isna(content) or content is None:
|
||||
content_text = "[No content]"
|
||||
else:
|
||||
content_str = str(content)
|
||||
content_text = content_str[:MESSAGE_CONTENT_PREVIEW_LENGTH] + ('...' if len(content_str) > MESSAGE_CONTENT_PREVIEW_LENGTH else '')
|
||||
|
||||
text += f"<b>Content:</b> {content_text}"
|
||||
hover_text.append(text)
|
||||
return hover_text
|
||||
|
||||
|
||||
def calculate_point_sizes(reduced_embeddings, density_based_sizing=False,
|
||||
point_size=DEFAULT_POINT_SIZE, size_variation=2.0):
|
||||
"""Calculate point sizes based on density if enabled"""
|
||||
if not density_based_sizing:
|
||||
return [point_size] * len(reduced_embeddings)
|
||||
|
||||
local_densities = calculate_local_density_scaling(reduced_embeddings)
|
||||
# Invert densities so sparse areas get larger points
|
||||
inverted_densities = 1.0 - local_densities
|
||||
# Scale point sizes
|
||||
point_sizes = point_size * (1.0 + inverted_densities * (size_variation - 1.0))
|
||||
return point_sizes
|
||||
|
||||
|
||||
def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover_text,
|
||||
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA", enable_3d=False,
|
||||
cluster_names=None):
|
||||
"""Create a plot colored by clusters"""
|
||||
fig = go.Figure()
|
||||
|
||||
unique_clusters = np.unique(cluster_labels)
|
||||
colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel
|
||||
|
||||
for i, cluster_id in enumerate(unique_clusters):
|
||||
cluster_mask = cluster_labels == cluster_id
|
||||
if cluster_mask.any():
|
||||
cluster_embeddings = reduced_embeddings[cluster_mask]
|
||||
cluster_hover = [hover_text[j] for j, mask in enumerate(cluster_mask) if mask]
|
||||
cluster_sizes = [point_sizes[j] for j, mask in enumerate(cluster_mask) if mask]
|
||||
|
||||
# Use generated name if available, otherwise fall back to default
|
||||
if cluster_names and cluster_id in cluster_names:
|
||||
cluster_name = cluster_names[cluster_id]
|
||||
else:
|
||||
cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
|
||||
|
||||
if enable_3d:
|
||||
fig.add_trace(go.Scatter3d(
|
||||
x=cluster_embeddings[:, 0],
|
||||
y=cluster_embeddings[:, 1],
|
||||
z=cluster_embeddings[:, 2],
|
||||
mode='markers',
|
||||
name=cluster_name,
|
||||
marker=dict(
|
||||
size=cluster_sizes,
|
||||
color=colors[i % len(colors)],
|
||||
opacity=point_opacity,
|
||||
line=dict(width=1, color='white')
|
||||
),
|
||||
hovertemplate='%{hovertext}<extra></extra>',
|
||||
hovertext=cluster_hover
|
||||
))
|
||||
else:
|
||||
fig.add_trace(go.Scatter(
|
||||
x=cluster_embeddings[:, 0],
|
||||
y=cluster_embeddings[:, 1],
|
||||
mode='markers',
|
||||
name=cluster_name,
|
||||
marker=dict(
|
||||
size=cluster_sizes,
|
||||
color=colors[i % len(colors)],
|
||||
opacity=point_opacity,
|
||||
line=dict(width=1, color='white')
|
||||
),
|
||||
hovertemplate='%{hovertext}<extra></extra>',
|
||||
hovertext=cluster_hover
|
||||
))
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, hover_text,
|
||||
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, enable_3d=False):
|
||||
"""Create a plot colored by source files"""
|
||||
fig = go.Figure()
|
||||
colors = px.colors.qualitative.Set1
|
||||
|
||||
for i, source in enumerate(selected_sources):
|
||||
source_mask = filtered_df['source_file'] == source
|
||||
if source_mask.any():
|
||||
source_embeddings = reduced_embeddings[source_mask]
|
||||
source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
|
||||
source_sizes = [point_sizes[j] for j, mask in enumerate(source_mask) if mask]
|
||||
|
||||
if enable_3d:
|
||||
fig.add_trace(go.Scatter3d(
|
||||
x=source_embeddings[:, 0],
|
||||
y=source_embeddings[:, 1],
|
||||
z=source_embeddings[:, 2],
|
||||
mode='markers',
|
||||
name=source,
|
||||
marker=dict(
|
||||
size=source_sizes,
|
||||
color=colors[i % len(colors)],
|
||||
opacity=point_opacity,
|
||||
line=dict(width=1, color='white')
|
||||
),
|
||||
hovertemplate='%{hovertext}<extra></extra>',
|
||||
hovertext=source_hover
|
||||
))
|
||||
else:
|
||||
fig.add_trace(go.Scatter(
|
||||
x=source_embeddings[:, 0],
|
||||
y=source_embeddings[:, 1],
|
||||
mode='markers',
|
||||
name=source,
|
||||
marker=dict(
|
||||
size=source_sizes,
|
||||
color=colors[i % len(colors)],
|
||||
opacity=point_opacity,
|
||||
line=dict(width=1, color='white')
|
||||
),
|
||||
hovertemplate='%{hovertext}<extra></extra>',
|
||||
hovertext=source_hover
|
||||
))
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=None,
|
||||
selected_sources=None, method="PCA", clustering_method="None",
|
||||
point_size=DEFAULT_POINT_SIZE, point_opacity=DEFAULT_POINT_OPACITY,
|
||||
density_based_sizing=False, size_variation=2.0, enable_3d=False,
|
||||
cluster_names=None):
|
||||
"""Create the main visualization plot"""
|
||||
|
||||
# Create hover text
|
||||
hover_text = create_hover_text(filtered_df)
|
||||
|
||||
# Calculate point sizes
|
||||
point_sizes = calculate_point_sizes(reduced_embeddings, density_based_sizing,
|
||||
point_size, size_variation)
|
||||
|
||||
# Create plot based on coloring strategy
|
||||
if cluster_labels is not None:
|
||||
fig = create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels,
|
||||
hover_text, point_sizes, point_opacity, method, enable_3d,
|
||||
cluster_names)
|
||||
else:
|
||||
if selected_sources is None:
|
||||
selected_sources = filtered_df['source_file'].unique()
|
||||
fig = create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources,
|
||||
hover_text, point_sizes, point_opacity, enable_3d)
|
||||
|
||||
# Update layout
|
||||
title_suffix = f" with {clustering_method}" if clustering_method != "None" else ""
|
||||
dimension_text = "3D" if enable_3d else "2D"
|
||||
|
||||
if enable_3d:
|
||||
fig.update_layout(
|
||||
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
|
||||
scene=dict(
|
||||
xaxis_title=f"{method} Component 1",
|
||||
yaxis_title=f"{method} Component 2",
|
||||
zaxis_title=f"{method} Component 3"
|
||||
),
|
||||
width=1000,
|
||||
height=700
|
||||
)
|
||||
else:
|
||||
fig.update_layout(
|
||||
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
|
||||
xaxis_title=f"{method} Component 1",
|
||||
yaxis_title=f"{method} Component 2",
|
||||
hovermode='closest',
|
||||
width=1000,
|
||||
height=700
|
||||
)
|
||||
|
||||
return fig
|
||||
|
||||
|
||||
def display_clustering_metrics(cluster_labels, silhouette_avg, calinski_harabasz, show_metrics=True):
|
||||
"""Display clustering quality metrics"""
|
||||
if cluster_labels is not None and show_metrics:
|
||||
col1, col2, col3 = st.columns(3)
|
||||
with col1:
|
||||
n_clusters_found = len(np.unique(cluster_labels[cluster_labels != -1]))
|
||||
st.metric("Clusters Found", n_clusters_found)
|
||||
with col2:
|
||||
if silhouette_avg is not None:
|
||||
st.metric("Silhouette Score", f"{silhouette_avg:.3f}")
|
||||
else:
|
||||
st.metric("Silhouette Score", "N/A")
|
||||
with col3:
|
||||
if calinski_harabasz is not None:
|
||||
st.metric("Calinski-Harabasz Index", f"{calinski_harabasz:.1f}")
|
||||
else:
|
||||
st.metric("Calinski-Harabasz Index", "N/A")
|
||||
|
||||
|
||||
def display_summary_stats(filtered_df, selected_sources):
|
||||
"""Display summary statistics"""
|
||||
col1, col2, col3 = st.columns(3)
|
||||
|
||||
with col1:
|
||||
st.metric("Total Messages", len(filtered_df))
|
||||
|
||||
with col2:
|
||||
st.metric("Unique Authors", filtered_df['author_name'].nunique())
|
||||
|
||||
with col3:
|
||||
st.metric("Source Files", len(selected_sources))
|
||||
|
||||
|
||||
def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method, enable_3d=False):
|
||||
"""Display clustering results and export options"""
|
||||
if cluster_labels is None:
|
||||
return
|
||||
|
||||
st.subheader("📊 Clustering Results")
|
||||
|
||||
# Add cluster information to dataframe for export
|
||||
export_df = filtered_df.copy()
|
||||
export_df['cluster_id'] = cluster_labels
|
||||
export_df['x_coordinate'] = reduced_embeddings[:, 0]
|
||||
export_df['y_coordinate'] = reduced_embeddings[:, 1]
|
||||
|
||||
# Add z coordinate if 3D
|
||||
if enable_3d and reduced_embeddings.shape[1] >= 3:
|
||||
export_df['z_coordinate'] = reduced_embeddings[:, 2]
|
||||
|
||||
# Show cluster distribution
|
||||
cluster_dist = pd.Series(cluster_labels).value_counts().sort_index()
|
||||
st.bar_chart(cluster_dist)
|
||||
|
||||
# Download option
|
||||
csv_data = export_df.to_csv(index=False)
|
||||
dimension_text = "3D" if enable_3d else "2D"
|
||||
st.download_button(
|
||||
label="📥 Download Clustering Results (CSV)",
|
||||
data=csv_data,
|
||||
file_name=f"chat_clusters_{method}_{clustering_method}_{dimension_text}.csv",
|
||||
mime="text/csv"
|
||||
)
|
||||
|
||||
|
||||
def display_data_table(filtered_df, cluster_labels=None):
|
||||
"""Display the data table with optional clustering information"""
|
||||
if not st.checkbox("Show Data Table"):
|
||||
return
|
||||
|
||||
st.subheader("📋 Message Data")
|
||||
display_df = filtered_df[['timestamp_utc', 'author_name', 'source_file', 'content']].copy()
|
||||
|
||||
# Add clustering info if available
|
||||
if cluster_labels is not None:
|
||||
display_df['cluster'] = cluster_labels
|
||||
|
||||
display_df['content'] = display_df['content'].str[:100] + '...' # Truncate for display
|
||||
st.dataframe(display_df, use_container_width=True)
|
||||
|
||||
|
||||
def display_cluster_summary(cluster_names, cluster_labels):
|
||||
"""Display a summary of cluster names and their sizes"""
|
||||
if not cluster_names or cluster_labels is None:
|
||||
return
|
||||
|
||||
st.subheader("🏷️ Cluster Summary")
|
||||
|
||||
# Create summary data
|
||||
cluster_summary = []
|
||||
for cluster_id, name in cluster_names.items():
|
||||
count = np.sum(cluster_labels == cluster_id)
|
||||
cluster_summary.append({
|
||||
'Cluster ID': cluster_id,
|
||||
'Cluster Name': name,
|
||||
'Message Count': count,
|
||||
'Percentage': f"{100 * count / len(cluster_labels):.1f}%"
|
||||
})
|
||||
|
||||
# Sort by message count
|
||||
cluster_summary.sort(key=lambda x: x['Message Count'], reverse=True)
|
||||
|
||||
# Display as table
|
||||
summary_df = pd.DataFrame(cluster_summary)
|
||||
st.dataframe(summary_df, use_container_width=True, hide_index=True)
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -1 +1 @@
|
||||
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
|
||||
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
@@ -1 +1 @@
|
||||
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
|
||||
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
|
||||
|
||||
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -100,11 +100,11 @@ def process_csvs_in_directory(directory_path: str, model_name: str = 'all-MiniLM
|
||||
if __name__ == '__main__':
|
||||
# Define the directory where your CSV files are located.
|
||||
# The script will look for a folder named 'csv_data' in the current directory.
|
||||
CSV_DIRECTORY = 'csv_data'
|
||||
CSV_DIRECTORY = '../discord_chat_logs'
|
||||
|
||||
# This function will create the 'csv_data' directory and some sample
|
||||
# files if they don't exist. You can comment this out if you have your own files.
|
||||
create_sample_files(CSV_DIRECTORY)
|
||||
# create_sample_files(CSV_DIRECTORY)
|
||||
|
||||
# Run the main processing function on the directory
|
||||
process_csvs_in_directory(CSV_DIRECTORY)
|
||||
|
||||
Reference in New Issue
Block a user