Compare commits

...

6 Commits

Author SHA1 Message Date
ce906e4f9a udpated perplexity factor 2025-08-11 16:11:21 +01:00
fd9b25f256 updated readme 2025-08-11 03:07:44 +01:00
2b8659fc95 beter clusters and qol 2025-08-11 03:04:50 +01:00
647111e9d3 3d viz 2025-08-11 02:49:41 +01:00
4ca7e8ab61 refactor 2025-08-11 02:37:21 +01:00
6d35b42b27 updated reqs from clusteing 2025-08-11 02:22:59 +01:00
12 changed files with 1683 additions and 234 deletions

281
README.md
View File

@@ -1,2 +1,281 @@
# cult-scraper
# Discord Data Analysis & Visualization Suite
A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
## 🌟 Features
### 📥 Data Collection
- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
### 📊 Visualization & Analysis
- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
- **Image Dataset Viewer**: Browse and explore downloaded images by channel
### 🔧 Data Processing
- **Batch Processing**: Process multiple CSV files with embeddings
- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
- **Data Filtering**: Advanced filtering by authors, channels, and timeframes
## 📁 Repository Structure
```
cult-scraper-1/
├── scripts/ # Core data collection scripts
│ ├── bot.py # Discord bot for message scraping
│ ├── image_downloader.py # Download and convert Discord images
│ ├── embedder.py # Batch text embedding processor
│ └── embed_class.py # Text embedding utilities
├── apps/ # Interactive applications
│ ├── cluster_map/ # Chat message clustering & visualization
│ │ ├── main.py # Main Streamlit application
│ │ ├── data_loader.py # Data loading utilities
│ │ ├── clustering.py # Clustering algorithms
│ │ ├── visualization.py # Plotting and visualization
│ │ └── requirements.txt # Dependencies
│ └── image_viewer/ # Image dataset browser
│ ├── image_viewer.py # Streamlit image viewer
│ └── requirements.txt # Dependencies
├── discord_chat_logs/ # Exported CSV files from Discord
└── images_dataset/ # Downloaded images and metadata
└── images_dataset.json # Image dataset with base64 data
```
## 🚀 Quick Start
### 1. Discord Data Scraping
First, set up and run the Discord bot to collect message data:
```bash
cd scripts
# Configure your bot token in bot.py
python bot.py
```
**Requirements:**
- Discord bot token with message content intent enabled
- Bot must have read permissions in target channels
### 2. Generate Text Embeddings
Process the collected chat data to add semantic embeddings:
```bash
cd scripts
python embedder.py
```
This will:
- Process all CSV files in `discord_chat_logs/`
- Add embeddings to message content using sentence transformers
- Save updated files with embedding vectors
### 3. Download Images
Extract and download images from Discord attachments:
```bash
cd scripts
python image_downloader.py
```
Features:
- Downloads images from attachment URLs
- Converts to base64 for storage
- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
- Implements retry logic and rate limiting
### 4. Visualize Chat Data
Launch the interactive chat visualization tool:
```bash
cd apps/cluster_map
pip install -r requirements.txt
streamlit run main.py
```
**Capabilities:**
- 2D visualization using PCA or t-SNE
- Interactive clustering with DBSCAN/HDBSCAN
- Filter by channels, authors, and time periods
- Hover to see message content and metadata
### 5. Browse Image Dataset
View downloaded images in an organized interface:
```bash
cd apps/image_viewer
pip install -r requirements.txt
streamlit run image_viewer.py
```
**Features:**
- Channel-based organization
- Navigation controls (previous/next)
- Image metadata display
- Responsive layout
## 📋 Data Formats
### Discord Chat Logs (CSV)
```csv
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
```
### Image Dataset (JSON)
```json
{
"metadata": {
"created_at": "2025-08-11 12:34:56 UTC",
"summary": {
"total_images": 42,
"channels": ["memes", "general"],
"total_size_bytes": 1234567,
"file_extensions": [".png", ".jpg"],
"authors": ["user1", "user2"]
}
},
"images": [
{
"url": "https://cdn.discordapp.com/attachments/...",
"channel": "memes",
"author_name": "username",
"timestamp_utc": "2025-08-11 12:34:56+00:00",
"content": "Message text",
"file_extension": ".png",
"file_size": 54321,
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
}
]
}
```
## 🔧 Configuration
### Discord Bot Setup
1. Create a Discord application at https://discord.com/developers/applications
2. Create a bot and copy the token
3. Enable the following intents:
- Message Content Intent
- Server Members Intent (optional)
4. Invite bot to your server with appropriate permissions
### Environment Variables
```bash
# Set in scripts/bot.py
BOT_TOKEN = "your_discord_bot_token_here"
```
### Embedding Models
The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
Supported models:
- `all-MiniLM-L6-v2` (lightweight, fast)
- `all-mpnet-base-v2` (higher quality)
- `sentence-transformers/all-roberta-large-v1` (best quality, slower)
## 📊 Visualization Features
### Chat Message Clustering
- **Dimensionality Reduction**: PCA, t-SNE, UMAP
- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
- **Interactive Controls**: Filter by source files, authors, and clusters
- **Hover Information**: View message content, author, timestamp on hover
### Image Analysis
- **Channel Organization**: Browse images by Discord channel
- **Metadata Display**: Author, timestamp, message context
- **Navigation**: Previous/next controls with slider
- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
## 🛠️ Dependencies
### Core Scripts
- `discord.py` - Discord bot framework
- `pandas` - Data manipulation
- `sentence-transformers` - Text embeddings
- `requests` - HTTP requests for image downloads
### Visualization Apps
- `streamlit` - Web interface framework
- `plotly` - Interactive plotting
- `scikit-learn` - Machine learning algorithms
- `numpy` - Numerical computations
- `umap-learn` - Dimensionality reduction
- `hdbscan` - Density-based clustering
## 📈 Use Cases
### Research & Analytics
- **Community Analysis**: Understand conversation patterns and topics
- **Sentiment Analysis**: Track mood and sentiment over time
- **User Behavior**: Analyze posting patterns and engagement
- **Content Moderation**: Identify problematic content clusters
### Data Science Projects
- **NLP Research**: Experiment with text embeddings and clustering
- **Social Network Analysis**: Study communication patterns
- **Visualization Techniques**: Explore dimensionality reduction methods
- **Image Processing**: Analyze visual content sharing patterns
### Content Management
- **Archive Creation**: Preserve Discord community history
- **Content Discovery**: Find similar messages and discussions
- **Moderation Tools**: Identify spam or inappropriate content
- **Backup Solutions**: Create comprehensive data backups
## 🔒 Privacy & Ethics
- **Data Protection**: All processing happens locally
- **User Consent**: Ensure proper permissions before scraping
- **Compliance**: Follow Discord's Terms of Service
- **Anonymization**: Consider removing or hashing user IDs for research
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## 📄 License
This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
## 🆘 Troubleshooting
### Common Issues
**Bot can't read messages:**
- Ensure Message Content Intent is enabled
- Check bot permissions in Discord server
- Verify bot token is correct
**Embeddings not generating:**
- Install sentence-transformers: `pip install sentence-transformers`
- Check available GPU memory for large models
- Try a smaller model like `all-MiniLM-L6-v2`
**Images not downloading:**
- Check internet connectivity
- Verify Discord CDN URLs are accessible
- Increase retry limits for unreliable connections
**Visualization not loading:**
- Ensure all requirements are installed
- Check that CSV files have embeddings
- Try reducing dataset size for better performance
## 📚 Additional Resources
- [Discord.py Documentation](https://discordpy.readthedocs.io/)
- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
- [Streamlit Documentation](https://docs.streamlit.io/)
- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)

View File

@@ -0,0 +1,12 @@
"""
Discord Chat Embeddings Visualizer - Legacy Entry Point
This file serves as a compatibility layer for the original cluster.py.
The application has been refactored into modular components for better maintainability.
"""
# Import and run the main application
from main import main
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,226 @@
"""
Clustering algorithms and evaluation metrics.
"""
import numpy as np
import streamlit as st
from sklearn.cluster import SpectralClustering, AgglomerativeClustering, OPTICS
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score
import hdbscan
import pandas as pd
from collections import Counter
import re
from config import DEFAULT_RANDOM_STATE
def summarize_cluster_content(cluster_messages, max_words=3):
"""
Generate a meaningful name for a cluster based on its message content.
Args:
cluster_messages: List of message contents in the cluster
max_words: Maximum number of words in the cluster name
Returns:
str: Generated cluster name
"""
if not cluster_messages:
return "Empty Cluster"
# Combine all messages and clean text
all_text = " ".join([str(msg) for msg in cluster_messages if pd.notna(msg)])
if not all_text.strip():
return "Empty Content"
# Basic text cleaning
text = all_text.lower()
# Remove URLs, mentions, and special characters
text = re.sub(r'http[s]?://\S+', '', text) # Remove URLs
text = re.sub(r'<@\d+>', '', text) # Remove Discord mentions
text = re.sub(r'<:\w+:\d+>', '', text) # Remove custom emojis
text = re.sub(r'[^\w\s]', ' ', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
if not text:
return "Special Characters"
# Split into words and filter out common words
words = text.split()
# Common stop words to filter out
stop_words = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'between', 'among', 'until', 'without', 'under', 'over',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them',
'my', 'your', 'his', 'her', 'its', 'our', 'their', 'this', 'that', 'these', 'those',
'just', 'like', 'get', 'know', 'think', 'see', 'go', 'come', 'say', 'said',
'yeah', 'yes', 'no', 'oh', 'ok', 'okay', 'well', 'so', 'but', 'if', 'when',
'what', 'where', 'why', 'how', 'who', 'which', 'than', 'then', 'now', 'here',
'there', 'also', 'too', 'very', 'really', 'pretty', 'much', 'more', 'most',
'some', 'any', 'all', 'many', 'few', 'little', 'big', 'small', 'good', 'bad'
}
# Filter out stop words and very short/long words
filtered_words = [
word for word in words
if word not in stop_words
and len(word) >= 3
and len(word) <= 15
and word.isalpha() # Only alphabetic words
]
if not filtered_words:
return f"Chat ({len(cluster_messages)} msgs)"
# Count word frequencies
word_counts = Counter(filtered_words)
# Get most common words
most_common = word_counts.most_common(max_words * 2) # Get more than needed for filtering
# Select diverse words (avoid very similar words)
selected_words = []
for word, count in most_common:
# Avoid adding very similar words
if not any(word.startswith(existing[:4]) or existing.startswith(word[:4])
for existing in selected_words):
selected_words.append(word)
if len(selected_words) >= max_words:
break
if not selected_words:
return f"Discussion ({len(cluster_messages)} msgs)"
# Create cluster name
cluster_name = " + ".join(selected_words[:max_words]).title()
# Add message count for context
cluster_name += f" ({len(cluster_messages)})"
return cluster_name
def generate_cluster_names(filtered_df, cluster_labels):
"""
Generate names for all clusters based on their content.
Args:
filtered_df: DataFrame with message data
cluster_labels: Array of cluster labels for each message
Returns:
dict: Mapping from cluster_id to cluster_name
"""
if cluster_labels is None:
return {}
cluster_names = {}
unique_clusters = np.unique(cluster_labels)
for cluster_id in unique_clusters:
if cluster_id == -1:
cluster_names[cluster_id] = "Noise/Outliers"
continue
# Get messages in this cluster
cluster_mask = cluster_labels == cluster_id
cluster_messages = filtered_df[cluster_mask]['content'].tolist()
# Generate name
cluster_name = summarize_cluster_content(cluster_messages)
cluster_names[cluster_id] = cluster_name
return cluster_names
def apply_clustering(embeddings, clustering_method="None", n_clusters=5):
"""
Apply clustering algorithm to embeddings and return labels and metrics.
Args:
embeddings: High-dimensional embeddings to cluster
clustering_method: Name of clustering algorithm
n_clusters: Number of clusters (for methods that require it)
Returns:
tuple: (cluster_labels, silhouette_score, calinski_harabasz_score)
"""
if clustering_method == "None" or len(embeddings) <= n_clusters:
return None, None, None
# Standardize embeddings for better clustering
scaler = StandardScaler()
scaled_embeddings = scaler.fit_transform(embeddings)
cluster_labels = None
silhouette_avg = None
calinski_harabasz = None
try:
if clustering_method == "HDBSCAN":
min_cluster_size = max(2, len(embeddings) // 20) # Adaptive min cluster size
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
min_samples=1, cluster_selection_epsilon=0.5)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Spectral Clustering":
clusterer = SpectralClustering(n_clusters=n_clusters, random_state=DEFAULT_RANDOM_STATE,
affinity='rbf', gamma=1.0)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Gaussian Mixture":
clusterer = GaussianMixture(n_components=n_clusters, random_state=DEFAULT_RANDOM_STATE,
covariance_type='full', max_iter=200)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Agglomerative (Ward)":
clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Agglomerative (Complete)":
clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='complete')
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "OPTICS":
min_samples = max(2, len(embeddings) // 50)
clusterer = OPTICS(min_samples=min_samples, xi=0.05, min_cluster_size=0.1)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
# Calculate clustering quality metrics
if cluster_labels is not None and len(np.unique(cluster_labels)) > 1:
# Only calculate if we have multiple clusters and no noise-only clustering
valid_labels = cluster_labels[cluster_labels != -1] # Remove noise points for HDBSCAN/OPTICS
valid_embeddings = scaled_embeddings[cluster_labels != -1]
if len(valid_labels) > 0 and len(np.unique(valid_labels)) > 1:
silhouette_avg = silhouette_score(valid_embeddings, valid_labels)
calinski_harabasz = calinski_harabasz_score(valid_embeddings, valid_labels)
except Exception as e:
st.warning(f"Clustering failed: {str(e)}")
cluster_labels = None
return cluster_labels, silhouette_avg, calinski_harabasz
def get_cluster_statistics(cluster_labels):
"""Get basic statistics about clustering results"""
if cluster_labels is None:
return {}
unique_clusters = np.unique(cluster_labels)
n_clusters = len(unique_clusters[unique_clusters != -1]) # Exclude noise cluster (-1)
n_noise = np.sum(cluster_labels == -1)
return {
"n_clusters": n_clusters,
"n_noise_points": n_noise,
"cluster_distribution": np.bincount(cluster_labels[cluster_labels != -1]) if n_clusters > 0 else [],
"unique_clusters": unique_clusters
}

View File

@@ -0,0 +1,75 @@
"""
Configuration settings and constants for the Discord Chat Embeddings Visualizer.
"""
# Application settings
APP_TITLE = "The Cult - Visualised"
APP_ICON = "🗨️"
APP_LAYOUT = "wide"
# File paths
CHAT_LOGS_PATH = "../../discord_chat_logs"
# Algorithm parameters
DEFAULT_RANDOM_STATE = 42
DEFAULT_N_COMPONENTS = 2
DEFAULT_N_CLUSTERS = 5
DEFAULT_DIMENSION_REDUCTION_METHOD = "t-SNE"
DEFAULT_CLUSTERING_METHOD = "None"
# Visualization settings
DEFAULT_POINT_SIZE = 8
DEFAULT_POINT_OPACITY = 0.7
MAX_DISPLAYED_AUTHORS = 10
MESSAGE_CONTENT_PREVIEW_LENGTH = 200
MESSAGE_CONTENT_DISPLAY_LENGTH = 100
# Performance thresholds
LARGE_DATASET_WARNING_THRESHOLD = 1000
# Color palettes
PRIMARY_COLORS = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
"#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Clustering method categories
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS = [
"Spectral Clustering",
"Gaussian Mixture",
"Agglomerative (Ward)",
"Agglomerative (Complete)"
]
COMPUTATIONALLY_INTENSIVE_METHODS = {
"dimension_reduction": ["t-SNE", "Spectral Embedding"],
"clustering": ["Spectral Clustering", "OPTICS"]
}
# Method explanations
METHOD_EXPLANATIONS = {
"dimension_reduction": {
"PCA": "Linear, fast, preserves global variance",
"t-SNE": "Non-linear, good for local structure, slower",
"UMAP": "Balanced speed/quality, preserves local & global structure",
"Spectral Embedding": "Uses graph theory, good for non-convex clusters",
"Force-Directed": "Physics-based layout, creates natural spacing"
},
"clustering": {
"HDBSCAN": "Density-based, finds variable density clusters, handles noise",
"Spectral Clustering": "Uses eigenvalues, good for non-convex shapes",
"Gaussian Mixture": "Probabilistic, assumes gaussian distributions",
"Agglomerative (Ward)": "Hierarchical, minimizes within-cluster variance",
"Agglomerative (Complete)": "Hierarchical, minimizes maximum distance",
"OPTICS": "Density-based, finds clusters of varying densities"
},
"separation": {
"Spread Factor": "Applies repulsive forces between nearby points",
"Smart Jittering": "Adds intelligent noise to separate overlapping points",
"Density-Based Jittering": "Stronger separation in crowded areas",
"Perplexity Factor": "Controls t-SNE's focus on local vs global structure",
"Min Distance Factor": "Controls UMAP's point packing tightness"
},
"metrics": {
"Silhouette Score": "Higher is better (range: -1 to 1)",
"Calinski-Harabasz": "Higher is better, measures cluster separation"
}
}

View File

@@ -0,0 +1,86 @@
"""
Data loading and parsing utilities for Discord chat logs.
"""
import pandas as pd
import numpy as np
import streamlit as st
import ast
from pathlib import Path
from config import CHAT_LOGS_PATH
@st.cache_data
def load_all_chat_data():
"""Load all CSV files from the discord_chat_logs folder"""
chat_logs_path = Path(CHAT_LOGS_PATH)
with st.expander("📁 Loading Details", expanded=False):
# Display the path for debugging
st.write(f"Looking for CSV files in: {chat_logs_path}")
st.write(f"Path exists: {chat_logs_path.exists()}")
all_data = []
for csv_file in chat_logs_path.glob("*.csv"):
try:
df = pd.read_csv(csv_file)
df['source_file'] = csv_file.stem # Add source file name
all_data.append(df)
st.write(f"✅ Loaded {len(df)} messages from {csv_file.name}")
except Exception as e:
st.error(f"❌ Error loading {csv_file.name}: {e}")
if all_data:
combined_df = pd.concat(all_data, ignore_index=True)
st.success(f"🎉 Successfully loaded {len(combined_df)} total messages from {len(all_data)} files")
else:
st.error("No data loaded!")
combined_df = pd.DataFrame()
return combined_df if all_data else pd.DataFrame()
@st.cache_data
def parse_embeddings(df):
"""Parse the content_embedding column from string to numpy array"""
embeddings = []
valid_indices = []
for idx, embedding_str in enumerate(df['content_embedding']):
try:
# Parse the string representation of the list
embedding = ast.literal_eval(embedding_str)
if isinstance(embedding, list) and len(embedding) > 0:
embeddings.append(embedding)
valid_indices.append(idx)
except Exception as e:
continue
embeddings_array = np.array(embeddings)
valid_df = df.iloc[valid_indices].copy()
st.info(f"📊 Parsed {len(embeddings)} valid embeddings from {len(df)} messages")
st.info(f"🔢 Embedding dimension: {embeddings_array.shape[1] if len(embeddings) > 0 else 0}")
return embeddings_array, valid_df
def filter_data(df, selected_sources, selected_authors):
"""Filter dataframe by selected sources and authors"""
if not selected_sources:
selected_sources = df['source_file'].unique()
filtered_df = df[
(df['source_file'].isin(selected_sources)) &
(df['author_name'].isin(selected_authors))
]
return filtered_df
def get_filtered_embeddings(embeddings, valid_df, filtered_df):
"""Get embeddings corresponding to filtered dataframe"""
filtered_indices = filtered_df.index.tolist()
filtered_embeddings = embeddings[[i for i, idx in enumerate(valid_df.index) if idx in filtered_indices]]
return filtered_embeddings

View File

@@ -0,0 +1,211 @@
"""
Dimensionality reduction algorithms and point separation techniques.
"""
import numpy as np
import streamlit as st
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, SpectralEmbedding
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import pdist, squareform
from scipy.optimize import minimize
import umap
from config import DEFAULT_RANDOM_STATE
def apply_adaptive_spreading(embeddings, spread_factor=1.0):
"""
Apply adaptive spreading to push apart nearby points while preserving global structure.
Uses a force-based approach where closer points repel more strongly.
"""
if spread_factor <= 0:
return embeddings
embeddings = embeddings.copy()
n_points = len(embeddings)
print(f"DEBUG: Applying adaptive spreading to {n_points} points with factor {spread_factor}")
if n_points < 2:
return embeddings
# For very large datasets, skip spreading to avoid hanging
if n_points > 1000:
print(f"DEBUG: Large dataset ({n_points} points), skipping adaptive spreading...")
return embeddings
# Calculate pairwise distances
distances = squareform(pdist(embeddings))
# Apply force-based spreading with fewer iterations for large datasets
max_iterations = 3 if n_points > 500 else 5
for iteration in range(max_iterations):
if iteration % 2 == 0: # Progress indicator
print(f"DEBUG: Spreading iteration {iteration + 1}/{max_iterations}")
forces = np.zeros_like(embeddings)
for i in range(n_points):
for j in range(i + 1, n_points):
diff = embeddings[i] - embeddings[j]
dist = np.linalg.norm(diff)
if dist > 0:
# Repulsive force inversely proportional to distance
force_magnitude = spread_factor / (dist ** 2 + 0.01)
force_direction = diff / dist
force = force_magnitude * force_direction
forces[i] += force
forces[j] -= force
# Apply forces with damping
embeddings += forces * 0.1
print(f"DEBUG: Adaptive spreading complete")
return embeddings
def force_directed_layout(high_dim_embeddings, n_components=2, spread_factor=1.0):
"""
Create a force-directed layout from high-dimensional embeddings.
This creates more natural spacing between similar points.
"""
print(f"DEBUG: Starting force-directed layout with {len(high_dim_embeddings)} points...")
# For large datasets, fall back to PCA + spreading to avoid hanging
if len(high_dim_embeddings) > 500:
print(f"DEBUG: Large dataset ({len(high_dim_embeddings)} points), using PCA + spreading instead...")
pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
result = pca.fit_transform(high_dim_embeddings)
return apply_adaptive_spreading(result, spread_factor)
# Start with PCA as initial layout
pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
initial_layout = pca.fit_transform(high_dim_embeddings)
print(f"DEBUG: Initial PCA layout computed...")
# For simplicity, just apply spreading to the PCA result
# The original optimization was too computationally intensive
result = apply_adaptive_spreading(initial_layout, spread_factor)
print(f"DEBUG: Force-directed layout complete...")
return result
def calculate_local_density_scaling(embeddings, k=5):
"""
Calculate local density scaling factors to emphasize differences in dense regions.
"""
if len(embeddings) < k:
return np.ones(len(embeddings))
# Find k nearest neighbors for each point
nn = NearestNeighbors(n_neighbors=k+1) # +1 because first neighbor is the point itself
nn.fit(embeddings)
distances, indices = nn.kneighbors(embeddings)
# Calculate local density (inverse of average distance to k nearest neighbors)
local_densities = 1.0 / (np.mean(distances[:, 1:], axis=1) + 1e-6)
# Normalize densities
local_densities = (local_densities - np.min(local_densities)) / (np.max(local_densities) - np.min(local_densities) + 1e-6)
return local_densities
def apply_density_based_jittering(embeddings, density_scaling=True, jitter_strength=0.1):
"""
Apply smart jittering that's stronger in dense regions to separate overlapping points.
"""
if not density_scaling:
# Simple random jittering
noise = np.random.normal(0, jitter_strength, embeddings.shape)
return embeddings + noise
# Calculate local densities
densities = calculate_local_density_scaling(embeddings)
# Apply density-proportional jittering
jittered = embeddings.copy()
for i in range(len(embeddings)):
# More jitter in denser regions
jitter_amount = jitter_strength * (1 + densities[i])
noise = np.random.normal(0, jitter_amount, embeddings.shape[1])
jittered[i] += noise
return jittered
def reduce_dimensions(embeddings, method="PCA", n_components=2, spread_factor=1.0,
perplexity_factor=1.0, min_dist_factor=1.0):
"""Apply dimensionality reduction with enhanced separation"""
# Convert to numpy array if it's not already
embeddings = np.array(embeddings)
print(f"DEBUG: Starting {method} with {len(embeddings)} embeddings, shape: {embeddings.shape}")
# Standardize embeddings for better processing
scaler = StandardScaler()
scaled_embeddings = scaler.fit_transform(embeddings)
print(f"DEBUG: Embeddings standardized")
# Apply the selected dimensionality reduction method
if method == "PCA":
print(f"DEBUG: Applying PCA...")
reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
# Apply spreading to PCA results
print(f"DEBUG: Applying spreading...")
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
elif method == "t-SNE":
# Adjust perplexity based on user preference and data size
base_perplexity = min(30, len(embeddings)-1)
adjusted_perplexity = max(5, min(50, int(base_perplexity * perplexity_factor)))
print(f"DEBUG: Applying t-SNE with perplexity {adjusted_perplexity}...")
reducer = TSNE(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
perplexity=adjusted_perplexity, n_iter=1000,
early_exaggeration=12.0 * spread_factor, # Increase early exaggeration for more separation
learning_rate='auto')
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
elif method == "UMAP":
# Adjust UMAP parameters for better local separation
n_neighbors = min(15, len(embeddings)-1)
min_dist = 0.1 * min_dist_factor
spread = 1.0 * spread_factor
print(f"DEBUG: Applying UMAP with n_neighbors={n_neighbors}, min_dist={min_dist}...")
reducer = umap.UMAP(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
n_neighbors=n_neighbors, min_dist=min_dist,
spread=spread, local_connectivity=2.0)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
elif method == "Spectral Embedding":
n_neighbors = min(10, len(embeddings)-1)
print(f"DEBUG: Applying Spectral Embedding with n_neighbors={n_neighbors}...")
reducer = SpectralEmbedding(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
n_neighbors=n_neighbors)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
# Apply spreading to spectral results
print(f"DEBUG: Applying spreading...")
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
elif method == "Force-Directed":
# New method: Use force-directed layout for natural spreading
print(f"DEBUG: Applying Force-Directed layout...")
reduced_embeddings = force_directed_layout(scaled_embeddings, n_components, spread_factor)
else:
# Fallback to PCA
print(f"DEBUG: Unknown method {method}, falling back to PCA...")
reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
print(f"DEBUG: Dimensionality reduction complete. Output shape: {reduced_embeddings.shape}")
return reduced_embeddings

169
apps/cluster_map/main.py Normal file
View File

@@ -0,0 +1,169 @@
"""
Main application logic for the Discord Chat Embeddings Visualizer.
"""
import streamlit as st
import warnings
warnings.filterwarnings('ignore')
# Import custom modules
from ui_components import (
setup_page_config, display_title_and_description, get_all_ui_parameters,
display_performance_warnings
)
from data_loader import (
load_all_chat_data, parse_embeddings, filter_data, get_filtered_embeddings
)
from dimensionality_reduction import (
reduce_dimensions, apply_density_based_jittering
)
from clustering import apply_clustering, generate_cluster_names
from visualization import (
create_visualization_plot, display_clustering_metrics, display_summary_stats,
display_clustering_results, display_data_table, display_cluster_summary
)
def main():
"""Main application function"""
# Set up page configuration
setup_page_config()
# Display title and description
display_title_and_description()
# Load data
with st.spinner("Loading chat data..."):
df = load_all_chat_data()
if df.empty:
st.error("No data could be loaded. Please check the data directory.")
st.stop()
# Parse embeddings
with st.spinner("Parsing embeddings..."):
embeddings, valid_df = parse_embeddings(df)
if len(embeddings) == 0:
st.error("No valid embeddings found!")
st.stop()
# Get UI parameters
params = get_all_ui_parameters(valid_df)
# Check if any sources are selected before proceeding
if not params['selected_sources']:
st.info("📂 **Select source files from the sidebar to begin visualization**")
st.markdown("### Available Data Sources:")
# Show available sources as an informational table
source_info = []
for source in valid_df['source_file'].unique():
source_data = valid_df[valid_df['source_file'] == source]
source_info.append({
'Source File': source,
'Messages': len(source_data),
'Unique Authors': source_data['author_name'].nunique(),
'Date Range': f"{source_data['timestamp_utc'].min()} to {source_data['timestamp_utc'].max()}"
})
import pandas as pd
source_df = pd.DataFrame(source_info)
st.dataframe(source_df, use_container_width=True, hide_index=True)
st.markdown("👈 **Use the sidebar to select which sources to visualize**")
st.stop()
# Filter data
filtered_df = filter_data(valid_df, params['selected_sources'], params['selected_authors'])
if filtered_df.empty:
st.warning("No data matches the current filters! Try selecting different sources or authors.")
st.stop()
# Display performance warnings
display_performance_warnings(filtered_df, params['method'], params['clustering_method'])
# Get corresponding embeddings
filtered_embeddings = get_filtered_embeddings(embeddings, valid_df, filtered_df)
st.info(f"📈 Visualizing {len(filtered_df)} messages")
# Reduce dimensions
n_components = 3 if params['enable_3d'] else 2
with st.spinner(f"Reducing dimensions using {params['method']}..."):
reduced_embeddings = reduce_dimensions(
filtered_embeddings,
method=params['method'],
n_components=n_components,
spread_factor=params['spread_factor'],
perplexity_factor=params['perplexity_factor'],
min_dist_factor=params['min_dist_factor']
)
# Apply clustering
with st.spinner(f"Applying {params['clustering_method']}..."):
cluster_labels, silhouette_avg, calinski_harabasz = apply_clustering(
filtered_embeddings,
clustering_method=params['clustering_method'],
n_clusters=params['n_clusters']
)
# Apply jittering if requested
if params['apply_jittering']:
with st.spinner("Applying smart jittering to separate overlapping points..."):
reduced_embeddings = apply_density_based_jittering(
reduced_embeddings,
density_scaling=params['density_based_jitter'],
jitter_strength=params['jitter_strength']
)
# Generate cluster names if clustering was applied
cluster_names = None
if cluster_labels is not None:
with st.spinner("Generating cluster names..."):
cluster_names = generate_cluster_names(filtered_df, cluster_labels)
# Display clustering metrics
display_clustering_metrics(
cluster_labels, silhouette_avg, calinski_harabasz,
params['show_cluster_metrics']
)
# Display cluster summary with names
if cluster_names:
display_cluster_summary(cluster_names, cluster_labels)
# Create and display the main plot
fig = create_visualization_plot(
reduced_embeddings=reduced_embeddings,
filtered_df=filtered_df,
cluster_labels=cluster_labels,
selected_sources=params['selected_sources'] if params['selected_sources'] else None,
method=params['method'],
clustering_method=params['clustering_method'],
point_size=params['point_size'],
point_opacity=params['point_opacity'],
density_based_sizing=params['density_based_sizing'],
size_variation=params['size_variation'],
enable_3d=params['enable_3d'],
cluster_names=cluster_names
)
st.plotly_chart(fig, use_container_width=True)
# Display summary statistics
display_summary_stats(filtered_df, params['selected_sources'] or filtered_df['source_file'].unique())
# Display clustering results and export options
display_clustering_results(
filtered_df, cluster_labels, reduced_embeddings,
params['method'], params['clustering_method'], params['enable_3d']
)
# Display data table
display_data_table(filtered_df, cluster_labels)
if __name__ == "__main__":
main()

View File

@@ -3,3 +3,6 @@ pandas>=1.5.0
numpy>=1.24.0
plotly>=5.15.0
scikit-learn>=1.3.0
umap-learn>=0.5.3
hdbscan>=0.8.29
scipy>=1.10.0

View File

@@ -1,233 +0,0 @@
import streamlit as st
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import json
import os
from pathlib import Path
import ast
# Set page config
st.set_page_config(
page_title="Discord Chat Embeddings Visualizer",
page_icon="🗨️",
layout="wide"
)
# Title and description
st.title("🗨️ Discord Chat Embeddings Visualizer")
st.markdown("Explore Discord chat messages through their vector embeddings in 2D space")
@st.cache_data
def load_all_chat_data():
"""Load all CSV files from the discord_chat_logs folder"""
chat_logs_path = Path("../../discord_chat_logs")
# Display the path for debugging
st.write(f"Looking for CSV files in: {chat_logs_path}")
st.write(f"Path exists: {chat_logs_path.exists()}")
all_data = []
for csv_file in chat_logs_path.glob("*.csv"):
try:
df = pd.read_csv(csv_file)
df['source_file'] = csv_file.stem # Add source file name
all_data.append(df)
st.write(f"✅ Loaded {len(df)} messages from {csv_file.name}")
except Exception as e:
st.error(f"❌ Error loading {csv_file.name}: {e}")
if all_data:
combined_df = pd.concat(all_data, ignore_index=True)
st.success(f"🎉 Successfully loaded {len(combined_df)} total messages from {len(all_data)} files")
return combined_df
else:
st.error("No data loaded!")
return pd.DataFrame()
@st.cache_data
def parse_embeddings(df):
"""Parse the content_embedding column from string to numpy array"""
embeddings = []
valid_indices = []
for idx, embedding_str in enumerate(df['content_embedding']):
try:
# Parse the string representation of the list
embedding = ast.literal_eval(embedding_str)
if isinstance(embedding, list) and len(embedding) > 0:
embeddings.append(embedding)
valid_indices.append(idx)
except Exception as e:
continue
embeddings_array = np.array(embeddings)
valid_df = df.iloc[valid_indices].copy()
st.info(f"📊 Parsed {len(embeddings)} valid embeddings from {len(df)} messages")
st.info(f"🔢 Embedding dimension: {embeddings_array.shape[1] if len(embeddings) > 0 else 0}")
return embeddings_array, valid_df
@st.cache_data
def reduce_dimensions(embeddings, method="PCA", n_components=2):
"""Reduce embeddings to 2D using PCA or t-SNE"""
if method == "PCA":
reducer = PCA(n_components=n_components, random_state=42)
elif method == "t-SNE":
reducer = TSNE(n_components=n_components, random_state=42, perplexity=min(30, len(embeddings)-1))
reduced_embeddings = reducer.fit_transform(embeddings)
return reduced_embeddings
def create_hover_text(df):
"""Create hover text for plotly"""
hover_text = []
for _, row in df.iterrows():
text = f"<b>Author:</b> {row['author_name']}<br>"
text += f"<b>Timestamp:</b> {row['timestamp_utc']}<br>"
text += f"<b>Source:</b> {row['source_file']}<br>"
# Handle potential NaN or non-string content
content = row['content']
if pd.isna(content) or content is None:
content_text = "[No content]"
else:
content_str = str(content)
content_text = content_str[:200] + ('...' if len(content_str) > 200 else '')
text += f"<b>Content:</b> {content_text}"
hover_text.append(text)
return hover_text
def main():
# Load data
with st.spinner("Loading chat data..."):
df = load_all_chat_data()
if df.empty:
st.stop()
# Parse embeddings
with st.spinner("Parsing embeddings..."):
embeddings, valid_df = parse_embeddings(df)
if len(embeddings) == 0:
st.error("No valid embeddings found!")
st.stop()
# Sidebar controls
st.sidebar.header("🎛️ Visualization Controls")
# Dimension reduction method
method = st.sidebar.selectbox(
"Dimension Reduction Method",
["PCA", "t-SNE"],
help="PCA is faster, t-SNE may reveal better clusters"
)
# Source file filter
source_files = valid_df['source_file'].unique()
selected_sources = st.sidebar.multiselect(
"Filter by Source Files",
source_files,
default=source_files,
help="Select which chat log files to include"
)
# Author filter
authors = valid_df['author_name'].unique()
selected_authors = st.sidebar.multiselect(
"Filter by Authors",
authors,
default=authors[:10] if len(authors) > 10 else authors, # Limit to first 10 for performance
help="Select which authors to include"
)
# Filter data
filtered_df = valid_df[
(valid_df['source_file'].isin(selected_sources)) &
(valid_df['author_name'].isin(selected_authors))
]
if filtered_df.empty:
st.warning("No data matches the current filters!")
st.stop()
# Get corresponding embeddings
filtered_indices = filtered_df.index.tolist()
filtered_embeddings = embeddings[[i for i, idx in enumerate(valid_df.index) if idx in filtered_indices]]
st.info(f"📈 Visualizing {len(filtered_df)} messages")
# Reduce dimensions
with st.spinner(f"Reducing dimensions using {method}..."):
reduced_embeddings = reduce_dimensions(filtered_embeddings, method)
# Create hover text
hover_text = create_hover_text(filtered_df)
# Create the plot
fig = go.Figure()
# Color by source file
colors = px.colors.qualitative.Set1
for i, source in enumerate(selected_sources):
source_mask = filtered_df['source_file'] == source
if source_mask.any():
source_data = filtered_df[source_mask]
source_embeddings = reduced_embeddings[source_mask]
source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
fig.add_trace(go.Scatter(
x=source_embeddings[:, 0],
y=source_embeddings[:, 1],
mode='markers',
name=source,
marker=dict(
size=8,
color=colors[i % len(colors)],
opacity=0.7,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=source_hover
))
fig.update_layout(
title=f"Discord Chat Messages - {method} Visualization",
xaxis_title=f"{method} Component 1",
yaxis_title=f"{method} Component 2",
hovermode='closest',
width=1000,
height=700
)
# Display the plot
st.plotly_chart(fig, use_container_width=True)
# Statistics
col1, col2, col3 = st.columns(3)
with col1:
st.metric("Total Messages", len(filtered_df))
with col2:
st.metric("Unique Authors", filtered_df['author_name'].nunique())
with col3:
st.metric("Source Files", len(selected_sources))
# Show data table
if st.checkbox("Show Data Table"):
st.subheader("📋 Message Data")
display_df = filtered_df[['timestamp_utc', 'author_name', 'source_file', 'content']].copy()
display_df['content'] = display_df['content'].str[:100] + '...' # Truncate for display
st.dataframe(display_df, use_container_width=True)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,43 @@
#!/usr/bin/env python3
"""
Test script to debug the hanging issue in the modular app
"""
import numpy as np
import sys
import os
# Add the current directory to Python path
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
def test_dimensionality_reduction():
"""Test dimensionality reduction functions"""
print("Testing dimensionality reduction functions...")
from dimensionality_reduction import reduce_dimensions
# Create test data similar to what we'd expect
n_samples = 796 # Same as the user's dataset
n_features = 384 # Common embedding dimension
print(f"Creating test embeddings: {n_samples} x {n_features}")
test_embeddings = np.random.randn(n_samples, n_features)
# Test PCA (should be fast)
print("Testing PCA...")
try:
result = reduce_dimensions(test_embeddings, method="PCA")
print(f"✓ PCA successful, output shape: {result.shape}")
except Exception as e:
print(f"✗ PCA failed: {e}")
# Test UMAP (might be slower)
print("Testing UMAP...")
try:
result = reduce_dimensions(test_embeddings, method="UMAP")
print(f"✓ UMAP successful, output shape: {result.shape}")
except Exception as e:
print(f"✗ UMAP failed: {e}")
if __name__ == "__main__":
test_dimensionality_reduction()

View File

@@ -0,0 +1,267 @@
"""
Streamlit UI components and controls for the Discord Chat Embeddings Visualizer.
"""
import streamlit as st
import numpy as np
from config import (
APP_TITLE, APP_ICON, APP_LAYOUT, METHOD_EXPLANATIONS,
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS, COMPUTATIONALLY_INTENSIVE_METHODS,
LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS,
DEFAULT_DIMENSION_REDUCTION_METHOD, DEFAULT_CLUSTERING_METHOD
)
def setup_page_config():
"""Set up the Streamlit page configuration"""
st.set_page_config(
page_title=APP_TITLE,
page_icon=APP_ICON,
layout=APP_LAYOUT
)
def display_title_and_description():
"""Display the main title and description"""
st.title(f"{APP_ICON} {APP_TITLE}")
st.markdown("Explore Discord chat messages through their vector embeddings in 2D space")
def create_method_controls():
"""Create controls for dimension reduction and clustering methods"""
st.sidebar.header("🎛️ Visualization Controls")
# 3D visualization toggle
enable_3d = st.sidebar.checkbox(
"Enable 3D Visualization",
value=False,
help="Switch between 2D and 3D visualization. 3D uses 3 components instead of 2."
)
# Dimension reduction method
method_options = ["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"]
default_index = method_options.index(DEFAULT_DIMENSION_REDUCTION_METHOD) if DEFAULT_DIMENSION_REDUCTION_METHOD in method_options else 0
method = st.sidebar.selectbox(
"Dimension Reduction Method",
method_options,
index=default_index,
help="PCA is fastest, UMAP balances speed and quality, t-SNE and Spectral are slower but may reveal better structures. Force-Directed creates natural spacing."
)
# Clustering method
clustering_options = ["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture",
"Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"]
clustering_default_index = clustering_options.index(DEFAULT_CLUSTERING_METHOD) if DEFAULT_CLUSTERING_METHOD in clustering_options else 0
clustering_method = st.sidebar.selectbox(
"Clustering Method",
clustering_options,
index=clustering_default_index,
help="Apply clustering to identify groups. HDBSCAN and OPTICS can find variable density clusters."
)
return method, clustering_method, enable_3d
def create_clustering_controls(clustering_method):
"""Create controls for clustering parameters"""
# Always show the clusters slider, but indicate when it's used
if clustering_method in CLUSTERING_METHODS_REQUIRING_N_CLUSTERS:
help_text = "Number of clusters to create. This setting affects the clustering algorithm."
disabled = False
elif clustering_method == "None":
help_text = "Clustering is disabled. This setting has no effect."
disabled = True
else:
help_text = f"{clustering_method} automatically determines the number of clusters. This setting has no effect."
disabled = True
n_clusters = st.sidebar.slider(
"Number of Clusters",
min_value=2,
max_value=20,
value=5,
disabled=disabled,
help=help_text
)
return n_clusters
def create_separation_controls(method):
"""Create controls for point separation and method-specific parameters"""
st.sidebar.subheader("🎯 Point Separation Controls")
spread_factor = st.sidebar.slider(
"Spread Factor",
0.5, 3.0, 1.0, 0.1,
help="Increase to spread apart nearby points. Higher values create more separation."
)
# Method-specific parameters
perplexity_factor = 1.0
min_dist_factor = 1.0
if method == "t-SNE":
perplexity_factor = st.sidebar.slider(
"Perplexity Factor",
0.1, 2.0, 1.0, 0.1,
help="Affects local vs global structure balance. Lower values focus on local details."
)
if method == "UMAP":
min_dist_factor = st.sidebar.slider(
"Min Distance Factor",
0.1, 2.0, 1.0, 0.1,
help="Controls how tightly points are packed. Lower values create tighter clusters."
)
return spread_factor, perplexity_factor, min_dist_factor
def create_jittering_controls():
"""Create controls for jittering options"""
apply_jittering = st.sidebar.checkbox(
"Apply Smart Jittering",
value=False,
help="Add intelligent noise to separate overlapping points"
)
jitter_strength = 0.1
density_based_jitter = True
if apply_jittering:
jitter_strength = st.sidebar.slider(
"Jitter Strength",
0.01, 0.5, 0.1, 0.01,
help="Strength of jittering. Higher values spread points more."
)
density_based_jitter = st.sidebar.checkbox(
"Density-Based Jittering",
value=True,
help="Apply stronger jittering in dense regions"
)
return apply_jittering, jitter_strength, density_based_jitter
def create_advanced_options():
"""Create advanced visualization options"""
with st.sidebar.expander("⚙️ Advanced Options"):
show_cluster_metrics = st.checkbox("Show Clustering Metrics", value=True)
point_size = st.slider("Point Size", 4, 15, 8)
point_opacity = st.slider("Point Opacity", 0.3, 1.0, 0.7)
# Density-based visualization
density_based_sizing = st.checkbox(
"Density-Based Point Sizing",
value=False,
help="Make points larger in sparse regions, smaller in dense regions"
)
size_variation = 2.0
if density_based_sizing:
size_variation = st.slider(
"Size Variation Factor",
1.5, 4.0, 2.0, 0.1,
help="How much point sizes vary based on local density"
)
return show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation
def create_filter_controls(valid_df):
"""Create controls for filtering data by source and author"""
# Source file filter
source_files = valid_df['source_file'].unique()
selected_sources = st.sidebar.multiselect(
"Filter by Source Files",
source_files,
default=[],
help="Select which chat log files to include"
)
# Author filter
authors = valid_df['author_name'].unique()
default_authors = authors[:MAX_DISPLAYED_AUTHORS] if len(authors) > MAX_DISPLAYED_AUTHORS else authors
selected_authors = st.sidebar.multiselect(
"Filter by Authors",
authors,
default=default_authors,
help="Select which authors to include"
)
return selected_sources, selected_authors
def display_method_explanations():
"""Display explanations for different methods"""
st.sidebar.markdown("---")
with st.sidebar.expander("📚 Method Explanations"):
st.markdown("**Dimensionality Reduction:**")
for method, explanation in METHOD_EXPLANATIONS["dimension_reduction"].items():
st.markdown(f"- **{method}**: {explanation}")
st.markdown("\n**Clustering Methods:**")
for method, explanation in METHOD_EXPLANATIONS["clustering"].items():
st.markdown(f"- **{method}**: {explanation}")
st.markdown("\n**Separation Techniques:**")
for technique, explanation in METHOD_EXPLANATIONS["separation"].items():
st.markdown(f"- **{technique}**: {explanation}")
st.markdown("\n**Metrics:**")
for metric, explanation in METHOD_EXPLANATIONS["metrics"].items():
st.markdown(f"- **{metric}**: {explanation}")
def display_performance_warnings(filtered_df, method, clustering_method):
"""Display performance warnings for computationally intensive operations"""
if len(filtered_df) > LARGE_DATASET_WARNING_THRESHOLD:
if method in COMPUTATIONALLY_INTENSIVE_METHODS["dimension_reduction"]:
st.warning(f"⚠️ {method} with {len(filtered_df)} points may take several minutes to compute.")
if clustering_method in COMPUTATIONALLY_INTENSIVE_METHODS["clustering"]:
st.warning(f"⚠️ {clustering_method} with {len(filtered_df)} points may be computationally intensive.")
def get_all_ui_parameters(valid_df):
"""Get all UI parameters in a single function call"""
# Method selection
method, clustering_method, enable_3d = create_method_controls()
# Clustering parameters
n_clusters = create_clustering_controls(clustering_method)
# Separation controls
spread_factor, perplexity_factor, min_dist_factor = create_separation_controls(method)
# Jittering controls
apply_jittering, jitter_strength, density_based_jitter = create_jittering_controls()
# Advanced options
show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation = create_advanced_options()
# Filters
selected_sources, selected_authors = create_filter_controls(valid_df)
# Method explanations
display_method_explanations()
return {
'method': method,
'clustering_method': clustering_method,
'enable_3d': enable_3d,
'n_clusters': n_clusters,
'spread_factor': spread_factor,
'perplexity_factor': perplexity_factor,
'min_dist_factor': min_dist_factor,
'apply_jittering': apply_jittering,
'jitter_strength': jitter_strength,
'density_based_jitter': density_based_jitter,
'show_cluster_metrics': show_cluster_metrics,
'point_size': point_size,
'point_opacity': point_opacity,
'density_based_sizing': density_based_sizing,
'size_variation': size_variation,
'selected_sources': selected_sources,
'selected_authors': selected_authors
}

View File

@@ -0,0 +1,311 @@
"""
Visualization functions for creating interactive plots and displays.
"""
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import streamlit as st
from dimensionality_reduction import calculate_local_density_scaling
from config import MESSAGE_CONTENT_PREVIEW_LENGTH, DEFAULT_POINT_SIZE, DEFAULT_POINT_OPACITY
def create_hover_text(df):
"""Create hover text for plotly"""
hover_text = []
for _, row in df.iterrows():
text = f"<b>Author:</b> {row['author_name']}<br>"
text += f"<b>Timestamp:</b> {row['timestamp_utc']}<br>"
text += f"<b>Source:</b> {row['source_file']}<br>"
# Handle potential NaN or non-string content
content = row['content']
if pd.isna(content) or content is None:
content_text = "[No content]"
else:
content_str = str(content)
content_text = content_str[:MESSAGE_CONTENT_PREVIEW_LENGTH] + ('...' if len(content_str) > MESSAGE_CONTENT_PREVIEW_LENGTH else '')
text += f"<b>Content:</b> {content_text}"
hover_text.append(text)
return hover_text
def calculate_point_sizes(reduced_embeddings, density_based_sizing=False,
point_size=DEFAULT_POINT_SIZE, size_variation=2.0):
"""Calculate point sizes based on density if enabled"""
if not density_based_sizing:
return [point_size] * len(reduced_embeddings)
local_densities = calculate_local_density_scaling(reduced_embeddings)
# Invert densities so sparse areas get larger points
inverted_densities = 1.0 - local_densities
# Scale point sizes
point_sizes = point_size * (1.0 + inverted_densities * (size_variation - 1.0))
return point_sizes
def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover_text,
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA", enable_3d=False,
cluster_names=None):
"""Create a plot colored by clusters"""
fig = go.Figure()
unique_clusters = np.unique(cluster_labels)
colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel
for i, cluster_id in enumerate(unique_clusters):
cluster_mask = cluster_labels == cluster_id
if cluster_mask.any():
cluster_embeddings = reduced_embeddings[cluster_mask]
cluster_hover = [hover_text[j] for j, mask in enumerate(cluster_mask) if mask]
cluster_sizes = [point_sizes[j] for j, mask in enumerate(cluster_mask) if mask]
# Use generated name if available, otherwise fall back to default
if cluster_names and cluster_id in cluster_names:
cluster_name = cluster_names[cluster_id]
else:
cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
if enable_3d:
fig.add_trace(go.Scatter3d(
x=cluster_embeddings[:, 0],
y=cluster_embeddings[:, 1],
z=cluster_embeddings[:, 2],
mode='markers',
name=cluster_name,
marker=dict(
size=cluster_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=cluster_hover
))
else:
fig.add_trace(go.Scatter(
x=cluster_embeddings[:, 0],
y=cluster_embeddings[:, 1],
mode='markers',
name=cluster_name,
marker=dict(
size=cluster_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=cluster_hover
))
return fig
def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, hover_text,
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, enable_3d=False):
"""Create a plot colored by source files"""
fig = go.Figure()
colors = px.colors.qualitative.Set1
for i, source in enumerate(selected_sources):
source_mask = filtered_df['source_file'] == source
if source_mask.any():
source_embeddings = reduced_embeddings[source_mask]
source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
source_sizes = [point_sizes[j] for j, mask in enumerate(source_mask) if mask]
if enable_3d:
fig.add_trace(go.Scatter3d(
x=source_embeddings[:, 0],
y=source_embeddings[:, 1],
z=source_embeddings[:, 2],
mode='markers',
name=source,
marker=dict(
size=source_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=source_hover
))
else:
fig.add_trace(go.Scatter(
x=source_embeddings[:, 0],
y=source_embeddings[:, 1],
mode='markers',
name=source,
marker=dict(
size=source_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=source_hover
))
return fig
def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=None,
selected_sources=None, method="PCA", clustering_method="None",
point_size=DEFAULT_POINT_SIZE, point_opacity=DEFAULT_POINT_OPACITY,
density_based_sizing=False, size_variation=2.0, enable_3d=False,
cluster_names=None):
"""Create the main visualization plot"""
# Create hover text
hover_text = create_hover_text(filtered_df)
# Calculate point sizes
point_sizes = calculate_point_sizes(reduced_embeddings, density_based_sizing,
point_size, size_variation)
# Create plot based on coloring strategy
if cluster_labels is not None:
fig = create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels,
hover_text, point_sizes, point_opacity, method, enable_3d,
cluster_names)
else:
if selected_sources is None:
selected_sources = filtered_df['source_file'].unique()
fig = create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources,
hover_text, point_sizes, point_opacity, enable_3d)
# Update layout
title_suffix = f" with {clustering_method}" if clustering_method != "None" else ""
dimension_text = "3D" if enable_3d else "2D"
if enable_3d:
fig.update_layout(
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
scene=dict(
xaxis_title=f"{method} Component 1",
yaxis_title=f"{method} Component 2",
zaxis_title=f"{method} Component 3"
),
width=1000,
height=700
)
else:
fig.update_layout(
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
xaxis_title=f"{method} Component 1",
yaxis_title=f"{method} Component 2",
hovermode='closest',
width=1000,
height=700
)
return fig
def display_clustering_metrics(cluster_labels, silhouette_avg, calinski_harabasz, show_metrics=True):
"""Display clustering quality metrics"""
if cluster_labels is not None and show_metrics:
col1, col2, col3 = st.columns(3)
with col1:
n_clusters_found = len(np.unique(cluster_labels[cluster_labels != -1]))
st.metric("Clusters Found", n_clusters_found)
with col2:
if silhouette_avg is not None:
st.metric("Silhouette Score", f"{silhouette_avg:.3f}")
else:
st.metric("Silhouette Score", "N/A")
with col3:
if calinski_harabasz is not None:
st.metric("Calinski-Harabasz Index", f"{calinski_harabasz:.1f}")
else:
st.metric("Calinski-Harabasz Index", "N/A")
def display_summary_stats(filtered_df, selected_sources):
"""Display summary statistics"""
col1, col2, col3 = st.columns(3)
with col1:
st.metric("Total Messages", len(filtered_df))
with col2:
st.metric("Unique Authors", filtered_df['author_name'].nunique())
with col3:
st.metric("Source Files", len(selected_sources))
def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method, enable_3d=False):
"""Display clustering results and export options"""
if cluster_labels is None:
return
st.subheader("📊 Clustering Results")
# Add cluster information to dataframe for export
export_df = filtered_df.copy()
export_df['cluster_id'] = cluster_labels
export_df['x_coordinate'] = reduced_embeddings[:, 0]
export_df['y_coordinate'] = reduced_embeddings[:, 1]
# Add z coordinate if 3D
if enable_3d and reduced_embeddings.shape[1] >= 3:
export_df['z_coordinate'] = reduced_embeddings[:, 2]
# Show cluster distribution
cluster_dist = pd.Series(cluster_labels).value_counts().sort_index()
st.bar_chart(cluster_dist)
# Download option
csv_data = export_df.to_csv(index=False)
dimension_text = "3D" if enable_3d else "2D"
st.download_button(
label="📥 Download Clustering Results (CSV)",
data=csv_data,
file_name=f"chat_clusters_{method}_{clustering_method}_{dimension_text}.csv",
mime="text/csv"
)
def display_data_table(filtered_df, cluster_labels=None):
"""Display the data table with optional clustering information"""
if not st.checkbox("Show Data Table"):
return
st.subheader("📋 Message Data")
display_df = filtered_df[['timestamp_utc', 'author_name', 'source_file', 'content']].copy()
# Add clustering info if available
if cluster_labels is not None:
display_df['cluster'] = cluster_labels
display_df['content'] = display_df['content'].str[:100] + '...' # Truncate for display
st.dataframe(display_df, use_container_width=True)
def display_cluster_summary(cluster_names, cluster_labels):
"""Display a summary of cluster names and their sizes"""
if not cluster_names or cluster_labels is None:
return
st.subheader("🏷️ Cluster Summary")
# Create summary data
cluster_summary = []
for cluster_id, name in cluster_names.items():
count = np.sum(cluster_labels == cluster_id)
cluster_summary.append({
'Cluster ID': cluster_id,
'Cluster Name': name,
'Message Count': count,
'Percentage': f"{100 * count / len(cluster_labels):.1f}%"
})
# Sort by message count
cluster_summary.sort(key=lambda x: x['Message Count'], reverse=True)
# Display as table
summary_df = pd.DataFrame(cluster_summary)
st.dataframe(summary_df, use_container_width=True, hide_index=True)