Compare commits

...

17 Commits

Author SHA1 Message Date
ce906e4f9a udpated perplexity factor 2025-08-11 16:11:21 +01:00
fd9b25f256 updated readme 2025-08-11 03:07:44 +01:00
2b8659fc95 beter clusters and qol 2025-08-11 03:04:50 +01:00
647111e9d3 3d viz 2025-08-11 02:49:41 +01:00
4ca7e8ab61 refactor 2025-08-11 02:37:21 +01:00
6d35b42b27 updated reqs from clusteing 2025-08-11 02:22:59 +01:00
248cc5765f clustermap app 2025-08-11 01:59:48 +01:00
80c115b57d embedded datasets 2025-08-11 01:51:43 +01:00
aa9f2dc618 updated embedder 2025-08-11 01:51:34 +01:00
fb3fb70cc5 text embedding script and class 2025-08-11 01:47:52 +01:00
7ca86d7751 image viewer app 2025-08-11 01:35:14 +01:00
245cc81289 images dataset 2025-08-11 01:22:03 +01:00
ba528a3806 image downloader +read me 2025-08-11 01:21:35 +01:00
e22705600a DATASETS 2025-08-11 01:10:41 +01:00
9aaad019a5 added new bot script 2025-08-11 01:10:36 +01:00
45190bd0ff env 2025-08-11 01:10:24 +01:00
458a8c4881 updated bot dir 2025-08-11 01:10:17 +01:00
31 changed files with 14020 additions and 55 deletions

1
.env Normal file
View File

@@ -0,0 +1 @@
MTQwNDI0NTI1MTk4Nzg2OTgyOA.G_GnSa.wsi4qZ_4F40EU19wxfRLA3UG521_r9TSxOL4Q0

View File

@@ -0,0 +1,98 @@
# Discord Image Downloader
This script processes Discord chat log CSV files to download and convert images to a base64 dataset.
## Features
- Parses all CSV files in the `discord_chat_logs/` directory
- Extracts attachment URLs from the `attachment_urls` column
- Downloads images using wget-like functionality (via Python requests)
- Converts images to base64 format for easy storage and processing
- Saves metadata including channel, sender, timestamp, and message context
- Handles Discord CDN URLs with query parameters
- Implements retry logic and rate limiting
- Deduplicates images based on URL hash
## Setup
1. Install dependencies:
```bash
./setup.sh
```
Or manually:
```bash
pip3 install -r requirements.txt
```
2. Run the image downloader:
```bash
cd scripts
python3 image_downloader.py
```
## Output
The script creates an `images_dataset/` directory containing:
- `images_dataset.json` - Complete dataset with images in base64 format
### Dataset Structure
```json
{
"metadata": {
"created_at": "2025-08-11 12:34:56 UTC",
"summary": {
"total_images": 42,
"channels": ["memes", "general", "nsfw"],
"total_size_bytes": 1234567,
"file_extensions": [".png", ".jpg", ".gif"],
"authors": ["user1", "user2"]
}
},
"images": [
{
"url": "https://cdn.discordapp.com/attachments/...",
"channel": "memes",
"author_name": "username",
"author_nickname": "User Nickname",
"author_id": "123456789",
"message_id": "987654321",
"timestamp_utc": "2020-03-11 18:25:49.086000+00:00",
"content": "Message text content",
"file_extension": ".png",
"file_size": 54321,
"url_hash": "abc123def456",
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
}
]
}
```
## Supported Image Formats
- PNG (.png)
- JPEG (.jpg, .jpeg)
- GIF (.gif)
- WebP (.webp)
- BMP (.bmp)
- TIFF (.tiff)
## Configuration
You can modify the following variables in `image_downloader.py`:
- `MAX_RETRIES` - Number of download retry attempts (default: 3)
- `DELAY_BETWEEN_REQUESTS` - Delay between requests in seconds (default: 0.5)
- `SUPPORTED_EXTENSIONS` - Set of supported image file extensions
## Error Handling
The script includes robust error handling:
- Skips non-image URLs
- Retries failed downloads with exponential backoff
- Validates content types from server responses
- Continues processing even if individual downloads fail
- Logs all activities and errors to console

281
README.md
View File

@@ -1,2 +1,281 @@
# cult-scraper
# Discord Data Analysis & Visualization Suite
A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
## 🌟 Features
### 📥 Data Collection
- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
### 📊 Visualization & Analysis
- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
- **Image Dataset Viewer**: Browse and explore downloaded images by channel
### 🔧 Data Processing
- **Batch Processing**: Process multiple CSV files with embeddings
- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
- **Data Filtering**: Advanced filtering by authors, channels, and timeframes
## 📁 Repository Structure
```
cult-scraper-1/
├── scripts/ # Core data collection scripts
│ ├── bot.py # Discord bot for message scraping
│ ├── image_downloader.py # Download and convert Discord images
│ ├── embedder.py # Batch text embedding processor
│ └── embed_class.py # Text embedding utilities
├── apps/ # Interactive applications
│ ├── cluster_map/ # Chat message clustering & visualization
│ │ ├── main.py # Main Streamlit application
│ │ ├── data_loader.py # Data loading utilities
│ │ ├── clustering.py # Clustering algorithms
│ │ ├── visualization.py # Plotting and visualization
│ │ └── requirements.txt # Dependencies
│ └── image_viewer/ # Image dataset browser
│ ├── image_viewer.py # Streamlit image viewer
│ └── requirements.txt # Dependencies
├── discord_chat_logs/ # Exported CSV files from Discord
└── images_dataset/ # Downloaded images and metadata
└── images_dataset.json # Image dataset with base64 data
```
## 🚀 Quick Start
### 1. Discord Data Scraping
First, set up and run the Discord bot to collect message data:
```bash
cd scripts
# Configure your bot token in bot.py
python bot.py
```
**Requirements:**
- Discord bot token with message content intent enabled
- Bot must have read permissions in target channels
### 2. Generate Text Embeddings
Process the collected chat data to add semantic embeddings:
```bash
cd scripts
python embedder.py
```
This will:
- Process all CSV files in `discord_chat_logs/`
- Add embeddings to message content using sentence transformers
- Save updated files with embedding vectors
### 3. Download Images
Extract and download images from Discord attachments:
```bash
cd scripts
python image_downloader.py
```
Features:
- Downloads images from attachment URLs
- Converts to base64 for storage
- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
- Implements retry logic and rate limiting
### 4. Visualize Chat Data
Launch the interactive chat visualization tool:
```bash
cd apps/cluster_map
pip install -r requirements.txt
streamlit run main.py
```
**Capabilities:**
- 2D visualization using PCA or t-SNE
- Interactive clustering with DBSCAN/HDBSCAN
- Filter by channels, authors, and time periods
- Hover to see message content and metadata
### 5. Browse Image Dataset
View downloaded images in an organized interface:
```bash
cd apps/image_viewer
pip install -r requirements.txt
streamlit run image_viewer.py
```
**Features:**
- Channel-based organization
- Navigation controls (previous/next)
- Image metadata display
- Responsive layout
## 📋 Data Formats
### Discord Chat Logs (CSV)
```csv
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
```
### Image Dataset (JSON)
```json
{
"metadata": {
"created_at": "2025-08-11 12:34:56 UTC",
"summary": {
"total_images": 42,
"channels": ["memes", "general"],
"total_size_bytes": 1234567,
"file_extensions": [".png", ".jpg"],
"authors": ["user1", "user2"]
}
},
"images": [
{
"url": "https://cdn.discordapp.com/attachments/...",
"channel": "memes",
"author_name": "username",
"timestamp_utc": "2025-08-11 12:34:56+00:00",
"content": "Message text",
"file_extension": ".png",
"file_size": 54321,
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
}
]
}
```
## 🔧 Configuration
### Discord Bot Setup
1. Create a Discord application at https://discord.com/developers/applications
2. Create a bot and copy the token
3. Enable the following intents:
- Message Content Intent
- Server Members Intent (optional)
4. Invite bot to your server with appropriate permissions
### Environment Variables
```bash
# Set in scripts/bot.py
BOT_TOKEN = "your_discord_bot_token_here"
```
### Embedding Models
The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
Supported models:
- `all-MiniLM-L6-v2` (lightweight, fast)
- `all-mpnet-base-v2` (higher quality)
- `sentence-transformers/all-roberta-large-v1` (best quality, slower)
## 📊 Visualization Features
### Chat Message Clustering
- **Dimensionality Reduction**: PCA, t-SNE, UMAP
- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
- **Interactive Controls**: Filter by source files, authors, and clusters
- **Hover Information**: View message content, author, timestamp on hover
### Image Analysis
- **Channel Organization**: Browse images by Discord channel
- **Metadata Display**: Author, timestamp, message context
- **Navigation**: Previous/next controls with slider
- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
## 🛠️ Dependencies
### Core Scripts
- `discord.py` - Discord bot framework
- `pandas` - Data manipulation
- `sentence-transformers` - Text embeddings
- `requests` - HTTP requests for image downloads
### Visualization Apps
- `streamlit` - Web interface framework
- `plotly` - Interactive plotting
- `scikit-learn` - Machine learning algorithms
- `numpy` - Numerical computations
- `umap-learn` - Dimensionality reduction
- `hdbscan` - Density-based clustering
## 📈 Use Cases
### Research & Analytics
- **Community Analysis**: Understand conversation patterns and topics
- **Sentiment Analysis**: Track mood and sentiment over time
- **User Behavior**: Analyze posting patterns and engagement
- **Content Moderation**: Identify problematic content clusters
### Data Science Projects
- **NLP Research**: Experiment with text embeddings and clustering
- **Social Network Analysis**: Study communication patterns
- **Visualization Techniques**: Explore dimensionality reduction methods
- **Image Processing**: Analyze visual content sharing patterns
### Content Management
- **Archive Creation**: Preserve Discord community history
- **Content Discovery**: Find similar messages and discussions
- **Moderation Tools**: Identify spam or inappropriate content
- **Backup Solutions**: Create comprehensive data backups
## 🔒 Privacy & Ethics
- **Data Protection**: All processing happens locally
- **User Consent**: Ensure proper permissions before scraping
- **Compliance**: Follow Discord's Terms of Service
- **Anonymization**: Consider removing or hashing user IDs for research
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request
## 📄 License
This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
## 🆘 Troubleshooting
### Common Issues
**Bot can't read messages:**
- Ensure Message Content Intent is enabled
- Check bot permissions in Discord server
- Verify bot token is correct
**Embeddings not generating:**
- Install sentence-transformers: `pip install sentence-transformers`
- Check available GPU memory for large models
- Try a smaller model like `all-MiniLM-L6-v2`
**Images not downloading:**
- Check internet connectivity
- Verify Discord CDN URLs are accessible
- Increase retry limits for unreliable connections
**Visualization not loading:**
- Ensure all requirements are installed
- Check that CSV files have embeddings
- Try reducing dataset size for better performance
## 📚 Additional Resources
- [Discord.py Documentation](https://discordpy.readthedocs.io/)
- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
- [Streamlit Documentation](https://docs.streamlit.io/)
- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)

View File

@@ -0,0 +1,58 @@
# Discord Chat Embeddings Visualizer
A Streamlit application that visualizes Discord chat messages using their vector embeddings in 2D space.
## Features
- **2D Visualization**: View chat messages plotted using PCA or t-SNE dimension reduction
- **Interactive Plotting**: Hover over points to see message content, author, and timestamp
- **Filtering**: Filter by source chat log files and authors
- **Multiple Datasets**: Automatically loads all CSV files from the discord_chat_logs folder
## Installation
1. Install the required dependencies:
```bash
pip install -r requirements.txt
```
## Usage
Run the Streamlit application:
```bash
streamlit run streamlit_app.py
```
The app will automatically load all CSV files from the `../../discord_chat_logs/` directory.
## Data Format
The application expects CSV files with the following columns:
- `message_id`: Unique identifier for the message
- `timestamp_utc`: When the message was sent
- `author_id`: Author's Discord ID
- `author_name`: Author's username
- `author_nickname`: Author's server nickname
- `content`: The message content
- `attachment_urls`: Any attached files
- `embeds`: Embedded content
- `content_embedding`: Vector embedding of the message content (as a string representation of a list)
## Visualization Options
- **PCA**: Principal Component Analysis - faster, good for getting an overview
- **t-SNE**: t-Distributed Stochastic Neighbor Embedding - slower but may reveal better clusters
## Controls
- **Dimension Reduction Method**: Choose between PCA and t-SNE
- **Filter by Source Files**: Select which chat log files to include
- **Filter by Authors**: Select which authors to display
- **Show Data Table**: View the underlying data in table format
## Performance Notes
- For large datasets, consider filtering by authors or source files to improve performance
- t-SNE is computationally intensive and may take longer with large datasets
- The app caches data and computations for better performance

View File

@@ -0,0 +1,12 @@
"""
Discord Chat Embeddings Visualizer - Legacy Entry Point
This file serves as a compatibility layer for the original cluster.py.
The application has been refactored into modular components for better maintainability.
"""
# Import and run the main application
from main import main
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,226 @@
"""
Clustering algorithms and evaluation metrics.
"""
import numpy as np
import streamlit as st
from sklearn.cluster import SpectralClustering, AgglomerativeClustering, OPTICS
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, calinski_harabasz_score
import hdbscan
import pandas as pd
from collections import Counter
import re
from config import DEFAULT_RANDOM_STATE
def summarize_cluster_content(cluster_messages, max_words=3):
"""
Generate a meaningful name for a cluster based on its message content.
Args:
cluster_messages: List of message contents in the cluster
max_words: Maximum number of words in the cluster name
Returns:
str: Generated cluster name
"""
if not cluster_messages:
return "Empty Cluster"
# Combine all messages and clean text
all_text = " ".join([str(msg) for msg in cluster_messages if pd.notna(msg)])
if not all_text.strip():
return "Empty Content"
# Basic text cleaning
text = all_text.lower()
# Remove URLs, mentions, and special characters
text = re.sub(r'http[s]?://\S+', '', text) # Remove URLs
text = re.sub(r'<@\d+>', '', text) # Remove Discord mentions
text = re.sub(r'<:\w+:\d+>', '', text) # Remove custom emojis
text = re.sub(r'[^\w\s]', ' ', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
if not text:
return "Special Characters"
# Split into words and filter out common words
words = text.split()
# Common stop words to filter out
stop_words = {
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after',
'above', 'below', 'between', 'among', 'until', 'without', 'under', 'over',
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them',
'my', 'your', 'his', 'her', 'its', 'our', 'their', 'this', 'that', 'these', 'those',
'just', 'like', 'get', 'know', 'think', 'see', 'go', 'come', 'say', 'said',
'yeah', 'yes', 'no', 'oh', 'ok', 'okay', 'well', 'so', 'but', 'if', 'when',
'what', 'where', 'why', 'how', 'who', 'which', 'than', 'then', 'now', 'here',
'there', 'also', 'too', 'very', 'really', 'pretty', 'much', 'more', 'most',
'some', 'any', 'all', 'many', 'few', 'little', 'big', 'small', 'good', 'bad'
}
# Filter out stop words and very short/long words
filtered_words = [
word for word in words
if word not in stop_words
and len(word) >= 3
and len(word) <= 15
and word.isalpha() # Only alphabetic words
]
if not filtered_words:
return f"Chat ({len(cluster_messages)} msgs)"
# Count word frequencies
word_counts = Counter(filtered_words)
# Get most common words
most_common = word_counts.most_common(max_words * 2) # Get more than needed for filtering
# Select diverse words (avoid very similar words)
selected_words = []
for word, count in most_common:
# Avoid adding very similar words
if not any(word.startswith(existing[:4]) or existing.startswith(word[:4])
for existing in selected_words):
selected_words.append(word)
if len(selected_words) >= max_words:
break
if not selected_words:
return f"Discussion ({len(cluster_messages)} msgs)"
# Create cluster name
cluster_name = " + ".join(selected_words[:max_words]).title()
# Add message count for context
cluster_name += f" ({len(cluster_messages)})"
return cluster_name
def generate_cluster_names(filtered_df, cluster_labels):
"""
Generate names for all clusters based on their content.
Args:
filtered_df: DataFrame with message data
cluster_labels: Array of cluster labels for each message
Returns:
dict: Mapping from cluster_id to cluster_name
"""
if cluster_labels is None:
return {}
cluster_names = {}
unique_clusters = np.unique(cluster_labels)
for cluster_id in unique_clusters:
if cluster_id == -1:
cluster_names[cluster_id] = "Noise/Outliers"
continue
# Get messages in this cluster
cluster_mask = cluster_labels == cluster_id
cluster_messages = filtered_df[cluster_mask]['content'].tolist()
# Generate name
cluster_name = summarize_cluster_content(cluster_messages)
cluster_names[cluster_id] = cluster_name
return cluster_names
def apply_clustering(embeddings, clustering_method="None", n_clusters=5):
"""
Apply clustering algorithm to embeddings and return labels and metrics.
Args:
embeddings: High-dimensional embeddings to cluster
clustering_method: Name of clustering algorithm
n_clusters: Number of clusters (for methods that require it)
Returns:
tuple: (cluster_labels, silhouette_score, calinski_harabasz_score)
"""
if clustering_method == "None" or len(embeddings) <= n_clusters:
return None, None, None
# Standardize embeddings for better clustering
scaler = StandardScaler()
scaled_embeddings = scaler.fit_transform(embeddings)
cluster_labels = None
silhouette_avg = None
calinski_harabasz = None
try:
if clustering_method == "HDBSCAN":
min_cluster_size = max(2, len(embeddings) // 20) # Adaptive min cluster size
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size,
min_samples=1, cluster_selection_epsilon=0.5)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Spectral Clustering":
clusterer = SpectralClustering(n_clusters=n_clusters, random_state=DEFAULT_RANDOM_STATE,
affinity='rbf', gamma=1.0)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Gaussian Mixture":
clusterer = GaussianMixture(n_components=n_clusters, random_state=DEFAULT_RANDOM_STATE,
covariance_type='full', max_iter=200)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Agglomerative (Ward)":
clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "Agglomerative (Complete)":
clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='complete')
cluster_labels = clusterer.fit_predict(scaled_embeddings)
elif clustering_method == "OPTICS":
min_samples = max(2, len(embeddings) // 50)
clusterer = OPTICS(min_samples=min_samples, xi=0.05, min_cluster_size=0.1)
cluster_labels = clusterer.fit_predict(scaled_embeddings)
# Calculate clustering quality metrics
if cluster_labels is not None and len(np.unique(cluster_labels)) > 1:
# Only calculate if we have multiple clusters and no noise-only clustering
valid_labels = cluster_labels[cluster_labels != -1] # Remove noise points for HDBSCAN/OPTICS
valid_embeddings = scaled_embeddings[cluster_labels != -1]
if len(valid_labels) > 0 and len(np.unique(valid_labels)) > 1:
silhouette_avg = silhouette_score(valid_embeddings, valid_labels)
calinski_harabasz = calinski_harabasz_score(valid_embeddings, valid_labels)
except Exception as e:
st.warning(f"Clustering failed: {str(e)}")
cluster_labels = None
return cluster_labels, silhouette_avg, calinski_harabasz
def get_cluster_statistics(cluster_labels):
"""Get basic statistics about clustering results"""
if cluster_labels is None:
return {}
unique_clusters = np.unique(cluster_labels)
n_clusters = len(unique_clusters[unique_clusters != -1]) # Exclude noise cluster (-1)
n_noise = np.sum(cluster_labels == -1)
return {
"n_clusters": n_clusters,
"n_noise_points": n_noise,
"cluster_distribution": np.bincount(cluster_labels[cluster_labels != -1]) if n_clusters > 0 else [],
"unique_clusters": unique_clusters
}

View File

@@ -0,0 +1,75 @@
"""
Configuration settings and constants for the Discord Chat Embeddings Visualizer.
"""
# Application settings
APP_TITLE = "The Cult - Visualised"
APP_ICON = "🗨️"
APP_LAYOUT = "wide"
# File paths
CHAT_LOGS_PATH = "../../discord_chat_logs"
# Algorithm parameters
DEFAULT_RANDOM_STATE = 42
DEFAULT_N_COMPONENTS = 2
DEFAULT_N_CLUSTERS = 5
DEFAULT_DIMENSION_REDUCTION_METHOD = "t-SNE"
DEFAULT_CLUSTERING_METHOD = "None"
# Visualization settings
DEFAULT_POINT_SIZE = 8
DEFAULT_POINT_OPACITY = 0.7
MAX_DISPLAYED_AUTHORS = 10
MESSAGE_CONTENT_PREVIEW_LENGTH = 200
MESSAGE_CONTENT_DISPLAY_LENGTH = 100
# Performance thresholds
LARGE_DATASET_WARNING_THRESHOLD = 1000
# Color palettes
PRIMARY_COLORS = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd",
"#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
# Clustering method categories
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS = [
"Spectral Clustering",
"Gaussian Mixture",
"Agglomerative (Ward)",
"Agglomerative (Complete)"
]
COMPUTATIONALLY_INTENSIVE_METHODS = {
"dimension_reduction": ["t-SNE", "Spectral Embedding"],
"clustering": ["Spectral Clustering", "OPTICS"]
}
# Method explanations
METHOD_EXPLANATIONS = {
"dimension_reduction": {
"PCA": "Linear, fast, preserves global variance",
"t-SNE": "Non-linear, good for local structure, slower",
"UMAP": "Balanced speed/quality, preserves local & global structure",
"Spectral Embedding": "Uses graph theory, good for non-convex clusters",
"Force-Directed": "Physics-based layout, creates natural spacing"
},
"clustering": {
"HDBSCAN": "Density-based, finds variable density clusters, handles noise",
"Spectral Clustering": "Uses eigenvalues, good for non-convex shapes",
"Gaussian Mixture": "Probabilistic, assumes gaussian distributions",
"Agglomerative (Ward)": "Hierarchical, minimizes within-cluster variance",
"Agglomerative (Complete)": "Hierarchical, minimizes maximum distance",
"OPTICS": "Density-based, finds clusters of varying densities"
},
"separation": {
"Spread Factor": "Applies repulsive forces between nearby points",
"Smart Jittering": "Adds intelligent noise to separate overlapping points",
"Density-Based Jittering": "Stronger separation in crowded areas",
"Perplexity Factor": "Controls t-SNE's focus on local vs global structure",
"Min Distance Factor": "Controls UMAP's point packing tightness"
},
"metrics": {
"Silhouette Score": "Higher is better (range: -1 to 1)",
"Calinski-Harabasz": "Higher is better, measures cluster separation"
}
}

View File

@@ -0,0 +1,86 @@
"""
Data loading and parsing utilities for Discord chat logs.
"""
import pandas as pd
import numpy as np
import streamlit as st
import ast
from pathlib import Path
from config import CHAT_LOGS_PATH
@st.cache_data
def load_all_chat_data():
"""Load all CSV files from the discord_chat_logs folder"""
chat_logs_path = Path(CHAT_LOGS_PATH)
with st.expander("📁 Loading Details", expanded=False):
# Display the path for debugging
st.write(f"Looking for CSV files in: {chat_logs_path}")
st.write(f"Path exists: {chat_logs_path.exists()}")
all_data = []
for csv_file in chat_logs_path.glob("*.csv"):
try:
df = pd.read_csv(csv_file)
df['source_file'] = csv_file.stem # Add source file name
all_data.append(df)
st.write(f"✅ Loaded {len(df)} messages from {csv_file.name}")
except Exception as e:
st.error(f"❌ Error loading {csv_file.name}: {e}")
if all_data:
combined_df = pd.concat(all_data, ignore_index=True)
st.success(f"🎉 Successfully loaded {len(combined_df)} total messages from {len(all_data)} files")
else:
st.error("No data loaded!")
combined_df = pd.DataFrame()
return combined_df if all_data else pd.DataFrame()
@st.cache_data
def parse_embeddings(df):
"""Parse the content_embedding column from string to numpy array"""
embeddings = []
valid_indices = []
for idx, embedding_str in enumerate(df['content_embedding']):
try:
# Parse the string representation of the list
embedding = ast.literal_eval(embedding_str)
if isinstance(embedding, list) and len(embedding) > 0:
embeddings.append(embedding)
valid_indices.append(idx)
except Exception as e:
continue
embeddings_array = np.array(embeddings)
valid_df = df.iloc[valid_indices].copy()
st.info(f"📊 Parsed {len(embeddings)} valid embeddings from {len(df)} messages")
st.info(f"🔢 Embedding dimension: {embeddings_array.shape[1] if len(embeddings) > 0 else 0}")
return embeddings_array, valid_df
def filter_data(df, selected_sources, selected_authors):
"""Filter dataframe by selected sources and authors"""
if not selected_sources:
selected_sources = df['source_file'].unique()
filtered_df = df[
(df['source_file'].isin(selected_sources)) &
(df['author_name'].isin(selected_authors))
]
return filtered_df
def get_filtered_embeddings(embeddings, valid_df, filtered_df):
"""Get embeddings corresponding to filtered dataframe"""
filtered_indices = filtered_df.index.tolist()
filtered_embeddings = embeddings[[i for i, idx in enumerate(valid_df.index) if idx in filtered_indices]]
return filtered_embeddings

View File

@@ -0,0 +1,211 @@
"""
Dimensionality reduction algorithms and point separation techniques.
"""
import numpy as np
import streamlit as st
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE, SpectralEmbedding
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from scipy.spatial.distance import pdist, squareform
from scipy.optimize import minimize
import umap
from config import DEFAULT_RANDOM_STATE
def apply_adaptive_spreading(embeddings, spread_factor=1.0):
"""
Apply adaptive spreading to push apart nearby points while preserving global structure.
Uses a force-based approach where closer points repel more strongly.
"""
if spread_factor <= 0:
return embeddings
embeddings = embeddings.copy()
n_points = len(embeddings)
print(f"DEBUG: Applying adaptive spreading to {n_points} points with factor {spread_factor}")
if n_points < 2:
return embeddings
# For very large datasets, skip spreading to avoid hanging
if n_points > 1000:
print(f"DEBUG: Large dataset ({n_points} points), skipping adaptive spreading...")
return embeddings
# Calculate pairwise distances
distances = squareform(pdist(embeddings))
# Apply force-based spreading with fewer iterations for large datasets
max_iterations = 3 if n_points > 500 else 5
for iteration in range(max_iterations):
if iteration % 2 == 0: # Progress indicator
print(f"DEBUG: Spreading iteration {iteration + 1}/{max_iterations}")
forces = np.zeros_like(embeddings)
for i in range(n_points):
for j in range(i + 1, n_points):
diff = embeddings[i] - embeddings[j]
dist = np.linalg.norm(diff)
if dist > 0:
# Repulsive force inversely proportional to distance
force_magnitude = spread_factor / (dist ** 2 + 0.01)
force_direction = diff / dist
force = force_magnitude * force_direction
forces[i] += force
forces[j] -= force
# Apply forces with damping
embeddings += forces * 0.1
print(f"DEBUG: Adaptive spreading complete")
return embeddings
def force_directed_layout(high_dim_embeddings, n_components=2, spread_factor=1.0):
"""
Create a force-directed layout from high-dimensional embeddings.
This creates more natural spacing between similar points.
"""
print(f"DEBUG: Starting force-directed layout with {len(high_dim_embeddings)} points...")
# For large datasets, fall back to PCA + spreading to avoid hanging
if len(high_dim_embeddings) > 500:
print(f"DEBUG: Large dataset ({len(high_dim_embeddings)} points), using PCA + spreading instead...")
pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
result = pca.fit_transform(high_dim_embeddings)
return apply_adaptive_spreading(result, spread_factor)
# Start with PCA as initial layout
pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
initial_layout = pca.fit_transform(high_dim_embeddings)
print(f"DEBUG: Initial PCA layout computed...")
# For simplicity, just apply spreading to the PCA result
# The original optimization was too computationally intensive
result = apply_adaptive_spreading(initial_layout, spread_factor)
print(f"DEBUG: Force-directed layout complete...")
return result
def calculate_local_density_scaling(embeddings, k=5):
"""
Calculate local density scaling factors to emphasize differences in dense regions.
"""
if len(embeddings) < k:
return np.ones(len(embeddings))
# Find k nearest neighbors for each point
nn = NearestNeighbors(n_neighbors=k+1) # +1 because first neighbor is the point itself
nn.fit(embeddings)
distances, indices = nn.kneighbors(embeddings)
# Calculate local density (inverse of average distance to k nearest neighbors)
local_densities = 1.0 / (np.mean(distances[:, 1:], axis=1) + 1e-6)
# Normalize densities
local_densities = (local_densities - np.min(local_densities)) / (np.max(local_densities) - np.min(local_densities) + 1e-6)
return local_densities
def apply_density_based_jittering(embeddings, density_scaling=True, jitter_strength=0.1):
"""
Apply smart jittering that's stronger in dense regions to separate overlapping points.
"""
if not density_scaling:
# Simple random jittering
noise = np.random.normal(0, jitter_strength, embeddings.shape)
return embeddings + noise
# Calculate local densities
densities = calculate_local_density_scaling(embeddings)
# Apply density-proportional jittering
jittered = embeddings.copy()
for i in range(len(embeddings)):
# More jitter in denser regions
jitter_amount = jitter_strength * (1 + densities[i])
noise = np.random.normal(0, jitter_amount, embeddings.shape[1])
jittered[i] += noise
return jittered
def reduce_dimensions(embeddings, method="PCA", n_components=2, spread_factor=1.0,
perplexity_factor=1.0, min_dist_factor=1.0):
"""Apply dimensionality reduction with enhanced separation"""
# Convert to numpy array if it's not already
embeddings = np.array(embeddings)
print(f"DEBUG: Starting {method} with {len(embeddings)} embeddings, shape: {embeddings.shape}")
# Standardize embeddings for better processing
scaler = StandardScaler()
scaled_embeddings = scaler.fit_transform(embeddings)
print(f"DEBUG: Embeddings standardized")
# Apply the selected dimensionality reduction method
if method == "PCA":
print(f"DEBUG: Applying PCA...")
reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
# Apply spreading to PCA results
print(f"DEBUG: Applying spreading...")
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
elif method == "t-SNE":
# Adjust perplexity based on user preference and data size
base_perplexity = min(30, len(embeddings)-1)
adjusted_perplexity = max(5, min(50, int(base_perplexity * perplexity_factor)))
print(f"DEBUG: Applying t-SNE with perplexity {adjusted_perplexity}...")
reducer = TSNE(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
perplexity=adjusted_perplexity, n_iter=1000,
early_exaggeration=12.0 * spread_factor, # Increase early exaggeration for more separation
learning_rate='auto')
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
elif method == "UMAP":
# Adjust UMAP parameters for better local separation
n_neighbors = min(15, len(embeddings)-1)
min_dist = 0.1 * min_dist_factor
spread = 1.0 * spread_factor
print(f"DEBUG: Applying UMAP with n_neighbors={n_neighbors}, min_dist={min_dist}...")
reducer = umap.UMAP(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
n_neighbors=n_neighbors, min_dist=min_dist,
spread=spread, local_connectivity=2.0)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
elif method == "Spectral Embedding":
n_neighbors = min(10, len(embeddings)-1)
print(f"DEBUG: Applying Spectral Embedding with n_neighbors={n_neighbors}...")
reducer = SpectralEmbedding(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
n_neighbors=n_neighbors)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
# Apply spreading to spectral results
print(f"DEBUG: Applying spreading...")
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
elif method == "Force-Directed":
# New method: Use force-directed layout for natural spreading
print(f"DEBUG: Applying Force-Directed layout...")
reduced_embeddings = force_directed_layout(scaled_embeddings, n_components, spread_factor)
else:
# Fallback to PCA
print(f"DEBUG: Unknown method {method}, falling back to PCA...")
reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
reduced_embeddings = reducer.fit_transform(scaled_embeddings)
reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
print(f"DEBUG: Dimensionality reduction complete. Output shape: {reduced_embeddings.shape}")
return reduced_embeddings

169
apps/cluster_map/main.py Normal file
View File

@@ -0,0 +1,169 @@
"""
Main application logic for the Discord Chat Embeddings Visualizer.
"""
import streamlit as st
import warnings
warnings.filterwarnings('ignore')
# Import custom modules
from ui_components import (
setup_page_config, display_title_and_description, get_all_ui_parameters,
display_performance_warnings
)
from data_loader import (
load_all_chat_data, parse_embeddings, filter_data, get_filtered_embeddings
)
from dimensionality_reduction import (
reduce_dimensions, apply_density_based_jittering
)
from clustering import apply_clustering, generate_cluster_names
from visualization import (
create_visualization_plot, display_clustering_metrics, display_summary_stats,
display_clustering_results, display_data_table, display_cluster_summary
)
def main():
"""Main application function"""
# Set up page configuration
setup_page_config()
# Display title and description
display_title_and_description()
# Load data
with st.spinner("Loading chat data..."):
df = load_all_chat_data()
if df.empty:
st.error("No data could be loaded. Please check the data directory.")
st.stop()
# Parse embeddings
with st.spinner("Parsing embeddings..."):
embeddings, valid_df = parse_embeddings(df)
if len(embeddings) == 0:
st.error("No valid embeddings found!")
st.stop()
# Get UI parameters
params = get_all_ui_parameters(valid_df)
# Check if any sources are selected before proceeding
if not params['selected_sources']:
st.info("📂 **Select source files from the sidebar to begin visualization**")
st.markdown("### Available Data Sources:")
# Show available sources as an informational table
source_info = []
for source in valid_df['source_file'].unique():
source_data = valid_df[valid_df['source_file'] == source]
source_info.append({
'Source File': source,
'Messages': len(source_data),
'Unique Authors': source_data['author_name'].nunique(),
'Date Range': f"{source_data['timestamp_utc'].min()} to {source_data['timestamp_utc'].max()}"
})
import pandas as pd
source_df = pd.DataFrame(source_info)
st.dataframe(source_df, use_container_width=True, hide_index=True)
st.markdown("👈 **Use the sidebar to select which sources to visualize**")
st.stop()
# Filter data
filtered_df = filter_data(valid_df, params['selected_sources'], params['selected_authors'])
if filtered_df.empty:
st.warning("No data matches the current filters! Try selecting different sources or authors.")
st.stop()
# Display performance warnings
display_performance_warnings(filtered_df, params['method'], params['clustering_method'])
# Get corresponding embeddings
filtered_embeddings = get_filtered_embeddings(embeddings, valid_df, filtered_df)
st.info(f"📈 Visualizing {len(filtered_df)} messages")
# Reduce dimensions
n_components = 3 if params['enable_3d'] else 2
with st.spinner(f"Reducing dimensions using {params['method']}..."):
reduced_embeddings = reduce_dimensions(
filtered_embeddings,
method=params['method'],
n_components=n_components,
spread_factor=params['spread_factor'],
perplexity_factor=params['perplexity_factor'],
min_dist_factor=params['min_dist_factor']
)
# Apply clustering
with st.spinner(f"Applying {params['clustering_method']}..."):
cluster_labels, silhouette_avg, calinski_harabasz = apply_clustering(
filtered_embeddings,
clustering_method=params['clustering_method'],
n_clusters=params['n_clusters']
)
# Apply jittering if requested
if params['apply_jittering']:
with st.spinner("Applying smart jittering to separate overlapping points..."):
reduced_embeddings = apply_density_based_jittering(
reduced_embeddings,
density_scaling=params['density_based_jitter'],
jitter_strength=params['jitter_strength']
)
# Generate cluster names if clustering was applied
cluster_names = None
if cluster_labels is not None:
with st.spinner("Generating cluster names..."):
cluster_names = generate_cluster_names(filtered_df, cluster_labels)
# Display clustering metrics
display_clustering_metrics(
cluster_labels, silhouette_avg, calinski_harabasz,
params['show_cluster_metrics']
)
# Display cluster summary with names
if cluster_names:
display_cluster_summary(cluster_names, cluster_labels)
# Create and display the main plot
fig = create_visualization_plot(
reduced_embeddings=reduced_embeddings,
filtered_df=filtered_df,
cluster_labels=cluster_labels,
selected_sources=params['selected_sources'] if params['selected_sources'] else None,
method=params['method'],
clustering_method=params['clustering_method'],
point_size=params['point_size'],
point_opacity=params['point_opacity'],
density_based_sizing=params['density_based_sizing'],
size_variation=params['size_variation'],
enable_3d=params['enable_3d'],
cluster_names=cluster_names
)
st.plotly_chart(fig, use_container_width=True)
# Display summary statistics
display_summary_stats(filtered_df, params['selected_sources'] or filtered_df['source_file'].unique())
# Display clustering results and export options
display_clustering_results(
filtered_df, cluster_labels, reduced_embeddings,
params['method'], params['clustering_method'], params['enable_3d']
)
# Display data table
display_data_table(filtered_df, cluster_labels)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,8 @@
streamlit>=1.28.0
pandas>=1.5.0
numpy>=1.24.0
plotly>=5.15.0
scikit-learn>=1.3.0
umap-learn>=0.5.3
hdbscan>=0.8.29
scipy>=1.10.0

View File

@@ -0,0 +1,43 @@
#!/usr/bin/env python3
"""
Test script to debug the hanging issue in the modular app
"""
import numpy as np
import sys
import os
# Add the current directory to Python path
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
def test_dimensionality_reduction():
"""Test dimensionality reduction functions"""
print("Testing dimensionality reduction functions...")
from dimensionality_reduction import reduce_dimensions
# Create test data similar to what we'd expect
n_samples = 796 # Same as the user's dataset
n_features = 384 # Common embedding dimension
print(f"Creating test embeddings: {n_samples} x {n_features}")
test_embeddings = np.random.randn(n_samples, n_features)
# Test PCA (should be fast)
print("Testing PCA...")
try:
result = reduce_dimensions(test_embeddings, method="PCA")
print(f"✓ PCA successful, output shape: {result.shape}")
except Exception as e:
print(f"✗ PCA failed: {e}")
# Test UMAP (might be slower)
print("Testing UMAP...")
try:
result = reduce_dimensions(test_embeddings, method="UMAP")
print(f"✓ UMAP successful, output shape: {result.shape}")
except Exception as e:
print(f"✗ UMAP failed: {e}")
if __name__ == "__main__":
test_dimensionality_reduction()

View File

@@ -0,0 +1,267 @@
"""
Streamlit UI components and controls for the Discord Chat Embeddings Visualizer.
"""
import streamlit as st
import numpy as np
from config import (
APP_TITLE, APP_ICON, APP_LAYOUT, METHOD_EXPLANATIONS,
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS, COMPUTATIONALLY_INTENSIVE_METHODS,
LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS,
DEFAULT_DIMENSION_REDUCTION_METHOD, DEFAULT_CLUSTERING_METHOD
)
def setup_page_config():
"""Set up the Streamlit page configuration"""
st.set_page_config(
page_title=APP_TITLE,
page_icon=APP_ICON,
layout=APP_LAYOUT
)
def display_title_and_description():
"""Display the main title and description"""
st.title(f"{APP_ICON} {APP_TITLE}")
st.markdown("Explore Discord chat messages through their vector embeddings in 2D space")
def create_method_controls():
"""Create controls for dimension reduction and clustering methods"""
st.sidebar.header("🎛️ Visualization Controls")
# 3D visualization toggle
enable_3d = st.sidebar.checkbox(
"Enable 3D Visualization",
value=False,
help="Switch between 2D and 3D visualization. 3D uses 3 components instead of 2."
)
# Dimension reduction method
method_options = ["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"]
default_index = method_options.index(DEFAULT_DIMENSION_REDUCTION_METHOD) if DEFAULT_DIMENSION_REDUCTION_METHOD in method_options else 0
method = st.sidebar.selectbox(
"Dimension Reduction Method",
method_options,
index=default_index,
help="PCA is fastest, UMAP balances speed and quality, t-SNE and Spectral are slower but may reveal better structures. Force-Directed creates natural spacing."
)
# Clustering method
clustering_options = ["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture",
"Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"]
clustering_default_index = clustering_options.index(DEFAULT_CLUSTERING_METHOD) if DEFAULT_CLUSTERING_METHOD in clustering_options else 0
clustering_method = st.sidebar.selectbox(
"Clustering Method",
clustering_options,
index=clustering_default_index,
help="Apply clustering to identify groups. HDBSCAN and OPTICS can find variable density clusters."
)
return method, clustering_method, enable_3d
def create_clustering_controls(clustering_method):
"""Create controls for clustering parameters"""
# Always show the clusters slider, but indicate when it's used
if clustering_method in CLUSTERING_METHODS_REQUIRING_N_CLUSTERS:
help_text = "Number of clusters to create. This setting affects the clustering algorithm."
disabled = False
elif clustering_method == "None":
help_text = "Clustering is disabled. This setting has no effect."
disabled = True
else:
help_text = f"{clustering_method} automatically determines the number of clusters. This setting has no effect."
disabled = True
n_clusters = st.sidebar.slider(
"Number of Clusters",
min_value=2,
max_value=20,
value=5,
disabled=disabled,
help=help_text
)
return n_clusters
def create_separation_controls(method):
"""Create controls for point separation and method-specific parameters"""
st.sidebar.subheader("🎯 Point Separation Controls")
spread_factor = st.sidebar.slider(
"Spread Factor",
0.5, 3.0, 1.0, 0.1,
help="Increase to spread apart nearby points. Higher values create more separation."
)
# Method-specific parameters
perplexity_factor = 1.0
min_dist_factor = 1.0
if method == "t-SNE":
perplexity_factor = st.sidebar.slider(
"Perplexity Factor",
0.1, 2.0, 1.0, 0.1,
help="Affects local vs global structure balance. Lower values focus on local details."
)
if method == "UMAP":
min_dist_factor = st.sidebar.slider(
"Min Distance Factor",
0.1, 2.0, 1.0, 0.1,
help="Controls how tightly points are packed. Lower values create tighter clusters."
)
return spread_factor, perplexity_factor, min_dist_factor
def create_jittering_controls():
"""Create controls for jittering options"""
apply_jittering = st.sidebar.checkbox(
"Apply Smart Jittering",
value=False,
help="Add intelligent noise to separate overlapping points"
)
jitter_strength = 0.1
density_based_jitter = True
if apply_jittering:
jitter_strength = st.sidebar.slider(
"Jitter Strength",
0.01, 0.5, 0.1, 0.01,
help="Strength of jittering. Higher values spread points more."
)
density_based_jitter = st.sidebar.checkbox(
"Density-Based Jittering",
value=True,
help="Apply stronger jittering in dense regions"
)
return apply_jittering, jitter_strength, density_based_jitter
def create_advanced_options():
"""Create advanced visualization options"""
with st.sidebar.expander("⚙️ Advanced Options"):
show_cluster_metrics = st.checkbox("Show Clustering Metrics", value=True)
point_size = st.slider("Point Size", 4, 15, 8)
point_opacity = st.slider("Point Opacity", 0.3, 1.0, 0.7)
# Density-based visualization
density_based_sizing = st.checkbox(
"Density-Based Point Sizing",
value=False,
help="Make points larger in sparse regions, smaller in dense regions"
)
size_variation = 2.0
if density_based_sizing:
size_variation = st.slider(
"Size Variation Factor",
1.5, 4.0, 2.0, 0.1,
help="How much point sizes vary based on local density"
)
return show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation
def create_filter_controls(valid_df):
"""Create controls for filtering data by source and author"""
# Source file filter
source_files = valid_df['source_file'].unique()
selected_sources = st.sidebar.multiselect(
"Filter by Source Files",
source_files,
default=[],
help="Select which chat log files to include"
)
# Author filter
authors = valid_df['author_name'].unique()
default_authors = authors[:MAX_DISPLAYED_AUTHORS] if len(authors) > MAX_DISPLAYED_AUTHORS else authors
selected_authors = st.sidebar.multiselect(
"Filter by Authors",
authors,
default=default_authors,
help="Select which authors to include"
)
return selected_sources, selected_authors
def display_method_explanations():
"""Display explanations for different methods"""
st.sidebar.markdown("---")
with st.sidebar.expander("📚 Method Explanations"):
st.markdown("**Dimensionality Reduction:**")
for method, explanation in METHOD_EXPLANATIONS["dimension_reduction"].items():
st.markdown(f"- **{method}**: {explanation}")
st.markdown("\n**Clustering Methods:**")
for method, explanation in METHOD_EXPLANATIONS["clustering"].items():
st.markdown(f"- **{method}**: {explanation}")
st.markdown("\n**Separation Techniques:**")
for technique, explanation in METHOD_EXPLANATIONS["separation"].items():
st.markdown(f"- **{technique}**: {explanation}")
st.markdown("\n**Metrics:**")
for metric, explanation in METHOD_EXPLANATIONS["metrics"].items():
st.markdown(f"- **{metric}**: {explanation}")
def display_performance_warnings(filtered_df, method, clustering_method):
"""Display performance warnings for computationally intensive operations"""
if len(filtered_df) > LARGE_DATASET_WARNING_THRESHOLD:
if method in COMPUTATIONALLY_INTENSIVE_METHODS["dimension_reduction"]:
st.warning(f"⚠️ {method} with {len(filtered_df)} points may take several minutes to compute.")
if clustering_method in COMPUTATIONALLY_INTENSIVE_METHODS["clustering"]:
st.warning(f"⚠️ {clustering_method} with {len(filtered_df)} points may be computationally intensive.")
def get_all_ui_parameters(valid_df):
"""Get all UI parameters in a single function call"""
# Method selection
method, clustering_method, enable_3d = create_method_controls()
# Clustering parameters
n_clusters = create_clustering_controls(clustering_method)
# Separation controls
spread_factor, perplexity_factor, min_dist_factor = create_separation_controls(method)
# Jittering controls
apply_jittering, jitter_strength, density_based_jitter = create_jittering_controls()
# Advanced options
show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation = create_advanced_options()
# Filters
selected_sources, selected_authors = create_filter_controls(valid_df)
# Method explanations
display_method_explanations()
return {
'method': method,
'clustering_method': clustering_method,
'enable_3d': enable_3d,
'n_clusters': n_clusters,
'spread_factor': spread_factor,
'perplexity_factor': perplexity_factor,
'min_dist_factor': min_dist_factor,
'apply_jittering': apply_jittering,
'jitter_strength': jitter_strength,
'density_based_jitter': density_based_jitter,
'show_cluster_metrics': show_cluster_metrics,
'point_size': point_size,
'point_opacity': point_opacity,
'density_based_sizing': density_based_sizing,
'size_variation': size_variation,
'selected_sources': selected_sources,
'selected_authors': selected_authors
}

View File

@@ -0,0 +1,311 @@
"""
Visualization functions for creating interactive plots and displays.
"""
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import streamlit as st
from dimensionality_reduction import calculate_local_density_scaling
from config import MESSAGE_CONTENT_PREVIEW_LENGTH, DEFAULT_POINT_SIZE, DEFAULT_POINT_OPACITY
def create_hover_text(df):
"""Create hover text for plotly"""
hover_text = []
for _, row in df.iterrows():
text = f"<b>Author:</b> {row['author_name']}<br>"
text += f"<b>Timestamp:</b> {row['timestamp_utc']}<br>"
text += f"<b>Source:</b> {row['source_file']}<br>"
# Handle potential NaN or non-string content
content = row['content']
if pd.isna(content) or content is None:
content_text = "[No content]"
else:
content_str = str(content)
content_text = content_str[:MESSAGE_CONTENT_PREVIEW_LENGTH] + ('...' if len(content_str) > MESSAGE_CONTENT_PREVIEW_LENGTH else '')
text += f"<b>Content:</b> {content_text}"
hover_text.append(text)
return hover_text
def calculate_point_sizes(reduced_embeddings, density_based_sizing=False,
point_size=DEFAULT_POINT_SIZE, size_variation=2.0):
"""Calculate point sizes based on density if enabled"""
if not density_based_sizing:
return [point_size] * len(reduced_embeddings)
local_densities = calculate_local_density_scaling(reduced_embeddings)
# Invert densities so sparse areas get larger points
inverted_densities = 1.0 - local_densities
# Scale point sizes
point_sizes = point_size * (1.0 + inverted_densities * (size_variation - 1.0))
return point_sizes
def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover_text,
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA", enable_3d=False,
cluster_names=None):
"""Create a plot colored by clusters"""
fig = go.Figure()
unique_clusters = np.unique(cluster_labels)
colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel
for i, cluster_id in enumerate(unique_clusters):
cluster_mask = cluster_labels == cluster_id
if cluster_mask.any():
cluster_embeddings = reduced_embeddings[cluster_mask]
cluster_hover = [hover_text[j] for j, mask in enumerate(cluster_mask) if mask]
cluster_sizes = [point_sizes[j] for j, mask in enumerate(cluster_mask) if mask]
# Use generated name if available, otherwise fall back to default
if cluster_names and cluster_id in cluster_names:
cluster_name = cluster_names[cluster_id]
else:
cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
if enable_3d:
fig.add_trace(go.Scatter3d(
x=cluster_embeddings[:, 0],
y=cluster_embeddings[:, 1],
z=cluster_embeddings[:, 2],
mode='markers',
name=cluster_name,
marker=dict(
size=cluster_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=cluster_hover
))
else:
fig.add_trace(go.Scatter(
x=cluster_embeddings[:, 0],
y=cluster_embeddings[:, 1],
mode='markers',
name=cluster_name,
marker=dict(
size=cluster_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=cluster_hover
))
return fig
def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, hover_text,
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, enable_3d=False):
"""Create a plot colored by source files"""
fig = go.Figure()
colors = px.colors.qualitative.Set1
for i, source in enumerate(selected_sources):
source_mask = filtered_df['source_file'] == source
if source_mask.any():
source_embeddings = reduced_embeddings[source_mask]
source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
source_sizes = [point_sizes[j] for j, mask in enumerate(source_mask) if mask]
if enable_3d:
fig.add_trace(go.Scatter3d(
x=source_embeddings[:, 0],
y=source_embeddings[:, 1],
z=source_embeddings[:, 2],
mode='markers',
name=source,
marker=dict(
size=source_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=source_hover
))
else:
fig.add_trace(go.Scatter(
x=source_embeddings[:, 0],
y=source_embeddings[:, 1],
mode='markers',
name=source,
marker=dict(
size=source_sizes,
color=colors[i % len(colors)],
opacity=point_opacity,
line=dict(width=1, color='white')
),
hovertemplate='%{hovertext}<extra></extra>',
hovertext=source_hover
))
return fig
def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=None,
selected_sources=None, method="PCA", clustering_method="None",
point_size=DEFAULT_POINT_SIZE, point_opacity=DEFAULT_POINT_OPACITY,
density_based_sizing=False, size_variation=2.0, enable_3d=False,
cluster_names=None):
"""Create the main visualization plot"""
# Create hover text
hover_text = create_hover_text(filtered_df)
# Calculate point sizes
point_sizes = calculate_point_sizes(reduced_embeddings, density_based_sizing,
point_size, size_variation)
# Create plot based on coloring strategy
if cluster_labels is not None:
fig = create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels,
hover_text, point_sizes, point_opacity, method, enable_3d,
cluster_names)
else:
if selected_sources is None:
selected_sources = filtered_df['source_file'].unique()
fig = create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources,
hover_text, point_sizes, point_opacity, enable_3d)
# Update layout
title_suffix = f" with {clustering_method}" if clustering_method != "None" else ""
dimension_text = "3D" if enable_3d else "2D"
if enable_3d:
fig.update_layout(
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
scene=dict(
xaxis_title=f"{method} Component 1",
yaxis_title=f"{method} Component 2",
zaxis_title=f"{method} Component 3"
),
width=1000,
height=700
)
else:
fig.update_layout(
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
xaxis_title=f"{method} Component 1",
yaxis_title=f"{method} Component 2",
hovermode='closest',
width=1000,
height=700
)
return fig
def display_clustering_metrics(cluster_labels, silhouette_avg, calinski_harabasz, show_metrics=True):
"""Display clustering quality metrics"""
if cluster_labels is not None and show_metrics:
col1, col2, col3 = st.columns(3)
with col1:
n_clusters_found = len(np.unique(cluster_labels[cluster_labels != -1]))
st.metric("Clusters Found", n_clusters_found)
with col2:
if silhouette_avg is not None:
st.metric("Silhouette Score", f"{silhouette_avg:.3f}")
else:
st.metric("Silhouette Score", "N/A")
with col3:
if calinski_harabasz is not None:
st.metric("Calinski-Harabasz Index", f"{calinski_harabasz:.1f}")
else:
st.metric("Calinski-Harabasz Index", "N/A")
def display_summary_stats(filtered_df, selected_sources):
"""Display summary statistics"""
col1, col2, col3 = st.columns(3)
with col1:
st.metric("Total Messages", len(filtered_df))
with col2:
st.metric("Unique Authors", filtered_df['author_name'].nunique())
with col3:
st.metric("Source Files", len(selected_sources))
def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method, enable_3d=False):
"""Display clustering results and export options"""
if cluster_labels is None:
return
st.subheader("📊 Clustering Results")
# Add cluster information to dataframe for export
export_df = filtered_df.copy()
export_df['cluster_id'] = cluster_labels
export_df['x_coordinate'] = reduced_embeddings[:, 0]
export_df['y_coordinate'] = reduced_embeddings[:, 1]
# Add z coordinate if 3D
if enable_3d and reduced_embeddings.shape[1] >= 3:
export_df['z_coordinate'] = reduced_embeddings[:, 2]
# Show cluster distribution
cluster_dist = pd.Series(cluster_labels).value_counts().sort_index()
st.bar_chart(cluster_dist)
# Download option
csv_data = export_df.to_csv(index=False)
dimension_text = "3D" if enable_3d else "2D"
st.download_button(
label="📥 Download Clustering Results (CSV)",
data=csv_data,
file_name=f"chat_clusters_{method}_{clustering_method}_{dimension_text}.csv",
mime="text/csv"
)
def display_data_table(filtered_df, cluster_labels=None):
"""Display the data table with optional clustering information"""
if not st.checkbox("Show Data Table"):
return
st.subheader("📋 Message Data")
display_df = filtered_df[['timestamp_utc', 'author_name', 'source_file', 'content']].copy()
# Add clustering info if available
if cluster_labels is not None:
display_df['cluster'] = cluster_labels
display_df['content'] = display_df['content'].str[:100] + '...' # Truncate for display
st.dataframe(display_df, use_container_width=True)
def display_cluster_summary(cluster_names, cluster_labels):
"""Display a summary of cluster names and their sizes"""
if not cluster_names or cluster_labels is None:
return
st.subheader("🏷️ Cluster Summary")
# Create summary data
cluster_summary = []
for cluster_id, name in cluster_names.items():
count = np.sum(cluster_labels == cluster_id)
cluster_summary.append({
'Cluster ID': cluster_id,
'Cluster Name': name,
'Message Count': count,
'Percentage': f"{100 * count / len(cluster_labels):.1f}%"
})
# Sort by message count
cluster_summary.sort(key=lambda x: x['Message Count'], reverse=True)
# Display as table
summary_df = pd.DataFrame(cluster_summary)
st.dataframe(summary_df, use_container_width=True, hide_index=True)

View File

@@ -0,0 +1,59 @@
# Image Dataset Viewer
A simple Streamlit application to browse images from your Discord chat dataset.
## Features
- 📋 Dropdown to select different channels
- 🖼️ View images with navigation controls
- ⬅️➡️ Previous/Next buttons and slider navigation
- 📊 Display metadata for each image
- 📱 Responsive layout
## Setup and Usage
### Option 1: Using the run script (Recommended)
```bash
./run.sh
```
### Option 2: Manual setup
1. Create a virtual environment:
```bash
python3 -m venv venv
source venv/bin/activate
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Run the application:
```bash
streamlit run image_viewer.py
```
## How it works
The application:
1. Loads the `images_dataset.json` file from the parent directory
2. Extracts unique channel names from the dataset
3. Allows you to select a channel from a dropdown
4. Displays images from that channel with navigation controls
5. Shows metadata including author, timestamp, and message content
## Dataset Structure
The app expects your dataset to have entries with:
- `channel`: The channel name
- `image_url`, `image_path`, `url`, or `attachment_url`: The image location
- `author`: The message author (optional)
- `timestamp`: When the message was sent (optional)
- `content` or `message`: The message text (optional)
## Troubleshooting
- If images don't load, check that the URLs in your dataset are accessible
- For local images, ensure the paths are relative to the project root
- Large datasets may take a moment to load initially

View File

@@ -0,0 +1,226 @@
import streamlit as st
import json
import os
from pathlib import Path
import requests
from PIL import Image
from io import BytesIO
# Set page config
st.set_page_config(
page_title="Image Dataset Viewer",
page_icon="🖼️",
layout="wide"
)
# Cache the dataset loading
@st.cache_data
def load_dataset():
"""Load the images dataset JSON file"""
dataset_path = "../images_dataset/images_dataset.json"
try:
with open(dataset_path, 'r', encoding='utf-8') as f:
data = json.load(f)
return data
except Exception as e:
st.error(f"Error loading dataset: {e}")
return {}
@st.cache_data
def get_channels(data):
"""Extract unique channels from the dataset"""
# First try to get channels from metadata
if isinstance(data, dict) and 'metadata' in data and 'summary' in data['metadata']:
channels = data['metadata']['summary'].get('channels', [])
if channels:
return sorted(channels)
# Fallback: extract from images array
channels = set()
images = data.get('images', []) if isinstance(data, dict) else []
for item in images:
if isinstance(item, dict) and 'channel' in item:
channels.add(item['channel'])
return sorted(list(channels))
def display_image(image_url, caption="", base64_data=None):
"""Display an image from URL, local path, or base64 data"""
try:
if base64_data and base64_data != "image datta ...........":
# Load image from base64 data
import base64
image_data = base64.b64decode(base64_data)
image = Image.open(BytesIO(image_data))
elif image_url and image_url.startswith(('http://', 'https://')):
# Load image from URL
response = requests.get(image_url, timeout=10)
response.raise_for_status()
image = Image.open(BytesIO(response.content))
elif image_url:
# Load local image
image_path = Path(__file__).parent.parent / image_url
if image_path.exists():
image = Image.open(image_path)
else:
st.error(f"Image not found: {image_url}")
return False
else:
st.error("No valid image source found")
return False
st.image(image, caption=caption, use_column_width=True)
return True
except Exception as e:
st.error(f"Error loading image: {e}")
return False
def main():
st.title("🖼️ Image Dataset Viewer")
st.markdown("Browse images from your dataset by channel")
# Load dataset
with st.spinner("Loading dataset..."):
data = load_dataset()
if not data:
st.error("No data loaded. Please check your dataset file.")
return
# Display dataset summary if available
if isinstance(data, dict) and 'metadata' in data:
metadata = data['metadata']
if 'summary' in metadata:
summary = metadata['summary']
col1, col2, col3, col4 = st.columns(4)
with col1:
st.metric("Total Images", summary.get('total_images', 'Unknown'))
with col2:
st.metric("Channels", len(summary.get('channels', [])))
with col3:
st.metric("Authors", len(summary.get('authors', [])))
with col4:
size_mb = summary.get('total_size_bytes', 0) / (1024 * 1024)
st.metric("Total Size", f"{size_mb:.1f} MB")
# Get channels
channels = get_channels(data)
if not channels:
st.error("No channels found in the dataset.")
return
# Channel selection
selected_channel = st.selectbox(
"Select a channel:",
channels,
help="Choose a channel to view its images"
)
# Filter images by channel
channel_images = []
images = data.get('images', []) if isinstance(data, dict) else []
for i, item in enumerate(images):
if isinstance(item, dict) and item.get('channel') == selected_channel:
if 'url' in item or 'base64_data' in item:
channel_images.append({
'id': i,
'data': item
})
if not channel_images:
st.warning(f"No images found for channel: {selected_channel}")
return
st.success(f"Found {len(channel_images)} images in #{selected_channel}")
# Image navigation
if len(channel_images) > 1:
col1, col2, col3 = st.columns([1, 2, 1])
with col1:
if st.button("⬅️ Previous", use_container_width=True):
if 'image_index' in st.session_state and st.session_state.image_index > 0:
st.session_state.image_index -= 1
else:
st.session_state.image_index = len(channel_images) - 1
with col2:
# Initialize or get current index
if 'image_index' not in st.session_state:
st.session_state.image_index = 0
# Image selector
st.session_state.image_index = st.slider(
"Image",
0,
len(channel_images) - 1,
st.session_state.image_index,
help=f"Navigate through {len(channel_images)} images"
)
with col3:
if st.button("Next ➡️", use_container_width=True):
if 'image_index' in st.session_state and st.session_state.image_index < len(channel_images) - 1:
st.session_state.image_index += 1
else:
st.session_state.image_index = 0
else:
st.session_state.image_index = 0
# Display current image
current_image = channel_images[st.session_state.image_index]
image_data = current_image['data']
# Get image URL and base64 data
image_url = image_data.get('url')
base64_data = image_data.get('base64_data')
if image_url or base64_data:
# Create two columns for image and metadata
col1, col2 = st.columns([2, 1])
with col1:
st.subheader(f"Image {st.session_state.image_index + 1} of {len(channel_images)}")
caption = f"Channel: #{selected_channel}"
if 'author_name' in image_data:
caption += f" | Author: {image_data['author_name']}"
if 'timestamp_utc' in image_data:
caption += f" | Time: {image_data['timestamp_utc']}"
display_image(image_url, caption, base64_data)
with col2:
st.subheader("Metadata")
# Display metadata in an organized way
metadata_to_show = {
'ID': current_image['id'],
'Channel': image_data.get('channel', 'Unknown'),
'Author': image_data.get('author_name', 'Unknown'),
'Nickname': image_data.get('author_nickname', 'Unknown'),
'Author ID': image_data.get('author_id', 'Unknown'),
'Message ID': image_data.get('message_id', 'Unknown'),
'Timestamp': image_data.get('timestamp_utc', 'Unknown'),
'File Extension': image_data.get('file_extension', 'Unknown'),
'File Size': f"{image_data.get('file_size', 0):,} bytes" if image_data.get('file_size') else 'Unknown',
'Message': image_data.get('content', 'No message'),
}
for key, value in metadata_to_show.items():
if value and value != 'Unknown':
st.write(f"**{key}:** {value}")
# Show all other metadata
st.subheader("Raw Data")
with st.expander("Show all metadata"):
st.json(image_data)
else:
st.error("No image URL or base64 data found in this entry")
st.json(image_data)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,3 @@
streamlit>=1.28.0
requests>=2.31.0
Pillow>=10.0.0

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1 @@
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
1 message_id timestamp_utc author_id author_name author_nickname content attachment_urls embeds

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1 @@
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
1 message_id timestamp_utc author_id author_name author_nickname content attachment_urls embeds

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -1,9 +1,7 @@
# discord_export_bot.py
# discord_export_bot_v2.py
# This bot connects to a Discord server and exports the entire message
# history from every accessible text channel into separate CSV files.
# Make sure to install the discord.py library first:
# pip install discord.py
# This version uses a more robust task-based approach to prevent hanging.
import discord
import csv
@@ -11,47 +9,34 @@ import os
import asyncio
# --- Configuration ---
# Place your Bot Token here. Treat this like a password!
# It's recommended to use environment variables for security.
BOT_TOKEN = "YOUR_BOT_TOKEN_HERE"
# The directory where the CSV files will be saved.
# The script will create this directory if it doesn't exist.
BOT_TOKEN = "___"
OUTPUT_DIRECTORY = "discord_chat_logs"
# Optional: If you want to lock the bot to one server
# ALLOWED_SERVER_ID = 123456789012345678
# -------------------
# --- Bot Setup ---
# Define the necessary "Intents" for the bot. Intents tell Discord what
# events your bot needs to receive. To read messages, we need the
# `messages` and `message_content` intents. You MUST enable these
# in the Discord Developer Portal for your bot.
# The intents MUST be enabled in the Discord Developer Portal.
intents = discord.Intents.default()
intents.guilds = True
intents.messages = True
intents.message_content = True # This is a privileged intent!
intents.message_content = True # This is the most important one!
# Create the bot client instance with the specified intents.
client = discord.Client(intents=intents)
# --- Main Export Logic ---
async def export_channel_history(channel):
"""
Asynchronously fetches all messages from a given text channel
and saves them to a CSV file.
"""
print(f"Starting export for channel: #{channel.name} (ID: {channel.id})")
print(f"-> Starting export for channel: #{channel.name}")
# Sanitize channel name to create a valid filename
# Replaces invalid file name characters with an underscore
sanitized_channel_name = "".join(c if c.isalnum() else '_' for c in channel.name)
file_path = os.path.join(OUTPUT_DIRECTORY, f"{sanitized_channel_name}.csv")
try:
message_count = 0
with open(file_path, 'w', newline='', encoding='utf-8') as csvfile:
# Define the headers for the CSV file. This includes all the
# useful information we can easily get from a message object.
header = [
'message_id', 'timestamp_utc', 'author_id', 'author_name',
'author_nickname', 'content', 'attachment_urls', 'embeds'
@@ -59,95 +44,98 @@ async def export_channel_history(channel):
writer = csv.DictWriter(csvfile, fieldnames=header)
writer.writeheader()
# Fetch the channel's history. `limit=None` tells the library to
# fetch all messages. This can take a very long time and consume
# significant memory for channels with a large history.
# This is the part that fails without the Message Content Intent
async for message in channel.history(limit=None):
message_count += 1
if message_count % 100 == 0:
if message_count % 250 == 0: # Log progress less frequently
print(f" ... processed {message_count} messages in #{channel.name}")
# Extract attachment URLs
attachment_urls = ", ".join([att.url for att in message.attachments])
# Serialize embed objects to a string representation (e.g., JSON)
# This gives a detailed look into rich embeds.
embeds_str = ", ".join([str(embed.to_dict()) for embed in message.embeds])
# Write the message data as a row in the CSV
# Handle nickname - only Member objects have nick attribute, not User objects
author_nickname = getattr(message.author, 'nick', None) or message.author.display_name
writer.writerow({
'message_id': message.id,
'timestamp_utc': message.created_at,
'author_id': message.author.id,
'author_name': message.author.name,
'author_nickname': message.author.nick,
'author_nickname': author_nickname,
'content': message.content,
'attachment_urls': attachment_urls,
'embeds': embeds_str
})
print(f"✅ Finished exporting {message_count} messages from #{channel.name}.")
if message_count > 0:
print(f"✅ Finished exporting {message_count} messages from #{channel.name}.")
else:
print(f"⚠️ Channel #{channel.name} is empty or unreadable. 0 messages exported.")
return True
except discord.errors.Forbidden:
print(f"❌ ERROR: Permission denied for channel #{channel.name}. Skipping.")
print(f"❌ ERROR: Permission denied for channel #{channel.name}. Check bot permissions. Skipping.")
return False
except Exception as e:
print(f"❌ An unexpected error occurred for channel #{channel.name}: {e}")
return False
# --- Bot Events ---
@client.event
async def on_ready():
async def main_export_task():
"""
This event is triggered once the bot has successfully connected to Discord.
The main logic for the bot's export process.
This is run as a background task to avoid blocking.
"""
print(f'Logged in as: {client.user.name} (ID: {client.user.id})')
# Wait until the bot is fully ready before starting
await client.wait_until_ready()
print('------')
print("Bot is ready. Starting export process...")
# Create the output directory if it doesn't exist
if not os.path.exists(OUTPUT_DIRECTORY):
os.makedirs(OUTPUT_DIRECTORY)
print(f"Created output directory: {OUTPUT_DIRECTORY}")
# Get the server (guild) the bot is in. This script assumes the bot
# is only in ONE server. If it's in multiple, you may need to specify
# which one to target.
guild = client.guilds[0]
if not guild:
# Use the first guild the bot is in. For specific server, use client.get_guild(ALLOWED_SERVER_ID)
if not client.guilds:
print("Error: Bot does not appear to be in any server.")
await client.close()
return
guild = client.guilds[0]
print(f"Targeting server: {guild.name} (ID: {guild.id})")
# Get a list of all text channels the bot can see
text_channels = [channel for channel in guild.text_channels]
print(f"Found {len(text_channels)} text channels to export.")
# Loop through each channel and run the export function
for channel in text_channels:
await export_channel_history(channel)
# A small delay to be respectful to Discord's API, although
# the library handles rate limiting automatically.
await asyncio.sleep(1)
print('------')
print("All channels have been processed. The bot will now shut down.")
# Shuts down the bot once the export is complete.
# This properly closes the bot's connection.
await client.close()
@client.event
async def on_ready():
"""
This event is triggered once the bot has successfully connected.
It now only prints a ready message and starts the main task.
"""
print(f'Logged in as: {client.user.name} (ID: {client.user.id})')
# Schedule the main task to run in the background
client.loop.create_task(main_export_task())
# --- Run the Bot ---
if __name__ == "__main__":
if BOT_TOKEN == "YOUR_BOT_TOKEN_HERE":
print("!!! ERROR: Please replace 'YOUR_BOT_TOKEN_HERE' with your actual bot token in the script.")
else:
try:
# This starts the bot. The `on_ready` event will be called once it's connected.
client.run(BOT_TOKEN)
except discord.errors.LoginFailure:
print("!!! ERROR: Login failed. The token is likely invalid or incorrect.")
except Exception as e:
print(f"!!! An error occurred while running the bot: {e}")

147
scripts/embed_class.py Normal file
View File

@@ -0,0 +1,147 @@
# main.py
# Description: A simple Python class to generate text embeddings using sentence-transformers.
#
# Required libraries:
# pip install sentence-transformers pandas torch
#
# This script defines a TextEmbedder class that can be used to:
# 1. Load a pre-trained sentence-transformer model.
# 2. Embed a single string or a list of strings into vectors.
# 3. Embed an entire text column in a pandas DataFrame and add the embeddings as a new column.
import pandas as pd
from sentence_transformers import SentenceTransformer
from typing import List, Union
class TextEmbedder:
"""
A simple class to handle text embedding using sentence-transformers.
"""
def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
"""
Initializes the TextEmbedder and loads the specified model.
Args:
model_name (str): The name of the sentence-transformer model to use.
Defaults to 'all-MiniLM-L6-v2', a small and efficient model.
"""
self.model_name = model_name
self.model = None
self.load_model()
def load_model(self):
"""
Loads the sentence-transformer model from Hugging Face.
This method is called automatically during initialization.
"""
try:
print(f"Loading model: '{self.model_name}'...")
self.model = SentenceTransformer(self.model_name)
print("Model loaded successfully.")
except Exception as e:
print(f"Error loading model: {e}")
self.model = None
def embed(self, text: Union[str, List[str]]):
"""
Generates vector embeddings for a given string or list of strings.
Args:
text (Union[str, List[str]]): A single string or a list of strings to embed.
Returns:
A list of vector embeddings. Each embedding is a list of floats.
Returns None if the model is not loaded.
"""
if self.model is None:
print("Model is not loaded. Cannot perform inference.")
return None
print(f"Embedding text...")
# The model's encode function handles both single strings and lists of strings.
embeddings = self.model.encode(text, convert_to_numpy=False)
# We convert to a list of lists for easier use with pandas.
if isinstance(text, str):
return embeddings.tolist()
return [emb.tolist() for emb in embeddings]
def embed_dataframe_column(self, df: pd.DataFrame, column_name: str) -> pd.DataFrame:
"""
Embeds the text in a specified DataFrame column and adds the embeddings
as a new column to the DataFrame.
Args:
df (pd.DataFrame): The pandas DataFrame to process.
column_name (str): The name of the column containing the text to embed.
Returns:
pd.DataFrame: The original DataFrame with a new column containing the embeddings.
Returns the original DataFrame unmodified if an error occurs.
"""
if self.model is None:
print("Model is not loaded. Cannot process DataFrame.")
return df
if column_name not in df.columns:
print(f"Error: Column '{column_name}' not found in the DataFrame.")
return df
# Ensure the column is of string type and handle potential missing values (NaN)
# by filling them with an empty string.
text_to_embed = df[column_name].astype(str).fillna('').tolist()
# Generate embeddings for the entire column's text
embeddings = self.embed(text_to_embed)
if embeddings:
# Add the embeddings as a new column
new_column_name = f'{column_name}_embedding'
df[new_column_name] = embeddings
print(f"Successfully added '{new_column_name}' to the DataFrame.")
return df
# --- Example Usage ---
if __name__ == '__main__':
# 1. Initialize the embedder. This will automatically load the model.
embedder = TextEmbedder(model_name='all-MiniLM-L6-v2')
# 2. Embed a single string
print("\n--- Embedding a single string ---")
single_string = "This is a simple test sentence."
vector = embedder.embed(single_string)
if vector:
print(f"Original string: '{single_string}'")
# Print the first 5 dimensions of the vector for brevity
print(f"Resulting vector (first 5 dims): {vector[:5]}")
print(f"Vector dimension: {len(vector)}")
# 3. Embed a list of strings
print("\n--- Embedding a list of strings ---")
list_of_strings = ["The quick brown fox jumps over the lazy dog.", "Hello, world!"]
vectors = embedder.embed(list_of_strings)
if vectors:
for i, text in enumerate(list_of_strings):
print(f"Original string: '{text}'")
print(f"Resulting vector (first 5 dims): {vectors[i][:5]}")
print(f"Vector dimension: {len(vectors[i])}\n")
# 4. Embed a pandas DataFrame column
print("\n--- Embedding a DataFrame column ---")
# Create a sample DataFrame
data = {'product_id': [1, 2, 3],
'description': ['A comfortable cotton t-shirt.', 'High-quality noise-cancelling headphones.', 'A book about the history of computing.']}
my_df = pd.DataFrame(data)
print("Original DataFrame:")
print(my_df)
# Embed the 'description' column
df_with_embeddings = embedder.embed_dataframe_column(my_df, 'description')
print("\nDataFrame with embeddings:")
# Using .to_string() to ensure the full content is displayed
print(df_with_embeddings.to_string())

111
scripts/embedder.py Normal file
View File

@@ -0,0 +1,111 @@
# batch_embedder.py
# Description: A script to process all CSV files in a directory,
# add text embeddings to a specified column, and
# save the results back to the original files.
#
# This script assumes the TextEmbedder class is in a file named `main.py`
# in the same directory.
import os
import pandas as pd
from embed_class import TextEmbedder # Importing the class from main.py
def create_sample_files(directory: str):
"""Creates a few sample CSV files for demonstration purposes."""
if not os.path.exists(directory):
print(f"Creating sample directory: '{directory}'")
os.makedirs(directory)
# Sample file 1: Product descriptions
df1_data = {'product_name': ['Smart Watch', 'Wireless Mouse', 'Keyboard'],
'description': ['A watch that tracks fitness and notifications.', 'Ergonomic mouse with long battery life.', 'Mechanical keyboard with RGB lighting.']}
df1 = pd.DataFrame(df1_data)
df1.to_csv(os.path.join(directory, 'products.csv'), index=False)
# Sample file 2: Customer reviews
df2_data = {'review_id': [101, 102, 103],
'comment_text': ['The product exceeded my expectations!', 'It arrived late and was the wrong color.', 'I would definitely recommend this to a friend.']}
df2 = pd.DataFrame(df2_data)
df2.to_csv(os.path.join(directory, 'reviews.csv'), index=False)
print(f"Created sample files in '{directory}'.")
def process_csvs_in_directory(directory_path: str, model_name: str = 'all-MiniLM-L6-v2'):
"""
Finds all CSV files in a directory, embeds a user-specified text column,
and overwrites the original CSV with the new data.
Args:
directory_path (str): The path to the directory containing CSV files.
model_name (str): The sentence-transformer model to use for embedding.
"""
print(f"Starting batch processing for directory: '{directory_path}'")
# 1. Initialize the TextEmbedder
# This will load the model, which can take a moment.
try:
embedder = TextEmbedder(model_name)
except Exception as e:
print(f"Failed to initialize TextEmbedder. Aborting. Error: {e}")
return
# 2. Find all CSV files in the directory
try:
all_files = os.listdir(directory_path)
csv_files = [f for f in all_files if f.endswith('.csv')]
except FileNotFoundError:
print(f"Error: Directory not found at '{directory_path}'. Please create it and add CSV files.")
return
if not csv_files:
print("No CSV files found in the directory.")
return
print(f"Found {len(csv_files)} CSV files to process.")
# 3. Loop through each CSV file
for filename in csv_files:
file_path = os.path.join(directory_path, filename)
print(f"\n--- Processing file: {filename} ---")
try:
# Read the CSV into a DataFrame
df = pd.read_csv(file_path)
print("Available columns:", list(df.columns))
# Ask the user for the column to embed
column_to_embed = input(f"Enter the name of the column to embed for '{filename}': ")
# Check if the column exists
if column_to_embed not in df.columns:
print(f"Column '{column_to_embed}' not found. Skipping this file.")
continue
# 4. Use the embedder to add the new column
df_with_embeddings = embedder.embed_dataframe_column(df, column_to_embed)
# 5. Save the modified DataFrame back to the original file
df_with_embeddings.to_csv(file_path, index=False)
print(f"Successfully processed and saved '{filename}'.")
except Exception as e:
print(f"An error occurred while processing {filename}: {e}")
continue # Move to the next file
print("\nBatch processing complete.")
# --- Main Execution Block ---
if __name__ == '__main__':
# Define the directory where your CSV files are located.
# The script will look for a folder named 'csv_data' in the current directory.
CSV_DIRECTORY = '../discord_chat_logs'
# This function will create the 'csv_data' directory and some sample
# files if they don't exist. You can comment this out if you have your own files.
# create_sample_files(CSV_DIRECTORY)
# Run the main processing function on the directory
process_csvs_in_directory(CSV_DIRECTORY)

228
scripts/image_downloader.py Executable file
View File

@@ -0,0 +1,228 @@
#!/usr/bin/env python3
"""
Discord Image Downloader and Base64 Converter
This script parses all CSV files in the discord_chat_logs directory,
extracts attachment URLs, downloads the images, and saves them in base64
format with associated metadata (channel and sender information).
"""
import csv
import os
import base64
import json
import requests
import urllib.parse
from pathlib import Path
from typing import Dict, List, Optional
import time
import hashlib
# Configuration
CSV_DIRECTORY = "../discord_chat_logs"
OUTPUT_DIRECTORY = "../images_dataset"
OUTPUT_JSON_FILE = "images_dataset.json"
MAX_RETRIES = 3
DELAY_BETWEEN_REQUESTS = 0.5 # seconds
# Supported image extensions
SUPPORTED_EXTENSIONS = {'.png', '.jpg', '.jpeg', '.gif', '.webp', '.bmp', '.tiff'}
class ImageDownloader:
def __init__(self, csv_dir: str, output_dir: str):
self.csv_dir = Path(csv_dir)
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
self.images_data = []
self.processed_urls = set()
def get_file_extension_from_url(self, url: str) -> Optional[str]:
"""Extract file extension from URL, handling Discord CDN URLs."""
# Parse the URL to get the path
parsed = urllib.parse.urlparse(url)
path = parsed.path.lower()
# Check for direct extension in path
for ext in SUPPORTED_EXTENSIONS:
if ext in path:
return ext
# Check query parameters for format info
query_params = urllib.parse.parse_qs(parsed.query)
if 'format' in query_params:
format_val = query_params['format'][0].lower()
if f'.{format_val}' in SUPPORTED_EXTENSIONS:
return f'.{format_val}'
return None
def is_image_url(self, url: str) -> bool:
"""Check if URL points to an image file."""
if not url or not url.startswith(('http://', 'https://')):
return False
return self.get_file_extension_from_url(url) is not None
def download_image(self, url: str) -> Optional[bytes]:
"""Download image from URL with retries."""
for attempt in range(MAX_RETRIES):
try:
print(f"Downloading: {url} (attempt {attempt + 1})")
response = self.session.get(url, timeout=30)
response.raise_for_status()
# Verify content is actually an image
content_type = response.headers.get('content-type', '').lower()
if not content_type.startswith('image/'):
print(f"Warning: URL doesn't return image content: {url}")
return None
return response.content
except requests.exceptions.RequestException as e:
print(f"Error downloading {url}: {e}")
if attempt < MAX_RETRIES - 1:
time.sleep(DELAY_BETWEEN_REQUESTS * (attempt + 1))
else:
print(f"Failed to download after {MAX_RETRIES} attempts: {url}")
return None
return None
def process_csv_file(self, csv_path: Path) -> None:
"""Process a single CSV file to extract and download images."""
channel_name = csv_path.stem
print(f"\nProcessing channel: {channel_name}")
try:
with open(csv_path, 'r', encoding='utf-8') as csvfile:
reader = csv.DictReader(csvfile)
for row_num, row in enumerate(reader, 1):
attachment_urls = row.get('attachment_urls', '').strip()
if not attachment_urls:
continue
# Split multiple URLs if they exist (comma-separated)
urls = [url.strip() for url in attachment_urls.split(',') if url.strip()]
for url in urls:
if url in self.processed_urls:
continue
if not self.is_image_url(url):
continue
self.processed_urls.add(url)
# Download the image
image_data = self.download_image(url)
if image_data is None:
continue
# Create unique filename based on URL hash
url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
file_extension = self.get_file_extension_from_url(url) or '.unknown'
# Convert to base64
base64_data = base64.b64encode(image_data).decode('utf-8')
# Create metadata
image_metadata = {
'url': url,
'channel': channel_name,
'author_name': row.get('author_name', ''),
'author_nickname': row.get('author_nickname', ''),
'author_id': row.get('author_id', ''),
'message_id': row.get('message_id', ''),
'timestamp_utc': row.get('timestamp_utc', ''),
'content': row.get('content', ''),
'file_extension': file_extension,
'file_size': len(image_data),
'url_hash': url_hash,
'base64_data': base64_data
}
self.images_data.append(image_metadata)
print(f"✓ Downloaded and converted: {url} ({len(image_data)} bytes)")
# Small delay to be respectful
time.sleep(DELAY_BETWEEN_REQUESTS)
except Exception as e:
print(f"Error processing {csv_path}: {e}")
def save_dataset(self) -> None:
"""Save the collected images dataset to JSON file."""
output_file = self.output_dir / OUTPUT_JSON_FILE
# Create summary statistics
summary = {
'total_images': len(self.images_data),
'channels': list(set(img['channel'] for img in self.images_data)),
'total_size_bytes': sum(img['file_size'] for img in self.images_data),
'file_extensions': list(set(img['file_extension'] for img in self.images_data)),
'authors': list(set(img['author_name'] for img in self.images_data if img['author_name']))
}
# Prepare final dataset
dataset = {
'metadata': {
'created_at': time.strftime('%Y-%m-%d %H:%M:%S UTC', time.gmtime()),
'summary': summary
},
'images': self.images_data
}
# Save to JSON file
with open(output_file, 'w', encoding='utf-8') as jsonfile:
json.dump(dataset, jsonfile, indent=2, ensure_ascii=False)
print(f"\n✓ Dataset saved to: {output_file}")
print(f"Total images: {summary['total_images']}")
print(f"Total size: {summary['total_size_bytes']:,} bytes")
print(f"Channels: {', '.join(summary['channels'])}")
def run(self) -> None:
"""Main execution function."""
print("Discord Image Downloader and Base64 Converter")
print("=" * 50)
# Find all CSV files
csv_files = list(self.csv_dir.glob("*.csv"))
if not csv_files:
print(f"No CSV files found in {self.csv_dir}")
return
print(f"Found {len(csv_files)} CSV files to process")
# Process each CSV file
for csv_file in csv_files:
self.process_csv_file(csv_file)
# Save the final dataset
if self.images_data:
self.save_dataset()
else:
print("\nNo images were found or downloaded.")
def main():
"""Main entry point."""
script_dir = Path(__file__).parent
csv_directory = script_dir / CSV_DIRECTORY
output_directory = script_dir / OUTPUT_DIRECTORY
if not csv_directory.exists():
print(f"Error: CSV directory not found: {csv_directory}")
return
downloader = ImageDownloader(str(csv_directory), str(output_directory))
downloader.run()
if __name__ == "__main__":
main()