udpated perplexity factor

updated readme
beter clusters and qol
2025-08-11 16:11:21 +01:00 · 2025-08-11 03:07:44 +01:00 · 2025-08-11 03:04:50 +01:00 · 2025-08-11 02:49:41 +01:00 · 2025-08-11 02:37:21 +01:00 · 2025-08-11 02:22:59 +01:00
31 changed files with 14020 additions and 55 deletions
--- a/.env
+++ b/.env
@@ -0,0 +1 @@
+MTQwNDI0NTI1MTk4Nzg2OTgyOA.G_GnSa.wsi4qZ_4F40EU19wxfRLA3UG521_r9TSxOL4Q0
--- a/IMAGE_DOWNLOADER_README.md
+++ b/IMAGE_DOWNLOADER_README.md
@@ -0,0 +1,98 @@
+# Discord Image Downloader
+
+This script processes Discord chat log CSV files to download and convert images to a base64 dataset.
+
+## Features
+
+- Parses all CSV files in the `discord_chat_logs/` directory
+- Extracts attachment URLs from the `attachment_urls` column
+- Downloads images using wget-like functionality (via Python requests)
+- Converts images to base64 format for easy storage and processing
+- Saves metadata including channel, sender, timestamp, and message context
+- Handles Discord CDN URLs with query parameters
+- Implements retry logic and rate limiting
+- Deduplicates images based on URL hash
+
+## Setup
+
+1. Install dependencies:
+   ```bash
+   ./setup.sh
+   ```
+
+   Or manually:
+   ```bash
+   pip3 install -r requirements.txt
+   ```
+
+2. Run the image downloader:
+   ```bash
+   cd scripts
+   python3 image_downloader.py
+   ```
+
+## Output
+
+The script creates an `images_dataset/` directory containing:
+
+- `images_dataset.json` - Complete dataset with images in base64 format
+
+### Dataset Structure
+
+```json
+{
+  "metadata": {
+    "created_at": "2025-08-11 12:34:56 UTC",
+    "summary": {
+      "total_images": 42,
+      "channels": ["memes", "general", "nsfw"],
+      "total_size_bytes": 1234567,
+      "file_extensions": [".png", ".jpg", ".gif"],
+      "authors": ["user1", "user2"]
+    }
+  },
+  "images": [
+    {
+      "url": "https://cdn.discordapp.com/attachments/...",
+      "channel": "memes",
+      "author_name": "username",
+      "author_nickname": "User Nickname",
+      "author_id": "123456789",
+      "message_id": "987654321",
+      "timestamp_utc": "2020-03-11 18:25:49.086000+00:00",
+      "content": "Message text content",
+      "file_extension": ".png",
+      "file_size": 54321,
+      "url_hash": "abc123def456",
+      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
+    }
+  ]
+}
+```
+
+## Supported Image Formats
+
+- PNG (.png)
+- JPEG (.jpg, .jpeg)
+- GIF (.gif)
+- WebP (.webp)
+- BMP (.bmp)
+- TIFF (.tiff)
+
+## Configuration
+
+You can modify the following variables in `image_downloader.py`:
+
+- `MAX_RETRIES` - Number of download retry attempts (default: 3)
+- `DELAY_BETWEEN_REQUESTS` - Delay between requests in seconds (default: 0.5)
+- `SUPPORTED_EXTENSIONS` - Set of supported image file extensions
+
+## Error Handling
+
+The script includes robust error handling:
+
+- Skips non-image URLs
+- Retries failed downloads with exponential backoff
+- Validates content types from server responses
+- Continues processing even if individual downloads fail
+- Logs all activities and errors to console
--- a/README.md
+++ b/README.md
@@ -1,2 +1,281 @@
-# cult-scraper
+# Discord Data Analysis & Visualization Suite

+A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
+
+## 🌟 Features
+
+### 📥 Data Collection
+- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
+- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
+- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
+
+### 📊 Visualization & Analysis
+- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
+- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
+- **Image Dataset Viewer**: Browse and explore downloaded images by channel
+
+### 🔧 Data Processing
+- **Batch Processing**: Process multiple CSV files with embeddings
+- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
+- **Data Filtering**: Advanced filtering by authors, channels, and timeframes
+
+## 📁 Repository Structure
+
+```
+cult-scraper-1/
+├── scripts/                          # Core data collection scripts
+│   ├── bot.py                        # Discord bot for message scraping
+│   ├── image_downloader.py           # Download and convert Discord images
+│   ├── embedder.py                   # Batch text embedding processor
+│   └── embed_class.py                # Text embedding utilities
+├── apps/                             # Interactive applications
+│   ├── cluster_map/                  # Chat message clustering & visualization
+│   │   ├── main.py                   # Main Streamlit application
+│   │   ├── data_loader.py            # Data loading utilities
+│   │   ├── clustering.py             # Clustering algorithms
+│   │   ├── visualization.py          # Plotting and visualization
+│   │   └── requirements.txt          # Dependencies
+│   └── image_viewer/                 # Image dataset browser
+│       ├── image_viewer.py           # Streamlit image viewer
+│       └── requirements.txt          # Dependencies
+├── discord_chat_logs/                # Exported CSV files from Discord
+└── images_dataset/                   # Downloaded images and metadata
+    └── images_dataset.json           # Image dataset with base64 data
+```
+
+## 🚀 Quick Start
+
+### 1. Discord Data Scraping
+
+First, set up and run the Discord bot to collect message data:
+
+```bash
+cd scripts
+# Configure your bot token in bot.py
+python bot.py
+```
+
+**Requirements:**
+- Discord bot token with message content intent enabled
+- Bot must have read permissions in target channels
+
+### 2. Generate Text Embeddings
+
+Process the collected chat data to add semantic embeddings:
+
+```bash
+cd scripts
+python embedder.py
+```
+
+This will:
+- Process all CSV files in `discord_chat_logs/`
+- Add embeddings to message content using sentence transformers
+- Save updated files with embedding vectors
+
+### 3. Download Images
+
+Extract and download images from Discord attachments:
+
+```bash
+cd scripts
+python image_downloader.py
+```
+
+Features:
+- Downloads images from attachment URLs
+- Converts to base64 for storage
+- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
+- Implements retry logic and rate limiting
+
+### 4. Visualize Chat Data
+
+Launch the interactive chat visualization tool:
+
+```bash
+cd apps/cluster_map
+pip install -r requirements.txt
+streamlit run main.py
+```
+
+**Capabilities:**
+- 2D visualization using PCA or t-SNE
+- Interactive clustering with DBSCAN/HDBSCAN
+- Filter by channels, authors, and time periods
+- Hover to see message content and metadata
+
+### 5. Browse Image Dataset
+
+View downloaded images in an organized interface:
+
+```bash
+cd apps/image_viewer
+pip install -r requirements.txt
+streamlit run image_viewer.py
+```
+
+**Features:**
+- Channel-based organization
+- Navigation controls (previous/next)
+- Image metadata display
+- Responsive layout
+
+## 📋 Data Formats
+
+### Discord Chat Logs (CSV)
+```csv
+message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
+1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
+```
+
+### Image Dataset (JSON)
+```json
+{
+  "metadata": {
+    "created_at": "2025-08-11 12:34:56 UTC",
+    "summary": {
+      "total_images": 42,
+      "channels": ["memes", "general"],
+      "total_size_bytes": 1234567,
+      "file_extensions": [".png", ".jpg"],
+      "authors": ["user1", "user2"]
+    }
+  },
+  "images": [
+    {
+      "url": "https://cdn.discordapp.com/attachments/...",
+      "channel": "memes",
+      "author_name": "username",
+      "timestamp_utc": "2025-08-11 12:34:56+00:00",
+      "content": "Message text",
+      "file_extension": ".png",
+      "file_size": 54321,
+      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
+    }
+  ]
+}
+```
+
+## 🔧 Configuration
+
+### Discord Bot Setup
+1. Create a Discord application at https://discord.com/developers/applications
+2. Create a bot and copy the token
+3. Enable the following intents:
+   - Message Content Intent
+   - Server Members Intent (optional)
+4. Invite bot to your server with appropriate permissions
+
+### Environment Variables
+```bash
+# Set in scripts/bot.py
+BOT_TOKEN = "your_discord_bot_token_here"
+```
+
+### Embedding Models
+The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
+
+Supported models:
+- `all-MiniLM-L6-v2` (lightweight, fast)
+- `all-mpnet-base-v2` (higher quality)
+- `sentence-transformers/all-roberta-large-v1` (best quality, slower)
+
+## 📊 Visualization Features
+
+### Chat Message Clustering
+- **Dimensionality Reduction**: PCA, t-SNE, UMAP
+- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
+- **Interactive Controls**: Filter by source files, authors, and clusters
+- **Hover Information**: View message content, author, timestamp on hover
+
+### Image Analysis
+- **Channel Organization**: Browse images by Discord channel
+- **Metadata Display**: Author, timestamp, message context
+- **Navigation**: Previous/next controls with slider
+- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
+
+## 🛠️ Dependencies
+
+### Core Scripts
+- `discord.py` - Discord bot framework
+- `pandas` - Data manipulation
+- `sentence-transformers` - Text embeddings
+- `requests` - HTTP requests for image downloads
+
+### Visualization Apps
+- `streamlit` - Web interface framework
+- `plotly` - Interactive plotting
+- `scikit-learn` - Machine learning algorithms
+- `numpy` - Numerical computations
+- `umap-learn` - Dimensionality reduction
+- `hdbscan` - Density-based clustering
+
+## 📈 Use Cases
+
+### Research & Analytics
+- **Community Analysis**: Understand conversation patterns and topics
+- **Sentiment Analysis**: Track mood and sentiment over time
+- **User Behavior**: Analyze posting patterns and engagement
+- **Content Moderation**: Identify problematic content clusters
+
+### Data Science Projects
+- **NLP Research**: Experiment with text embeddings and clustering
+- **Social Network Analysis**: Study communication patterns
+- **Visualization Techniques**: Explore dimensionality reduction methods
+- **Image Processing**: Analyze visual content sharing patterns
+
+### Content Management
+- **Archive Creation**: Preserve Discord community history
+- **Content Discovery**: Find similar messages and discussions
+- **Moderation Tools**: Identify spam or inappropriate content
+- **Backup Solutions**: Create comprehensive data backups
+
+## 🔒 Privacy & Ethics
+
+- **Data Protection**: All processing happens locally
+- **User Consent**: Ensure proper permissions before scraping
+- **Compliance**: Follow Discord's Terms of Service
+- **Anonymization**: Consider removing or hashing user IDs for research
+
+## 🤝 Contributing
+
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Test thoroughly
+5. Submit a pull request
+
+## 📄 License
+
+This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
+
+## 🆘 Troubleshooting
+
+### Common Issues
+
+**Bot can't read messages:**
+- Ensure Message Content Intent is enabled
+- Check bot permissions in Discord server
+- Verify bot token is correct
+
+**Embeddings not generating:**
+- Install sentence-transformers: `pip install sentence-transformers`
+- Check available GPU memory for large models
+- Try a smaller model like `all-MiniLM-L6-v2`
+
+**Images not downloading:**
+- Check internet connectivity
+- Verify Discord CDN URLs are accessible
+- Increase retry limits for unreliable connections
+
+**Visualization not loading:**
+- Ensure all requirements are installed
+- Check that CSV files have embeddings
+- Try reducing dataset size for better performance
+
+## 📚 Additional Resources
+
+- [Discord.py Documentation](https://discordpy.readthedocs.io/)
+- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
+- [Streamlit Documentation](https://docs.streamlit.io/)
+- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
--- a/apps/cluster_map/README.md
+++ b/apps/cluster_map/README.md
@@ -0,0 +1,58 @@
+# Discord Chat Embeddings Visualizer
+
+A Streamlit application that visualizes Discord chat messages using their vector embeddings in 2D space.
+
+## Features
+
+- **2D Visualization**: View chat messages plotted using PCA or t-SNE dimension reduction
+- **Interactive Plotting**: Hover over points to see message content, author, and timestamp
+- **Filtering**: Filter by source chat log files and authors
+- **Multiple Datasets**: Automatically loads all CSV files from the discord_chat_logs folder
+
+## Installation
+
+1. Install the required dependencies:
+```bash
+pip install -r requirements.txt
+```
+
+## Usage
+
+Run the Streamlit application:
+
+```bash
+streamlit run streamlit_app.py
+```
+
+The app will automatically load all CSV files from the `../../discord_chat_logs/` directory.
+
+## Data Format
+
+The application expects CSV files with the following columns:
+- `message_id`: Unique identifier for the message
+- `timestamp_utc`: When the message was sent
+- `author_id`: Author's Discord ID
+- `author_name`: Author's username
+- `author_nickname`: Author's server nickname
+- `content`: The message content
+- `attachment_urls`: Any attached files
+- `embeds`: Embedded content
+- `content_embedding`: Vector embedding of the message content (as a string representation of a list)
+
+## Visualization Options
+
+- **PCA**: Principal Component Analysis - faster, good for getting an overview
+- **t-SNE**: t-Distributed Stochastic Neighbor Embedding - slower but may reveal better clusters
+
+## Controls
+
+- **Dimension Reduction Method**: Choose between PCA and t-SNE
+- **Filter by Source Files**: Select which chat log files to include
+- **Filter by Authors**: Select which authors to display
+- **Show Data Table**: View the underlying data in table format
+
+## Performance Notes
+
+- For large datasets, consider filtering by authors or source files to improve performance
+- t-SNE is computationally intensive and may take longer with large datasets
+- The app caches data and computations for better performance
--- a/apps/cluster_map/cluster.py
+++ b/apps/cluster_map/cluster.py
@@ -0,0 +1,12 @@
+"""
+Discord Chat Embeddings Visualizer - Legacy Entry Point
+
+This file serves as a compatibility layer for the original cluster.py.
+The application has been refactored into modular components for better maintainability.
+"""
+
+# Import and run the main application
+from main import main
+
+if __name__ == "__main__":
+    main()
--- a/apps/cluster_map/clustering.py
+++ b/apps/cluster_map/clustering.py
@@ -0,0 +1,226 @@
+"""
+Clustering algorithms and evaluation metrics.
+"""
+
+import numpy as np
+import streamlit as st
+from sklearn.cluster import SpectralClustering, AgglomerativeClustering, OPTICS
+from sklearn.mixture import GaussianMixture
+from sklearn.preprocessing import StandardScaler
+from sklearn.metrics import silhouette_score, calinski_harabasz_score
+import hdbscan
+import pandas as pd
+from collections import Counter
+import re
+from config import DEFAULT_RANDOM_STATE
+
+
+def summarize_cluster_content(cluster_messages, max_words=3):
+    """
+    Generate a meaningful name for a cluster based on its message content.
+    
+    Args:
+        cluster_messages: List of message contents in the cluster
+        max_words: Maximum number of words in the cluster name
+        
+    Returns:
+        str: Generated cluster name
+    """
+    if not cluster_messages:
+        return "Empty Cluster"
+    
+    # Combine all messages and clean text
+    all_text = " ".join([str(msg) for msg in cluster_messages if pd.notna(msg)])
+    if not all_text.strip():
+        return "Empty Content"
+    
+    # Basic text cleaning
+    text = all_text.lower()
+    
+    # Remove URLs, mentions, and special characters
+    text = re.sub(r'http[s]?://\S+', '', text)  # Remove URLs
+    text = re.sub(r'<@\d+>', '', text)  # Remove Discord mentions
+    text = re.sub(r'<:\w+:\d+>', '', text)  # Remove custom emojis
+    text = re.sub(r'[^\w\s]', ' ', text)  # Remove punctuation
+    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
+    
+    if not text:
+        return "Special Characters"
+    
+    # Split into words and filter out common words
+    words = text.split()
+    
+    # Common stop words to filter out
+    stop_words = {
+        'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
+        'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after',
+        'above', 'below', 'between', 'among', 'until', 'without', 'under', 'over',
+        'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
+        'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
+        'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them',
+        'my', 'your', 'his', 'her', 'its', 'our', 'their', 'this', 'that', 'these', 'those',
+        'just', 'like', 'get', 'know', 'think', 'see', 'go', 'come', 'say', 'said',
+        'yeah', 'yes', 'no', 'oh', 'ok', 'okay', 'well', 'so', 'but', 'if', 'when',
+        'what', 'where', 'why', 'how', 'who', 'which', 'than', 'then', 'now', 'here',
+        'there', 'also', 'too', 'very', 'really', 'pretty', 'much', 'more', 'most',
+        'some', 'any', 'all', 'many', 'few', 'little', 'big', 'small', 'good', 'bad'
+    }
+    
+    # Filter out stop words and very short/long words
+    filtered_words = [
+        word for word in words 
+        if word not in stop_words 
+        and len(word) >= 3 
+        and len(word) <= 15
+        and word.isalpha()  # Only alphabetic words
+    ]
+    
+    if not filtered_words:
+        return f"Chat ({len(cluster_messages)} msgs)"
+    
+    # Count word frequencies
+    word_counts = Counter(filtered_words)
+    
+    # Get most common words
+    most_common = word_counts.most_common(max_words * 2)  # Get more than needed for filtering
+    
+    # Select diverse words (avoid very similar words)
+    selected_words = []
+    for word, count in most_common:
+        # Avoid adding very similar words
+        if not any(word.startswith(existing[:4]) or existing.startswith(word[:4]) 
+                  for existing in selected_words):
+            selected_words.append(word)
+            if len(selected_words) >= max_words:
+                break
+    
+    if not selected_words:
+        return f"Discussion ({len(cluster_messages)} msgs)"
+    
+    # Create cluster name
+    cluster_name = " + ".join(selected_words[:max_words]).title()
+    
+    # Add message count for context
+    cluster_name += f" ({len(cluster_messages)})"
+    
+    return cluster_name
+
+
+def generate_cluster_names(filtered_df, cluster_labels):
+    """
+    Generate names for all clusters based on their content.
+    
+    Args:
+        filtered_df: DataFrame with message data
+        cluster_labels: Array of cluster labels for each message
+        
+    Returns:
+        dict: Mapping from cluster_id to cluster_name
+    """
+    if cluster_labels is None:
+        return {}
+    
+    cluster_names = {}
+    unique_clusters = np.unique(cluster_labels)
+    
+    for cluster_id in unique_clusters:
+        if cluster_id == -1:
+            cluster_names[cluster_id] = "Noise/Outliers"
+            continue
+            
+        # Get messages in this cluster
+        cluster_mask = cluster_labels == cluster_id
+        cluster_messages = filtered_df[cluster_mask]['content'].tolist()
+        
+        # Generate name
+        cluster_name = summarize_cluster_content(cluster_messages)
+        cluster_names[cluster_id] = cluster_name
+    
+    return cluster_names
+
+
+def apply_clustering(embeddings, clustering_method="None", n_clusters=5):
+    """
+    Apply clustering algorithm to embeddings and return labels and metrics.
+    
+    Args:
+        embeddings: High-dimensional embeddings to cluster
+        clustering_method: Name of clustering algorithm
+        n_clusters: Number of clusters (for methods that require it)
+        
+    Returns:
+        tuple: (cluster_labels, silhouette_score, calinski_harabasz_score)
+    """
+    if clustering_method == "None" or len(embeddings) <= n_clusters:
+        return None, None, None
+    
+    # Standardize embeddings for better clustering
+    scaler = StandardScaler()
+    scaled_embeddings = scaler.fit_transform(embeddings)
+    
+    cluster_labels = None
+    silhouette_avg = None
+    calinski_harabasz = None
+    
+    try:
+        if clustering_method == "HDBSCAN":
+            min_cluster_size = max(2, len(embeddings) // 20)  # Adaptive min cluster size
+            clusterer = hdbscan.HDBSCAN(min_cluster_size=min_cluster_size, 
+                                      min_samples=1, cluster_selection_epsilon=0.5)
+            cluster_labels = clusterer.fit_predict(scaled_embeddings)
+            
+        elif clustering_method == "Spectral Clustering":
+            clusterer = SpectralClustering(n_clusters=n_clusters, random_state=DEFAULT_RANDOM_STATE,
+                                         affinity='rbf', gamma=1.0)
+            cluster_labels = clusterer.fit_predict(scaled_embeddings)
+            
+        elif clustering_method == "Gaussian Mixture":
+            clusterer = GaussianMixture(n_components=n_clusters, random_state=DEFAULT_RANDOM_STATE,
+                                      covariance_type='full', max_iter=200)
+            cluster_labels = clusterer.fit_predict(scaled_embeddings)
+            
+        elif clustering_method == "Agglomerative (Ward)":
+            clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
+            cluster_labels = clusterer.fit_predict(scaled_embeddings)
+            
+        elif clustering_method == "Agglomerative (Complete)":
+            clusterer = AgglomerativeClustering(n_clusters=n_clusters, linkage='complete')
+            cluster_labels = clusterer.fit_predict(scaled_embeddings)
+            
+        elif clustering_method == "OPTICS":
+            min_samples = max(2, len(embeddings) // 50)
+            clusterer = OPTICS(min_samples=min_samples, xi=0.05, min_cluster_size=0.1)
+            cluster_labels = clusterer.fit_predict(scaled_embeddings)
+        
+        # Calculate clustering quality metrics
+        if cluster_labels is not None and len(np.unique(cluster_labels)) > 1:
+            # Only calculate if we have multiple clusters and no noise-only clustering
+            valid_labels = cluster_labels[cluster_labels != -1]  # Remove noise points for HDBSCAN/OPTICS
+            valid_embeddings = scaled_embeddings[cluster_labels != -1]
+            
+            if len(valid_labels) > 0 and len(np.unique(valid_labels)) > 1:
+                silhouette_avg = silhouette_score(valid_embeddings, valid_labels)
+                calinski_harabasz = calinski_harabasz_score(valid_embeddings, valid_labels)
+                
+    except Exception as e:
+        st.warning(f"Clustering failed: {str(e)}")
+        cluster_labels = None
+    
+    return cluster_labels, silhouette_avg, calinski_harabasz
+
+
+def get_cluster_statistics(cluster_labels):
+    """Get basic statistics about clustering results"""
+    if cluster_labels is None:
+        return {}
+    
+    unique_clusters = np.unique(cluster_labels)
+    n_clusters = len(unique_clusters[unique_clusters != -1])  # Exclude noise cluster (-1)
+    n_noise = np.sum(cluster_labels == -1)
+    
+    return {
+        "n_clusters": n_clusters,
+        "n_noise_points": n_noise,
+        "cluster_distribution": np.bincount(cluster_labels[cluster_labels != -1]) if n_clusters > 0 else [],
+        "unique_clusters": unique_clusters
+    }
--- a/apps/cluster_map/config.py
+++ b/apps/cluster_map/config.py
@@ -0,0 +1,75 @@
+"""
+Configuration settings and constants for the Discord Chat Embeddings Visualizer.
+"""
+
+# Application settings
+APP_TITLE = "The Cult - Visualised"
+APP_ICON = "🗨️"
+APP_LAYOUT = "wide"
+
+# File paths
+CHAT_LOGS_PATH = "../../discord_chat_logs"
+
+# Algorithm parameters
+DEFAULT_RANDOM_STATE = 42
+DEFAULT_N_COMPONENTS = 2
+DEFAULT_N_CLUSTERS = 5
+DEFAULT_DIMENSION_REDUCTION_METHOD = "t-SNE"
+DEFAULT_CLUSTERING_METHOD = "None"
+
+# Visualization settings
+DEFAULT_POINT_SIZE = 8
+DEFAULT_POINT_OPACITY = 0.7
+MAX_DISPLAYED_AUTHORS = 10
+MESSAGE_CONTENT_PREVIEW_LENGTH = 200
+MESSAGE_CONTENT_DISPLAY_LENGTH = 100
+
+# Performance thresholds
+LARGE_DATASET_WARNING_THRESHOLD = 1000
+
+# Color palettes
+PRIMARY_COLORS = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#9467bd", 
+                  "#8c564b", "#e377c2", "#7f7f7f", "#bcbd22", "#17becf"]
+
+# Clustering method categories
+CLUSTERING_METHODS_REQUIRING_N_CLUSTERS = [
+    "Spectral Clustering", 
+    "Gaussian Mixture", 
+    "Agglomerative (Ward)", 
+    "Agglomerative (Complete)"
+]
+
+COMPUTATIONALLY_INTENSIVE_METHODS = {
+    "dimension_reduction": ["t-SNE", "Spectral Embedding"],
+    "clustering": ["Spectral Clustering", "OPTICS"]
+}
+
+# Method explanations
+METHOD_EXPLANATIONS = {
+    "dimension_reduction": {
+        "PCA": "Linear, fast, preserves global variance",
+        "t-SNE": "Non-linear, good for local structure, slower",
+        "UMAP": "Balanced speed/quality, preserves local & global structure",
+        "Spectral Embedding": "Uses graph theory, good for non-convex clusters",
+        "Force-Directed": "Physics-based layout, creates natural spacing"
+    },
+    "clustering": {
+        "HDBSCAN": "Density-based, finds variable density clusters, handles noise",
+        "Spectral Clustering": "Uses eigenvalues, good for non-convex shapes",
+        "Gaussian Mixture": "Probabilistic, assumes gaussian distributions",
+        "Agglomerative (Ward)": "Hierarchical, minimizes within-cluster variance",
+        "Agglomerative (Complete)": "Hierarchical, minimizes maximum distance",
+        "OPTICS": "Density-based, finds clusters of varying densities"
+    },
+    "separation": {
+        "Spread Factor": "Applies repulsive forces between nearby points",
+        "Smart Jittering": "Adds intelligent noise to separate overlapping points",
+        "Density-Based Jittering": "Stronger separation in crowded areas",
+        "Perplexity Factor": "Controls t-SNE's focus on local vs global structure",
+        "Min Distance Factor": "Controls UMAP's point packing tightness"
+    },
+    "metrics": {
+        "Silhouette Score": "Higher is better (range: -1 to 1)",
+        "Calinski-Harabasz": "Higher is better, measures cluster separation"
+    }
+}
--- a/apps/cluster_map/data_loader.py
+++ b/apps/cluster_map/data_loader.py
@@ -0,0 +1,86 @@
+"""
+Data loading and parsing utilities for Discord chat logs.
+"""
+
+import pandas as pd
+import numpy as np
+import streamlit as st
+import ast
+from pathlib import Path
+from config import CHAT_LOGS_PATH
+
+
+@st.cache_data
+def load_all_chat_data():
+    """Load all CSV files from the discord_chat_logs folder"""
+    chat_logs_path = Path(CHAT_LOGS_PATH)
+    
+    with st.expander("📁 Loading Details", expanded=False):
+        # Display the path for debugging
+        st.write(f"Looking for CSV files in: {chat_logs_path}")
+        st.write(f"Path exists: {chat_logs_path.exists()}")
+    
+        all_data = []
+        
+        for csv_file in chat_logs_path.glob("*.csv"):
+            try:
+                df = pd.read_csv(csv_file)
+                df['source_file'] = csv_file.stem  # Add source file name
+                all_data.append(df)
+                st.write(f"✅ Loaded {len(df)} messages from {csv_file.name}")
+            except Exception as e:
+                st.error(f"❌ Error loading {csv_file.name}: {e}")
+        
+        if all_data:
+            combined_df = pd.concat(all_data, ignore_index=True)
+            st.success(f"🎉 Successfully loaded {len(combined_df)} total messages from {len(all_data)} files")
+        else:
+            st.error("No data loaded!")
+            combined_df = pd.DataFrame()
+    
+    return combined_df if all_data else pd.DataFrame()
+
+
+@st.cache_data
+def parse_embeddings(df):
+    """Parse the content_embedding column from string to numpy array"""
+    embeddings = []
+    valid_indices = []
+    
+    for idx, embedding_str in enumerate(df['content_embedding']):
+        try:
+            # Parse the string representation of the list
+            embedding = ast.literal_eval(embedding_str)
+            if isinstance(embedding, list) and len(embedding) > 0:
+                embeddings.append(embedding)
+                valid_indices.append(idx)
+        except Exception as e:
+            continue
+    
+    embeddings_array = np.array(embeddings)
+    valid_df = df.iloc[valid_indices].copy()
+    
+    st.info(f"📊 Parsed {len(embeddings)} valid embeddings from {len(df)} messages")
+    st.info(f"🔢 Embedding dimension: {embeddings_array.shape[1] if len(embeddings) > 0 else 0}")
+    
+    return embeddings_array, valid_df
+
+
+def filter_data(df, selected_sources, selected_authors):
+    """Filter dataframe by selected sources and authors"""
+    if not selected_sources:
+        selected_sources = df['source_file'].unique()
+    
+    filtered_df = df[
+        (df['source_file'].isin(selected_sources)) &
+        (df['author_name'].isin(selected_authors))
+    ]
+    
+    return filtered_df
+
+
+def get_filtered_embeddings(embeddings, valid_df, filtered_df):
+    """Get embeddings corresponding to filtered dataframe"""
+    filtered_indices = filtered_df.index.tolist()
+    filtered_embeddings = embeddings[[i for i, idx in enumerate(valid_df.index) if idx in filtered_indices]]
+    return filtered_embeddings
--- a/apps/cluster_map/dimensionality_reduction.py
+++ b/apps/cluster_map/dimensionality_reduction.py
@@ -0,0 +1,211 @@
+"""
+Dimensionality reduction algorithms and point separation techniques.
+"""
+
+import numpy as np
+import streamlit as st
+from sklearn.decomposition import PCA
+from sklearn.manifold import TSNE, SpectralEmbedding
+from sklearn.preprocessing import StandardScaler
+from sklearn.neighbors import NearestNeighbors
+from scipy.spatial.distance import pdist, squareform
+from scipy.optimize import minimize
+import umap
+from config import DEFAULT_RANDOM_STATE
+
+
+def apply_adaptive_spreading(embeddings, spread_factor=1.0):
+    """
+    Apply adaptive spreading to push apart nearby points while preserving global structure.
+    Uses a force-based approach where closer points repel more strongly.
+    """
+    if spread_factor <= 0:
+        return embeddings
+    
+    embeddings = embeddings.copy()
+    n_points = len(embeddings)
+    
+    print(f"DEBUG: Applying adaptive spreading to {n_points} points with factor {spread_factor}")
+    
+    if n_points < 2:
+        return embeddings
+    
+    # For very large datasets, skip spreading to avoid hanging
+    if n_points > 1000:
+        print(f"DEBUG: Large dataset ({n_points} points), skipping adaptive spreading...")
+        return embeddings
+    
+    # Calculate pairwise distances
+    distances = squareform(pdist(embeddings))
+    
+    # Apply force-based spreading with fewer iterations for large datasets
+    max_iterations = 3 if n_points > 500 else 5
+    
+    for iteration in range(max_iterations):
+        if iteration % 2 == 0:  # Progress indicator
+            print(f"DEBUG: Spreading iteration {iteration + 1}/{max_iterations}")
+            
+        forces = np.zeros_like(embeddings)
+        
+        for i in range(n_points):
+            for j in range(i + 1, n_points):
+                diff = embeddings[i] - embeddings[j]
+                dist = np.linalg.norm(diff)
+                
+                if dist > 0:
+                    # Repulsive force inversely proportional to distance
+                    force_magnitude = spread_factor / (dist ** 2 + 0.01)
+                    force_direction = diff / dist
+                    force = force_magnitude * force_direction
+                    
+                    forces[i] += force
+                    forces[j] -= force
+        
+        # Apply forces with damping
+        embeddings += forces * 0.1
+    
+    print(f"DEBUG: Adaptive spreading complete")
+    return embeddings
+
+
+def force_directed_layout(high_dim_embeddings, n_components=2, spread_factor=1.0):
+    """
+    Create a force-directed layout from high-dimensional embeddings.
+    This creates more natural spacing between similar points.
+    """
+    print(f"DEBUG: Starting force-directed layout with {len(high_dim_embeddings)} points...")
+    
+    # For large datasets, fall back to PCA + spreading to avoid hanging
+    if len(high_dim_embeddings) > 500:
+        print(f"DEBUG: Large dataset ({len(high_dim_embeddings)} points), using PCA + spreading instead...")
+        pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
+        result = pca.fit_transform(high_dim_embeddings)
+        return apply_adaptive_spreading(result, spread_factor)
+    
+    # Start with PCA as initial layout
+    pca = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
+    initial_layout = pca.fit_transform(high_dim_embeddings)
+    print(f"DEBUG: Initial PCA layout computed...")
+    
+    # For simplicity, just apply spreading to the PCA result
+    # The original optimization was too computationally intensive
+    result = apply_adaptive_spreading(initial_layout, spread_factor)
+    print(f"DEBUG: Force-directed layout complete...")
+    return result
+
+
+def calculate_local_density_scaling(embeddings, k=5):
+    """
+    Calculate local density scaling factors to emphasize differences in dense regions.
+    """
+    if len(embeddings) < k:
+        return np.ones(len(embeddings))
+    
+    # Find k nearest neighbors for each point
+    nn = NearestNeighbors(n_neighbors=k+1)  # +1 because first neighbor is the point itself
+    nn.fit(embeddings)
+    distances, indices = nn.kneighbors(embeddings)
+    
+    # Calculate local density (inverse of average distance to k nearest neighbors)
+    local_densities = 1.0 / (np.mean(distances[:, 1:], axis=1) + 1e-6)
+    
+    # Normalize densities
+    local_densities = (local_densities - np.min(local_densities)) / (np.max(local_densities) - np.min(local_densities) + 1e-6)
+    
+    return local_densities
+
+
+def apply_density_based_jittering(embeddings, density_scaling=True, jitter_strength=0.1):
+    """
+    Apply smart jittering that's stronger in dense regions to separate overlapping points.
+    """
+    if not density_scaling:
+        # Simple random jittering
+        noise = np.random.normal(0, jitter_strength, embeddings.shape)
+        return embeddings + noise
+    
+    # Calculate local densities
+    densities = calculate_local_density_scaling(embeddings)
+    
+    # Apply density-proportional jittering
+    jittered = embeddings.copy()
+    for i in range(len(embeddings)):
+        # More jitter in denser regions
+        jitter_amount = jitter_strength * (1 + densities[i])
+        noise = np.random.normal(0, jitter_amount, embeddings.shape[1])
+        jittered[i] += noise
+    
+    return jittered
+
+
+def reduce_dimensions(embeddings, method="PCA", n_components=2, spread_factor=1.0, 
+                     perplexity_factor=1.0, min_dist_factor=1.0):
+    """Apply dimensionality reduction with enhanced separation"""
+    
+    # Convert to numpy array if it's not already
+    embeddings = np.array(embeddings)
+    
+    print(f"DEBUG: Starting {method} with {len(embeddings)} embeddings, shape: {embeddings.shape}")
+    
+    # Standardize embeddings for better processing
+    scaler = StandardScaler()
+    scaled_embeddings = scaler.fit_transform(embeddings)
+    print(f"DEBUG: Embeddings standardized")
+    
+    # Apply the selected dimensionality reduction method
+    if method == "PCA":
+        print(f"DEBUG: Applying PCA...")
+        reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
+        reduced_embeddings = reducer.fit_transform(scaled_embeddings)
+        # Apply spreading to PCA results
+        print(f"DEBUG: Applying spreading...")
+        reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
+        
+    elif method == "t-SNE":
+        # Adjust perplexity based on user preference and data size
+        base_perplexity = min(30, len(embeddings)-1)
+        adjusted_perplexity = max(5, min(50, int(base_perplexity * perplexity_factor)))
+        print(f"DEBUG: Applying t-SNE with perplexity {adjusted_perplexity}...")
+        
+        reducer = TSNE(n_components=n_components, random_state=DEFAULT_RANDOM_STATE, 
+                      perplexity=adjusted_perplexity, n_iter=1000,
+                      early_exaggeration=12.0 * spread_factor,  # Increase early exaggeration for more separation
+                      learning_rate='auto')
+        reduced_embeddings = reducer.fit_transform(scaled_embeddings)
+        
+    elif method == "UMAP":
+        # Adjust UMAP parameters for better local separation
+        n_neighbors = min(15, len(embeddings)-1)
+        min_dist = 0.1 * min_dist_factor
+        spread = 1.0 * spread_factor
+        print(f"DEBUG: Applying UMAP with n_neighbors={n_neighbors}, min_dist={min_dist}...")
+        
+        reducer = umap.UMAP(n_components=n_components, random_state=DEFAULT_RANDOM_STATE, 
+                           n_neighbors=n_neighbors, min_dist=min_dist,
+                           spread=spread, local_connectivity=2.0)
+        reduced_embeddings = reducer.fit_transform(scaled_embeddings)
+        
+    elif method == "Spectral Embedding":
+        n_neighbors = min(10, len(embeddings)-1)
+        print(f"DEBUG: Applying Spectral Embedding with n_neighbors={n_neighbors}...")
+        reducer = SpectralEmbedding(n_components=n_components, random_state=DEFAULT_RANDOM_STATE,
+                                   n_neighbors=n_neighbors)
+        reduced_embeddings = reducer.fit_transform(scaled_embeddings)
+        # Apply spreading to spectral results
+        print(f"DEBUG: Applying spreading...")
+        reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
+        
+    elif method == "Force-Directed":
+        # New method: Use force-directed layout for natural spreading
+        print(f"DEBUG: Applying Force-Directed layout...")
+        reduced_embeddings = force_directed_layout(scaled_embeddings, n_components, spread_factor)
+        
+    else:
+        # Fallback to PCA
+        print(f"DEBUG: Unknown method {method}, falling back to PCA...")
+        reducer = PCA(n_components=n_components, random_state=DEFAULT_RANDOM_STATE)
+        reduced_embeddings = reducer.fit_transform(scaled_embeddings)
+        reduced_embeddings = apply_adaptive_spreading(reduced_embeddings, spread_factor)
+    
+    print(f"DEBUG: Dimensionality reduction complete. Output shape: {reduced_embeddings.shape}")
+    return reduced_embeddings
--- a/apps/cluster_map/main.py
+++ b/apps/cluster_map/main.py
@@ -0,0 +1,169 @@
+"""
+Main application logic for the Discord Chat Embeddings Visualizer.
+"""
+
+import streamlit as st
+import warnings
+warnings.filterwarnings('ignore')
+
+# Import custom modules
+from ui_components import (
+    setup_page_config, display_title_and_description, get_all_ui_parameters,
+    display_performance_warnings
+)
+from data_loader import (
+    load_all_chat_data, parse_embeddings, filter_data, get_filtered_embeddings
+)
+from dimensionality_reduction import (
+    reduce_dimensions, apply_density_based_jittering
+)
+from clustering import apply_clustering, generate_cluster_names
+from visualization import (
+    create_visualization_plot, display_clustering_metrics, display_summary_stats,
+    display_clustering_results, display_data_table, display_cluster_summary
+)
+
+
+def main():
+    """Main application function"""
+    # Set up page configuration
+    setup_page_config()
+    
+    # Display title and description
+    display_title_and_description()
+    
+    # Load data
+    with st.spinner("Loading chat data..."):
+        df = load_all_chat_data()
+    
+    if df.empty:
+        st.error("No data could be loaded. Please check the data directory.")
+        st.stop()
+    
+    # Parse embeddings
+    with st.spinner("Parsing embeddings..."):
+        embeddings, valid_df = parse_embeddings(df)
+    
+    if len(embeddings) == 0:
+        st.error("No valid embeddings found!")
+        st.stop()
+    
+    # Get UI parameters
+    params = get_all_ui_parameters(valid_df)
+    
+    # Check if any sources are selected before proceeding
+    if not params['selected_sources']:
+        st.info("📂 **Select source files from the sidebar to begin visualization**")
+        st.markdown("### Available Data Sources:")
+        
+        # Show available sources as an informational table
+        source_info = []
+        for source in valid_df['source_file'].unique():
+            source_data = valid_df[valid_df['source_file'] == source]
+            source_info.append({
+                'Source File': source,
+                'Messages': len(source_data),
+                'Unique Authors': source_data['author_name'].nunique(),
+                'Date Range': f"{source_data['timestamp_utc'].min()} to {source_data['timestamp_utc'].max()}"
+            })
+        
+        import pandas as pd
+        source_df = pd.DataFrame(source_info)
+        st.dataframe(source_df, use_container_width=True, hide_index=True)
+        
+        st.markdown("👈 **Use the sidebar to select which sources to visualize**")
+        st.stop()
+    
+    # Filter data
+    filtered_df = filter_data(valid_df, params['selected_sources'], params['selected_authors'])
+    
+    if filtered_df.empty:
+        st.warning("No data matches the current filters! Try selecting different sources or authors.")
+        st.stop()
+    
+    # Display performance warnings
+    display_performance_warnings(filtered_df, params['method'], params['clustering_method'])
+    
+    # Get corresponding embeddings
+    filtered_embeddings = get_filtered_embeddings(embeddings, valid_df, filtered_df)
+    
+    st.info(f"📈 Visualizing {len(filtered_df)} messages")
+    
+    # Reduce dimensions
+    n_components = 3 if params['enable_3d'] else 2
+    with st.spinner(f"Reducing dimensions using {params['method']}..."):
+        reduced_embeddings = reduce_dimensions(
+            filtered_embeddings, 
+            method=params['method'],
+            n_components=n_components,
+            spread_factor=params['spread_factor'],
+            perplexity_factor=params['perplexity_factor'],
+            min_dist_factor=params['min_dist_factor']
+        )
+    
+    # Apply clustering
+    with st.spinner(f"Applying {params['clustering_method']}..."):
+        cluster_labels, silhouette_avg, calinski_harabasz = apply_clustering(
+            filtered_embeddings,
+            clustering_method=params['clustering_method'],
+            n_clusters=params['n_clusters']
+        )
+    
+    # Apply jittering if requested
+    if params['apply_jittering']:
+        with st.spinner("Applying smart jittering to separate overlapping points..."):
+            reduced_embeddings = apply_density_based_jittering(
+                reduced_embeddings, 
+                density_scaling=params['density_based_jitter'], 
+                jitter_strength=params['jitter_strength']
+            )
+    
+    # Generate cluster names if clustering was applied
+    cluster_names = None
+    if cluster_labels is not None:
+        with st.spinner("Generating cluster names..."):
+            cluster_names = generate_cluster_names(filtered_df, cluster_labels)
+    
+    # Display clustering metrics
+    display_clustering_metrics(
+        cluster_labels, silhouette_avg, calinski_harabasz, 
+        params['show_cluster_metrics']
+    )
+    
+    # Display cluster summary with names
+    if cluster_names:
+        display_cluster_summary(cluster_names, cluster_labels)
+    
+    # Create and display the main plot
+    fig = create_visualization_plot(
+        reduced_embeddings=reduced_embeddings,
+        filtered_df=filtered_df,
+        cluster_labels=cluster_labels,
+        selected_sources=params['selected_sources'] if params['selected_sources'] else None,
+        method=params['method'],
+        clustering_method=params['clustering_method'],
+        point_size=params['point_size'],
+        point_opacity=params['point_opacity'],
+        density_based_sizing=params['density_based_sizing'],
+        size_variation=params['size_variation'],
+        enable_3d=params['enable_3d'],
+        cluster_names=cluster_names
+    )
+    
+    st.plotly_chart(fig, use_container_width=True)
+    
+    # Display summary statistics
+    display_summary_stats(filtered_df, params['selected_sources'] or filtered_df['source_file'].unique())
+    
+    # Display clustering results and export options
+    display_clustering_results(
+        filtered_df, cluster_labels, reduced_embeddings, 
+        params['method'], params['clustering_method'], params['enable_3d']
+    )
+    
+    # Display data table
+    display_data_table(filtered_df, cluster_labels)
+
+
+if __name__ == "__main__":
+    main()
--- a/apps/cluster_map/requirements.txt
+++ b/apps/cluster_map/requirements.txt
@@ -0,0 +1,8 @@
+streamlit>=1.28.0
+pandas>=1.5.0
+numpy>=1.24.0
+plotly>=5.15.0
+scikit-learn>=1.3.0
+umap-learn>=0.5.3
+hdbscan>=0.8.29
+scipy>=1.10.0
--- a/apps/cluster_map/test_debug.py
+++ b/apps/cluster_map/test_debug.py
@@ -0,0 +1,43 @@
+#!/usr/bin/env python3
+"""
+Test script to debug the hanging issue in the modular app
+"""
+
+import numpy as np
+import sys
+import os
+
+# Add the current directory to Python path
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+
+def test_dimensionality_reduction():
+    """Test dimensionality reduction functions"""
+    print("Testing dimensionality reduction functions...")
+    
+    from dimensionality_reduction import reduce_dimensions
+    
+    # Create test data similar to what we'd expect
+    n_samples = 796  # Same as the user's dataset
+    n_features = 384  # Common embedding dimension
+    
+    print(f"Creating test embeddings: {n_samples} x {n_features}")
+    test_embeddings = np.random.randn(n_samples, n_features)
+    
+    # Test PCA (should be fast)
+    print("Testing PCA...")
+    try:
+        result = reduce_dimensions(test_embeddings, method="PCA")
+        print(f"✓ PCA successful, output shape: {result.shape}")
+    except Exception as e:
+        print(f"✗ PCA failed: {e}")
+    
+    # Test UMAP (might be slower)
+    print("Testing UMAP...")
+    try:
+        result = reduce_dimensions(test_embeddings, method="UMAP")
+        print(f"✓ UMAP successful, output shape: {result.shape}")
+    except Exception as e:
+        print(f"✗ UMAP failed: {e}")
+
+if __name__ == "__main__":
+    test_dimensionality_reduction()
--- a/apps/cluster_map/ui_components.py
+++ b/apps/cluster_map/ui_components.py
@@ -0,0 +1,267 @@
+"""
+Streamlit UI components and controls for the Discord Chat Embeddings Visualizer.
+"""
+
+import streamlit as st
+import numpy as np
+from config import (
+    APP_TITLE, APP_ICON, APP_LAYOUT, METHOD_EXPLANATIONS,
+    CLUSTERING_METHODS_REQUIRING_N_CLUSTERS, COMPUTATIONALLY_INTENSIVE_METHODS,
+    LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS,
+    DEFAULT_DIMENSION_REDUCTION_METHOD, DEFAULT_CLUSTERING_METHOD
+)
+
+
+def setup_page_config():
+    """Set up the Streamlit page configuration"""
+    st.set_page_config(
+        page_title=APP_TITLE,
+        page_icon=APP_ICON,
+        layout=APP_LAYOUT
+    )
+
+
+def display_title_and_description():
+    """Display the main title and description"""
+    st.title(f"{APP_ICON} {APP_TITLE}")
+    st.markdown("Explore Discord chat messages through their vector embeddings in 2D space")
+
+
+def create_method_controls():
+    """Create controls for dimension reduction and clustering methods"""
+    st.sidebar.header("🎛️ Visualization Controls")
+    
+    # 3D visualization toggle
+    enable_3d = st.sidebar.checkbox(
+        "Enable 3D Visualization", 
+        value=False,
+        help="Switch between 2D and 3D visualization. 3D uses 3 components instead of 2."
+    )
+    
+    # Dimension reduction method
+    method_options = ["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"]
+    default_index = method_options.index(DEFAULT_DIMENSION_REDUCTION_METHOD) if DEFAULT_DIMENSION_REDUCTION_METHOD in method_options else 0
+    method = st.sidebar.selectbox(
+        "Dimension Reduction Method",
+        method_options,
+        index=default_index,
+        help="PCA is fastest, UMAP balances speed and quality, t-SNE and Spectral are slower but may reveal better structures. Force-Directed creates natural spacing."
+    )
+    
+    # Clustering method
+    clustering_options = ["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture", 
+                         "Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"]
+    clustering_default_index = clustering_options.index(DEFAULT_CLUSTERING_METHOD) if DEFAULT_CLUSTERING_METHOD in clustering_options else 0
+    clustering_method = st.sidebar.selectbox(
+        "Clustering Method",
+        clustering_options,
+        index=clustering_default_index,
+        help="Apply clustering to identify groups. HDBSCAN and OPTICS can find variable density clusters."
+    )
+    
+    return method, clustering_method, enable_3d
+
+
+def create_clustering_controls(clustering_method):
+    """Create controls for clustering parameters"""
+    # Always show the clusters slider, but indicate when it's used
+    if clustering_method in CLUSTERING_METHODS_REQUIRING_N_CLUSTERS:
+        help_text = "Number of clusters to create. This setting affects the clustering algorithm."
+        disabled = False
+    elif clustering_method == "None":
+        help_text = "Clustering is disabled. This setting has no effect."
+        disabled = True
+    else:
+        help_text = f"{clustering_method} automatically determines the number of clusters. This setting has no effect."
+        disabled = True
+    
+    n_clusters = st.sidebar.slider(
+        "Number of Clusters", 
+        min_value=2, 
+        max_value=20, 
+        value=5,
+        disabled=disabled,
+        help=help_text
+    )
+    
+    return n_clusters
+
+
+def create_separation_controls(method):
+    """Create controls for point separation and method-specific parameters"""
+    st.sidebar.subheader("🎯 Point Separation Controls")
+    
+    spread_factor = st.sidebar.slider(
+        "Spread Factor", 
+        0.5, 3.0, 1.0, 0.1,
+        help="Increase to spread apart nearby points. Higher values create more separation."
+    )
+    
+    # Method-specific parameters
+    perplexity_factor = 1.0
+    min_dist_factor = 1.0
+    
+    if method == "t-SNE":
+        perplexity_factor = st.sidebar.slider(
+            "Perplexity Factor", 
+            0.1, 2.0, 1.0, 0.1,
+            help="Affects local vs global structure balance. Lower values focus on local details."
+        )
+        
+    if method == "UMAP":
+        min_dist_factor = st.sidebar.slider(
+            "Min Distance Factor", 
+            0.1, 2.0, 1.0, 0.1,
+            help="Controls how tightly points are packed. Lower values create tighter clusters."
+        )
+    
+    return spread_factor, perplexity_factor, min_dist_factor
+
+
+def create_jittering_controls():
+    """Create controls for jittering options"""
+    apply_jittering = st.sidebar.checkbox(
+        "Apply Smart Jittering", 
+        value=False,
+        help="Add intelligent noise to separate overlapping points"
+    )
+    
+    jitter_strength = 0.1
+    density_based_jitter = True
+    
+    if apply_jittering:
+        jitter_strength = st.sidebar.slider(
+            "Jitter Strength", 
+            0.01, 0.5, 0.1, 0.01,
+            help="Strength of jittering. Higher values spread points more."
+        )
+        density_based_jitter = st.sidebar.checkbox(
+            "Density-Based Jittering", 
+            value=True,
+            help="Apply stronger jittering in dense regions"
+        )
+    
+    return apply_jittering, jitter_strength, density_based_jitter
+
+
+def create_advanced_options():
+    """Create advanced visualization options"""
+    with st.sidebar.expander("⚙️ Advanced Options"):
+        show_cluster_metrics = st.checkbox("Show Clustering Metrics", value=True)
+        point_size = st.slider("Point Size", 4, 15, 8)
+        point_opacity = st.slider("Point Opacity", 0.3, 1.0, 0.7)
+        
+        # Density-based visualization
+        density_based_sizing = st.checkbox(
+            "Density-Based Point Sizing", 
+            value=False,
+            help="Make points larger in sparse regions, smaller in dense regions"
+        )
+        
+        size_variation = 2.0
+        if density_based_sizing:
+            size_variation = st.slider(
+                "Size Variation Factor", 
+                1.5, 4.0, 2.0, 0.1,
+                help="How much point sizes vary based on local density"
+            )
+    
+    return show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation
+
+
+def create_filter_controls(valid_df):
+    """Create controls for filtering data by source and author"""
+    # Source file filter
+    source_files = valid_df['source_file'].unique()
+    selected_sources = st.sidebar.multiselect(
+        "Filter by Source Files",
+        source_files,
+        default=[],
+        help="Select which chat log files to include"
+    )
+    
+    # Author filter
+    authors = valid_df['author_name'].unique()
+    default_authors = authors[:MAX_DISPLAYED_AUTHORS] if len(authors) > MAX_DISPLAYED_AUTHORS else authors
+    selected_authors = st.sidebar.multiselect(
+        "Filter by Authors",
+        authors,
+        default=default_authors,
+        help="Select which authors to include"
+    )
+    
+    return selected_sources, selected_authors
+
+
+def display_method_explanations():
+    """Display explanations for different methods"""
+    st.sidebar.markdown("---")
+    with st.sidebar.expander("📚 Method Explanations"):
+        st.markdown("**Dimensionality Reduction:**")
+        for method, explanation in METHOD_EXPLANATIONS["dimension_reduction"].items():
+            st.markdown(f"- **{method}**: {explanation}")
+        
+        st.markdown("\n**Clustering Methods:**")
+        for method, explanation in METHOD_EXPLANATIONS["clustering"].items():
+            st.markdown(f"- **{method}**: {explanation}")
+        
+        st.markdown("\n**Separation Techniques:**")
+        for technique, explanation in METHOD_EXPLANATIONS["separation"].items():
+            st.markdown(f"- **{technique}**: {explanation}")
+        
+        st.markdown("\n**Metrics:**")
+        for metric, explanation in METHOD_EXPLANATIONS["metrics"].items():
+            st.markdown(f"- **{metric}**: {explanation}")
+
+
+def display_performance_warnings(filtered_df, method, clustering_method):
+    """Display performance warnings for computationally intensive operations"""
+    if len(filtered_df) > LARGE_DATASET_WARNING_THRESHOLD:
+        if method in COMPUTATIONALLY_INTENSIVE_METHODS["dimension_reduction"]:
+            st.warning(f"⚠️ {method} with {len(filtered_df)} points may take several minutes to compute.")
+        if clustering_method in COMPUTATIONALLY_INTENSIVE_METHODS["clustering"]:
+            st.warning(f"⚠️ {clustering_method} with {len(filtered_df)} points may be computationally intensive.")
+
+
+def get_all_ui_parameters(valid_df):
+    """Get all UI parameters in a single function call"""
+    # Method selection
+    method, clustering_method, enable_3d = create_method_controls()
+    
+    # Clustering parameters
+    n_clusters = create_clustering_controls(clustering_method)
+    
+    # Separation controls
+    spread_factor, perplexity_factor, min_dist_factor = create_separation_controls(method)
+    
+    # Jittering controls
+    apply_jittering, jitter_strength, density_based_jitter = create_jittering_controls()
+    
+    # Advanced options
+    show_cluster_metrics, point_size, point_opacity, density_based_sizing, size_variation = create_advanced_options()
+    
+    # Filters
+    selected_sources, selected_authors = create_filter_controls(valid_df)
+    
+    # Method explanations
+    display_method_explanations()
+    
+    return {
+        'method': method,
+        'clustering_method': clustering_method,
+        'enable_3d': enable_3d,
+        'n_clusters': n_clusters,
+        'spread_factor': spread_factor,
+        'perplexity_factor': perplexity_factor,
+        'min_dist_factor': min_dist_factor,
+        'apply_jittering': apply_jittering,
+        'jitter_strength': jitter_strength,
+        'density_based_jitter': density_based_jitter,
+        'show_cluster_metrics': show_cluster_metrics,
+        'point_size': point_size,
+        'point_opacity': point_opacity,
+        'density_based_sizing': density_based_sizing,
+        'size_variation': size_variation,
+        'selected_sources': selected_sources,
+        'selected_authors': selected_authors
+    }
--- a/apps/cluster_map/visualization.py
+++ b/apps/cluster_map/visualization.py
@@ -0,0 +1,311 @@
+"""
+Visualization functions for creating interactive plots and displays.
+"""
+
+import pandas as pd
+import numpy as np
+import plotly.express as px
+import plotly.graph_objects as go
+import streamlit as st
+from dimensionality_reduction import calculate_local_density_scaling
+from config import MESSAGE_CONTENT_PREVIEW_LENGTH, DEFAULT_POINT_SIZE, DEFAULT_POINT_OPACITY
+
+
+def create_hover_text(df):
+    """Create hover text for plotly"""
+    hover_text = []
+    for _, row in df.iterrows():
+        text = f"<b>Author:</b> {row['author_name']}<br>"
+        text += f"<b>Timestamp:</b> {row['timestamp_utc']}<br>"
+        text += f"<b>Source:</b> {row['source_file']}<br>"
+        
+        # Handle potential NaN or non-string content
+        content = row['content']
+        if pd.isna(content) or content is None:
+            content_text = "[No content]"
+        else:
+            content_str = str(content)
+            content_text = content_str[:MESSAGE_CONTENT_PREVIEW_LENGTH] + ('...' if len(content_str) > MESSAGE_CONTENT_PREVIEW_LENGTH else '')
+        
+        text += f"<b>Content:</b> {content_text}"
+        hover_text.append(text)
+    return hover_text
+
+
+def calculate_point_sizes(reduced_embeddings, density_based_sizing=False, 
+                         point_size=DEFAULT_POINT_SIZE, size_variation=2.0):
+    """Calculate point sizes based on density if enabled"""
+    if not density_based_sizing:
+        return [point_size] * len(reduced_embeddings)
+    
+    local_densities = calculate_local_density_scaling(reduced_embeddings)
+    # Invert densities so sparse areas get larger points
+    inverted_densities = 1.0 - local_densities
+    # Scale point sizes
+    point_sizes = point_size * (1.0 + inverted_densities * (size_variation - 1.0))
+    return point_sizes
+
+
+def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover_text, 
+                         point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA", enable_3d=False,
+                         cluster_names=None):
+    """Create a plot colored by clusters"""
+    fig = go.Figure()
+    
+    unique_clusters = np.unique(cluster_labels)
+    colors = px.colors.qualitative.Set3 + px.colors.qualitative.Pastel
+    
+    for i, cluster_id in enumerate(unique_clusters):
+        cluster_mask = cluster_labels == cluster_id
+        if cluster_mask.any():
+            cluster_embeddings = reduced_embeddings[cluster_mask]
+            cluster_hover = [hover_text[j] for j, mask in enumerate(cluster_mask) if mask]
+            cluster_sizes = [point_sizes[j] for j, mask in enumerate(cluster_mask) if mask]
+            
+            # Use generated name if available, otherwise fall back to default
+            if cluster_names and cluster_id in cluster_names:
+                cluster_name = cluster_names[cluster_id]
+            else:
+                cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
+            
+            if enable_3d:
+                fig.add_trace(go.Scatter3d(
+                    x=cluster_embeddings[:, 0],
+                    y=cluster_embeddings[:, 1],
+                    z=cluster_embeddings[:, 2],
+                    mode='markers',
+                    name=cluster_name,
+                    marker=dict(
+                        size=cluster_sizes,
+                        color=colors[i % len(colors)],
+                        opacity=point_opacity,
+                        line=dict(width=1, color='white')
+                    ),
+                    hovertemplate='%{hovertext}<extra></extra>',
+                    hovertext=cluster_hover
+                ))
+            else:
+                fig.add_trace(go.Scatter(
+                    x=cluster_embeddings[:, 0],
+                    y=cluster_embeddings[:, 1],
+                    mode='markers',
+                    name=cluster_name,
+                    marker=dict(
+                        size=cluster_sizes,
+                        color=colors[i % len(colors)],
+                        opacity=point_opacity,
+                        line=dict(width=1, color='white')
+                    ),
+                    hovertemplate='%{hovertext}<extra></extra>',
+                    hovertext=cluster_hover
+                ))
+    
+    return fig
+
+
+def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, hover_text, 
+                              point_sizes, point_opacity=DEFAULT_POINT_OPACITY, enable_3d=False):
+    """Create a plot colored by source files"""
+    fig = go.Figure()
+    colors = px.colors.qualitative.Set1
+    
+    for i, source in enumerate(selected_sources):
+        source_mask = filtered_df['source_file'] == source
+        if source_mask.any():
+            source_embeddings = reduced_embeddings[source_mask]
+            source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
+            source_sizes = [point_sizes[j] for j, mask in enumerate(source_mask) if mask]
+            
+            if enable_3d:
+                fig.add_trace(go.Scatter3d(
+                    x=source_embeddings[:, 0],
+                    y=source_embeddings[:, 1],
+                    z=source_embeddings[:, 2],
+                    mode='markers',
+                    name=source,
+                    marker=dict(
+                        size=source_sizes,
+                        color=colors[i % len(colors)],
+                        opacity=point_opacity,
+                        line=dict(width=1, color='white')
+                    ),
+                    hovertemplate='%{hovertext}<extra></extra>',
+                    hovertext=source_hover
+                ))
+            else:
+                fig.add_trace(go.Scatter(
+                    x=source_embeddings[:, 0],
+                    y=source_embeddings[:, 1],
+                    mode='markers',
+                    name=source,
+                    marker=dict(
+                        size=source_sizes,
+                        color=colors[i % len(colors)],
+                        opacity=point_opacity,
+                        line=dict(width=1, color='white')
+                    ),
+                    hovertemplate='%{hovertext}<extra></extra>',
+                    hovertext=source_hover
+                ))
+    
+    return fig
+
+
+def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=None, 
+                             selected_sources=None, method="PCA", clustering_method="None",
+                             point_size=DEFAULT_POINT_SIZE, point_opacity=DEFAULT_POINT_OPACITY,
+                             density_based_sizing=False, size_variation=2.0, enable_3d=False,
+                             cluster_names=None):
+    """Create the main visualization plot"""
+    
+    # Create hover text
+    hover_text = create_hover_text(filtered_df)
+    
+    # Calculate point sizes
+    point_sizes = calculate_point_sizes(reduced_embeddings, density_based_sizing, 
+                                       point_size, size_variation)
+    
+    # Create plot based on coloring strategy
+    if cluster_labels is not None:
+        fig = create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, 
+                                   hover_text, point_sizes, point_opacity, method, enable_3d,
+                                   cluster_names)
+    else:
+        if selected_sources is None:
+            selected_sources = filtered_df['source_file'].unique()
+        fig = create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, 
+                                        hover_text, point_sizes, point_opacity, enable_3d)
+    
+    # Update layout
+    title_suffix = f" with {clustering_method}" if clustering_method != "None" else ""
+    dimension_text = "3D" if enable_3d else "2D"
+    
+    if enable_3d:
+        fig.update_layout(
+            title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
+            scene=dict(
+                xaxis_title=f"{method} Component 1",
+                yaxis_title=f"{method} Component 2",
+                zaxis_title=f"{method} Component 3"
+            ),
+            width=1000,
+            height=700
+        )
+    else:
+        fig.update_layout(
+            title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
+            xaxis_title=f"{method} Component 1",
+            yaxis_title=f"{method} Component 2",
+            hovermode='closest',
+            width=1000,
+            height=700
+        )
+    
+    return fig
+
+
+def display_clustering_metrics(cluster_labels, silhouette_avg, calinski_harabasz, show_metrics=True):
+    """Display clustering quality metrics"""
+    if cluster_labels is not None and show_metrics:
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            n_clusters_found = len(np.unique(cluster_labels[cluster_labels != -1]))
+            st.metric("Clusters Found", n_clusters_found)
+        with col2:
+            if silhouette_avg is not None:
+                st.metric("Silhouette Score", f"{silhouette_avg:.3f}")
+            else:
+                st.metric("Silhouette Score", "N/A")
+        with col3:
+            if calinski_harabasz is not None:
+                st.metric("Calinski-Harabasz Index", f"{calinski_harabasz:.1f}")
+            else:
+                st.metric("Calinski-Harabasz Index", "N/A")
+
+
+def display_summary_stats(filtered_df, selected_sources):
+    """Display summary statistics"""
+    col1, col2, col3 = st.columns(3)
+    
+    with col1:
+        st.metric("Total Messages", len(filtered_df))
+    
+    with col2:
+        st.metric("Unique Authors", filtered_df['author_name'].nunique())
+    
+    with col3:
+        st.metric("Source Files", len(selected_sources))
+
+
+def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method, enable_3d=False):
+    """Display clustering results and export options"""
+    if cluster_labels is None:
+        return
+        
+    st.subheader("📊 Clustering Results")
+    
+    # Add cluster information to dataframe for export
+    export_df = filtered_df.copy()
+    export_df['cluster_id'] = cluster_labels
+    export_df['x_coordinate'] = reduced_embeddings[:, 0]
+    export_df['y_coordinate'] = reduced_embeddings[:, 1]
+    
+    # Add z coordinate if 3D
+    if enable_3d and reduced_embeddings.shape[1] >= 3:
+        export_df['z_coordinate'] = reduced_embeddings[:, 2]
+    
+    # Show cluster distribution
+    cluster_dist = pd.Series(cluster_labels).value_counts().sort_index()
+    st.bar_chart(cluster_dist)
+    
+    # Download option
+    csv_data = export_df.to_csv(index=False)
+    dimension_text = "3D" if enable_3d else "2D"
+    st.download_button(
+        label="📥 Download Clustering Results (CSV)",
+        data=csv_data,
+        file_name=f"chat_clusters_{method}_{clustering_method}_{dimension_text}.csv",
+        mime="text/csv"
+    )
+
+
+def display_data_table(filtered_df, cluster_labels=None):
+    """Display the data table with optional clustering information"""
+    if not st.checkbox("Show Data Table"):
+        return
+        
+    st.subheader("📋 Message Data")
+    display_df = filtered_df[['timestamp_utc', 'author_name', 'source_file', 'content']].copy()
+    
+    # Add clustering info if available
+    if cluster_labels is not None:
+        display_df['cluster'] = cluster_labels
+        
+    display_df['content'] = display_df['content'].str[:100] + '...'  # Truncate for display
+    st.dataframe(display_df, use_container_width=True)
+
+
+def display_cluster_summary(cluster_names, cluster_labels):
+    """Display a summary of cluster names and their sizes"""
+    if not cluster_names or cluster_labels is None:
+        return
+        
+    st.subheader("🏷️ Cluster Summary")
+    
+    # Create summary data
+    cluster_summary = []
+    for cluster_id, name in cluster_names.items():
+        count = np.sum(cluster_labels == cluster_id)
+        cluster_summary.append({
+            'Cluster ID': cluster_id,
+            'Cluster Name': name,
+            'Message Count': count,
+            'Percentage': f"{100 * count / len(cluster_labels):.1f}%"
+        })
+    
+    # Sort by message count
+    cluster_summary.sort(key=lambda x: x['Message Count'], reverse=True)
+    
+    # Display as table
+    summary_df = pd.DataFrame(cluster_summary)
+    st.dataframe(summary_df, use_container_width=True, hide_index=True)
--- a/apps/image_viewer/README.md
+++ b/apps/image_viewer/README.md
@@ -0,0 +1,59 @@
+# Image Dataset Viewer
+
+A simple Streamlit application to browse images from your Discord chat dataset.
+
+## Features
+
+- 📋 Dropdown to select different channels
+- 🖼️ View images with navigation controls
+- ⬅️➡️ Previous/Next buttons and slider navigation
+- 📊 Display metadata for each image
+- 📱 Responsive layout
+
+## Setup and Usage
+
+### Option 1: Using the run script (Recommended)
+```bash
+./run.sh
+```
+
+### Option 2: Manual setup
+1. Create a virtual environment:
+   ```bash
+   python3 -m venv venv
+   source venv/bin/activate
+   ```
+
+2. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+3. Run the application:
+   ```bash
+   streamlit run image_viewer.py
+   ```
+
+## How it works
+
+The application:
+1. Loads the `images_dataset.json` file from the parent directory
+2. Extracts unique channel names from the dataset
+3. Allows you to select a channel from a dropdown
+4. Displays images from that channel with navigation controls
+5. Shows metadata including author, timestamp, and message content
+
+## Dataset Structure
+
+The app expects your dataset to have entries with:
+- `channel`: The channel name
+- `image_url`, `image_path`, `url`, or `attachment_url`: The image location
+- `author`: The message author (optional)
+- `timestamp`: When the message was sent (optional)
+- `content` or `message`: The message text (optional)
+
+## Troubleshooting
+
+- If images don't load, check that the URLs in your dataset are accessible
+- For local images, ensure the paths are relative to the project root
+- Large datasets may take a moment to load initially
--- a/apps/image_viewer/image_viewer.py
+++ b/apps/image_viewer/image_viewer.py
@@ -0,0 +1,226 @@
+import streamlit as st
+import json
+import os
+from pathlib import Path
+import requests
+from PIL import Image
+from io import BytesIO
+
+# Set page config
+st.set_page_config(
+    page_title="Image Dataset Viewer",
+    page_icon="🖼️",
+    layout="wide"
+)
+
+# Cache the dataset loading
+@st.cache_data
+def load_dataset():
+    """Load the images dataset JSON file"""
+    dataset_path = "../images_dataset/images_dataset.json"
+    
+    try:
+        with open(dataset_path, 'r', encoding='utf-8') as f:
+            data = json.load(f)
+        return data
+    except Exception as e:
+        st.error(f"Error loading dataset: {e}")
+        return {}
+
+@st.cache_data
+def get_channels(data):
+    """Extract unique channels from the dataset"""
+    # First try to get channels from metadata
+    if isinstance(data, dict) and 'metadata' in data and 'summary' in data['metadata']:
+        channels = data['metadata']['summary'].get('channels', [])
+        if channels:
+            return sorted(channels)
+    
+    # Fallback: extract from images array
+    channels = set()
+    images = data.get('images', []) if isinstance(data, dict) else []
+    
+    for item in images:
+        if isinstance(item, dict) and 'channel' in item:
+            channels.add(item['channel'])
+    
+    return sorted(list(channels))
+
+def display_image(image_url, caption="", base64_data=None):
+    """Display an image from URL, local path, or base64 data"""
+    try:
+        if base64_data and base64_data != "image datta ...........":
+            # Load image from base64 data
+            import base64
+            image_data = base64.b64decode(base64_data)
+            image = Image.open(BytesIO(image_data))
+        elif image_url and image_url.startswith(('http://', 'https://')):
+            # Load image from URL
+            response = requests.get(image_url, timeout=10)
+            response.raise_for_status()
+            image = Image.open(BytesIO(response.content))
+        elif image_url:
+            # Load local image
+            image_path = Path(__file__).parent.parent / image_url
+            if image_path.exists():
+                image = Image.open(image_path)
+            else:
+                st.error(f"Image not found: {image_url}")
+                return False
+        else:
+            st.error("No valid image source found")
+            return False
+        
+        st.image(image, caption=caption, use_column_width=True)
+        return True
+    except Exception as e:
+        st.error(f"Error loading image: {e}")
+        return False
+
+def main():
+    st.title("🖼️ Image Dataset Viewer")
+    st.markdown("Browse images from your dataset by channel")
+    
+    # Load dataset
+    with st.spinner("Loading dataset..."):
+        data = load_dataset()
+    
+    if not data:
+        st.error("No data loaded. Please check your dataset file.")
+        return
+    
+    # Display dataset summary if available
+    if isinstance(data, dict) and 'metadata' in data:
+        metadata = data['metadata']
+        if 'summary' in metadata:
+            summary = metadata['summary']
+            col1, col2, col3, col4 = st.columns(4)
+            with col1:
+                st.metric("Total Images", summary.get('total_images', 'Unknown'))
+            with col2:
+                st.metric("Channels", len(summary.get('channels', [])))
+            with col3:
+                st.metric("Authors", len(summary.get('authors', [])))
+            with col4:
+                size_mb = summary.get('total_size_bytes', 0) / (1024 * 1024)
+                st.metric("Total Size", f"{size_mb:.1f} MB")
+    
+    # Get channels
+    channels = get_channels(data)
+    
+    if not channels:
+        st.error("No channels found in the dataset.")
+        return
+    
+    # Channel selection
+    selected_channel = st.selectbox(
+        "Select a channel:",
+        channels,
+        help="Choose a channel to view its images"
+    )
+    
+    # Filter images by channel
+    channel_images = []
+    images = data.get('images', []) if isinstance(data, dict) else []
+    
+    for i, item in enumerate(images):
+        if isinstance(item, dict) and item.get('channel') == selected_channel:
+            if 'url' in item or 'base64_data' in item:
+                channel_images.append({
+                    'id': i,
+                    'data': item
+                })
+    
+    if not channel_images:
+        st.warning(f"No images found for channel: {selected_channel}")
+        return
+    
+    st.success(f"Found {len(channel_images)} images in #{selected_channel}")
+    
+    # Image navigation
+    if len(channel_images) > 1:
+        col1, col2, col3 = st.columns([1, 2, 1])
+        
+        with col1:
+            if st.button("⬅️ Previous", use_container_width=True):
+                if 'image_index' in st.session_state and st.session_state.image_index > 0:
+                    st.session_state.image_index -= 1
+                else:
+                    st.session_state.image_index = len(channel_images) - 1
+        
+        with col2:
+            # Initialize or get current index
+            if 'image_index' not in st.session_state:
+                st.session_state.image_index = 0
+            
+            # Image selector
+            st.session_state.image_index = st.slider(
+                "Image",
+                0,
+                len(channel_images) - 1,
+                st.session_state.image_index,
+                help=f"Navigate through {len(channel_images)} images"
+            )
+        
+        with col3:
+            if st.button("Next ➡️", use_container_width=True):
+                if 'image_index' in st.session_state and st.session_state.image_index < len(channel_images) - 1:
+                    st.session_state.image_index += 1
+                else:
+                    st.session_state.image_index = 0
+    else:
+        st.session_state.image_index = 0
+    
+    # Display current image
+    current_image = channel_images[st.session_state.image_index]
+    image_data = current_image['data']
+    
+    # Get image URL and base64 data
+    image_url = image_data.get('url')
+    base64_data = image_data.get('base64_data')
+    
+    if image_url or base64_data:
+        # Create two columns for image and metadata
+        col1, col2 = st.columns([2, 1])
+        
+        with col1:
+            st.subheader(f"Image {st.session_state.image_index + 1} of {len(channel_images)}")
+            caption = f"Channel: #{selected_channel}"
+            if 'author_name' in image_data:
+                caption += f" | Author: {image_data['author_name']}"
+            if 'timestamp_utc' in image_data:
+                caption += f" | Time: {image_data['timestamp_utc']}"
+            
+            display_image(image_url, caption, base64_data)
+        
+        with col2:
+            st.subheader("Metadata")
+            
+            # Display metadata in an organized way
+            metadata_to_show = {
+                'ID': current_image['id'],
+                'Channel': image_data.get('channel', 'Unknown'),
+                'Author': image_data.get('author_name', 'Unknown'),
+                'Nickname': image_data.get('author_nickname', 'Unknown'),
+                'Author ID': image_data.get('author_id', 'Unknown'),
+                'Message ID': image_data.get('message_id', 'Unknown'),
+                'Timestamp': image_data.get('timestamp_utc', 'Unknown'),
+                'File Extension': image_data.get('file_extension', 'Unknown'),
+                'File Size': f"{image_data.get('file_size', 0):,} bytes" if image_data.get('file_size') else 'Unknown',
+                'Message': image_data.get('content', 'No message'),
+            }
+            
+            for key, value in metadata_to_show.items():
+                if value and value != 'Unknown':
+                    st.write(f"**{key}:** {value}")
+            
+            # Show all other metadata
+            st.subheader("Raw Data")
+            with st.expander("Show all metadata"):
+                st.json(image_data)
+    else:
+        st.error("No image URL or base64 data found in this entry")
+        st.json(image_data)
+
+if __name__ == "__main__":
+    main()
--- a/apps/image_viewer/requirements.txt
+++ b/apps/image_viewer/requirements.txt
@@ -0,0 +1,3 @@
+streamlit>=1.28.0
+requests>=2.31.0
+Pillow>=10.0.0
--- a/discord_chat_logs/botcommands.csv
+++ b/discord_chat_logs/botcommands.csv
--- a/discord_chat_logs/general.csv
+++ b/discord_chat_logs/general.csv
--- a/discord_chat_logs/memes.csv
+++ b/discord_chat_logs/memes.csv
--- a/discord_chat_logs/newbotcommandscausefarmboisanoucnce.csv
+++ b/discord_chat_logs/newbotcommandscausefarmboisanoucnce.csv
@@ -0,0 +1 @@
+message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
--- a/discord_chat_logs/nsfw.csv
+++ b/discord_chat_logs/nsfw.csv
--- a/discord_chat_logs/oop.csv
+++ b/discord_chat_logs/oop.csv
@@ -0,0 +1 @@
+message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds
--- a/discord_chat_logs/original_dons.csv
+++ b/discord_chat_logs/original_dons.csv
--- a/discord_chat_logs/pmmp_gang_exclusive.csv
+++ b/discord_chat_logs/pmmp_gang_exclusive.csv
--- a/discord_chat_logs/read_this_shit_first.csv
+++ b/discord_chat_logs/read_this_shit_first.csv
--- a/images_dataset/images_dataset.json
+++ b/images_dataset/images_dataset.json
--- a/scripts/bot.py
+++ b/scripts/bot.py
@@ -1,9 +1,7 @@
-# discord_export_bot.py
+# discord_export_bot_v2.py
 # This bot connects to a Discord server and exports the entire message
 # history from every accessible text channel into separate CSV files.
-
-# Make sure to install the discord.py library first:
-# pip install discord.py
+# This version uses a more robust task-based approach to prevent hanging.

 import discord
 import csv
@@ -11,47 +9,34 @@ import os
 import asyncio

 # --- Configuration ---
-# Place your Bot Token here. Treat this like a password!
-# It's recommended to use environment variables for security.
-BOT_TOKEN = "YOUR_BOT_TOKEN_HERE"
-
-# The directory where the CSV files will be saved.
-# The script will create this directory if it doesn't exist.
+BOT_TOKEN = "___"
 OUTPUT_DIRECTORY = "discord_chat_logs"
+# Optional: If you want to lock the bot to one server
+# ALLOWED_SERVER_ID = 123456789012345678 
 # -------------------

-
 # --- Bot Setup ---
-# Define the necessary "Intents" for the bot. Intents tell Discord what
-# events your bot needs to receive. To read messages, we need the
-# `messages` and `message_content` intents. You MUST enable these
-# in the Discord Developer Portal for your bot.
+# The intents MUST be enabled in the Discord Developer Portal.
 intents = discord.Intents.default()
 intents.guilds = True
 intents.messages = True
-intents.message_content = True # This is a privileged intent!
+intents.message_content = True # This is the most important one!

-# Create the bot client instance with the specified intents.
 client = discord.Client(intents=intents)

-# --- Main Export Logic ---
 async def export_channel_history(channel):
    """
    Asynchronously fetches all messages from a given text channel
    and saves them to a CSV file.
    """
-    print(f"Starting export for channel: #{channel.name} (ID: {channel.id})")
+    print(f"-> Starting export for channel: #{channel.name}")
    
-    # Sanitize channel name to create a valid filename
-    # Replaces invalid file name characters with an underscore
    sanitized_channel_name = "".join(c if c.isalnum() else '_' for c in channel.name)
    file_path = os.path.join(OUTPUT_DIRECTORY, f"{sanitized_channel_name}.csv")

    try:
        message_count = 0
        with open(file_path, 'w', newline='', encoding='utf-8') as csvfile:
-            # Define the headers for the CSV file. This includes all the
-            # useful information we can easily get from a message object.
            header = [
                'message_id', 'timestamp_utc', 'author_id', 'author_name',
                'author_nickname', 'content', 'attachment_urls', 'embeds'
@@ -59,95 +44,98 @@ async def export_channel_history(channel):
            writer = csv.DictWriter(csvfile, fieldnames=header)
            writer.writeheader()

-            # Fetch the channel's history. `limit=None` tells the library to
-            # fetch all messages. This can take a very long time and consume
-            # significant memory for channels with a large history.
+            # This is the part that fails without the Message Content Intent
            async for message in channel.history(limit=None):
                message_count += 1
-                if message_count % 100 == 0:
+                if message_count % 250 == 0: # Log progress less frequently
                    print(f"  ... processed {message_count} messages in #{channel.name}")

-                # Extract attachment URLs
                attachment_urls = ", ".join([att.url for att in message.attachments])
-
-                # Serialize embed objects to a string representation (e.g., JSON)
-                # This gives a detailed look into rich embeds.
                embeds_str = ", ".join([str(embed.to_dict()) for embed in message.embeds])

-                # Write the message data as a row in the CSV
+                # Handle nickname - only Member objects have nick attribute, not User objects
+                author_nickname = getattr(message.author, 'nick', None) or message.author.display_name
+
                writer.writerow({
                    'message_id': message.id,
                    'timestamp_utc': message.created_at,
                    'author_id': message.author.id,
                    'author_name': message.author.name,
-                    'author_nickname': message.author.nick,
+                    'author_nickname': author_nickname,
                    'content': message.content,
                    'attachment_urls': attachment_urls,
                    'embeds': embeds_str
                })
-
-        print(f"✅ Finished exporting {message_count} messages from #{channel.name}.")
+        
+        if message_count > 0:
+            print(f"✅ Finished exporting {message_count} messages from #{channel.name}.")
+        else:
+            print(f"⚠️ Channel #{channel.name} is empty or unreadable. 0 messages exported.")
        return True

    except discord.errors.Forbidden:
-        print(f"❌ ERROR: Permission denied for channel #{channel.name}. Skipping.")
+        print(f"❌ ERROR: Permission denied for channel #{channel.name}. Check bot permissions. Skipping.")
        return False
    except Exception as e:
        print(f"❌ An unexpected error occurred for channel #{channel.name}: {e}")
        return False

-# --- Bot Events ---
-@client.event
-async def on_ready():
+async def main_export_task():
    """
-    This event is triggered once the bot has successfully connected to Discord.
+    The main logic for the bot's export process.
+    This is run as a background task to avoid blocking.
    """
-    print(f'Logged in as: {client.user.name} (ID: {client.user.id})')
+    # Wait until the bot is fully ready before starting
+    await client.wait_until_ready()
+    
    print('------')
+    print("Bot is ready. Starting export process...")

    # Create the output directory if it doesn't exist
    if not os.path.exists(OUTPUT_DIRECTORY):
        os.makedirs(OUTPUT_DIRECTORY)
        print(f"Created output directory: {OUTPUT_DIRECTORY}")

-    # Get the server (guild) the bot is in. This script assumes the bot
-    # is only in ONE server. If it's in multiple, you may need to specify
-    # which one to target.
-    guild = client.guilds[0] 
-    if not guild:
+    # Use the first guild the bot is in. For specific server, use client.get_guild(ALLOWED_SERVER_ID)
+    if not client.guilds:
        print("Error: Bot does not appear to be in any server.")
        await client.close()
        return
-
+    
+    guild = client.guilds[0]
    print(f"Targeting server: {guild.name} (ID: {guild.id})")
    
-    # Get a list of all text channels the bot can see
    text_channels = [channel for channel in guild.text_channels]
    print(f"Found {len(text_channels)} text channels to export.")

-    # Loop through each channel and run the export function
    for channel in text_channels:
        await export_channel_history(channel)
-        # A small delay to be respectful to Discord's API, although
-        # the library handles rate limiting automatically.
        await asyncio.sleep(1) 

    print('------')
    print("All channels have been processed. The bot will now shut down.")
    
-    # Shuts down the bot once the export is complete.
+    # This properly closes the bot's connection.
    await client.close()

+@client.event
+async def on_ready():
+    """
+    This event is triggered once the bot has successfully connected.
+    It now only prints a ready message and starts the main task.
+    """
+    print(f'Logged in as: {client.user.name} (ID: {client.user.id})')
+    # Schedule the main task to run in the background
+    client.loop.create_task(main_export_task())
+
 # --- Run the Bot ---
 if __name__ == "__main__":
    if BOT_TOKEN == "YOUR_BOT_TOKEN_HERE":
        print("!!! ERROR: Please replace 'YOUR_BOT_TOKEN_HERE' with your actual bot token in the script.")
    else:
        try:
-            # This starts the bot. The `on_ready` event will be called once it's connected.
            client.run(BOT_TOKEN)
        except discord.errors.LoginFailure:
            print("!!! ERROR: Login failed. The token is likely invalid or incorrect.")
        except Exception as e:
            print(f"!!! An error occurred while running the bot: {e}")
-
--- a/scripts/embed_class.py
+++ b/scripts/embed_class.py
@@ -0,0 +1,147 @@
+# main.py
+# Description: A simple Python class to generate text embeddings using sentence-transformers.
+#
+# Required libraries:
+# pip install sentence-transformers pandas torch
+#
+# This script defines a TextEmbedder class that can be used to:
+# 1. Load a pre-trained sentence-transformer model.
+# 2. Embed a single string or a list of strings into vectors.
+# 3. Embed an entire text column in a pandas DataFrame and add the embeddings as a new column.
+
+import pandas as pd
+from sentence_transformers import SentenceTransformer
+from typing import List, Union
+
+class TextEmbedder:
+    """
+    A simple class to handle text embedding using sentence-transformers.
+    """
+    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
+        """
+        Initializes the TextEmbedder and loads the specified model.
+
+        Args:
+            model_name (str): The name of the sentence-transformer model to use.
+                              Defaults to 'all-MiniLM-L6-v2', a small and efficient model.
+        """
+        self.model_name = model_name
+        self.model = None
+        self.load_model()
+
+    def load_model(self):
+        """
+        Loads the sentence-transformer model from Hugging Face.
+        This method is called automatically during initialization.
+        """
+        try:
+            print(f"Loading model: '{self.model_name}'...")
+            self.model = SentenceTransformer(self.model_name)
+            print("Model loaded successfully.")
+        except Exception as e:
+            print(f"Error loading model: {e}")
+            self.model = None
+
+    def embed(self, text: Union[str, List[str]]):
+        """
+        Generates vector embeddings for a given string or list of strings.
+
+        Args:
+            text (Union[str, List[str]]): A single string or a list of strings to embed.
+
+        Returns:
+            A list of vector embeddings. Each embedding is a list of floats.
+            Returns None if the model is not loaded.
+        """
+        if self.model is None:
+            print("Model is not loaded. Cannot perform inference.")
+            return None
+
+        print(f"Embedding text...")
+        # The model's encode function handles both single strings and lists of strings.
+        embeddings = self.model.encode(text, convert_to_numpy=False)
+        # We convert to a list of lists for easier use with pandas.
+        if isinstance(text, str):
+            return embeddings.tolist()
+        return [emb.tolist() for emb in embeddings]
+
+
+    def embed_dataframe_column(self, df: pd.DataFrame, column_name: str) -> pd.DataFrame:
+        """
+        Embeds the text in a specified DataFrame column and adds the embeddings
+        as a new column to the DataFrame.
+
+        Args:
+            df (pd.DataFrame): The pandas DataFrame to process.
+            column_name (str): The name of the column containing the text to embed.
+
+        Returns:
+            pd.DataFrame: The original DataFrame with a new column containing the embeddings.
+                          Returns the original DataFrame unmodified if an error occurs.
+        """
+        if self.model is None:
+            print("Model is not loaded. Cannot process DataFrame.")
+            return df
+
+        if column_name not in df.columns:
+            print(f"Error: Column '{column_name}' not found in the DataFrame.")
+            return df
+
+        # Ensure the column is of string type and handle potential missing values (NaN)
+        # by filling them with an empty string.
+        text_to_embed = df[column_name].astype(str).fillna('').tolist()
+
+        # Generate embeddings for the entire column's text
+        embeddings = self.embed(text_to_embed)
+
+        if embeddings:
+            # Add the embeddings as a new column
+            new_column_name = f'{column_name}_embedding'
+            df[new_column_name] = embeddings
+            print(f"Successfully added '{new_column_name}' to the DataFrame.")
+
+        return df
+
+# --- Example Usage ---
+if __name__ == '__main__':
+    # 1. Initialize the embedder. This will automatically load the model.
+    embedder = TextEmbedder(model_name='all-MiniLM-L6-v2')
+
+    # 2. Embed a single string
+    print("\n--- Embedding a single string ---")
+    single_string = "This is a simple test sentence."
+    vector = embedder.embed(single_string)
+    if vector:
+        print(f"Original string: '{single_string}'")
+        # Print the first 5 dimensions of the vector for brevity
+        print(f"Resulting vector (first 5 dims): {vector[:5]}")
+        print(f"Vector dimension: {len(vector)}")
+
+    # 3. Embed a list of strings
+    print("\n--- Embedding a list of strings ---")
+    list_of_strings = ["The quick brown fox jumps over the lazy dog.", "Hello, world!"]
+    vectors = embedder.embed(list_of_strings)
+    if vectors:
+        for i, text in enumerate(list_of_strings):
+            print(f"Original string: '{text}'")
+            print(f"Resulting vector (first 5 dims): {vectors[i][:5]}")
+            print(f"Vector dimension: {len(vectors[i])}\n")
+
+
+    # 4. Embed a pandas DataFrame column
+    print("\n--- Embedding a DataFrame column ---")
+    # Create a sample DataFrame
+    data = {'product_id': [1, 2, 3],
+            'description': ['A comfortable cotton t-shirt.', 'High-quality noise-cancelling headphones.', 'A book about the history of computing.']}
+    my_df = pd.DataFrame(data)
+
+    print("Original DataFrame:")
+    print(my_df)
+
+    # Embed the 'description' column
+    df_with_embeddings = embedder.embed_dataframe_column(my_df, 'description')
+
+    print("\nDataFrame with embeddings:")
+    # Using .to_string() to ensure the full content is displayed
+    print(df_with_embeddings.to_string())
+
--- a/scripts/embedder.py
+++ b/scripts/embedder.py
@@ -0,0 +1,111 @@
+# batch_embedder.py
+# Description: A script to process all CSV files in a directory,
+#              add text embeddings to a specified column, and
+#              save the results back to the original files.
+#
+# This script assumes the TextEmbedder class is in a file named `main.py`
+# in the same directory.
+
+import os
+import pandas as pd
+from embed_class import TextEmbedder # Importing the class from main.py
+
+def create_sample_files(directory: str):
+    """Creates a few sample CSV files for demonstration purposes."""
+    if not os.path.exists(directory):
+        print(f"Creating sample directory: '{directory}'")
+        os.makedirs(directory)
+
+    # Sample file 1: Product descriptions
+    df1_data = {'product_name': ['Smart Watch', 'Wireless Mouse', 'Keyboard'],
+                'description': ['A watch that tracks fitness and notifications.', 'Ergonomic mouse with long battery life.', 'Mechanical keyboard with RGB lighting.']}
+    df1 = pd.DataFrame(df1_data)
+    df1.to_csv(os.path.join(directory, 'products.csv'), index=False)
+
+    # Sample file 2: Customer reviews
+    df2_data = {'review_id': [101, 102, 103],
+                'comment_text': ['The product exceeded my expectations!', 'It arrived late and was the wrong color.', 'I would definitely recommend this to a friend.']}
+    df2 = pd.DataFrame(df2_data)
+    df2.to_csv(os.path.join(directory, 'reviews.csv'), index=False)
+    
+    print(f"Created sample files in '{directory}'.")
+
+
+def process_csvs_in_directory(directory_path: str, model_name: str = 'all-MiniLM-L6-v2'):
+    """
+    Finds all CSV files in a directory, embeds a user-specified text column,
+    and overwrites the original CSV with the new data.
+
+    Args:
+        directory_path (str): The path to the directory containing CSV files.
+        model_name (str): The sentence-transformer model to use for embedding.
+    """
+    print(f"Starting batch processing for directory: '{directory_path}'")
+    
+    # 1. Initialize the TextEmbedder
+    # This will load the model, which can take a moment.
+    try:
+        embedder = TextEmbedder(model_name)
+    except Exception as e:
+        print(f"Failed to initialize TextEmbedder. Aborting. Error: {e}")
+        return
+
+    # 2. Find all CSV files in the directory
+    try:
+        all_files = os.listdir(directory_path)
+        csv_files = [f for f in all_files if f.endswith('.csv')]
+    except FileNotFoundError:
+        print(f"Error: Directory not found at '{directory_path}'. Please create it and add CSV files.")
+        return
+
+    if not csv_files:
+        print("No CSV files found in the directory.")
+        return
+
+    print(f"Found {len(csv_files)} CSV files to process.")
+
+    # 3. Loop through each CSV file
+    for filename in csv_files:
+        file_path = os.path.join(directory_path, filename)
+        print(f"\n--- Processing file: {filename} ---")
+        
+        try:
+            # Read the CSV into a DataFrame
+            df = pd.read_csv(file_path)
+            print("Available columns:", list(df.columns))
+
+            # Ask the user for the column to embed
+            column_to_embed = input(f"Enter the name of the column to embed for '{filename}': ")
+
+            # Check if the column exists
+            if column_to_embed not in df.columns:
+                print(f"Column '{column_to_embed}' not found. Skipping this file.")
+                continue
+
+            # 4. Use the embedder to add the new column
+            df_with_embeddings = embedder.embed_dataframe_column(df, column_to_embed)
+
+            # 5. Save the modified DataFrame back to the original file
+            df_with_embeddings.to_csv(file_path, index=False)
+            print(f"Successfully processed and saved '{filename}'.")
+
+        except Exception as e:
+            print(f"An error occurred while processing {filename}: {e}")
+            continue # Move to the next file
+
+    print("\nBatch processing complete.")
+
+
+# --- Main Execution Block ---
+if __name__ == '__main__':
+    # Define the directory where your CSV files are located.
+    # The script will look for a folder named 'csv_data' in the current directory.
+    CSV_DIRECTORY = '../discord_chat_logs'
+    
+    # This function will create the 'csv_data' directory and some sample
+    # files if they don't exist. You can comment this out if you have your own files.
+    # create_sample_files(CSV_DIRECTORY)
+
+    # Run the main processing function on the directory
+    process_csvs_in_directory(CSV_DIRECTORY)
+
--- a/scripts/image_downloader.py
+++ b/scripts/image_downloader.py
@@ -0,0 +1,228 @@
+#!/usr/bin/env python3
+"""
+Discord Image Downloader and Base64 Converter
+
+This script parses all CSV files in the discord_chat_logs directory,
+extracts attachment URLs, downloads the images, and saves them in base64
+format with associated metadata (channel and sender information).
+"""
+
+import csv
+import os
+import base64
+import json
+import requests
+import urllib.parse
+from pathlib import Path
+from typing import Dict, List, Optional
+import time
+import hashlib
+
+# Configuration
+CSV_DIRECTORY = "../discord_chat_logs"
+OUTPUT_DIRECTORY = "../images_dataset"
+OUTPUT_JSON_FILE = "images_dataset.json"
+MAX_RETRIES = 3
+DELAY_BETWEEN_REQUESTS = 0.5  # seconds
+
+# Supported image extensions
+SUPPORTED_EXTENSIONS = {'.png', '.jpg', '.jpeg', '.gif', '.webp', '.bmp', '.tiff'}
+
+class ImageDownloader:
+    def __init__(self, csv_dir: str, output_dir: str):
+        self.csv_dir = Path(csv_dir)
+        self.output_dir = Path(output_dir)
+        self.output_dir.mkdir(exist_ok=True)
+        self.session = requests.Session()
+        self.session.headers.update({
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
+        })
+        self.images_data = []
+        self.processed_urls = set()
+        
+    def get_file_extension_from_url(self, url: str) -> Optional[str]:
+        """Extract file extension from URL, handling Discord CDN URLs."""
+        # Parse the URL to get the path
+        parsed = urllib.parse.urlparse(url)
+        path = parsed.path.lower()
+        
+        # Check for direct extension in path
+        for ext in SUPPORTED_EXTENSIONS:
+            if ext in path:
+                return ext
+        
+        # Check query parameters for format info
+        query_params = urllib.parse.parse_qs(parsed.query)
+        if 'format' in query_params:
+            format_val = query_params['format'][0].lower()
+            if f'.{format_val}' in SUPPORTED_EXTENSIONS:
+                return f'.{format_val}'
+        
+        return None
+    
+    def is_image_url(self, url: str) -> bool:
+        """Check if URL points to an image file."""
+        if not url or not url.startswith(('http://', 'https://')):
+            return False
+        
+        return self.get_file_extension_from_url(url) is not None
+    
+    def download_image(self, url: str) -> Optional[bytes]:
+        """Download image from URL with retries."""
+        for attempt in range(MAX_RETRIES):
+            try:
+                print(f"Downloading: {url} (attempt {attempt + 1})")
+                response = self.session.get(url, timeout=30)
+                response.raise_for_status()
+                
+                # Verify content is actually an image
+                content_type = response.headers.get('content-type', '').lower()
+                if not content_type.startswith('image/'):
+                    print(f"Warning: URL doesn't return image content: {url}")
+                    return None
+                
+                return response.content
+                
+            except requests.exceptions.RequestException as e:
+                print(f"Error downloading {url}: {e}")
+                if attempt < MAX_RETRIES - 1:
+                    time.sleep(DELAY_BETWEEN_REQUESTS * (attempt + 1))
+                else:
+                    print(f"Failed to download after {MAX_RETRIES} attempts: {url}")
+                    return None
+        
+        return None
+    
+    def process_csv_file(self, csv_path: Path) -> None:
+        """Process a single CSV file to extract and download images."""
+        channel_name = csv_path.stem
+        print(f"\nProcessing channel: {channel_name}")
+        
+        try:
+            with open(csv_path, 'r', encoding='utf-8') as csvfile:
+                reader = csv.DictReader(csvfile)
+                
+                for row_num, row in enumerate(reader, 1):
+                    attachment_urls = row.get('attachment_urls', '').strip()
+                    
+                    if not attachment_urls:
+                        continue
+                    
+                    # Split multiple URLs if they exist (comma-separated)
+                    urls = [url.strip() for url in attachment_urls.split(',') if url.strip()]
+                    
+                    for url in urls:
+                        if url in self.processed_urls:
+                            continue
+                        
+                        if not self.is_image_url(url):
+                            continue
+                        
+                        self.processed_urls.add(url)
+                        
+                        # Download the image
+                        image_data = self.download_image(url)
+                        if image_data is None:
+                            continue
+                        
+                        # Create unique filename based on URL hash
+                        url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
+                        file_extension = self.get_file_extension_from_url(url) or '.unknown'
+                        
+                        # Convert to base64
+                        base64_data = base64.b64encode(image_data).decode('utf-8')
+                        
+                        # Create metadata
+                        image_metadata = {
+                            'url': url,
+                            'channel': channel_name,
+                            'author_name': row.get('author_name', ''),
+                            'author_nickname': row.get('author_nickname', ''),
+                            'author_id': row.get('author_id', ''),
+                            'message_id': row.get('message_id', ''),
+                            'timestamp_utc': row.get('timestamp_utc', ''),
+                            'content': row.get('content', ''),
+                            'file_extension': file_extension,
+                            'file_size': len(image_data),
+                            'url_hash': url_hash,
+                            'base64_data': base64_data
+                        }
+                        
+                        self.images_data.append(image_metadata)
+                        print(f"✓ Downloaded and converted: {url} ({len(image_data)} bytes)")
+                        
+                        # Small delay to be respectful
+                        time.sleep(DELAY_BETWEEN_REQUESTS)
+        
+        except Exception as e:
+            print(f"Error processing {csv_path}: {e}")
+    
+    def save_dataset(self) -> None:
+        """Save the collected images dataset to JSON file."""
+        output_file = self.output_dir / OUTPUT_JSON_FILE
+        
+        # Create summary statistics
+        summary = {
+            'total_images': len(self.images_data),
+            'channels': list(set(img['channel'] for img in self.images_data)),
+            'total_size_bytes': sum(img['file_size'] for img in self.images_data),
+            'file_extensions': list(set(img['file_extension'] for img in self.images_data)),
+            'authors': list(set(img['author_name'] for img in self.images_data if img['author_name']))
+        }
+        
+        # Prepare final dataset
+        dataset = {
+            'metadata': {
+                'created_at': time.strftime('%Y-%m-%d %H:%M:%S UTC', time.gmtime()),
+                'summary': summary
+            },
+            'images': self.images_data
+        }
+        
+        # Save to JSON file
+        with open(output_file, 'w', encoding='utf-8') as jsonfile:
+            json.dump(dataset, jsonfile, indent=2, ensure_ascii=False)
+        
+        print(f"\n✓ Dataset saved to: {output_file}")
+        print(f"Total images: {summary['total_images']}")
+        print(f"Total size: {summary['total_size_bytes']:,} bytes")
+        print(f"Channels: {', '.join(summary['channels'])}")
+    
+    def run(self) -> None:
+        """Main execution function."""
+        print("Discord Image Downloader and Base64 Converter")
+        print("=" * 50)
+        
+        # Find all CSV files
+        csv_files = list(self.csv_dir.glob("*.csv"))
+        if not csv_files:
+            print(f"No CSV files found in {self.csv_dir}")
+            return
+        
+        print(f"Found {len(csv_files)} CSV files to process")
+        
+        # Process each CSV file
+        for csv_file in csv_files:
+            self.process_csv_file(csv_file)
+        
+        # Save the final dataset
+        if self.images_data:
+            self.save_dataset()
+        else:
+            print("\nNo images were found or downloaded.")
+
+def main():
+    """Main entry point."""
+    script_dir = Path(__file__).parent
+    csv_directory = script_dir / CSV_DIRECTORY
+    output_directory = script_dir / OUTPUT_DIRECTORY
+    
+    if not csv_directory.exists():
+        print(f"Error: CSV directory not found: {csv_directory}")
+        return
+    
+    downloader = ImageDownloader(str(csv_directory), str(output_directory))
+    downloader.run()
+
+if __name__ == "__main__":
+    main()
Author	SHA1	Message	Date
Azeem Fidahusein	ce906e4f9a	udpated perplexity factor	2025-08-11 16:11:21 +01:00
Azeem Fidahusein	fd9b25f256	updated readme	2025-08-11 03:07:44 +01:00
Azeem Fidahusein	2b8659fc95	beter clusters and qol	2025-08-11 03:04:50 +01:00
Azeem Fidahusein	647111e9d3	3d viz	2025-08-11 02:49:41 +01:00
Azeem Fidahusein	4ca7e8ab61	refactor	2025-08-11 02:37:21 +01:00
Azeem Fidahusein	6d35b42b27	updated reqs from clusteing	2025-08-11 02:22:59 +01:00
Azeem Fidahusein	248cc5765f	clustermap app	2025-08-11 01:59:48 +01:00
Azeem Fidahusein	80c115b57d	embedded datasets	2025-08-11 01:51:43 +01:00
Azeem Fidahusein	aa9f2dc618	updated embedder	2025-08-11 01:51:34 +01:00
Azeem Fidahusein	fb3fb70cc5	text embedding script and class	2025-08-11 01:47:52 +01:00
Azeem Fidahusein	7ca86d7751	image viewer app	2025-08-11 01:35:14 +01:00
Azeem Fidahusein	245cc81289	images dataset	2025-08-11 01:22:03 +01:00
Azeem Fidahusein	ba528a3806	image downloader +read me	2025-08-11 01:21:35 +01:00
Azeem Fidahusein	e22705600a	DATASETS	2025-08-11 01:10:41 +01:00
Azeem Fidahusein	9aaad019a5	added new bot script	2025-08-11 01:10:36 +01:00
Azeem Fidahusein	45190bd0ff	env	2025-08-11 01:10:24 +01:00
Azeem Fidahusein	458a8c4881	updated bot dir	2025-08-11 01:10:17 +01:00
				`@@ -0,0 +1 @@`
				`MTQwNDI0NTI1MTk4Nzg2OTgyOA.G_GnSa.wsi4qZ_4F40EU19wxfRLA3UG521_r9TSxOL4Q0`
				`@@ -0,0 +1 @@`
				`message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds`