udpated perplexity factor

updated readme
beter clusters and qol
2025-08-11 16:11:21 +01:00 · 2025-08-11 03:07:44 +01:00 · 2025-08-11 03:04:50 +01:00 · 2025-08-11 02:49:41 +01:00
6 changed files with 622 additions and 60 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,281 @@
-# cult-scraper
+# Discord Data Analysis & Visualization Suite
 A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
 ## 🌟 Features
 ### 📥 Data Collection
 - **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
 - **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
 - **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
 ### 📊 Visualization & Analysis
 - **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
 - **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
 - **Image Dataset Viewer**: Browse and explore downloaded images by channel
 ### 🔧 Data Processing
 - **Batch Processing**: Process multiple CSV files with embeddings
 - **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
 - **Data Filtering**: Advanced filtering by authors, channels, and timeframes
 ## 📁 Repository Structure
 ```
 cult-scraper-1/
 ├── scripts/                          # Core data collection scripts
 │   ├── bot.py                        # Discord bot for message scraping
 │   ├── image_downloader.py           # Download and convert Discord images
 │   ├── embedder.py                   # Batch text embedding processor
 │   └── embed_class.py                # Text embedding utilities
 ├── apps/                             # Interactive applications
 │   ├── cluster_map/                  # Chat message clustering & visualization
 │   │   ├── main.py                   # Main Streamlit application
 │   │   ├── data_loader.py            # Data loading utilities
 │   │   ├── clustering.py             # Clustering algorithms
 │   │   ├── visualization.py          # Plotting and visualization
 │   │   └── requirements.txt          # Dependencies
 │   └── image_viewer/                 # Image dataset browser
 │       ├── image_viewer.py           # Streamlit image viewer
 │       └── requirements.txt          # Dependencies
 ├── discord_chat_logs/                # Exported CSV files from Discord
 └── images_dataset/                   # Downloaded images and metadata
    └── images_dataset.json           # Image dataset with base64 data
 ```
 ## 🚀 Quick Start
 ### 1. Discord Data Scraping
 First, set up and run the Discord bot to collect message data:
 ```bash
 cd scripts
 # Configure your bot token in bot.py
 python bot.py
 ```
 **Requirements:**
 - Discord bot token with message content intent enabled
 - Bot must have read permissions in target channels
 ### 2. Generate Text Embeddings
 Process the collected chat data to add semantic embeddings:
 ```bash
 cd scripts
 python embedder.py
 ```
 This will:
 - Process all CSV files in `discord_chat_logs/`
 - Add embeddings to message content using sentence transformers
 - Save updated files with embedding vectors
 ### 3. Download Images
 Extract and download images from Discord attachments:
 ```bash
 cd scripts
 python image_downloader.py
 ```
 Features:
 - Downloads images from attachment URLs
 - Converts to base64 for storage
 - Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
 - Implements retry logic and rate limiting
 ### 4. Visualize Chat Data
 Launch the interactive chat visualization tool:
 ```bash
 cd apps/cluster_map
 pip install -r requirements.txt
 streamlit run main.py
 ```
 **Capabilities:**
 - 2D visualization using PCA or t-SNE
 - Interactive clustering with DBSCAN/HDBSCAN
 - Filter by channels, authors, and time periods
 - Hover to see message content and metadata
 ### 5. Browse Image Dataset
 View downloaded images in an organized interface:
 ```bash
 cd apps/image_viewer
 pip install -r requirements.txt
 streamlit run image_viewer.py
 ```
 **Features:**
 - Channel-based organization
 - Navigation controls (previous/next)
 - Image metadata display
 - Responsive layout
 ## 📋 Data Formats
 ### Discord Chat Logs (CSV)
 ```csv
 message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
 1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
 ```
 ### Image Dataset (JSON)
 ```json
 {
  "metadata": {
    "created_at": "2025-08-11 12:34:56 UTC",
    "summary": {
      "total_images": 42,
      "channels": ["memes", "general"],
      "total_size_bytes": 1234567,
      "file_extensions": [".png", ".jpg"],
      "authors": ["user1", "user2"]
    }
  },
  "images": [
    {
      "url": "https://cdn.discordapp.com/attachments/...",
      "channel": "memes",
      "author_name": "username",
      "timestamp_utc": "2025-08-11 12:34:56+00:00",
      "content": "Message text",
      "file_extension": ".png",
      "file_size": 54321,
      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  ]
 }
 ```
 ## 🔧 Configuration
 ### Discord Bot Setup
 1. Create a Discord application at https://discord.com/developers/applications
 2. Create a bot and copy the token
 3. Enable the following intents:
   - Message Content Intent
   - Server Members Intent (optional)
 4. Invite bot to your server with appropriate permissions
 ### Environment Variables
 ```bash
 # Set in scripts/bot.py
 BOT_TOKEN = "your_discord_bot_token_here"
 ```
 ### Embedding Models
 The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
 Supported models:
 - `all-MiniLM-L6-v2` (lightweight, fast)
 - `all-mpnet-base-v2` (higher quality)
 - `sentence-transformers/all-roberta-large-v1` (best quality, slower)
 ## 📊 Visualization Features
 ### Chat Message Clustering
 - **Dimensionality Reduction**: PCA, t-SNE, UMAP
 - **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
 - **Interactive Controls**: Filter by source files, authors, and clusters
 - **Hover Information**: View message content, author, timestamp on hover
 ### Image Analysis
 - **Channel Organization**: Browse images by Discord channel
 - **Metadata Display**: Author, timestamp, message context
 - **Navigation**: Previous/next controls with slider
 - **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
 ## 🛠️ Dependencies
 ### Core Scripts
 - `discord.py` - Discord bot framework
 - `pandas` - Data manipulation
 - `sentence-transformers` - Text embeddings
 - `requests` - HTTP requests for image downloads
 ### Visualization Apps
 - `streamlit` - Web interface framework
 - `plotly` - Interactive plotting
 - `scikit-learn` - Machine learning algorithms
 - `numpy` - Numerical computations
 - `umap-learn` - Dimensionality reduction
 - `hdbscan` - Density-based clustering
 ## 📈 Use Cases
 ### Research & Analytics
 - **Community Analysis**: Understand conversation patterns and topics
 - **Sentiment Analysis**: Track mood and sentiment over time
 - **User Behavior**: Analyze posting patterns and engagement
 - **Content Moderation**: Identify problematic content clusters
 ### Data Science Projects
 - **NLP Research**: Experiment with text embeddings and clustering
 - **Social Network Analysis**: Study communication patterns
 - **Visualization Techniques**: Explore dimensionality reduction methods
 - **Image Processing**: Analyze visual content sharing patterns
 ### Content Management
 - **Archive Creation**: Preserve Discord community history
 - **Content Discovery**: Find similar messages and discussions
 - **Moderation Tools**: Identify spam or inappropriate content
 - **Backup Solutions**: Create comprehensive data backups
 ## 🔒 Privacy & Ethics
 - **Data Protection**: All processing happens locally
 - **User Consent**: Ensure proper permissions before scraping
 - **Compliance**: Follow Discord's Terms of Service
 - **Anonymization**: Consider removing or hashing user IDs for research
 ## 🤝 Contributing
 1. Fork the repository
 2. Create a feature branch
 3. Make your changes
 4. Test thoroughly
 5. Submit a pull request
 ## 📄 License
 This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
 ## 🆘 Troubleshooting
 ### Common Issues
 **Bot can't read messages:**
 - Ensure Message Content Intent is enabled
 - Check bot permissions in Discord server
 - Verify bot token is correct
 **Embeddings not generating:**
 - Install sentence-transformers: `pip install sentence-transformers`
 - Check available GPU memory for large models
 - Try a smaller model like `all-MiniLM-L6-v2`
 **Images not downloading:**
 - Check internet connectivity
 - Verify Discord CDN URLs are accessible
 - Increase retry limits for unreliable connections
 **Visualization not loading:**
 - Ensure all requirements are installed
 - Check that CSV files have embeddings
 - Try reducing dataset size for better performance
 ## 📚 Additional Resources
 - [Discord.py Documentation](https://discordpy.readthedocs.io/)
 - [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
 - [Streamlit Documentation](https://docs.streamlit.io/)
 - [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
--- a/apps/cluster_map/clustering.py
+++ b/apps/cluster_map/clustering.py
@@ -9,9 +9,136 @@ from sklearn.mixture import GaussianMixture
 from sklearn.preprocessing import StandardScaler
 from sklearn.metrics import silhouette_score, calinski_harabasz_score
 import hdbscan
 import pandas as pd
 from collections import Counter
 import re
 from config import DEFAULT_RANDOM_STATE
 def summarize_cluster_content(cluster_messages, max_words=3):
    """
    Generate a meaningful name for a cluster based on its message content.
    Args:
        cluster_messages: List of message contents in the cluster
        max_words: Maximum number of words in the cluster name
    Returns:
        str: Generated cluster name
    """
    if not cluster_messages:
        return "Empty Cluster"
    # Combine all messages and clean text
    all_text = " ".join([str(msg) for msg in cluster_messages if pd.notna(msg)])
    if not all_text.strip():
        return "Empty Content"
    # Basic text cleaning
    text = all_text.lower()
    # Remove URLs, mentions, and special characters
    text = re.sub(r'http[s]?://\S+', '', text)  # Remove URLs
    text = re.sub(r'<@\d+>', '', text)  # Remove Discord mentions
    text = re.sub(r'<:\w+:\d+>', '', text)  # Remove custom emojis
    text = re.sub(r'[^\w\s]', ' ', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace
    if not text:
        return "Special Characters"
    # Split into words and filter out common words
    words = text.split()
    # Common stop words to filter out
    stop_words = {
        'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
        'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after',
        'above', 'below', 'between', 'among', 'until', 'without', 'under', 'over',
        'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
        'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
        'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them',
        'my', 'your', 'his', 'her', 'its', 'our', 'their', 'this', 'that', 'these', 'those',
        'just', 'like', 'get', 'know', 'think', 'see', 'go', 'come', 'say', 'said',
        'yeah', 'yes', 'no', 'oh', 'ok', 'okay', 'well', 'so', 'but', 'if', 'when',
        'what', 'where', 'why', 'how', 'who', 'which', 'than', 'then', 'now', 'here',
        'there', 'also', 'too', 'very', 'really', 'pretty', 'much', 'more', 'most',
        'some', 'any', 'all', 'many', 'few', 'little', 'big', 'small', 'good', 'bad'
    }
    # Filter out stop words and very short/long words
    filtered_words = [
        word for word in words 
        if word not in stop_words 
        and len(word) >= 3 
        and len(word) <= 15
        and word.isalpha()  # Only alphabetic words
    ]
    if not filtered_words:
        return f"Chat ({len(cluster_messages)} msgs)"
    # Count word frequencies
    word_counts = Counter(filtered_words)
    # Get most common words
    most_common = word_counts.most_common(max_words * 2)  # Get more than needed for filtering
    # Select diverse words (avoid very similar words)
    selected_words = []
    for word, count in most_common:
        # Avoid adding very similar words
        if not any(word.startswith(existing[:4]) or existing.startswith(word[:4]) 
                  for existing in selected_words):
            selected_words.append(word)
            if len(selected_words) >= max_words:
                break
    if not selected_words:
        return f"Discussion ({len(cluster_messages)} msgs)"
    # Create cluster name
    cluster_name = " + ".join(selected_words[:max_words]).title()
    # Add message count for context
    cluster_name += f" ({len(cluster_messages)})"
    return cluster_name
 def generate_cluster_names(filtered_df, cluster_labels):
    """
    Generate names for all clusters based on their content.
    Args:
        filtered_df: DataFrame with message data
        cluster_labels: Array of cluster labels for each message
    Returns:
        dict: Mapping from cluster_id to cluster_name
    """
    if cluster_labels is None:
        return {}
    cluster_names = {}
    unique_clusters = np.unique(cluster_labels)
    for cluster_id in unique_clusters:
        if cluster_id == -1:
            cluster_names[cluster_id] = "Noise/Outliers"
            continue
        # Get messages in this cluster
        cluster_mask = cluster_labels == cluster_id
        cluster_messages = filtered_df[cluster_mask]['content'].tolist()
        # Generate name
        cluster_name = summarize_cluster_content(cluster_messages)
        cluster_names[cluster_id] = cluster_name
    return cluster_names
 def apply_clustering(embeddings, clustering_method="None", n_clusters=5):
    """
    Apply clustering algorithm to embeddings and return labels and metrics.
--- a/apps/cluster_map/config.py
+++ b/apps/cluster_map/config.py
@@ -3,7 +3,7 @@ Configuration settings and constants for the Discord Chat Embeddings Visualizer.
 """
 # Application settings
-APP_TITLE = "Discord Chat Embeddings Visualizer"
+APP_TITLE = "The Cult - Visualised"
 APP_ICON = "🗨️"
 APP_LAYOUT = "wide"
@@ -14,6 +14,8 @@ CHAT_LOGS_PATH = "../../discord_chat_logs"
 DEFAULT_RANDOM_STATE = 42
 DEFAULT_N_COMPONENTS = 2
 DEFAULT_N_CLUSTERS = 5
 DEFAULT_DIMENSION_REDUCTION_METHOD = "t-SNE"
 DEFAULT_CLUSTERING_METHOD = "None"
 # Visualization settings
 DEFAULT_POINT_SIZE = 8
--- a/apps/cluster_map/main.py
+++ b/apps/cluster_map/main.py
@@ -17,10 +17,10 @@ from data_loader import (
 from dimensionality_reduction import (
    reduce_dimensions, apply_density_based_jittering
 )
-from clustering import apply_clustering
+from clustering import apply_clustering, generate_cluster_names
 from visualization import (
    create_visualization_plot, display_clustering_metrics, display_summary_stats,
-    display_clustering_results, display_data_table
+    display_clustering_results, display_data_table, display_cluster_summary
 )
@@ -51,11 +51,34 @@ def main():
    # Get UI parameters
    params = get_all_ui_parameters(valid_df)
    # Check if any sources are selected before proceeding
    if not params['selected_sources']:
        st.info("📂 **Select source files from the sidebar to begin visualization**")
        st.markdown("### Available Data Sources:")
        # Show available sources as an informational table
        source_info = []
        for source in valid_df['source_file'].unique():
            source_data = valid_df[valid_df['source_file'] == source]
            source_info.append({
                'Source File': source,
                'Messages': len(source_data),
                'Unique Authors': source_data['author_name'].nunique(),
                'Date Range': f"{source_data['timestamp_utc'].min()} to {source_data['timestamp_utc'].max()}"
            })
        import pandas as pd
        source_df = pd.DataFrame(source_info)
        st.dataframe(source_df, use_container_width=True, hide_index=True)
        st.markdown("👈 **Use the sidebar to select which sources to visualize**")
        st.stop()
    # Filter data
    filtered_df = filter_data(valid_df, params['selected_sources'], params['selected_authors'])
    if filtered_df.empty:
-        st.warning("No data matches the current filters!")
+        st.warning("No data matches the current filters! Try selecting different sources or authors.")
        st.stop()
    # Display performance warnings
@@ -67,10 +90,12 @@ def main():
    st.info(f"📈 Visualizing {len(filtered_df)} messages")
    # Reduce dimensions
    n_components = 3 if params['enable_3d'] else 2
    with st.spinner(f"Reducing dimensions using {params['method']}..."):
        reduced_embeddings = reduce_dimensions(
            filtered_embeddings, 
            method=params['method'],
            n_components=n_components,
            spread_factor=params['spread_factor'],
            perplexity_factor=params['perplexity_factor'],
            min_dist_factor=params['min_dist_factor']
@@ -93,12 +118,22 @@ def main():
                jitter_strength=params['jitter_strength']
            )
    # Generate cluster names if clustering was applied
    cluster_names = None
    if cluster_labels is not None:
        with st.spinner("Generating cluster names..."):
            cluster_names = generate_cluster_names(filtered_df, cluster_labels)
    # Display clustering metrics
    display_clustering_metrics(
        cluster_labels, silhouette_avg, calinski_harabasz, 
        params['show_cluster_metrics']
    )
    # Display cluster summary with names
    if cluster_names:
        display_cluster_summary(cluster_names, cluster_labels)
    # Create and display the main plot
    fig = create_visualization_plot(
        reduced_embeddings=reduced_embeddings,
@@ -110,7 +145,9 @@ def main():
        point_size=params['point_size'],
        point_opacity=params['point_opacity'],
        density_based_sizing=params['density_based_sizing'],
-        size_variation=params['size_variation']
+        size_variation=params['size_variation'],
        enable_3d=params['enable_3d'],
        cluster_names=cluster_names
    )
    st.plotly_chart(fig, use_container_width=True)
@@ -121,7 +158,7 @@ def main():
    # Display clustering results and export options
    display_clustering_results(
        filtered_df, cluster_labels, reduced_embeddings, 
-        params['method'], params['clustering_method']
+        params['method'], params['clustering_method'], params['enable_3d']
    )
    # Display data table
--- a/apps/cluster_map/ui_components.py
+++ b/apps/cluster_map/ui_components.py
@@ -7,7 +7,8 @@ import numpy as np
 from config import (
    APP_TITLE, APP_ICON, APP_LAYOUT, METHOD_EXPLANATIONS,
    CLUSTERING_METHODS_REQUIRING_N_CLUSTERS, COMPUTATIONALLY_INTENSIVE_METHODS,
-    LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS
+    LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS,
    DEFAULT_DIMENSION_REDUCTION_METHOD, DEFAULT_CLUSTERING_METHOD
 )
@@ -30,29 +31,58 @@ def create_method_controls():
    """Create controls for dimension reduction and clustering methods"""
    st.sidebar.header("🎛️ Visualization Controls")
    # 3D visualization toggle
    enable_3d = st.sidebar.checkbox(
        "Enable 3D Visualization", 
        value=False,
        help="Switch between 2D and 3D visualization. 3D uses 3 components instead of 2."
    )
    # Dimension reduction method
    method_options = ["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"]
    default_index = method_options.index(DEFAULT_DIMENSION_REDUCTION_METHOD) if DEFAULT_DIMENSION_REDUCTION_METHOD in method_options else 0
    method = st.sidebar.selectbox(
        "Dimension Reduction Method",
-        ["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"],
+        method_options,
        index=default_index,
        help="PCA is fastest, UMAP balances speed and quality, t-SNE and Spectral are slower but may reveal better structures. Force-Directed creates natural spacing."
    )
    # Clustering method
    clustering_options = ["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture", 
                         "Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"]
    clustering_default_index = clustering_options.index(DEFAULT_CLUSTERING_METHOD) if DEFAULT_CLUSTERING_METHOD in clustering_options else 0
    clustering_method = st.sidebar.selectbox(
        "Clustering Method",
-        ["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture", 
+        clustering_options,
-         "Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"],
+        index=clustering_default_index,
        help="Apply clustering to identify groups. HDBSCAN and OPTICS can find variable density clusters."
    )
-    return method, clustering_method
+    return method, clustering_method, enable_3d
 def create_clustering_controls(clustering_method):
    """Create controls for clustering parameters"""
-    n_clusters = 5
+    # Always show the clusters slider, but indicate when it's used
    if clustering_method in CLUSTERING_METHODS_REQUIRING_N_CLUSTERS:
-        n_clusters = st.sidebar.slider("Number of Clusters", 2, 15, 5)
+        help_text = "Number of clusters to create. This setting affects the clustering algorithm."
        disabled = False
    elif clustering_method == "None":
        help_text = "Clustering is disabled. This setting has no effect."
        disabled = True
    else:
        help_text = f"{clustering_method} automatically determines the number of clusters. This setting has no effect."
        disabled = True
    n_clusters = st.sidebar.slider(
        "Number of Clusters", 
        min_value=2, 
        max_value=20, 
        value=5,
        disabled=disabled,
        help=help_text
    )
    return n_clusters
@@ -74,7 +104,7 @@ def create_separation_controls(method):
    if method == "t-SNE":
        perplexity_factor = st.sidebar.slider(
            "Perplexity Factor", 
-            0.5, 2.0, 1.0, 0.1,
+            0.1, 2.0, 1.0, 0.1,
            help="Affects local vs global structure balance. Lower values focus on local details."
        )
@@ -196,7 +226,7 @@ def display_performance_warnings(filtered_df, method, clustering_method):
 def get_all_ui_parameters(valid_df):
    """Get all UI parameters in a single function call"""
    # Method selection
-    method, clustering_method = create_method_controls()
+    method, clustering_method, enable_3d = create_method_controls()
    # Clustering parameters
    n_clusters = create_clustering_controls(clustering_method)
@@ -219,6 +249,7 @@ def get_all_ui_parameters(valid_df):
    return {
        'method': method,
        'clustering_method': clustering_method,
        'enable_3d': enable_3d,
        'n_clusters': n_clusters,
        'spread_factor': spread_factor,
        'perplexity_factor': perplexity_factor,
--- a/apps/cluster_map/visualization.py
+++ b/apps/cluster_map/visualization.py
@@ -47,7 +47,8 @@ def calculate_point_sizes(reduced_embeddings, density_based_sizing=False,
 def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover_text, 
-                         point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA"):
+                         point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA", enable_3d=False,
                         cluster_names=None):
    """Create a plot colored by clusters"""
    fig = go.Figure()
@@ -61,28 +62,49 @@ def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover
            cluster_hover = [hover_text[j] for j, mask in enumerate(cluster_mask) if mask]
            cluster_sizes = [point_sizes[j] for j, mask in enumerate(cluster_mask) if mask]
-            cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
+            # Use generated name if available, otherwise fall back to default
            if cluster_names and cluster_id in cluster_names:
                cluster_name = cluster_names[cluster_id]
            else:
                cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
-            fig.add_trace(go.Scatter(
+            if enable_3d:
-                x=cluster_embeddings[:, 0],
+                fig.add_trace(go.Scatter3d(
-                y=cluster_embeddings[:, 1],
+                    x=cluster_embeddings[:, 0],
-                mode='markers',
+                    y=cluster_embeddings[:, 1],
-                name=cluster_name,
+                    z=cluster_embeddings[:, 2],
-                marker=dict(
+                    mode='markers',
-                    size=cluster_sizes,
+                    name=cluster_name,
-                    color=colors[i % len(colors)],
+                    marker=dict(
-                    opacity=point_opacity,
+                        size=cluster_sizes,
-                    line=dict(width=1, color='white')
+                        color=colors[i % len(colors)],
-                ),
+                        opacity=point_opacity,
-                hovertemplate='%{hovertext}<extra></extra>',
+                        line=dict(width=1, color='white')
-                hovertext=cluster_hover
+                    ),
-            ))
+                    hovertemplate='%{hovertext}<extra></extra>',
                    hovertext=cluster_hover
                ))
            else:
                fig.add_trace(go.Scatter(
                    x=cluster_embeddings[:, 0],
                    y=cluster_embeddings[:, 1],
                    mode='markers',
                    name=cluster_name,
                    marker=dict(
                        size=cluster_sizes,
                        color=colors[i % len(colors)],
                        opacity=point_opacity,
                        line=dict(width=1, color='white')
                    ),
                    hovertemplate='%{hovertext}<extra></extra>',
                    hovertext=cluster_hover
                ))
    return fig
 def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, hover_text, 
-                              point_sizes, point_opacity=DEFAULT_POINT_OPACITY):
+                              point_sizes, point_opacity=DEFAULT_POINT_OPACITY, enable_3d=False):
    """Create a plot colored by source files"""
    fig = go.Figure()
    colors = px.colors.qualitative.Set1
@@ -94,20 +116,37 @@ def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources
            source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
            source_sizes = [point_sizes[j] for j, mask in enumerate(source_mask) if mask]
-            fig.add_trace(go.Scatter(
+            if enable_3d:
-                x=source_embeddings[:, 0],
+                fig.add_trace(go.Scatter3d(
-                y=source_embeddings[:, 1],
+                    x=source_embeddings[:, 0],
-                mode='markers',
+                    y=source_embeddings[:, 1],
-                name=source,
+                    z=source_embeddings[:, 2],
-                marker=dict(
+                    mode='markers',
-                    size=source_sizes,
+                    name=source,
-                    color=colors[i % len(colors)],
+                    marker=dict(
-                    opacity=point_opacity,
+                        size=source_sizes,
-                    line=dict(width=1, color='white')
+                        color=colors[i % len(colors)],
-                ),
+                        opacity=point_opacity,
-                hovertemplate='%{hovertext}<extra></extra>',
+                        line=dict(width=1, color='white')
-                hovertext=source_hover
+                    ),
-            ))
+                    hovertemplate='%{hovertext}<extra></extra>',
                    hovertext=source_hover
                ))
            else:
                fig.add_trace(go.Scatter(
                    x=source_embeddings[:, 0],
                    y=source_embeddings[:, 1],
                    mode='markers',
                    name=source,
                    marker=dict(
                        size=source_sizes,
                        color=colors[i % len(colors)],
                        opacity=point_opacity,
                        line=dict(width=1, color='white')
                    ),
                    hovertemplate='%{hovertext}<extra></extra>',
                    hovertext=source_hover
                ))
    return fig
@@ -115,7 +154,8 @@ def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources
 def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=None, 
                             selected_sources=None, method="PCA", clustering_method="None",
                             point_size=DEFAULT_POINT_SIZE, point_opacity=DEFAULT_POINT_OPACITY,
-                             density_based_sizing=False, size_variation=2.0):
+                             density_based_sizing=False, size_variation=2.0, enable_3d=False,
                             cluster_names=None):
    """Create the main visualization plot"""
    # Create hover text
@@ -128,23 +168,38 @@ def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=No
    # Create plot based on coloring strategy
    if cluster_labels is not None:
        fig = create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, 
-                                   hover_text, point_sizes, point_opacity, method)
+                                   hover_text, point_sizes, point_opacity, method, enable_3d,
                                   cluster_names)
    else:
        if selected_sources is None:
            selected_sources = filtered_df['source_file'].unique()
        fig = create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, 
-                                        hover_text, point_sizes, point_opacity)
+                                        hover_text, point_sizes, point_opacity, enable_3d)
    # Update layout
    title_suffix = f" with {clustering_method}" if clustering_method != "None" else ""
-    fig.update_layout(
+    dimension_text = "3D" if enable_3d else "2D"
-        title=f"Discord Chat Messages - {method} Visualization{title_suffix}",
+    
-        xaxis_title=f"{method} Component 1",
+    if enable_3d:
-        yaxis_title=f"{method} Component 2",
+        fig.update_layout(
-        hovermode='closest',
+            title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
-        width=1000,
+            scene=dict(
-        height=700
+                xaxis_title=f"{method} Component 1",
-    )
+                yaxis_title=f"{method} Component 2",
                zaxis_title=f"{method} Component 3"
            ),
            width=1000,
            height=700
        )
    else:
        fig.update_layout(
            title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
            xaxis_title=f"{method} Component 1",
            yaxis_title=f"{method} Component 2",
            hovermode='closest',
            width=1000,
            height=700
        )
    return fig
@@ -182,7 +237,7 @@ def display_summary_stats(filtered_df, selected_sources):
        st.metric("Source Files", len(selected_sources))
-def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method):
+def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method, enable_3d=False):
    """Display clustering results and export options"""
    if cluster_labels is None:
        return
@@ -195,16 +250,21 @@ def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings,
    export_df['x_coordinate'] = reduced_embeddings[:, 0]
    export_df['y_coordinate'] = reduced_embeddings[:, 1]
    # Add z coordinate if 3D
    if enable_3d and reduced_embeddings.shape[1] >= 3:
        export_df['z_coordinate'] = reduced_embeddings[:, 2]
    # Show cluster distribution
    cluster_dist = pd.Series(cluster_labels).value_counts().sort_index()
    st.bar_chart(cluster_dist)
    # Download option
    csv_data = export_df.to_csv(index=False)
    dimension_text = "3D" if enable_3d else "2D"
    st.download_button(
        label="📥 Download Clustering Results (CSV)",
        data=csv_data,
-        file_name=f"chat_clusters_{method}_{clustering_method}.csv",
+        file_name=f"chat_clusters_{method}_{clustering_method}_{dimension_text}.csv",
        mime="text/csv"
    )
@@ -223,3 +283,29 @@ def display_data_table(filtered_df, cluster_labels=None):
    display_df['content'] = display_df['content'].str[:100] + '...'  # Truncate for display
    st.dataframe(display_df, use_container_width=True)
 def display_cluster_summary(cluster_names, cluster_labels):
    """Display a summary of cluster names and their sizes"""
    if not cluster_names or cluster_labels is None:
        return
    st.subheader("🏷️ Cluster Summary")
    # Create summary data
    cluster_summary = []
    for cluster_id, name in cluster_names.items():
        count = np.sum(cluster_labels == cluster_id)
        cluster_summary.append({
            'Cluster ID': cluster_id,
            'Cluster Name': name,
            'Message Count': count,
            'Percentage': f"{100 * count / len(cluster_labels):.1f}%"
        })
    # Sort by message count
    cluster_summary.sort(key=lambda x: x['Message Count'], reverse=True)
    # Display as table
    summary_df = pd.DataFrame(cluster_summary)
    st.dataframe(summary_df, use_container_width=True, hide_index=True)
Author	SHA1	Message	Date
Azeem Fidahusein	ce906e4f9a	udpated perplexity factor	2025-08-11 16:11:21 +01:00
Azeem Fidahusein	fd9b25f256	updated readme	2025-08-11 03:07:44 +01:00
Azeem Fidahusein	2b8659fc95	beter clusters and qol	2025-08-11 03:04:50 +01:00
Azeem Fidahusein	647111e9d3	3d viz	2025-08-11 02:49:41 +01:00