Compare commits
4 Commits
4ca7e8ab61
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| ce906e4f9a | |||
| fd9b25f256 | |||
| 2b8659fc95 | |||
| 647111e9d3 |
281
README.md
281
README.md
@@ -1,2 +1,281 @@
|
|||||||
# cult-scraper
|
# Discord Data Analysis & Visualization Suite
|
||||||
|
|
||||||
|
A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
|
||||||
|
|
||||||
|
## 🌟 Features
|
||||||
|
|
||||||
|
### 📥 Data Collection
|
||||||
|
- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
|
||||||
|
- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
|
||||||
|
- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
|
||||||
|
|
||||||
|
### 📊 Visualization & Analysis
|
||||||
|
- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
|
||||||
|
- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
|
||||||
|
- **Image Dataset Viewer**: Browse and explore downloaded images by channel
|
||||||
|
|
||||||
|
### 🔧 Data Processing
|
||||||
|
- **Batch Processing**: Process multiple CSV files with embeddings
|
||||||
|
- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
|
||||||
|
- **Data Filtering**: Advanced filtering by authors, channels, and timeframes
|
||||||
|
|
||||||
|
## 📁 Repository Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
cult-scraper-1/
|
||||||
|
├── scripts/ # Core data collection scripts
|
||||||
|
│ ├── bot.py # Discord bot for message scraping
|
||||||
|
│ ├── image_downloader.py # Download and convert Discord images
|
||||||
|
│ ├── embedder.py # Batch text embedding processor
|
||||||
|
│ └── embed_class.py # Text embedding utilities
|
||||||
|
├── apps/ # Interactive applications
|
||||||
|
│ ├── cluster_map/ # Chat message clustering & visualization
|
||||||
|
│ │ ├── main.py # Main Streamlit application
|
||||||
|
│ │ ├── data_loader.py # Data loading utilities
|
||||||
|
│ │ ├── clustering.py # Clustering algorithms
|
||||||
|
│ │ ├── visualization.py # Plotting and visualization
|
||||||
|
│ │ └── requirements.txt # Dependencies
|
||||||
|
│ └── image_viewer/ # Image dataset browser
|
||||||
|
│ ├── image_viewer.py # Streamlit image viewer
|
||||||
|
│ └── requirements.txt # Dependencies
|
||||||
|
├── discord_chat_logs/ # Exported CSV files from Discord
|
||||||
|
└── images_dataset/ # Downloaded images and metadata
|
||||||
|
└── images_dataset.json # Image dataset with base64 data
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🚀 Quick Start
|
||||||
|
|
||||||
|
### 1. Discord Data Scraping
|
||||||
|
|
||||||
|
First, set up and run the Discord bot to collect message data:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd scripts
|
||||||
|
# Configure your bot token in bot.py
|
||||||
|
python bot.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
- Discord bot token with message content intent enabled
|
||||||
|
- Bot must have read permissions in target channels
|
||||||
|
|
||||||
|
### 2. Generate Text Embeddings
|
||||||
|
|
||||||
|
Process the collected chat data to add semantic embeddings:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd scripts
|
||||||
|
python embedder.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This will:
|
||||||
|
- Process all CSV files in `discord_chat_logs/`
|
||||||
|
- Add embeddings to message content using sentence transformers
|
||||||
|
- Save updated files with embedding vectors
|
||||||
|
|
||||||
|
### 3. Download Images
|
||||||
|
|
||||||
|
Extract and download images from Discord attachments:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd scripts
|
||||||
|
python image_downloader.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- Downloads images from attachment URLs
|
||||||
|
- Converts to base64 for storage
|
||||||
|
- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
|
||||||
|
- Implements retry logic and rate limiting
|
||||||
|
|
||||||
|
### 4. Visualize Chat Data
|
||||||
|
|
||||||
|
Launch the interactive chat visualization tool:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd apps/cluster_map
|
||||||
|
pip install -r requirements.txt
|
||||||
|
streamlit run main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Capabilities:**
|
||||||
|
- 2D visualization using PCA or t-SNE
|
||||||
|
- Interactive clustering with DBSCAN/HDBSCAN
|
||||||
|
- Filter by channels, authors, and time periods
|
||||||
|
- Hover to see message content and metadata
|
||||||
|
|
||||||
|
### 5. Browse Image Dataset
|
||||||
|
|
||||||
|
View downloaded images in an organized interface:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd apps/image_viewer
|
||||||
|
pip install -r requirements.txt
|
||||||
|
streamlit run image_viewer.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Channel-based organization
|
||||||
|
- Navigation controls (previous/next)
|
||||||
|
- Image metadata display
|
||||||
|
- Responsive layout
|
||||||
|
|
||||||
|
## 📋 Data Formats
|
||||||
|
|
||||||
|
### Discord Chat Logs (CSV)
|
||||||
|
```csv
|
||||||
|
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
|
||||||
|
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Image Dataset (JSON)
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"created_at": "2025-08-11 12:34:56 UTC",
|
||||||
|
"summary": {
|
||||||
|
"total_images": 42,
|
||||||
|
"channels": ["memes", "general"],
|
||||||
|
"total_size_bytes": 1234567,
|
||||||
|
"file_extensions": [".png", ".jpg"],
|
||||||
|
"authors": ["user1", "user2"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"images": [
|
||||||
|
{
|
||||||
|
"url": "https://cdn.discordapp.com/attachments/...",
|
||||||
|
"channel": "memes",
|
||||||
|
"author_name": "username",
|
||||||
|
"timestamp_utc": "2025-08-11 12:34:56+00:00",
|
||||||
|
"content": "Message text",
|
||||||
|
"file_extension": ".png",
|
||||||
|
"file_size": 54321,
|
||||||
|
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔧 Configuration
|
||||||
|
|
||||||
|
### Discord Bot Setup
|
||||||
|
1. Create a Discord application at https://discord.com/developers/applications
|
||||||
|
2. Create a bot and copy the token
|
||||||
|
3. Enable the following intents:
|
||||||
|
- Message Content Intent
|
||||||
|
- Server Members Intent (optional)
|
||||||
|
4. Invite bot to your server with appropriate permissions
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
```bash
|
||||||
|
# Set in scripts/bot.py
|
||||||
|
BOT_TOKEN = "your_discord_bot_token_here"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Embedding Models
|
||||||
|
The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
|
||||||
|
|
||||||
|
Supported models:
|
||||||
|
- `all-MiniLM-L6-v2` (lightweight, fast)
|
||||||
|
- `all-mpnet-base-v2` (higher quality)
|
||||||
|
- `sentence-transformers/all-roberta-large-v1` (best quality, slower)
|
||||||
|
|
||||||
|
## 📊 Visualization Features
|
||||||
|
|
||||||
|
### Chat Message Clustering
|
||||||
|
- **Dimensionality Reduction**: PCA, t-SNE, UMAP
|
||||||
|
- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
|
||||||
|
- **Interactive Controls**: Filter by source files, authors, and clusters
|
||||||
|
- **Hover Information**: View message content, author, timestamp on hover
|
||||||
|
|
||||||
|
### Image Analysis
|
||||||
|
- **Channel Organization**: Browse images by Discord channel
|
||||||
|
- **Metadata Display**: Author, timestamp, message context
|
||||||
|
- **Navigation**: Previous/next controls with slider
|
||||||
|
- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
|
||||||
|
|
||||||
|
## 🛠️ Dependencies
|
||||||
|
|
||||||
|
### Core Scripts
|
||||||
|
- `discord.py` - Discord bot framework
|
||||||
|
- `pandas` - Data manipulation
|
||||||
|
- `sentence-transformers` - Text embeddings
|
||||||
|
- `requests` - HTTP requests for image downloads
|
||||||
|
|
||||||
|
### Visualization Apps
|
||||||
|
- `streamlit` - Web interface framework
|
||||||
|
- `plotly` - Interactive plotting
|
||||||
|
- `scikit-learn` - Machine learning algorithms
|
||||||
|
- `numpy` - Numerical computations
|
||||||
|
- `umap-learn` - Dimensionality reduction
|
||||||
|
- `hdbscan` - Density-based clustering
|
||||||
|
|
||||||
|
## 📈 Use Cases
|
||||||
|
|
||||||
|
### Research & Analytics
|
||||||
|
- **Community Analysis**: Understand conversation patterns and topics
|
||||||
|
- **Sentiment Analysis**: Track mood and sentiment over time
|
||||||
|
- **User Behavior**: Analyze posting patterns and engagement
|
||||||
|
- **Content Moderation**: Identify problematic content clusters
|
||||||
|
|
||||||
|
### Data Science Projects
|
||||||
|
- **NLP Research**: Experiment with text embeddings and clustering
|
||||||
|
- **Social Network Analysis**: Study communication patterns
|
||||||
|
- **Visualization Techniques**: Explore dimensionality reduction methods
|
||||||
|
- **Image Processing**: Analyze visual content sharing patterns
|
||||||
|
|
||||||
|
### Content Management
|
||||||
|
- **Archive Creation**: Preserve Discord community history
|
||||||
|
- **Content Discovery**: Find similar messages and discussions
|
||||||
|
- **Moderation Tools**: Identify spam or inappropriate content
|
||||||
|
- **Backup Solutions**: Create comprehensive data backups
|
||||||
|
|
||||||
|
## 🔒 Privacy & Ethics
|
||||||
|
|
||||||
|
- **Data Protection**: All processing happens locally
|
||||||
|
- **User Consent**: Ensure proper permissions before scraping
|
||||||
|
- **Compliance**: Follow Discord's Terms of Service
|
||||||
|
- **Anonymization**: Consider removing or hashing user IDs for research
|
||||||
|
|
||||||
|
## 🤝 Contributing
|
||||||
|
|
||||||
|
1. Fork the repository
|
||||||
|
2. Create a feature branch
|
||||||
|
3. Make your changes
|
||||||
|
4. Test thoroughly
|
||||||
|
5. Submit a pull request
|
||||||
|
|
||||||
|
## 📄 License
|
||||||
|
|
||||||
|
This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
|
||||||
|
|
||||||
|
## 🆘 Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
**Bot can't read messages:**
|
||||||
|
- Ensure Message Content Intent is enabled
|
||||||
|
- Check bot permissions in Discord server
|
||||||
|
- Verify bot token is correct
|
||||||
|
|
||||||
|
**Embeddings not generating:**
|
||||||
|
- Install sentence-transformers: `pip install sentence-transformers`
|
||||||
|
- Check available GPU memory for large models
|
||||||
|
- Try a smaller model like `all-MiniLM-L6-v2`
|
||||||
|
|
||||||
|
**Images not downloading:**
|
||||||
|
- Check internet connectivity
|
||||||
|
- Verify Discord CDN URLs are accessible
|
||||||
|
- Increase retry limits for unreliable connections
|
||||||
|
|
||||||
|
**Visualization not loading:**
|
||||||
|
- Ensure all requirements are installed
|
||||||
|
- Check that CSV files have embeddings
|
||||||
|
- Try reducing dataset size for better performance
|
||||||
|
|
||||||
|
## 📚 Additional Resources
|
||||||
|
|
||||||
|
- [Discord.py Documentation](https://discordpy.readthedocs.io/)
|
||||||
|
- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
|
||||||
|
- [Streamlit Documentation](https://docs.streamlit.io/)
|
||||||
|
- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
|
||||||
@@ -9,9 +9,136 @@ from sklearn.mixture import GaussianMixture
|
|||||||
from sklearn.preprocessing import StandardScaler
|
from sklearn.preprocessing import StandardScaler
|
||||||
from sklearn.metrics import silhouette_score, calinski_harabasz_score
|
from sklearn.metrics import silhouette_score, calinski_harabasz_score
|
||||||
import hdbscan
|
import hdbscan
|
||||||
|
import pandas as pd
|
||||||
|
from collections import Counter
|
||||||
|
import re
|
||||||
from config import DEFAULT_RANDOM_STATE
|
from config import DEFAULT_RANDOM_STATE
|
||||||
|
|
||||||
|
|
||||||
|
def summarize_cluster_content(cluster_messages, max_words=3):
|
||||||
|
"""
|
||||||
|
Generate a meaningful name for a cluster based on its message content.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
cluster_messages: List of message contents in the cluster
|
||||||
|
max_words: Maximum number of words in the cluster name
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: Generated cluster name
|
||||||
|
"""
|
||||||
|
if not cluster_messages:
|
||||||
|
return "Empty Cluster"
|
||||||
|
|
||||||
|
# Combine all messages and clean text
|
||||||
|
all_text = " ".join([str(msg) for msg in cluster_messages if pd.notna(msg)])
|
||||||
|
if not all_text.strip():
|
||||||
|
return "Empty Content"
|
||||||
|
|
||||||
|
# Basic text cleaning
|
||||||
|
text = all_text.lower()
|
||||||
|
|
||||||
|
# Remove URLs, mentions, and special characters
|
||||||
|
text = re.sub(r'http[s]?://\S+', '', text) # Remove URLs
|
||||||
|
text = re.sub(r'<@\d+>', '', text) # Remove Discord mentions
|
||||||
|
text = re.sub(r'<:\w+:\d+>', '', text) # Remove custom emojis
|
||||||
|
text = re.sub(r'[^\w\s]', ' ', text) # Remove punctuation
|
||||||
|
text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace
|
||||||
|
|
||||||
|
if not text:
|
||||||
|
return "Special Characters"
|
||||||
|
|
||||||
|
# Split into words and filter out common words
|
||||||
|
words = text.split()
|
||||||
|
|
||||||
|
# Common stop words to filter out
|
||||||
|
stop_words = {
|
||||||
|
'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with',
|
||||||
|
'by', 'from', 'up', 'about', 'into', 'through', 'during', 'before', 'after',
|
||||||
|
'above', 'below', 'between', 'among', 'until', 'without', 'under', 'over',
|
||||||
|
'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
|
||||||
|
'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might',
|
||||||
|
'i', 'you', 'he', 'she', 'it', 'we', 'they', 'me', 'him', 'her', 'us', 'them',
|
||||||
|
'my', 'your', 'his', 'her', 'its', 'our', 'their', 'this', 'that', 'these', 'those',
|
||||||
|
'just', 'like', 'get', 'know', 'think', 'see', 'go', 'come', 'say', 'said',
|
||||||
|
'yeah', 'yes', 'no', 'oh', 'ok', 'okay', 'well', 'so', 'but', 'if', 'when',
|
||||||
|
'what', 'where', 'why', 'how', 'who', 'which', 'than', 'then', 'now', 'here',
|
||||||
|
'there', 'also', 'too', 'very', 'really', 'pretty', 'much', 'more', 'most',
|
||||||
|
'some', 'any', 'all', 'many', 'few', 'little', 'big', 'small', 'good', 'bad'
|
||||||
|
}
|
||||||
|
|
||||||
|
# Filter out stop words and very short/long words
|
||||||
|
filtered_words = [
|
||||||
|
word for word in words
|
||||||
|
if word not in stop_words
|
||||||
|
and len(word) >= 3
|
||||||
|
and len(word) <= 15
|
||||||
|
and word.isalpha() # Only alphabetic words
|
||||||
|
]
|
||||||
|
|
||||||
|
if not filtered_words:
|
||||||
|
return f"Chat ({len(cluster_messages)} msgs)"
|
||||||
|
|
||||||
|
# Count word frequencies
|
||||||
|
word_counts = Counter(filtered_words)
|
||||||
|
|
||||||
|
# Get most common words
|
||||||
|
most_common = word_counts.most_common(max_words * 2) # Get more than needed for filtering
|
||||||
|
|
||||||
|
# Select diverse words (avoid very similar words)
|
||||||
|
selected_words = []
|
||||||
|
for word, count in most_common:
|
||||||
|
# Avoid adding very similar words
|
||||||
|
if not any(word.startswith(existing[:4]) or existing.startswith(word[:4])
|
||||||
|
for existing in selected_words):
|
||||||
|
selected_words.append(word)
|
||||||
|
if len(selected_words) >= max_words:
|
||||||
|
break
|
||||||
|
|
||||||
|
if not selected_words:
|
||||||
|
return f"Discussion ({len(cluster_messages)} msgs)"
|
||||||
|
|
||||||
|
# Create cluster name
|
||||||
|
cluster_name = " + ".join(selected_words[:max_words]).title()
|
||||||
|
|
||||||
|
# Add message count for context
|
||||||
|
cluster_name += f" ({len(cluster_messages)})"
|
||||||
|
|
||||||
|
return cluster_name
|
||||||
|
|
||||||
|
|
||||||
|
def generate_cluster_names(filtered_df, cluster_labels):
|
||||||
|
"""
|
||||||
|
Generate names for all clusters based on their content.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
filtered_df: DataFrame with message data
|
||||||
|
cluster_labels: Array of cluster labels for each message
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
dict: Mapping from cluster_id to cluster_name
|
||||||
|
"""
|
||||||
|
if cluster_labels is None:
|
||||||
|
return {}
|
||||||
|
|
||||||
|
cluster_names = {}
|
||||||
|
unique_clusters = np.unique(cluster_labels)
|
||||||
|
|
||||||
|
for cluster_id in unique_clusters:
|
||||||
|
if cluster_id == -1:
|
||||||
|
cluster_names[cluster_id] = "Noise/Outliers"
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get messages in this cluster
|
||||||
|
cluster_mask = cluster_labels == cluster_id
|
||||||
|
cluster_messages = filtered_df[cluster_mask]['content'].tolist()
|
||||||
|
|
||||||
|
# Generate name
|
||||||
|
cluster_name = summarize_cluster_content(cluster_messages)
|
||||||
|
cluster_names[cluster_id] = cluster_name
|
||||||
|
|
||||||
|
return cluster_names
|
||||||
|
|
||||||
|
|
||||||
def apply_clustering(embeddings, clustering_method="None", n_clusters=5):
|
def apply_clustering(embeddings, clustering_method="None", n_clusters=5):
|
||||||
"""
|
"""
|
||||||
Apply clustering algorithm to embeddings and return labels and metrics.
|
Apply clustering algorithm to embeddings and return labels and metrics.
|
||||||
|
|||||||
@@ -3,7 +3,7 @@ Configuration settings and constants for the Discord Chat Embeddings Visualizer.
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
# Application settings
|
# Application settings
|
||||||
APP_TITLE = "Discord Chat Embeddings Visualizer"
|
APP_TITLE = "The Cult - Visualised"
|
||||||
APP_ICON = "🗨️"
|
APP_ICON = "🗨️"
|
||||||
APP_LAYOUT = "wide"
|
APP_LAYOUT = "wide"
|
||||||
|
|
||||||
@@ -14,6 +14,8 @@ CHAT_LOGS_PATH = "../../discord_chat_logs"
|
|||||||
DEFAULT_RANDOM_STATE = 42
|
DEFAULT_RANDOM_STATE = 42
|
||||||
DEFAULT_N_COMPONENTS = 2
|
DEFAULT_N_COMPONENTS = 2
|
||||||
DEFAULT_N_CLUSTERS = 5
|
DEFAULT_N_CLUSTERS = 5
|
||||||
|
DEFAULT_DIMENSION_REDUCTION_METHOD = "t-SNE"
|
||||||
|
DEFAULT_CLUSTERING_METHOD = "None"
|
||||||
|
|
||||||
# Visualization settings
|
# Visualization settings
|
||||||
DEFAULT_POINT_SIZE = 8
|
DEFAULT_POINT_SIZE = 8
|
||||||
|
|||||||
@@ -17,10 +17,10 @@ from data_loader import (
|
|||||||
from dimensionality_reduction import (
|
from dimensionality_reduction import (
|
||||||
reduce_dimensions, apply_density_based_jittering
|
reduce_dimensions, apply_density_based_jittering
|
||||||
)
|
)
|
||||||
from clustering import apply_clustering
|
from clustering import apply_clustering, generate_cluster_names
|
||||||
from visualization import (
|
from visualization import (
|
||||||
create_visualization_plot, display_clustering_metrics, display_summary_stats,
|
create_visualization_plot, display_clustering_metrics, display_summary_stats,
|
||||||
display_clustering_results, display_data_table
|
display_clustering_results, display_data_table, display_cluster_summary
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -51,11 +51,34 @@ def main():
|
|||||||
# Get UI parameters
|
# Get UI parameters
|
||||||
params = get_all_ui_parameters(valid_df)
|
params = get_all_ui_parameters(valid_df)
|
||||||
|
|
||||||
|
# Check if any sources are selected before proceeding
|
||||||
|
if not params['selected_sources']:
|
||||||
|
st.info("📂 **Select source files from the sidebar to begin visualization**")
|
||||||
|
st.markdown("### Available Data Sources:")
|
||||||
|
|
||||||
|
# Show available sources as an informational table
|
||||||
|
source_info = []
|
||||||
|
for source in valid_df['source_file'].unique():
|
||||||
|
source_data = valid_df[valid_df['source_file'] == source]
|
||||||
|
source_info.append({
|
||||||
|
'Source File': source,
|
||||||
|
'Messages': len(source_data),
|
||||||
|
'Unique Authors': source_data['author_name'].nunique(),
|
||||||
|
'Date Range': f"{source_data['timestamp_utc'].min()} to {source_data['timestamp_utc'].max()}"
|
||||||
|
})
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
source_df = pd.DataFrame(source_info)
|
||||||
|
st.dataframe(source_df, use_container_width=True, hide_index=True)
|
||||||
|
|
||||||
|
st.markdown("👈 **Use the sidebar to select which sources to visualize**")
|
||||||
|
st.stop()
|
||||||
|
|
||||||
# Filter data
|
# Filter data
|
||||||
filtered_df = filter_data(valid_df, params['selected_sources'], params['selected_authors'])
|
filtered_df = filter_data(valid_df, params['selected_sources'], params['selected_authors'])
|
||||||
|
|
||||||
if filtered_df.empty:
|
if filtered_df.empty:
|
||||||
st.warning("No data matches the current filters!")
|
st.warning("No data matches the current filters! Try selecting different sources or authors.")
|
||||||
st.stop()
|
st.stop()
|
||||||
|
|
||||||
# Display performance warnings
|
# Display performance warnings
|
||||||
@@ -67,10 +90,12 @@ def main():
|
|||||||
st.info(f"📈 Visualizing {len(filtered_df)} messages")
|
st.info(f"📈 Visualizing {len(filtered_df)} messages")
|
||||||
|
|
||||||
# Reduce dimensions
|
# Reduce dimensions
|
||||||
|
n_components = 3 if params['enable_3d'] else 2
|
||||||
with st.spinner(f"Reducing dimensions using {params['method']}..."):
|
with st.spinner(f"Reducing dimensions using {params['method']}..."):
|
||||||
reduced_embeddings = reduce_dimensions(
|
reduced_embeddings = reduce_dimensions(
|
||||||
filtered_embeddings,
|
filtered_embeddings,
|
||||||
method=params['method'],
|
method=params['method'],
|
||||||
|
n_components=n_components,
|
||||||
spread_factor=params['spread_factor'],
|
spread_factor=params['spread_factor'],
|
||||||
perplexity_factor=params['perplexity_factor'],
|
perplexity_factor=params['perplexity_factor'],
|
||||||
min_dist_factor=params['min_dist_factor']
|
min_dist_factor=params['min_dist_factor']
|
||||||
@@ -93,12 +118,22 @@ def main():
|
|||||||
jitter_strength=params['jitter_strength']
|
jitter_strength=params['jitter_strength']
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Generate cluster names if clustering was applied
|
||||||
|
cluster_names = None
|
||||||
|
if cluster_labels is not None:
|
||||||
|
with st.spinner("Generating cluster names..."):
|
||||||
|
cluster_names = generate_cluster_names(filtered_df, cluster_labels)
|
||||||
|
|
||||||
# Display clustering metrics
|
# Display clustering metrics
|
||||||
display_clustering_metrics(
|
display_clustering_metrics(
|
||||||
cluster_labels, silhouette_avg, calinski_harabasz,
|
cluster_labels, silhouette_avg, calinski_harabasz,
|
||||||
params['show_cluster_metrics']
|
params['show_cluster_metrics']
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Display cluster summary with names
|
||||||
|
if cluster_names:
|
||||||
|
display_cluster_summary(cluster_names, cluster_labels)
|
||||||
|
|
||||||
# Create and display the main plot
|
# Create and display the main plot
|
||||||
fig = create_visualization_plot(
|
fig = create_visualization_plot(
|
||||||
reduced_embeddings=reduced_embeddings,
|
reduced_embeddings=reduced_embeddings,
|
||||||
@@ -110,7 +145,9 @@ def main():
|
|||||||
point_size=params['point_size'],
|
point_size=params['point_size'],
|
||||||
point_opacity=params['point_opacity'],
|
point_opacity=params['point_opacity'],
|
||||||
density_based_sizing=params['density_based_sizing'],
|
density_based_sizing=params['density_based_sizing'],
|
||||||
size_variation=params['size_variation']
|
size_variation=params['size_variation'],
|
||||||
|
enable_3d=params['enable_3d'],
|
||||||
|
cluster_names=cluster_names
|
||||||
)
|
)
|
||||||
|
|
||||||
st.plotly_chart(fig, use_container_width=True)
|
st.plotly_chart(fig, use_container_width=True)
|
||||||
@@ -121,7 +158,7 @@ def main():
|
|||||||
# Display clustering results and export options
|
# Display clustering results and export options
|
||||||
display_clustering_results(
|
display_clustering_results(
|
||||||
filtered_df, cluster_labels, reduced_embeddings,
|
filtered_df, cluster_labels, reduced_embeddings,
|
||||||
params['method'], params['clustering_method']
|
params['method'], params['clustering_method'], params['enable_3d']
|
||||||
)
|
)
|
||||||
|
|
||||||
# Display data table
|
# Display data table
|
||||||
|
|||||||
@@ -7,7 +7,8 @@ import numpy as np
|
|||||||
from config import (
|
from config import (
|
||||||
APP_TITLE, APP_ICON, APP_LAYOUT, METHOD_EXPLANATIONS,
|
APP_TITLE, APP_ICON, APP_LAYOUT, METHOD_EXPLANATIONS,
|
||||||
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS, COMPUTATIONALLY_INTENSIVE_METHODS,
|
CLUSTERING_METHODS_REQUIRING_N_CLUSTERS, COMPUTATIONALLY_INTENSIVE_METHODS,
|
||||||
LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS
|
LARGE_DATASET_WARNING_THRESHOLD, MAX_DISPLAYED_AUTHORS,
|
||||||
|
DEFAULT_DIMENSION_REDUCTION_METHOD, DEFAULT_CLUSTERING_METHOD
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -30,29 +31,58 @@ def create_method_controls():
|
|||||||
"""Create controls for dimension reduction and clustering methods"""
|
"""Create controls for dimension reduction and clustering methods"""
|
||||||
st.sidebar.header("🎛️ Visualization Controls")
|
st.sidebar.header("🎛️ Visualization Controls")
|
||||||
|
|
||||||
|
# 3D visualization toggle
|
||||||
|
enable_3d = st.sidebar.checkbox(
|
||||||
|
"Enable 3D Visualization",
|
||||||
|
value=False,
|
||||||
|
help="Switch between 2D and 3D visualization. 3D uses 3 components instead of 2."
|
||||||
|
)
|
||||||
|
|
||||||
# Dimension reduction method
|
# Dimension reduction method
|
||||||
|
method_options = ["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"]
|
||||||
|
default_index = method_options.index(DEFAULT_DIMENSION_REDUCTION_METHOD) if DEFAULT_DIMENSION_REDUCTION_METHOD in method_options else 0
|
||||||
method = st.sidebar.selectbox(
|
method = st.sidebar.selectbox(
|
||||||
"Dimension Reduction Method",
|
"Dimension Reduction Method",
|
||||||
["PCA", "t-SNE", "UMAP", "Spectral Embedding", "Force-Directed"],
|
method_options,
|
||||||
|
index=default_index,
|
||||||
help="PCA is fastest, UMAP balances speed and quality, t-SNE and Spectral are slower but may reveal better structures. Force-Directed creates natural spacing."
|
help="PCA is fastest, UMAP balances speed and quality, t-SNE and Spectral are slower but may reveal better structures. Force-Directed creates natural spacing."
|
||||||
)
|
)
|
||||||
|
|
||||||
# Clustering method
|
# Clustering method
|
||||||
|
clustering_options = ["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture",
|
||||||
|
"Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"]
|
||||||
|
clustering_default_index = clustering_options.index(DEFAULT_CLUSTERING_METHOD) if DEFAULT_CLUSTERING_METHOD in clustering_options else 0
|
||||||
clustering_method = st.sidebar.selectbox(
|
clustering_method = st.sidebar.selectbox(
|
||||||
"Clustering Method",
|
"Clustering Method",
|
||||||
["None", "HDBSCAN", "Spectral Clustering", "Gaussian Mixture",
|
clustering_options,
|
||||||
"Agglomerative (Ward)", "Agglomerative (Complete)", "OPTICS"],
|
index=clustering_default_index,
|
||||||
help="Apply clustering to identify groups. HDBSCAN and OPTICS can find variable density clusters."
|
help="Apply clustering to identify groups. HDBSCAN and OPTICS can find variable density clusters."
|
||||||
)
|
)
|
||||||
|
|
||||||
return method, clustering_method
|
return method, clustering_method, enable_3d
|
||||||
|
|
||||||
|
|
||||||
def create_clustering_controls(clustering_method):
|
def create_clustering_controls(clustering_method):
|
||||||
"""Create controls for clustering parameters"""
|
"""Create controls for clustering parameters"""
|
||||||
n_clusters = 5
|
# Always show the clusters slider, but indicate when it's used
|
||||||
if clustering_method in CLUSTERING_METHODS_REQUIRING_N_CLUSTERS:
|
if clustering_method in CLUSTERING_METHODS_REQUIRING_N_CLUSTERS:
|
||||||
n_clusters = st.sidebar.slider("Number of Clusters", 2, 15, 5)
|
help_text = "Number of clusters to create. This setting affects the clustering algorithm."
|
||||||
|
disabled = False
|
||||||
|
elif clustering_method == "None":
|
||||||
|
help_text = "Clustering is disabled. This setting has no effect."
|
||||||
|
disabled = True
|
||||||
|
else:
|
||||||
|
help_text = f"{clustering_method} automatically determines the number of clusters. This setting has no effect."
|
||||||
|
disabled = True
|
||||||
|
|
||||||
|
n_clusters = st.sidebar.slider(
|
||||||
|
"Number of Clusters",
|
||||||
|
min_value=2,
|
||||||
|
max_value=20,
|
||||||
|
value=5,
|
||||||
|
disabled=disabled,
|
||||||
|
help=help_text
|
||||||
|
)
|
||||||
|
|
||||||
return n_clusters
|
return n_clusters
|
||||||
|
|
||||||
@@ -74,7 +104,7 @@ def create_separation_controls(method):
|
|||||||
if method == "t-SNE":
|
if method == "t-SNE":
|
||||||
perplexity_factor = st.sidebar.slider(
|
perplexity_factor = st.sidebar.slider(
|
||||||
"Perplexity Factor",
|
"Perplexity Factor",
|
||||||
0.5, 2.0, 1.0, 0.1,
|
0.1, 2.0, 1.0, 0.1,
|
||||||
help="Affects local vs global structure balance. Lower values focus on local details."
|
help="Affects local vs global structure balance. Lower values focus on local details."
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -196,7 +226,7 @@ def display_performance_warnings(filtered_df, method, clustering_method):
|
|||||||
def get_all_ui_parameters(valid_df):
|
def get_all_ui_parameters(valid_df):
|
||||||
"""Get all UI parameters in a single function call"""
|
"""Get all UI parameters in a single function call"""
|
||||||
# Method selection
|
# Method selection
|
||||||
method, clustering_method = create_method_controls()
|
method, clustering_method, enable_3d = create_method_controls()
|
||||||
|
|
||||||
# Clustering parameters
|
# Clustering parameters
|
||||||
n_clusters = create_clustering_controls(clustering_method)
|
n_clusters = create_clustering_controls(clustering_method)
|
||||||
@@ -219,6 +249,7 @@ def get_all_ui_parameters(valid_df):
|
|||||||
return {
|
return {
|
||||||
'method': method,
|
'method': method,
|
||||||
'clustering_method': clustering_method,
|
'clustering_method': clustering_method,
|
||||||
|
'enable_3d': enable_3d,
|
||||||
'n_clusters': n_clusters,
|
'n_clusters': n_clusters,
|
||||||
'spread_factor': spread_factor,
|
'spread_factor': spread_factor,
|
||||||
'perplexity_factor': perplexity_factor,
|
'perplexity_factor': perplexity_factor,
|
||||||
|
|||||||
@@ -47,7 +47,8 @@ def calculate_point_sizes(reduced_embeddings, density_based_sizing=False,
|
|||||||
|
|
||||||
|
|
||||||
def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover_text,
|
def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover_text,
|
||||||
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA"):
|
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, method="PCA", enable_3d=False,
|
||||||
|
cluster_names=None):
|
||||||
"""Create a plot colored by clusters"""
|
"""Create a plot colored by clusters"""
|
||||||
fig = go.Figure()
|
fig = go.Figure()
|
||||||
|
|
||||||
@@ -61,28 +62,49 @@ def create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels, hover
|
|||||||
cluster_hover = [hover_text[j] for j, mask in enumerate(cluster_mask) if mask]
|
cluster_hover = [hover_text[j] for j, mask in enumerate(cluster_mask) if mask]
|
||||||
cluster_sizes = [point_sizes[j] for j, mask in enumerate(cluster_mask) if mask]
|
cluster_sizes = [point_sizes[j] for j, mask in enumerate(cluster_mask) if mask]
|
||||||
|
|
||||||
cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
|
# Use generated name if available, otherwise fall back to default
|
||||||
|
if cluster_names and cluster_id in cluster_names:
|
||||||
|
cluster_name = cluster_names[cluster_id]
|
||||||
|
else:
|
||||||
|
cluster_name = f"Cluster {cluster_id}" if cluster_id != -1 else "Noise"
|
||||||
|
|
||||||
fig.add_trace(go.Scatter(
|
if enable_3d:
|
||||||
x=cluster_embeddings[:, 0],
|
fig.add_trace(go.Scatter3d(
|
||||||
y=cluster_embeddings[:, 1],
|
x=cluster_embeddings[:, 0],
|
||||||
mode='markers',
|
y=cluster_embeddings[:, 1],
|
||||||
name=cluster_name,
|
z=cluster_embeddings[:, 2],
|
||||||
marker=dict(
|
mode='markers',
|
||||||
size=cluster_sizes,
|
name=cluster_name,
|
||||||
color=colors[i % len(colors)],
|
marker=dict(
|
||||||
opacity=point_opacity,
|
size=cluster_sizes,
|
||||||
line=dict(width=1, color='white')
|
color=colors[i % len(colors)],
|
||||||
),
|
opacity=point_opacity,
|
||||||
hovertemplate='%{hovertext}<extra></extra>',
|
line=dict(width=1, color='white')
|
||||||
hovertext=cluster_hover
|
),
|
||||||
))
|
hovertemplate='%{hovertext}<extra></extra>',
|
||||||
|
hovertext=cluster_hover
|
||||||
|
))
|
||||||
|
else:
|
||||||
|
fig.add_trace(go.Scatter(
|
||||||
|
x=cluster_embeddings[:, 0],
|
||||||
|
y=cluster_embeddings[:, 1],
|
||||||
|
mode='markers',
|
||||||
|
name=cluster_name,
|
||||||
|
marker=dict(
|
||||||
|
size=cluster_sizes,
|
||||||
|
color=colors[i % len(colors)],
|
||||||
|
opacity=point_opacity,
|
||||||
|
line=dict(width=1, color='white')
|
||||||
|
),
|
||||||
|
hovertemplate='%{hovertext}<extra></extra>',
|
||||||
|
hovertext=cluster_hover
|
||||||
|
))
|
||||||
|
|
||||||
return fig
|
return fig
|
||||||
|
|
||||||
|
|
||||||
def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, hover_text,
|
def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources, hover_text,
|
||||||
point_sizes, point_opacity=DEFAULT_POINT_OPACITY):
|
point_sizes, point_opacity=DEFAULT_POINT_OPACITY, enable_3d=False):
|
||||||
"""Create a plot colored by source files"""
|
"""Create a plot colored by source files"""
|
||||||
fig = go.Figure()
|
fig = go.Figure()
|
||||||
colors = px.colors.qualitative.Set1
|
colors = px.colors.qualitative.Set1
|
||||||
@@ -94,20 +116,37 @@ def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources
|
|||||||
source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
|
source_hover = [hover_text[j] for j, mask in enumerate(source_mask) if mask]
|
||||||
source_sizes = [point_sizes[j] for j, mask in enumerate(source_mask) if mask]
|
source_sizes = [point_sizes[j] for j, mask in enumerate(source_mask) if mask]
|
||||||
|
|
||||||
fig.add_trace(go.Scatter(
|
if enable_3d:
|
||||||
x=source_embeddings[:, 0],
|
fig.add_trace(go.Scatter3d(
|
||||||
y=source_embeddings[:, 1],
|
x=source_embeddings[:, 0],
|
||||||
mode='markers',
|
y=source_embeddings[:, 1],
|
||||||
name=source,
|
z=source_embeddings[:, 2],
|
||||||
marker=dict(
|
mode='markers',
|
||||||
size=source_sizes,
|
name=source,
|
||||||
color=colors[i % len(colors)],
|
marker=dict(
|
||||||
opacity=point_opacity,
|
size=source_sizes,
|
||||||
line=dict(width=1, color='white')
|
color=colors[i % len(colors)],
|
||||||
),
|
opacity=point_opacity,
|
||||||
hovertemplate='%{hovertext}<extra></extra>',
|
line=dict(width=1, color='white')
|
||||||
hovertext=source_hover
|
),
|
||||||
))
|
hovertemplate='%{hovertext}<extra></extra>',
|
||||||
|
hovertext=source_hover
|
||||||
|
))
|
||||||
|
else:
|
||||||
|
fig.add_trace(go.Scatter(
|
||||||
|
x=source_embeddings[:, 0],
|
||||||
|
y=source_embeddings[:, 1],
|
||||||
|
mode='markers',
|
||||||
|
name=source,
|
||||||
|
marker=dict(
|
||||||
|
size=source_sizes,
|
||||||
|
color=colors[i % len(colors)],
|
||||||
|
opacity=point_opacity,
|
||||||
|
line=dict(width=1, color='white')
|
||||||
|
),
|
||||||
|
hovertemplate='%{hovertext}<extra></extra>',
|
||||||
|
hovertext=source_hover
|
||||||
|
))
|
||||||
|
|
||||||
return fig
|
return fig
|
||||||
|
|
||||||
@@ -115,7 +154,8 @@ def create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources
|
|||||||
def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=None,
|
def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=None,
|
||||||
selected_sources=None, method="PCA", clustering_method="None",
|
selected_sources=None, method="PCA", clustering_method="None",
|
||||||
point_size=DEFAULT_POINT_SIZE, point_opacity=DEFAULT_POINT_OPACITY,
|
point_size=DEFAULT_POINT_SIZE, point_opacity=DEFAULT_POINT_OPACITY,
|
||||||
density_based_sizing=False, size_variation=2.0):
|
density_based_sizing=False, size_variation=2.0, enable_3d=False,
|
||||||
|
cluster_names=None):
|
||||||
"""Create the main visualization plot"""
|
"""Create the main visualization plot"""
|
||||||
|
|
||||||
# Create hover text
|
# Create hover text
|
||||||
@@ -128,23 +168,38 @@ def create_visualization_plot(reduced_embeddings, filtered_df, cluster_labels=No
|
|||||||
# Create plot based on coloring strategy
|
# Create plot based on coloring strategy
|
||||||
if cluster_labels is not None:
|
if cluster_labels is not None:
|
||||||
fig = create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels,
|
fig = create_clustered_plot(reduced_embeddings, filtered_df, cluster_labels,
|
||||||
hover_text, point_sizes, point_opacity, method)
|
hover_text, point_sizes, point_opacity, method, enable_3d,
|
||||||
|
cluster_names)
|
||||||
else:
|
else:
|
||||||
if selected_sources is None:
|
if selected_sources is None:
|
||||||
selected_sources = filtered_df['source_file'].unique()
|
selected_sources = filtered_df['source_file'].unique()
|
||||||
fig = create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources,
|
fig = create_source_colored_plot(reduced_embeddings, filtered_df, selected_sources,
|
||||||
hover_text, point_sizes, point_opacity)
|
hover_text, point_sizes, point_opacity, enable_3d)
|
||||||
|
|
||||||
# Update layout
|
# Update layout
|
||||||
title_suffix = f" with {clustering_method}" if clustering_method != "None" else ""
|
title_suffix = f" with {clustering_method}" if clustering_method != "None" else ""
|
||||||
fig.update_layout(
|
dimension_text = "3D" if enable_3d else "2D"
|
||||||
title=f"Discord Chat Messages - {method} Visualization{title_suffix}",
|
|
||||||
xaxis_title=f"{method} Component 1",
|
if enable_3d:
|
||||||
yaxis_title=f"{method} Component 2",
|
fig.update_layout(
|
||||||
hovermode='closest',
|
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
|
||||||
width=1000,
|
scene=dict(
|
||||||
height=700
|
xaxis_title=f"{method} Component 1",
|
||||||
)
|
yaxis_title=f"{method} Component 2",
|
||||||
|
zaxis_title=f"{method} Component 3"
|
||||||
|
),
|
||||||
|
width=1000,
|
||||||
|
height=700
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
fig.update_layout(
|
||||||
|
title=f"Discord Chat Messages - {method} {dimension_text} Visualization{title_suffix}",
|
||||||
|
xaxis_title=f"{method} Component 1",
|
||||||
|
yaxis_title=f"{method} Component 2",
|
||||||
|
hovermode='closest',
|
||||||
|
width=1000,
|
||||||
|
height=700
|
||||||
|
)
|
||||||
|
|
||||||
return fig
|
return fig
|
||||||
|
|
||||||
@@ -182,7 +237,7 @@ def display_summary_stats(filtered_df, selected_sources):
|
|||||||
st.metric("Source Files", len(selected_sources))
|
st.metric("Source Files", len(selected_sources))
|
||||||
|
|
||||||
|
|
||||||
def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method):
|
def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings, method, clustering_method, enable_3d=False):
|
||||||
"""Display clustering results and export options"""
|
"""Display clustering results and export options"""
|
||||||
if cluster_labels is None:
|
if cluster_labels is None:
|
||||||
return
|
return
|
||||||
@@ -195,16 +250,21 @@ def display_clustering_results(filtered_df, cluster_labels, reduced_embeddings,
|
|||||||
export_df['x_coordinate'] = reduced_embeddings[:, 0]
|
export_df['x_coordinate'] = reduced_embeddings[:, 0]
|
||||||
export_df['y_coordinate'] = reduced_embeddings[:, 1]
|
export_df['y_coordinate'] = reduced_embeddings[:, 1]
|
||||||
|
|
||||||
|
# Add z coordinate if 3D
|
||||||
|
if enable_3d and reduced_embeddings.shape[1] >= 3:
|
||||||
|
export_df['z_coordinate'] = reduced_embeddings[:, 2]
|
||||||
|
|
||||||
# Show cluster distribution
|
# Show cluster distribution
|
||||||
cluster_dist = pd.Series(cluster_labels).value_counts().sort_index()
|
cluster_dist = pd.Series(cluster_labels).value_counts().sort_index()
|
||||||
st.bar_chart(cluster_dist)
|
st.bar_chart(cluster_dist)
|
||||||
|
|
||||||
# Download option
|
# Download option
|
||||||
csv_data = export_df.to_csv(index=False)
|
csv_data = export_df.to_csv(index=False)
|
||||||
|
dimension_text = "3D" if enable_3d else "2D"
|
||||||
st.download_button(
|
st.download_button(
|
||||||
label="📥 Download Clustering Results (CSV)",
|
label="📥 Download Clustering Results (CSV)",
|
||||||
data=csv_data,
|
data=csv_data,
|
||||||
file_name=f"chat_clusters_{method}_{clustering_method}.csv",
|
file_name=f"chat_clusters_{method}_{clustering_method}_{dimension_text}.csv",
|
||||||
mime="text/csv"
|
mime="text/csv"
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -223,3 +283,29 @@ def display_data_table(filtered_df, cluster_labels=None):
|
|||||||
|
|
||||||
display_df['content'] = display_df['content'].str[:100] + '...' # Truncate for display
|
display_df['content'] = display_df['content'].str[:100] + '...' # Truncate for display
|
||||||
st.dataframe(display_df, use_container_width=True)
|
st.dataframe(display_df, use_container_width=True)
|
||||||
|
|
||||||
|
|
||||||
|
def display_cluster_summary(cluster_names, cluster_labels):
|
||||||
|
"""Display a summary of cluster names and their sizes"""
|
||||||
|
if not cluster_names or cluster_labels is None:
|
||||||
|
return
|
||||||
|
|
||||||
|
st.subheader("🏷️ Cluster Summary")
|
||||||
|
|
||||||
|
# Create summary data
|
||||||
|
cluster_summary = []
|
||||||
|
for cluster_id, name in cluster_names.items():
|
||||||
|
count = np.sum(cluster_labels == cluster_id)
|
||||||
|
cluster_summary.append({
|
||||||
|
'Cluster ID': cluster_id,
|
||||||
|
'Cluster Name': name,
|
||||||
|
'Message Count': count,
|
||||||
|
'Percentage': f"{100 * count / len(cluster_labels):.1f}%"
|
||||||
|
})
|
||||||
|
|
||||||
|
# Sort by message count
|
||||||
|
cluster_summary.sort(key=lambda x: x['Message Count'], reverse=True)
|
||||||
|
|
||||||
|
# Display as table
|
||||||
|
summary_df = pd.DataFrame(cluster_summary)
|
||||||
|
st.dataframe(summary_df, use_container_width=True, hide_index=True)
|
||||||
|
|||||||
Reference in New Issue
Block a user