Discord Data Analysis & Visualization Suite
A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
🌟 Features
📥 Data Collection
- Discord Bot Scraper: Automated extraction of complete message history from Discord servers
- Image Downloader: Downloads and processes images from Discord attachments with base64 conversion
- Text Embeddings: Generate semantic embeddings for chat messages using sentence transformers
📊 Visualization & Analysis
- Interactive Chat Visualizer: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
- Clustering Analysis: Automated grouping of similar messages with DBSCAN and HDBSCAN
- Image Dataset Viewer: Browse and explore downloaded images by channel
🔧 Data Processing
- Batch Processing: Process multiple CSV files with embeddings
- Metadata Extraction: Comprehensive message metadata including timestamps, authors, and content
- Data Filtering: Advanced filtering by authors, channels, and timeframes
📁 Repository Structure
cult-scraper-1/
├── scripts/ # Core data collection scripts
│ ├── bot.py # Discord bot for message scraping
│ ├── image_downloader.py # Download and convert Discord images
│ ├── embedder.py # Batch text embedding processor
│ └── embed_class.py # Text embedding utilities
├── apps/ # Interactive applications
│ ├── cluster_map/ # Chat message clustering & visualization
│ │ ├── main.py # Main Streamlit application
│ │ ├── data_loader.py # Data loading utilities
│ │ ├── clustering.py # Clustering algorithms
│ │ ├── visualization.py # Plotting and visualization
│ │ └── requirements.txt # Dependencies
│ └── image_viewer/ # Image dataset browser
│ ├── image_viewer.py # Streamlit image viewer
│ └── requirements.txt # Dependencies
├── discord_chat_logs/ # Exported CSV files from Discord
└── images_dataset/ # Downloaded images and metadata
└── images_dataset.json # Image dataset with base64 data
🚀 Quick Start
1. Discord Data Scraping
First, set up and run the Discord bot to collect message data:
cd scripts
# Configure your bot token in bot.py
python bot.py
Requirements:
- Discord bot token with message content intent enabled
- Bot must have read permissions in target channels
2. Generate Text Embeddings
Process the collected chat data to add semantic embeddings:
cd scripts
python embedder.py
This will:
- Process all CSV files in
discord_chat_logs/ - Add embeddings to message content using sentence transformers
- Save updated files with embedding vectors
3. Download Images
Extract and download images from Discord attachments:
cd scripts
python image_downloader.py
Features:
- Downloads images from attachment URLs
- Converts to base64 for storage
- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
- Implements retry logic and rate limiting
4. Visualize Chat Data
Launch the interactive chat visualization tool:
cd apps/cluster_map
pip install -r requirements.txt
streamlit run main.py
Capabilities:
- 2D visualization using PCA or t-SNE
- Interactive clustering with DBSCAN/HDBSCAN
- Filter by channels, authors, and time periods
- Hover to see message content and metadata
5. Browse Image Dataset
View downloaded images in an organized interface:
cd apps/image_viewer
pip install -r requirements.txt
streamlit run image_viewer.py
Features:
- Channel-based organization
- Navigation controls (previous/next)
- Image metadata display
- Responsive layout
📋 Data Formats
Discord Chat Logs (CSV)
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
Image Dataset (JSON)
{
"metadata": {
"created_at": "2025-08-11 12:34:56 UTC",
"summary": {
"total_images": 42,
"channels": ["memes", "general"],
"total_size_bytes": 1234567,
"file_extensions": [".png", ".jpg"],
"authors": ["user1", "user2"]
}
},
"images": [
{
"url": "https://cdn.discordapp.com/attachments/...",
"channel": "memes",
"author_name": "username",
"timestamp_utc": "2025-08-11 12:34:56+00:00",
"content": "Message text",
"file_extension": ".png",
"file_size": 54321,
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
}
]
}
🔧 Configuration
Discord Bot Setup
- Create a Discord application at https://discord.com/developers/applications
- Create a bot and copy the token
- Enable the following intents:
- Message Content Intent
- Server Members Intent (optional)
- Invite bot to your server with appropriate permissions
Environment Variables
# Set in scripts/bot.py
BOT_TOKEN = "your_discord_bot_token_here"
Embedding Models
The system uses sentence-transformers models. Default: all-MiniLM-L6-v2
Supported models:
all-MiniLM-L6-v2(lightweight, fast)all-mpnet-base-v2(higher quality)sentence-transformers/all-roberta-large-v1(best quality, slower)
📊 Visualization Features
Chat Message Clustering
- Dimensionality Reduction: PCA, t-SNE, UMAP
- Clustering Algorithms: DBSCAN, HDBSCAN with automatic parameter tuning
- Interactive Controls: Filter by source files, authors, and clusters
- Hover Information: View message content, author, timestamp on hover
Image Analysis
- Channel Organization: Browse images by Discord channel
- Metadata Display: Author, timestamp, message context
- Navigation: Previous/next controls with slider
- Format Support: PNG, JPG, GIF, WebP, BMP, TIFF
🛠️ Dependencies
Core Scripts
discord.py- Discord bot frameworkpandas- Data manipulationsentence-transformers- Text embeddingsrequests- HTTP requests for image downloads
Visualization Apps
streamlit- Web interface frameworkplotly- Interactive plottingscikit-learn- Machine learning algorithmsnumpy- Numerical computationsumap-learn- Dimensionality reductionhdbscan- Density-based clustering
📈 Use Cases
Research & Analytics
- Community Analysis: Understand conversation patterns and topics
- Sentiment Analysis: Track mood and sentiment over time
- User Behavior: Analyze posting patterns and engagement
- Content Moderation: Identify problematic content clusters
Data Science Projects
- NLP Research: Experiment with text embeddings and clustering
- Social Network Analysis: Study communication patterns
- Visualization Techniques: Explore dimensionality reduction methods
- Image Processing: Analyze visual content sharing patterns
Content Management
- Archive Creation: Preserve Discord community history
- Content Discovery: Find similar messages and discussions
- Moderation Tools: Identify spam or inappropriate content
- Backup Solutions: Create comprehensive data backups
🔒 Privacy & Ethics
- Data Protection: All processing happens locally
- User Consent: Ensure proper permissions before scraping
- Compliance: Follow Discord's Terms of Service
- Anonymization: Consider removing or hashing user IDs for research
🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
📄 License
This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
🆘 Troubleshooting
Common Issues
Bot can't read messages:
- Ensure Message Content Intent is enabled
- Check bot permissions in Discord server
- Verify bot token is correct
Embeddings not generating:
- Install sentence-transformers:
pip install sentence-transformers - Check available GPU memory for large models
- Try a smaller model like
all-MiniLM-L6-v2
Images not downloading:
- Check internet connectivity
- Verify Discord CDN URLs are accessible
- Increase retry limits for unreliable connections
Visualization not loading:
- Ensure all requirements are installed
- Check that CSV files have embeddings
- Try reducing dataset size for better performance