# Discord Data Analysis & Visualization Suite A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities. ## 🌟 Features ### 📥 Data Collection - **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers - **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion - **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers ### 📊 Visualization & Analysis - **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE) - **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN - **Image Dataset Viewer**: Browse and explore downloaded images by channel ### 🔧 Data Processing - **Batch Processing**: Process multiple CSV files with embeddings - **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content - **Data Filtering**: Advanced filtering by authors, channels, and timeframes ## 📁 Repository Structure ``` cult-scraper-1/ ├── scripts/ # Core data collection scripts │ ├── bot.py # Discord bot for message scraping │ ├── image_downloader.py # Download and convert Discord images │ ├── embedder.py # Batch text embedding processor │ └── embed_class.py # Text embedding utilities ├── apps/ # Interactive applications │ ├── cluster_map/ # Chat message clustering & visualization │ │ ├── main.py # Main Streamlit application │ │ ├── data_loader.py # Data loading utilities │ │ ├── clustering.py # Clustering algorithms │ │ ├── visualization.py # Plotting and visualization │ │ └── requirements.txt # Dependencies │ └── image_viewer/ # Image dataset browser │ ├── image_viewer.py # Streamlit image viewer │ └── requirements.txt # Dependencies ├── discord_chat_logs/ # Exported CSV files from Discord └── images_dataset/ # Downloaded images and metadata └── images_dataset.json # Image dataset with base64 data ``` ## 🚀 Quick Start ### 1. Discord Data Scraping First, set up and run the Discord bot to collect message data: ```bash cd scripts # Configure your bot token in bot.py python bot.py ``` **Requirements:** - Discord bot token with message content intent enabled - Bot must have read permissions in target channels ### 2. Generate Text Embeddings Process the collected chat data to add semantic embeddings: ```bash cd scripts python embedder.py ``` This will: - Process all CSV files in `discord_chat_logs/` - Add embeddings to message content using sentence transformers - Save updated files with embedding vectors ### 3. Download Images Extract and download images from Discord attachments: ```bash cd scripts python image_downloader.py ``` Features: - Downloads images from attachment URLs - Converts to base64 for storage - Handles multiple image formats (PNG, JPG, GIF, WebP, etc.) - Implements retry logic and rate limiting ### 4. Visualize Chat Data Launch the interactive chat visualization tool: ```bash cd apps/cluster_map pip install -r requirements.txt streamlit run main.py ``` **Capabilities:** - 2D visualization using PCA or t-SNE - Interactive clustering with DBSCAN/HDBSCAN - Filter by channels, authors, and time periods - Hover to see message content and metadata ### 5. Browse Image Dataset View downloaded images in an organized interface: ```bash cd apps/image_viewer pip install -r requirements.txt streamlit run image_viewer.py ``` **Features:** - Channel-based organization - Navigation controls (previous/next) - Image metadata display - Responsive layout ## 📋 Data Formats ### Discord Chat Logs (CSV) ```csv message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding 1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]" ``` ### Image Dataset (JSON) ```json { "metadata": { "created_at": "2025-08-11 12:34:56 UTC", "summary": { "total_images": 42, "channels": ["memes", "general"], "total_size_bytes": 1234567, "file_extensions": [".png", ".jpg"], "authors": ["user1", "user2"] } }, "images": [ { "url": "https://cdn.discordapp.com/attachments/...", "channel": "memes", "author_name": "username", "timestamp_utc": "2025-08-11 12:34:56+00:00", "content": "Message text", "file_extension": ".png", "file_size": 54321, "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..." } ] } ``` ## 🔧 Configuration ### Discord Bot Setup 1. Create a Discord application at https://discord.com/developers/applications 2. Create a bot and copy the token 3. Enable the following intents: - Message Content Intent - Server Members Intent (optional) 4. Invite bot to your server with appropriate permissions ### Environment Variables ```bash # Set in scripts/bot.py BOT_TOKEN = "your_discord_bot_token_here" ``` ### Embedding Models The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2` Supported models: - `all-MiniLM-L6-v2` (lightweight, fast) - `all-mpnet-base-v2` (higher quality) - `sentence-transformers/all-roberta-large-v1` (best quality, slower) ## 📊 Visualization Features ### Chat Message Clustering - **Dimensionality Reduction**: PCA, t-SNE, UMAP - **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning - **Interactive Controls**: Filter by source files, authors, and clusters - **Hover Information**: View message content, author, timestamp on hover ### Image Analysis - **Channel Organization**: Browse images by Discord channel - **Metadata Display**: Author, timestamp, message context - **Navigation**: Previous/next controls with slider - **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF ## 🛠️ Dependencies ### Core Scripts - `discord.py` - Discord bot framework - `pandas` - Data manipulation - `sentence-transformers` - Text embeddings - `requests` - HTTP requests for image downloads ### Visualization Apps - `streamlit` - Web interface framework - `plotly` - Interactive plotting - `scikit-learn` - Machine learning algorithms - `numpy` - Numerical computations - `umap-learn` - Dimensionality reduction - `hdbscan` - Density-based clustering ## 📈 Use Cases ### Research & Analytics - **Community Analysis**: Understand conversation patterns and topics - **Sentiment Analysis**: Track mood and sentiment over time - **User Behavior**: Analyze posting patterns and engagement - **Content Moderation**: Identify problematic content clusters ### Data Science Projects - **NLP Research**: Experiment with text embeddings and clustering - **Social Network Analysis**: Study communication patterns - **Visualization Techniques**: Explore dimensionality reduction methods - **Image Processing**: Analyze visual content sharing patterns ### Content Management - **Archive Creation**: Preserve Discord community history - **Content Discovery**: Find similar messages and discussions - **Moderation Tools**: Identify spam or inappropriate content - **Backup Solutions**: Create comprehensive data backups ## 🔒 Privacy & Ethics - **Data Protection**: All processing happens locally - **User Consent**: Ensure proper permissions before scraping - **Compliance**: Follow Discord's Terms of Service - **Anonymization**: Consider removing or hashing user IDs for research ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Test thoroughly 5. Submit a pull request ## 📄 License This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit. ## 🆘 Troubleshooting ### Common Issues **Bot can't read messages:** - Ensure Message Content Intent is enabled - Check bot permissions in Discord server - Verify bot token is correct **Embeddings not generating:** - Install sentence-transformers: `pip install sentence-transformers` - Check available GPU memory for large models - Try a smaller model like `all-MiniLM-L6-v2` **Images not downloading:** - Check internet connectivity - Verify Discord CDN URLs are accessible - Increase retry limits for unreliable connections **Visualization not loading:** - Ensure all requirements are installed - Check that CSV files have embeddings - Try reducing dataset size for better performance ## 📚 Additional Resources - [Discord.py Documentation](https://discordpy.readthedocs.io/) - [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html) - [Streamlit Documentation](https://docs.streamlit.io/) - [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)