diff --git a/README.md b/README.md index 52ebaba..645ded3 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,281 @@ -# cult-scraper +# Discord Data Analysis & Visualization Suite +A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities. + +## 🌟 Features + +### 📥 Data Collection +- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers +- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion +- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers + +### 📊 Visualization & Analysis +- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE) +- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN +- **Image Dataset Viewer**: Browse and explore downloaded images by channel + +### 🔧 Data Processing +- **Batch Processing**: Process multiple CSV files with embeddings +- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content +- **Data Filtering**: Advanced filtering by authors, channels, and timeframes + +## 📁 Repository Structure + +``` +cult-scraper-1/ +├── scripts/ # Core data collection scripts +│ ├── bot.py # Discord bot for message scraping +│ ├── image_downloader.py # Download and convert Discord images +│ ├── embedder.py # Batch text embedding processor +│ └── embed_class.py # Text embedding utilities +├── apps/ # Interactive applications +│ ├── cluster_map/ # Chat message clustering & visualization +│ │ ├── main.py # Main Streamlit application +│ │ ├── data_loader.py # Data loading utilities +│ │ ├── clustering.py # Clustering algorithms +│ │ ├── visualization.py # Plotting and visualization +│ │ └── requirements.txt # Dependencies +│ └── image_viewer/ # Image dataset browser +│ ├── image_viewer.py # Streamlit image viewer +│ └── requirements.txt # Dependencies +├── discord_chat_logs/ # Exported CSV files from Discord +└── images_dataset/ # Downloaded images and metadata + └── images_dataset.json # Image dataset with base64 data +``` + +## 🚀 Quick Start + +### 1. Discord Data Scraping + +First, set up and run the Discord bot to collect message data: + +```bash +cd scripts +# Configure your bot token in bot.py +python bot.py +``` + +**Requirements:** +- Discord bot token with message content intent enabled +- Bot must have read permissions in target channels + +### 2. Generate Text Embeddings + +Process the collected chat data to add semantic embeddings: + +```bash +cd scripts +python embedder.py +``` + +This will: +- Process all CSV files in `discord_chat_logs/` +- Add embeddings to message content using sentence transformers +- Save updated files with embedding vectors + +### 3. Download Images + +Extract and download images from Discord attachments: + +```bash +cd scripts +python image_downloader.py +``` + +Features: +- Downloads images from attachment URLs +- Converts to base64 for storage +- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.) +- Implements retry logic and rate limiting + +### 4. Visualize Chat Data + +Launch the interactive chat visualization tool: + +```bash +cd apps/cluster_map +pip install -r requirements.txt +streamlit run main.py +``` + +**Capabilities:** +- 2D visualization using PCA or t-SNE +- Interactive clustering with DBSCAN/HDBSCAN +- Filter by channels, authors, and time periods +- Hover to see message content and metadata + +### 5. Browse Image Dataset + +View downloaded images in an organized interface: + +```bash +cd apps/image_viewer +pip install -r requirements.txt +streamlit run image_viewer.py +``` + +**Features:** +- Channel-based organization +- Navigation controls (previous/next) +- Image metadata display +- Responsive layout + +## 📋 Data Formats + +### Discord Chat Logs (CSV) +```csv +message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding +1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]" +``` + +### Image Dataset (JSON) +```json +{ + "metadata": { + "created_at": "2025-08-11 12:34:56 UTC", + "summary": { + "total_images": 42, + "channels": ["memes", "general"], + "total_size_bytes": 1234567, + "file_extensions": [".png", ".jpg"], + "authors": ["user1", "user2"] + } + }, + "images": [ + { + "url": "https://cdn.discordapp.com/attachments/...", + "channel": "memes", + "author_name": "username", + "timestamp_utc": "2025-08-11 12:34:56+00:00", + "content": "Message text", + "file_extension": ".png", + "file_size": 54321, + "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..." + } + ] +} +``` + +## 🔧 Configuration + +### Discord Bot Setup +1. Create a Discord application at https://discord.com/developers/applications +2. Create a bot and copy the token +3. Enable the following intents: + - Message Content Intent + - Server Members Intent (optional) +4. Invite bot to your server with appropriate permissions + +### Environment Variables +```bash +# Set in scripts/bot.py +BOT_TOKEN = "your_discord_bot_token_here" +``` + +### Embedding Models +The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2` + +Supported models: +- `all-MiniLM-L6-v2` (lightweight, fast) +- `all-mpnet-base-v2` (higher quality) +- `sentence-transformers/all-roberta-large-v1` (best quality, slower) + +## 📊 Visualization Features + +### Chat Message Clustering +- **Dimensionality Reduction**: PCA, t-SNE, UMAP +- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning +- **Interactive Controls**: Filter by source files, authors, and clusters +- **Hover Information**: View message content, author, timestamp on hover + +### Image Analysis +- **Channel Organization**: Browse images by Discord channel +- **Metadata Display**: Author, timestamp, message context +- **Navigation**: Previous/next controls with slider +- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF + +## 🛠️ Dependencies + +### Core Scripts +- `discord.py` - Discord bot framework +- `pandas` - Data manipulation +- `sentence-transformers` - Text embeddings +- `requests` - HTTP requests for image downloads + +### Visualization Apps +- `streamlit` - Web interface framework +- `plotly` - Interactive plotting +- `scikit-learn` - Machine learning algorithms +- `numpy` - Numerical computations +- `umap-learn` - Dimensionality reduction +- `hdbscan` - Density-based clustering + +## 📈 Use Cases + +### Research & Analytics +- **Community Analysis**: Understand conversation patterns and topics +- **Sentiment Analysis**: Track mood and sentiment over time +- **User Behavior**: Analyze posting patterns and engagement +- **Content Moderation**: Identify problematic content clusters + +### Data Science Projects +- **NLP Research**: Experiment with text embeddings and clustering +- **Social Network Analysis**: Study communication patterns +- **Visualization Techniques**: Explore dimensionality reduction methods +- **Image Processing**: Analyze visual content sharing patterns + +### Content Management +- **Archive Creation**: Preserve Discord community history +- **Content Discovery**: Find similar messages and discussions +- **Moderation Tools**: Identify spam or inappropriate content +- **Backup Solutions**: Create comprehensive data backups + +## 🔒 Privacy & Ethics + +- **Data Protection**: All processing happens locally +- **User Consent**: Ensure proper permissions before scraping +- **Compliance**: Follow Discord's Terms of Service +- **Anonymization**: Consider removing or hashing user IDs for research + +## 🤝 Contributing + +1. Fork the repository +2. Create a feature branch +3. Make your changes +4. Test thoroughly +5. Submit a pull request + +## 📄 License + +This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit. + +## 🆘 Troubleshooting + +### Common Issues + +**Bot can't read messages:** +- Ensure Message Content Intent is enabled +- Check bot permissions in Discord server +- Verify bot token is correct + +**Embeddings not generating:** +- Install sentence-transformers: `pip install sentence-transformers` +- Check available GPU memory for large models +- Try a smaller model like `all-MiniLM-L6-v2` + +**Images not downloading:** +- Check internet connectivity +- Verify Discord CDN URLs are accessible +- Increase retry limits for unreliable connections + +**Visualization not loading:** +- Ensure all requirements are installed +- Check that CSV files have embeddings +- Try reducing dataset size for better performance + +## 📚 Additional Resources + +- [Discord.py Documentation](https://discordpy.readthedocs.io/) +- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html) +- [Streamlit Documentation](https://docs.streamlit.io/) +- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html) \ No newline at end of file