updated readme

2025-08-11 03:07:44 +01:00
parent 2b8659fc95
commit fd9b25f256
1 changed files with 280 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,281 @@
-# cult-scraper
+# Discord Data Analysis & Visualization Suite

+A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
+
+## 🌟 Features
+
+### 📥 Data Collection
+- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
+- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
+- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
+
+### 📊 Visualization & Analysis
+- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
+- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
+- **Image Dataset Viewer**: Browse and explore downloaded images by channel
+
+### 🔧 Data Processing
+- **Batch Processing**: Process multiple CSV files with embeddings
+- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
+- **Data Filtering**: Advanced filtering by authors, channels, and timeframes
+
+## 📁 Repository Structure
+
+```
+cult-scraper-1/
+├── scripts/                          # Core data collection scripts
+│   ├── bot.py                        # Discord bot for message scraping
+│   ├── image_downloader.py           # Download and convert Discord images
+│   ├── embedder.py                   # Batch text embedding processor
+│   └── embed_class.py                # Text embedding utilities
+├── apps/                             # Interactive applications
+│   ├── cluster_map/                  # Chat message clustering & visualization
+│   │   ├── main.py                   # Main Streamlit application
+│   │   ├── data_loader.py            # Data loading utilities
+│   │   ├── clustering.py             # Clustering algorithms
+│   │   ├── visualization.py          # Plotting and visualization
+│   │   └── requirements.txt          # Dependencies
+│   └── image_viewer/                 # Image dataset browser
+│       ├── image_viewer.py           # Streamlit image viewer
+│       └── requirements.txt          # Dependencies
+├── discord_chat_logs/                # Exported CSV files from Discord
+└── images_dataset/                   # Downloaded images and metadata
+    └── images_dataset.json           # Image dataset with base64 data
+```
+
+## 🚀 Quick Start
+
+### 1. Discord Data Scraping
+
+First, set up and run the Discord bot to collect message data:
+
+```bash
+cd scripts
+# Configure your bot token in bot.py
+python bot.py
+```
+
+**Requirements:**
+- Discord bot token with message content intent enabled
+- Bot must have read permissions in target channels
+
+### 2. Generate Text Embeddings
+
+Process the collected chat data to add semantic embeddings:
+
+```bash
+cd scripts
+python embedder.py
+```
+
+This will:
+- Process all CSV files in `discord_chat_logs/`
+- Add embeddings to message content using sentence transformers
+- Save updated files with embedding vectors
+
+### 3. Download Images
+
+Extract and download images from Discord attachments:
+
+```bash
+cd scripts
+python image_downloader.py
+```
+
+Features:
+- Downloads images from attachment URLs
+- Converts to base64 for storage
+- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
+- Implements retry logic and rate limiting
+
+### 4. Visualize Chat Data
+
+Launch the interactive chat visualization tool:
+
+```bash
+cd apps/cluster_map
+pip install -r requirements.txt
+streamlit run main.py
+```
+
+**Capabilities:**
+- 2D visualization using PCA or t-SNE
+- Interactive clustering with DBSCAN/HDBSCAN
+- Filter by channels, authors, and time periods
+- Hover to see message content and metadata
+
+### 5. Browse Image Dataset
+
+View downloaded images in an organized interface:
+
+```bash
+cd apps/image_viewer
+pip install -r requirements.txt
+streamlit run image_viewer.py
+```
+
+**Features:**
+- Channel-based organization
+- Navigation controls (previous/next)
+- Image metadata display
+- Responsive layout
+
+## 📋 Data Formats
+
+### Discord Chat Logs (CSV)
+```csv
+message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
+1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
+```
+
+### Image Dataset (JSON)
+```json
+{
+  "metadata": {
+    "created_at": "2025-08-11 12:34:56 UTC",
+    "summary": {
+      "total_images": 42,
+      "channels": ["memes", "general"],
+      "total_size_bytes": 1234567,
+      "file_extensions": [".png", ".jpg"],
+      "authors": ["user1", "user2"]
+    }
+  },
+  "images": [
+    {
+      "url": "https://cdn.discordapp.com/attachments/...",
+      "channel": "memes",
+      "author_name": "username",
+      "timestamp_utc": "2025-08-11 12:34:56+00:00",
+      "content": "Message text",
+      "file_extension": ".png",
+      "file_size": 54321,
+      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
+    }
+  ]
+}
+```
+
+## 🔧 Configuration
+
+### Discord Bot Setup
+1. Create a Discord application at https://discord.com/developers/applications
+2. Create a bot and copy the token
+3. Enable the following intents:
+   - Message Content Intent
+   - Server Members Intent (optional)
+4. Invite bot to your server with appropriate permissions
+
+### Environment Variables
+```bash
+# Set in scripts/bot.py
+BOT_TOKEN = "your_discord_bot_token_here"
+```
+
+### Embedding Models
+The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
+
+Supported models:
+- `all-MiniLM-L6-v2` (lightweight, fast)
+- `all-mpnet-base-v2` (higher quality)
+- `sentence-transformers/all-roberta-large-v1` (best quality, slower)
+
+## 📊 Visualization Features
+
+### Chat Message Clustering
+- **Dimensionality Reduction**: PCA, t-SNE, UMAP
+- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
+- **Interactive Controls**: Filter by source files, authors, and clusters
+- **Hover Information**: View message content, author, timestamp on hover
+
+### Image Analysis
+- **Channel Organization**: Browse images by Discord channel
+- **Metadata Display**: Author, timestamp, message context
+- **Navigation**: Previous/next controls with slider
+- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
+
+## 🛠️ Dependencies
+
+### Core Scripts
+- `discord.py` - Discord bot framework
+- `pandas` - Data manipulation
+- `sentence-transformers` - Text embeddings
+- `requests` - HTTP requests for image downloads
+
+### Visualization Apps
+- `streamlit` - Web interface framework
+- `plotly` - Interactive plotting
+- `scikit-learn` - Machine learning algorithms
+- `numpy` - Numerical computations
+- `umap-learn` - Dimensionality reduction
+- `hdbscan` - Density-based clustering
+
+## 📈 Use Cases
+
+### Research & Analytics
+- **Community Analysis**: Understand conversation patterns and topics
+- **Sentiment Analysis**: Track mood and sentiment over time
+- **User Behavior**: Analyze posting patterns and engagement
+- **Content Moderation**: Identify problematic content clusters
+
+### Data Science Projects
+- **NLP Research**: Experiment with text embeddings and clustering
+- **Social Network Analysis**: Study communication patterns
+- **Visualization Techniques**: Explore dimensionality reduction methods
+- **Image Processing**: Analyze visual content sharing patterns
+
+### Content Management
+- **Archive Creation**: Preserve Discord community history
+- **Content Discovery**: Find similar messages and discussions
+- **Moderation Tools**: Identify spam or inappropriate content
+- **Backup Solutions**: Create comprehensive data backups
+
+## 🔒 Privacy & Ethics
+
+- **Data Protection**: All processing happens locally
+- **User Consent**: Ensure proper permissions before scraping
+- **Compliance**: Follow Discord's Terms of Service
+- **Anonymization**: Consider removing or hashing user IDs for research
+
+## 🤝 Contributing
+
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Test thoroughly
+5. Submit a pull request
+
+## 📄 License
+
+This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
+
+## 🆘 Troubleshooting
+
+### Common Issues
+
+**Bot can't read messages:**
+- Ensure Message Content Intent is enabled
+- Check bot permissions in Discord server
+- Verify bot token is correct
+
+**Embeddings not generating:**
+- Install sentence-transformers: `pip install sentence-transformers`
+- Check available GPU memory for large models
+- Try a smaller model like `all-MiniLM-L6-v2`
+
+**Images not downloading:**
+- Check internet connectivity
+- Verify Discord CDN URLs are accessible
+- Increase retry limits for unreliable connections
+
+**Visualization not loading:**
+- Ensure all requirements are installed
+- Check that CSV files have embeddings
+- Try reducing dataset size for better performance
+
+## 📚 Additional Resources
+
+- [Discord.py Documentation](https://discordpy.readthedocs.io/)
+- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
+- [Streamlit Documentation](https://docs.streamlit.io/)
+- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)