updated readme

2025-08-11 03:07:44 +01:00
parent 2b8659fc95
commit fd9b25f256
1 changed files with 280 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,281 @@
-# cult-scraper
+# Discord Data Analysis & Visualization Suite
 A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
 ## 🌟 Features
 ### 📥 Data Collection
 - **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
 - **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
 - **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
 ### 📊 Visualization & Analysis
 - **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
 - **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
 - **Image Dataset Viewer**: Browse and explore downloaded images by channel
 ### 🔧 Data Processing
 - **Batch Processing**: Process multiple CSV files with embeddings
 - **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
 - **Data Filtering**: Advanced filtering by authors, channels, and timeframes
 ## 📁 Repository Structure
 ```
 cult-scraper-1/
 ├── scripts/                          # Core data collection scripts
 │   ├── bot.py                        # Discord bot for message scraping
 │   ├── image_downloader.py           # Download and convert Discord images
 │   ├── embedder.py                   # Batch text embedding processor
 │   └── embed_class.py                # Text embedding utilities
 ├── apps/                             # Interactive applications
 │   ├── cluster_map/                  # Chat message clustering & visualization
 │   │   ├── main.py                   # Main Streamlit application
 │   │   ├── data_loader.py            # Data loading utilities
 │   │   ├── clustering.py             # Clustering algorithms
 │   │   ├── visualization.py          # Plotting and visualization
 │   │   └── requirements.txt          # Dependencies
 │   └── image_viewer/                 # Image dataset browser
 │       ├── image_viewer.py           # Streamlit image viewer
 │       └── requirements.txt          # Dependencies
 ├── discord_chat_logs/                # Exported CSV files from Discord
 └── images_dataset/                   # Downloaded images and metadata
    └── images_dataset.json           # Image dataset with base64 data
 ```
 ## 🚀 Quick Start
 ### 1. Discord Data Scraping
 First, set up and run the Discord bot to collect message data:
 ```bash
 cd scripts
 # Configure your bot token in bot.py
 python bot.py
 ```
 **Requirements:**
 - Discord bot token with message content intent enabled
 - Bot must have read permissions in target channels
 ### 2. Generate Text Embeddings
 Process the collected chat data to add semantic embeddings:
 ```bash
 cd scripts
 python embedder.py
 ```
 This will:
 - Process all CSV files in `discord_chat_logs/`
 - Add embeddings to message content using sentence transformers
 - Save updated files with embedding vectors
 ### 3. Download Images
 Extract and download images from Discord attachments:
 ```bash
 cd scripts
 python image_downloader.py
 ```
 Features:
 - Downloads images from attachment URLs
 - Converts to base64 for storage
 - Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
 - Implements retry logic and rate limiting
 ### 4. Visualize Chat Data
 Launch the interactive chat visualization tool:
 ```bash
 cd apps/cluster_map
 pip install -r requirements.txt
 streamlit run main.py
 ```
 **Capabilities:**
 - 2D visualization using PCA or t-SNE
 - Interactive clustering with DBSCAN/HDBSCAN
 - Filter by channels, authors, and time periods
 - Hover to see message content and metadata
 ### 5. Browse Image Dataset
 View downloaded images in an organized interface:
 ```bash
 cd apps/image_viewer
 pip install -r requirements.txt
 streamlit run image_viewer.py
 ```
 **Features:**
 - Channel-based organization
 - Navigation controls (previous/next)
 - Image metadata display
 - Responsive layout
 ## 📋 Data Formats
 ### Discord Chat Logs (CSV)
 ```csv
 message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
 1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
 ```
 ### Image Dataset (JSON)
 ```json
 {
  "metadata": {
    "created_at": "2025-08-11 12:34:56 UTC",
    "summary": {
      "total_images": 42,
      "channels": ["memes", "general"],
      "total_size_bytes": 1234567,
      "file_extensions": [".png", ".jpg"],
      "authors": ["user1", "user2"]
    }
  },
  "images": [
    {
      "url": "https://cdn.discordapp.com/attachments/...",
      "channel": "memes",
      "author_name": "username",
      "timestamp_utc": "2025-08-11 12:34:56+00:00",
      "content": "Message text",
      "file_extension": ".png",
      "file_size": 54321,
      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  ]
 }
 ```
 ## 🔧 Configuration
 ### Discord Bot Setup
 1. Create a Discord application at https://discord.com/developers/applications
 2. Create a bot and copy the token
 3. Enable the following intents:
   - Message Content Intent
   - Server Members Intent (optional)
 4. Invite bot to your server with appropriate permissions
 ### Environment Variables
 ```bash
 # Set in scripts/bot.py
 BOT_TOKEN = "your_discord_bot_token_here"
 ```
 ### Embedding Models
 The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
 Supported models:
 - `all-MiniLM-L6-v2` (lightweight, fast)
 - `all-mpnet-base-v2` (higher quality)
 - `sentence-transformers/all-roberta-large-v1` (best quality, slower)
 ## 📊 Visualization Features
 ### Chat Message Clustering
 - **Dimensionality Reduction**: PCA, t-SNE, UMAP
 - **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
 - **Interactive Controls**: Filter by source files, authors, and clusters
 - **Hover Information**: View message content, author, timestamp on hover
 ### Image Analysis
 - **Channel Organization**: Browse images by Discord channel
 - **Metadata Display**: Author, timestamp, message context
 - **Navigation**: Previous/next controls with slider
 - **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
 ## 🛠️ Dependencies
 ### Core Scripts
 - `discord.py` - Discord bot framework
 - `pandas` - Data manipulation
 - `sentence-transformers` - Text embeddings
 - `requests` - HTTP requests for image downloads
 ### Visualization Apps
 - `streamlit` - Web interface framework
 - `plotly` - Interactive plotting
 - `scikit-learn` - Machine learning algorithms
 - `numpy` - Numerical computations
 - `umap-learn` - Dimensionality reduction
 - `hdbscan` - Density-based clustering
 ## 📈 Use Cases
 ### Research & Analytics
 - **Community Analysis**: Understand conversation patterns and topics
 - **Sentiment Analysis**: Track mood and sentiment over time
 - **User Behavior**: Analyze posting patterns and engagement
 - **Content Moderation**: Identify problematic content clusters
 ### Data Science Projects
 - **NLP Research**: Experiment with text embeddings and clustering
 - **Social Network Analysis**: Study communication patterns
 - **Visualization Techniques**: Explore dimensionality reduction methods
 - **Image Processing**: Analyze visual content sharing patterns
 ### Content Management
 - **Archive Creation**: Preserve Discord community history
 - **Content Discovery**: Find similar messages and discussions
 - **Moderation Tools**: Identify spam or inappropriate content
 - **Backup Solutions**: Create comprehensive data backups
 ## 🔒 Privacy & Ethics
 - **Data Protection**: All processing happens locally
 - **User Consent**: Ensure proper permissions before scraping
 - **Compliance**: Follow Discord's Terms of Service
 - **Anonymization**: Consider removing or hashing user IDs for research
 ## 🤝 Contributing
 1. Fork the repository
 2. Create a feature branch
 3. Make your changes
 4. Test thoroughly
 5. Submit a pull request
 ## 📄 License
 This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
 ## 🆘 Troubleshooting
 ### Common Issues
 **Bot can't read messages:**
 - Ensure Message Content Intent is enabled
 - Check bot permissions in Discord server
 - Verify bot token is correct
 **Embeddings not generating:**
 - Install sentence-transformers: `pip install sentence-transformers`
 - Check available GPU memory for large models
 - Try a smaller model like `all-MiniLM-L6-v2`
 **Images not downloading:**
 - Check internet connectivity
 - Verify Discord CDN URLs are accessible
 - Increase retry limits for unreliable connections
 **Visualization not loading:**
 - Ensure all requirements are installed
 - Check that CSV files have embeddings
 - Try reducing dataset size for better performance
 ## 📚 Additional Resources
 - [Discord.py Documentation](https://discordpy.readthedocs.io/)
 - [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
 - [Streamlit Documentation](https://docs.streamlit.io/)
 - [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)