Files
cult-scraper/README.md
2025-08-11 03:07:44 +01:00

9.0 KiB

Discord Data Analysis & Visualization Suite

A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.

🌟 Features

📥 Data Collection

  • Discord Bot Scraper: Automated extraction of complete message history from Discord servers
  • Image Downloader: Downloads and processes images from Discord attachments with base64 conversion
  • Text Embeddings: Generate semantic embeddings for chat messages using sentence transformers

📊 Visualization & Analysis

  • Interactive Chat Visualizer: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
  • Clustering Analysis: Automated grouping of similar messages with DBSCAN and HDBSCAN
  • Image Dataset Viewer: Browse and explore downloaded images by channel

🔧 Data Processing

  • Batch Processing: Process multiple CSV files with embeddings
  • Metadata Extraction: Comprehensive message metadata including timestamps, authors, and content
  • Data Filtering: Advanced filtering by authors, channels, and timeframes

📁 Repository Structure

cult-scraper-1/
├── scripts/                          # Core data collection scripts
│   ├── bot.py                        # Discord bot for message scraping
│   ├── image_downloader.py           # Download and convert Discord images
│   ├── embedder.py                   # Batch text embedding processor
│   └── embed_class.py                # Text embedding utilities
├── apps/                             # Interactive applications
│   ├── cluster_map/                  # Chat message clustering & visualization
│   │   ├── main.py                   # Main Streamlit application
│   │   ├── data_loader.py            # Data loading utilities
│   │   ├── clustering.py             # Clustering algorithms
│   │   ├── visualization.py          # Plotting and visualization
│   │   └── requirements.txt          # Dependencies
│   └── image_viewer/                 # Image dataset browser
│       ├── image_viewer.py           # Streamlit image viewer
│       └── requirements.txt          # Dependencies
├── discord_chat_logs/                # Exported CSV files from Discord
└── images_dataset/                   # Downloaded images and metadata
    └── images_dataset.json           # Image dataset with base64 data

🚀 Quick Start

1. Discord Data Scraping

First, set up and run the Discord bot to collect message data:

cd scripts
# Configure your bot token in bot.py
python bot.py

Requirements:

  • Discord bot token with message content intent enabled
  • Bot must have read permissions in target channels

2. Generate Text Embeddings

Process the collected chat data to add semantic embeddings:

cd scripts
python embedder.py

This will:

  • Process all CSV files in discord_chat_logs/
  • Add embeddings to message content using sentence transformers
  • Save updated files with embedding vectors

3. Download Images

Extract and download images from Discord attachments:

cd scripts
python image_downloader.py

Features:

  • Downloads images from attachment URLs
  • Converts to base64 for storage
  • Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
  • Implements retry logic and rate limiting

4. Visualize Chat Data

Launch the interactive chat visualization tool:

cd apps/cluster_map
pip install -r requirements.txt
streamlit run main.py

Capabilities:

  • 2D visualization using PCA or t-SNE
  • Interactive clustering with DBSCAN/HDBSCAN
  • Filter by channels, authors, and time periods
  • Hover to see message content and metadata

5. Browse Image Dataset

View downloaded images in an organized interface:

cd apps/image_viewer
pip install -r requirements.txt
streamlit run image_viewer.py

Features:

  • Channel-based organization
  • Navigation controls (previous/next)
  • Image metadata display
  • Responsive layout

📋 Data Formats

Discord Chat Logs (CSV)

message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"

Image Dataset (JSON)

{
  "metadata": {
    "created_at": "2025-08-11 12:34:56 UTC",
    "summary": {
      "total_images": 42,
      "channels": ["memes", "general"],
      "total_size_bytes": 1234567,
      "file_extensions": [".png", ".jpg"],
      "authors": ["user1", "user2"]
    }
  },
  "images": [
    {
      "url": "https://cdn.discordapp.com/attachments/...",
      "channel": "memes",
      "author_name": "username",
      "timestamp_utc": "2025-08-11 12:34:56+00:00",
      "content": "Message text",
      "file_extension": ".png",
      "file_size": 54321,
      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  ]
}

🔧 Configuration

Discord Bot Setup

  1. Create a Discord application at https://discord.com/developers/applications
  2. Create a bot and copy the token
  3. Enable the following intents:
    • Message Content Intent
    • Server Members Intent (optional)
  4. Invite bot to your server with appropriate permissions

Environment Variables

# Set in scripts/bot.py
BOT_TOKEN = "your_discord_bot_token_here"

Embedding Models

The system uses sentence-transformers models. Default: all-MiniLM-L6-v2

Supported models:

  • all-MiniLM-L6-v2 (lightweight, fast)
  • all-mpnet-base-v2 (higher quality)
  • sentence-transformers/all-roberta-large-v1 (best quality, slower)

📊 Visualization Features

Chat Message Clustering

  • Dimensionality Reduction: PCA, t-SNE, UMAP
  • Clustering Algorithms: DBSCAN, HDBSCAN with automatic parameter tuning
  • Interactive Controls: Filter by source files, authors, and clusters
  • Hover Information: View message content, author, timestamp on hover

Image Analysis

  • Channel Organization: Browse images by Discord channel
  • Metadata Display: Author, timestamp, message context
  • Navigation: Previous/next controls with slider
  • Format Support: PNG, JPG, GIF, WebP, BMP, TIFF

🛠️ Dependencies

Core Scripts

  • discord.py - Discord bot framework
  • pandas - Data manipulation
  • sentence-transformers - Text embeddings
  • requests - HTTP requests for image downloads

Visualization Apps

  • streamlit - Web interface framework
  • plotly - Interactive plotting
  • scikit-learn - Machine learning algorithms
  • numpy - Numerical computations
  • umap-learn - Dimensionality reduction
  • hdbscan - Density-based clustering

📈 Use Cases

Research & Analytics

  • Community Analysis: Understand conversation patterns and topics
  • Sentiment Analysis: Track mood and sentiment over time
  • User Behavior: Analyze posting patterns and engagement
  • Content Moderation: Identify problematic content clusters

Data Science Projects

  • NLP Research: Experiment with text embeddings and clustering
  • Social Network Analysis: Study communication patterns
  • Visualization Techniques: Explore dimensionality reduction methods
  • Image Processing: Analyze visual content sharing patterns

Content Management

  • Archive Creation: Preserve Discord community history
  • Content Discovery: Find similar messages and discussions
  • Moderation Tools: Identify spam or inappropriate content
  • Backup Solutions: Create comprehensive data backups

🔒 Privacy & Ethics

  • Data Protection: All processing happens locally
  • User Consent: Ensure proper permissions before scraping
  • Compliance: Follow Discord's Terms of Service
  • Anonymization: Consider removing or hashing user IDs for research

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

📄 License

This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.

🆘 Troubleshooting

Common Issues

Bot can't read messages:

  • Ensure Message Content Intent is enabled
  • Check bot permissions in Discord server
  • Verify bot token is correct

Embeddings not generating:

  • Install sentence-transformers: pip install sentence-transformers
  • Check available GPU memory for large models
  • Try a smaller model like all-MiniLM-L6-v2

Images not downloading:

  • Check internet connectivity
  • Verify Discord CDN URLs are accessible
  • Increase retry limits for unreliable connections

Visualization not loading:

  • Ensure all requirements are installed
  • Check that CSV files have embeddings
  • Try reducing dataset size for better performance

📚 Additional Resources