# Discord Data Analysis & Visualization Suite

A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.

## 🌟 Features

### 📥 Data Collection
- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers

### 📊 Visualization & Analysis
- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
- **Image Dataset Viewer**: Browse and explore downloaded images by channel

### 🔧 Data Processing
- **Batch Processing**: Process multiple CSV files with embeddings
- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
- **Data Filtering**: Advanced filtering by authors, channels, and timeframes

## 📁 Repository Structure

```
cult-scraper-1/
├── scripts/                          # Core data collection scripts
│   ├── bot.py                        # Discord bot for message scraping
│   ├── image_downloader.py           # Download and convert Discord images
│   ├── embedder.py                   # Batch text embedding processor
│   └── embed_class.py                # Text embedding utilities
├── apps/                             # Interactive applications
│   ├── cluster_map/                  # Chat message clustering & visualization
│   │   ├── main.py                   # Main Streamlit application
│   │   ├── data_loader.py            # Data loading utilities
│   │   ├── clustering.py             # Clustering algorithms
│   │   ├── visualization.py          # Plotting and visualization
│   │   └── requirements.txt          # Dependencies
│   └── image_viewer/                 # Image dataset browser
│       ├── image_viewer.py           # Streamlit image viewer
│       └── requirements.txt          # Dependencies
├── discord_chat_logs/                # Exported CSV files from Discord
└── images_dataset/                   # Downloaded images and metadata
    └── images_dataset.json           # Image dataset with base64 data
```

## 🚀 Quick Start

### 1. Discord Data Scraping

First, set up and run the Discord bot to collect message data:

```bash
cd scripts
# Configure your bot token in bot.py
python bot.py
```

**Requirements:**
- Discord bot token with message content intent enabled
- Bot must have read permissions in target channels

### 2. Generate Text Embeddings

Process the collected chat data to add semantic embeddings:

```bash
cd scripts
python embedder.py
```

This will:
- Process all CSV files in `discord_chat_logs/`
- Add embeddings to message content using sentence transformers
- Save updated files with embedding vectors

### 3. Download Images

Extract and download images from Discord attachments:

```bash
cd scripts
python image_downloader.py
```

Features:
- Downloads images from attachment URLs
- Converts to base64 for storage
- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
- Implements retry logic and rate limiting

### 4. Visualize Chat Data

Launch the interactive chat visualization tool:

```bash
cd apps/cluster_map
pip install -r requirements.txt
streamlit run main.py
```

**Capabilities:**
- 2D visualization using PCA or t-SNE
- Interactive clustering with DBSCAN/HDBSCAN
- Filter by channels, authors, and time periods
- Hover to see message content and metadata

### 5. Browse Image Dataset

View downloaded images in an organized interface:

```bash
cd apps/image_viewer
pip install -r requirements.txt
streamlit run image_viewer.py
```

**Features:**
- Channel-based organization
- Navigation controls (previous/next)
- Image metadata display
- Responsive layout

## 📋 Data Formats

### Discord Chat Logs (CSV)
```csv
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
```

### Image Dataset (JSON)
```json
{
  "metadata": {
    "created_at": "2025-08-11 12:34:56 UTC",
    "summary": {
      "total_images": 42,
      "channels": ["memes", "general"],
      "total_size_bytes": 1234567,
      "file_extensions": [".png", ".jpg"],
      "authors": ["user1", "user2"]
    }
  },
  "images": [
    {
      "url": "https://cdn.discordapp.com/attachments/...",
      "channel": "memes",
      "author_name": "username",
      "timestamp_utc": "2025-08-11 12:34:56+00:00",
      "content": "Message text",
      "file_extension": ".png",
      "file_size": 54321,
      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  ]
}
```

## 🔧 Configuration

### Discord Bot Setup
1. Create a Discord application at https://discord.com/developers/applications
2. Create a bot and copy the token
3. Enable the following intents:
   - Message Content Intent
   - Server Members Intent (optional)
4. Invite bot to your server with appropriate permissions

### Environment Variables
```bash
# Set in scripts/bot.py
BOT_TOKEN = "your_discord_bot_token_here"
```

### Embedding Models
The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`

Supported models:
- `all-MiniLM-L6-v2` (lightweight, fast)
- `all-mpnet-base-v2` (higher quality)
- `sentence-transformers/all-roberta-large-v1` (best quality, slower)

## 📊 Visualization Features

### Chat Message Clustering
- **Dimensionality Reduction**: PCA, t-SNE, UMAP
- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
- **Interactive Controls**: Filter by source files, authors, and clusters
- **Hover Information**: View message content, author, timestamp on hover

### Image Analysis
- **Channel Organization**: Browse images by Discord channel
- **Metadata Display**: Author, timestamp, message context
- **Navigation**: Previous/next controls with slider
- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF

## 🛠️ Dependencies

### Core Scripts
- `discord.py` - Discord bot framework
- `pandas` - Data manipulation
- `sentence-transformers` - Text embeddings
- `requests` - HTTP requests for image downloads

### Visualization Apps
- `streamlit` - Web interface framework
- `plotly` - Interactive plotting
- `scikit-learn` - Machine learning algorithms
- `numpy` - Numerical computations
- `umap-learn` - Dimensionality reduction
- `hdbscan` - Density-based clustering

## 📈 Use Cases

### Research & Analytics
- **Community Analysis**: Understand conversation patterns and topics
- **Sentiment Analysis**: Track mood and sentiment over time
- **User Behavior**: Analyze posting patterns and engagement
- **Content Moderation**: Identify problematic content clusters

### Data Science Projects
- **NLP Research**: Experiment with text embeddings and clustering
- **Social Network Analysis**: Study communication patterns
- **Visualization Techniques**: Explore dimensionality reduction methods
- **Image Processing**: Analyze visual content sharing patterns

### Content Management
- **Archive Creation**: Preserve Discord community history
- **Content Discovery**: Find similar messages and discussions
- **Moderation Tools**: Identify spam or inappropriate content
- **Backup Solutions**: Create comprehensive data backups

## 🔒 Privacy & Ethics

- **Data Protection**: All processing happens locally
- **User Consent**: Ensure proper permissions before scraping
- **Compliance**: Follow Discord's Terms of Service
- **Anonymization**: Consider removing or hashing user IDs for research

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Test thoroughly
5. Submit a pull request

## 📄 License

This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.

## 🆘 Troubleshooting

### Common Issues

**Bot can't read messages:**
- Ensure Message Content Intent is enabled
- Check bot permissions in Discord server
- Verify bot token is correct

**Embeddings not generating:**
- Install sentence-transformers: `pip install sentence-transformers`
- Check available GPU memory for large models
- Try a smaller model like `all-MiniLM-L6-v2`

**Images not downloading:**
- Check internet connectivity
- Verify Discord CDN URLs are accessible
- Increase retry limits for unreliable connections

**Visualization not loading:**
- Ensure all requirements are installed
- Check that CSV files have embeddings
- Try reducing dataset size for better performance

## 📚 Additional Resources

- [Discord.py Documentation](https://discordpy.readthedocs.io/)
- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
- [Streamlit Documentation](https://docs.streamlit.io/)
- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)