updated readme
This commit is contained in:
281
README.md
281
README.md
@@ -1,2 +1,281 @@
|
|||||||
# cult-scraper
|
# Discord Data Analysis & Visualization Suite
|
||||||
|
|
||||||
|
A comprehensive toolkit for scraping, processing, and analyzing Discord chat data with advanced visualization capabilities.
|
||||||
|
|
||||||
|
## 🌟 Features
|
||||||
|
|
||||||
|
### 📥 Data Collection
|
||||||
|
- **Discord Bot Scraper**: Automated extraction of complete message history from Discord servers
|
||||||
|
- **Image Downloader**: Downloads and processes images from Discord attachments with base64 conversion
|
||||||
|
- **Text Embeddings**: Generate semantic embeddings for chat messages using sentence transformers
|
||||||
|
|
||||||
|
### 📊 Visualization & Analysis
|
||||||
|
- **Interactive Chat Visualizer**: 2D visualization of chat messages using dimensionality reduction (PCA, t-SNE)
|
||||||
|
- **Clustering Analysis**: Automated grouping of similar messages with DBSCAN and HDBSCAN
|
||||||
|
- **Image Dataset Viewer**: Browse and explore downloaded images by channel
|
||||||
|
|
||||||
|
### 🔧 Data Processing
|
||||||
|
- **Batch Processing**: Process multiple CSV files with embeddings
|
||||||
|
- **Metadata Extraction**: Comprehensive message metadata including timestamps, authors, and content
|
||||||
|
- **Data Filtering**: Advanced filtering by authors, channels, and timeframes
|
||||||
|
|
||||||
|
## 📁 Repository Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
cult-scraper-1/
|
||||||
|
├── scripts/ # Core data collection scripts
|
||||||
|
│ ├── bot.py # Discord bot for message scraping
|
||||||
|
│ ├── image_downloader.py # Download and convert Discord images
|
||||||
|
│ ├── embedder.py # Batch text embedding processor
|
||||||
|
│ └── embed_class.py # Text embedding utilities
|
||||||
|
├── apps/ # Interactive applications
|
||||||
|
│ ├── cluster_map/ # Chat message clustering & visualization
|
||||||
|
│ │ ├── main.py # Main Streamlit application
|
||||||
|
│ │ ├── data_loader.py # Data loading utilities
|
||||||
|
│ │ ├── clustering.py # Clustering algorithms
|
||||||
|
│ │ ├── visualization.py # Plotting and visualization
|
||||||
|
│ │ └── requirements.txt # Dependencies
|
||||||
|
│ └── image_viewer/ # Image dataset browser
|
||||||
|
│ ├── image_viewer.py # Streamlit image viewer
|
||||||
|
│ └── requirements.txt # Dependencies
|
||||||
|
├── discord_chat_logs/ # Exported CSV files from Discord
|
||||||
|
└── images_dataset/ # Downloaded images and metadata
|
||||||
|
└── images_dataset.json # Image dataset with base64 data
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🚀 Quick Start
|
||||||
|
|
||||||
|
### 1. Discord Data Scraping
|
||||||
|
|
||||||
|
First, set up and run the Discord bot to collect message data:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd scripts
|
||||||
|
# Configure your bot token in bot.py
|
||||||
|
python bot.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Requirements:**
|
||||||
|
- Discord bot token with message content intent enabled
|
||||||
|
- Bot must have read permissions in target channels
|
||||||
|
|
||||||
|
### 2. Generate Text Embeddings
|
||||||
|
|
||||||
|
Process the collected chat data to add semantic embeddings:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd scripts
|
||||||
|
python embedder.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This will:
|
||||||
|
- Process all CSV files in `discord_chat_logs/`
|
||||||
|
- Add embeddings to message content using sentence transformers
|
||||||
|
- Save updated files with embedding vectors
|
||||||
|
|
||||||
|
### 3. Download Images
|
||||||
|
|
||||||
|
Extract and download images from Discord attachments:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd scripts
|
||||||
|
python image_downloader.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- Downloads images from attachment URLs
|
||||||
|
- Converts to base64 for storage
|
||||||
|
- Handles multiple image formats (PNG, JPG, GIF, WebP, etc.)
|
||||||
|
- Implements retry logic and rate limiting
|
||||||
|
|
||||||
|
### 4. Visualize Chat Data
|
||||||
|
|
||||||
|
Launch the interactive chat visualization tool:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd apps/cluster_map
|
||||||
|
pip install -r requirements.txt
|
||||||
|
streamlit run main.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Capabilities:**
|
||||||
|
- 2D visualization using PCA or t-SNE
|
||||||
|
- Interactive clustering with DBSCAN/HDBSCAN
|
||||||
|
- Filter by channels, authors, and time periods
|
||||||
|
- Hover to see message content and metadata
|
||||||
|
|
||||||
|
### 5. Browse Image Dataset
|
||||||
|
|
||||||
|
View downloaded images in an organized interface:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd apps/image_viewer
|
||||||
|
pip install -r requirements.txt
|
||||||
|
streamlit run image_viewer.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Features:**
|
||||||
|
- Channel-based organization
|
||||||
|
- Navigation controls (previous/next)
|
||||||
|
- Image metadata display
|
||||||
|
- Responsive layout
|
||||||
|
|
||||||
|
## 📋 Data Formats
|
||||||
|
|
||||||
|
### Discord Chat Logs (CSV)
|
||||||
|
```csv
|
||||||
|
message_id,timestamp_utc,author_id,author_name,author_nickname,content,attachment_urls,embeds,content_embedding
|
||||||
|
1234567890,2025-08-11 12:34:56,9876543210,username,nickname,"Hello world!","https://cdn.discord.com/...",{},"[0.123, -0.456, ...]"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Image Dataset (JSON)
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"created_at": "2025-08-11 12:34:56 UTC",
|
||||||
|
"summary": {
|
||||||
|
"total_images": 42,
|
||||||
|
"channels": ["memes", "general"],
|
||||||
|
"total_size_bytes": 1234567,
|
||||||
|
"file_extensions": [".png", ".jpg"],
|
||||||
|
"authors": ["user1", "user2"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"images": [
|
||||||
|
{
|
||||||
|
"url": "https://cdn.discordapp.com/attachments/...",
|
||||||
|
"channel": "memes",
|
||||||
|
"author_name": "username",
|
||||||
|
"timestamp_utc": "2025-08-11 12:34:56+00:00",
|
||||||
|
"content": "Message text",
|
||||||
|
"file_extension": ".png",
|
||||||
|
"file_size": 54321,
|
||||||
|
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🔧 Configuration
|
||||||
|
|
||||||
|
### Discord Bot Setup
|
||||||
|
1. Create a Discord application at https://discord.com/developers/applications
|
||||||
|
2. Create a bot and copy the token
|
||||||
|
3. Enable the following intents:
|
||||||
|
- Message Content Intent
|
||||||
|
- Server Members Intent (optional)
|
||||||
|
4. Invite bot to your server with appropriate permissions
|
||||||
|
|
||||||
|
### Environment Variables
|
||||||
|
```bash
|
||||||
|
# Set in scripts/bot.py
|
||||||
|
BOT_TOKEN = "your_discord_bot_token_here"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Embedding Models
|
||||||
|
The system uses sentence-transformers models. Default: `all-MiniLM-L6-v2`
|
||||||
|
|
||||||
|
Supported models:
|
||||||
|
- `all-MiniLM-L6-v2` (lightweight, fast)
|
||||||
|
- `all-mpnet-base-v2` (higher quality)
|
||||||
|
- `sentence-transformers/all-roberta-large-v1` (best quality, slower)
|
||||||
|
|
||||||
|
## 📊 Visualization Features
|
||||||
|
|
||||||
|
### Chat Message Clustering
|
||||||
|
- **Dimensionality Reduction**: PCA, t-SNE, UMAP
|
||||||
|
- **Clustering Algorithms**: DBSCAN, HDBSCAN with automatic parameter tuning
|
||||||
|
- **Interactive Controls**: Filter by source files, authors, and clusters
|
||||||
|
- **Hover Information**: View message content, author, timestamp on hover
|
||||||
|
|
||||||
|
### Image Analysis
|
||||||
|
- **Channel Organization**: Browse images by Discord channel
|
||||||
|
- **Metadata Display**: Author, timestamp, message context
|
||||||
|
- **Navigation**: Previous/next controls with slider
|
||||||
|
- **Format Support**: PNG, JPG, GIF, WebP, BMP, TIFF
|
||||||
|
|
||||||
|
## 🛠️ Dependencies
|
||||||
|
|
||||||
|
### Core Scripts
|
||||||
|
- `discord.py` - Discord bot framework
|
||||||
|
- `pandas` - Data manipulation
|
||||||
|
- `sentence-transformers` - Text embeddings
|
||||||
|
- `requests` - HTTP requests for image downloads
|
||||||
|
|
||||||
|
### Visualization Apps
|
||||||
|
- `streamlit` - Web interface framework
|
||||||
|
- `plotly` - Interactive plotting
|
||||||
|
- `scikit-learn` - Machine learning algorithms
|
||||||
|
- `numpy` - Numerical computations
|
||||||
|
- `umap-learn` - Dimensionality reduction
|
||||||
|
- `hdbscan` - Density-based clustering
|
||||||
|
|
||||||
|
## 📈 Use Cases
|
||||||
|
|
||||||
|
### Research & Analytics
|
||||||
|
- **Community Analysis**: Understand conversation patterns and topics
|
||||||
|
- **Sentiment Analysis**: Track mood and sentiment over time
|
||||||
|
- **User Behavior**: Analyze posting patterns and engagement
|
||||||
|
- **Content Moderation**: Identify problematic content clusters
|
||||||
|
|
||||||
|
### Data Science Projects
|
||||||
|
- **NLP Research**: Experiment with text embeddings and clustering
|
||||||
|
- **Social Network Analysis**: Study communication patterns
|
||||||
|
- **Visualization Techniques**: Explore dimensionality reduction methods
|
||||||
|
- **Image Processing**: Analyze visual content sharing patterns
|
||||||
|
|
||||||
|
### Content Management
|
||||||
|
- **Archive Creation**: Preserve Discord community history
|
||||||
|
- **Content Discovery**: Find similar messages and discussions
|
||||||
|
- **Moderation Tools**: Identify spam or inappropriate content
|
||||||
|
- **Backup Solutions**: Create comprehensive data backups
|
||||||
|
|
||||||
|
## 🔒 Privacy & Ethics
|
||||||
|
|
||||||
|
- **Data Protection**: All processing happens locally
|
||||||
|
- **User Consent**: Ensure proper permissions before scraping
|
||||||
|
- **Compliance**: Follow Discord's Terms of Service
|
||||||
|
- **Anonymization**: Consider removing or hashing user IDs for research
|
||||||
|
|
||||||
|
## 🤝 Contributing
|
||||||
|
|
||||||
|
1. Fork the repository
|
||||||
|
2. Create a feature branch
|
||||||
|
3. Make your changes
|
||||||
|
4. Test thoroughly
|
||||||
|
5. Submit a pull request
|
||||||
|
|
||||||
|
## 📄 License
|
||||||
|
|
||||||
|
This project is intended for educational and research purposes. Please ensure compliance with Discord's Terms of Service and applicable privacy laws when using this toolkit.
|
||||||
|
|
||||||
|
## 🆘 Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
**Bot can't read messages:**
|
||||||
|
- Ensure Message Content Intent is enabled
|
||||||
|
- Check bot permissions in Discord server
|
||||||
|
- Verify bot token is correct
|
||||||
|
|
||||||
|
**Embeddings not generating:**
|
||||||
|
- Install sentence-transformers: `pip install sentence-transformers`
|
||||||
|
- Check available GPU memory for large models
|
||||||
|
- Try a smaller model like `all-MiniLM-L6-v2`
|
||||||
|
|
||||||
|
**Images not downloading:**
|
||||||
|
- Check internet connectivity
|
||||||
|
- Verify Discord CDN URLs are accessible
|
||||||
|
- Increase retry limits for unreliable connections
|
||||||
|
|
||||||
|
**Visualization not loading:**
|
||||||
|
- Ensure all requirements are installed
|
||||||
|
- Check that CSV files have embeddings
|
||||||
|
- Try reducing dataset size for better performance
|
||||||
|
|
||||||
|
## 📚 Additional Resources
|
||||||
|
|
||||||
|
- [Discord.py Documentation](https://discordpy.readthedocs.io/)
|
||||||
|
- [Sentence Transformers Models](https://www.sbert.net/docs/pretrained_models.html)
|
||||||
|
- [Streamlit Documentation](https://docs.streamlit.io/)
|
||||||
|
- [scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
|
||||||
Reference in New Issue
Block a user