99 lines
2.5 KiB
Markdown
99 lines
2.5 KiB
Markdown
# Discord Image Downloader
|
|
|
|
This script processes Discord chat log CSV files to download and convert images to a base64 dataset.
|
|
|
|
## Features
|
|
|
|
- Parses all CSV files in the `discord_chat_logs/` directory
|
|
- Extracts attachment URLs from the `attachment_urls` column
|
|
- Downloads images using wget-like functionality (via Python requests)
|
|
- Converts images to base64 format for easy storage and processing
|
|
- Saves metadata including channel, sender, timestamp, and message context
|
|
- Handles Discord CDN URLs with query parameters
|
|
- Implements retry logic and rate limiting
|
|
- Deduplicates images based on URL hash
|
|
|
|
## Setup
|
|
|
|
1. Install dependencies:
|
|
```bash
|
|
./setup.sh
|
|
```
|
|
|
|
Or manually:
|
|
```bash
|
|
pip3 install -r requirements.txt
|
|
```
|
|
|
|
2. Run the image downloader:
|
|
```bash
|
|
cd scripts
|
|
python3 image_downloader.py
|
|
```
|
|
|
|
## Output
|
|
|
|
The script creates an `images_dataset/` directory containing:
|
|
|
|
- `images_dataset.json` - Complete dataset with images in base64 format
|
|
|
|
### Dataset Structure
|
|
|
|
```json
|
|
{
|
|
"metadata": {
|
|
"created_at": "2025-08-11 12:34:56 UTC",
|
|
"summary": {
|
|
"total_images": 42,
|
|
"channels": ["memes", "general", "nsfw"],
|
|
"total_size_bytes": 1234567,
|
|
"file_extensions": [".png", ".jpg", ".gif"],
|
|
"authors": ["user1", "user2"]
|
|
}
|
|
},
|
|
"images": [
|
|
{
|
|
"url": "https://cdn.discordapp.com/attachments/...",
|
|
"channel": "memes",
|
|
"author_name": "username",
|
|
"author_nickname": "User Nickname",
|
|
"author_id": "123456789",
|
|
"message_id": "987654321",
|
|
"timestamp_utc": "2020-03-11 18:25:49.086000+00:00",
|
|
"content": "Message text content",
|
|
"file_extension": ".png",
|
|
"file_size": 54321,
|
|
"url_hash": "abc123def456",
|
|
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## Supported Image Formats
|
|
|
|
- PNG (.png)
|
|
- JPEG (.jpg, .jpeg)
|
|
- GIF (.gif)
|
|
- WebP (.webp)
|
|
- BMP (.bmp)
|
|
- TIFF (.tiff)
|
|
|
|
## Configuration
|
|
|
|
You can modify the following variables in `image_downloader.py`:
|
|
|
|
- `MAX_RETRIES` - Number of download retry attempts (default: 3)
|
|
- `DELAY_BETWEEN_REQUESTS` - Delay between requests in seconds (default: 0.5)
|
|
- `SUPPORTED_EXTENSIONS` - Set of supported image file extensions
|
|
|
|
## Error Handling
|
|
|
|
The script includes robust error handling:
|
|
|
|
- Skips non-image URLs
|
|
- Retries failed downloads with exponential backoff
|
|
- Validates content types from server responses
|
|
- Continues processing even if individual downloads fail
|
|
- Logs all activities and errors to console
|