image downloader +read me

This commit is contained in:
2025-08-11 01:21:35 +01:00
parent e22705600a
commit ba528a3806
2 changed files with 326 additions and 0 deletions

View File

@@ -0,0 +1,98 @@
# Discord Image Downloader
This script processes Discord chat log CSV files to download and convert images to a base64 dataset.
## Features
- Parses all CSV files in the `discord_chat_logs/` directory
- Extracts attachment URLs from the `attachment_urls` column
- Downloads images using wget-like functionality (via Python requests)
- Converts images to base64 format for easy storage and processing
- Saves metadata including channel, sender, timestamp, and message context
- Handles Discord CDN URLs with query parameters
- Implements retry logic and rate limiting
- Deduplicates images based on URL hash
## Setup
1. Install dependencies:
```bash
./setup.sh
```
Or manually:
```bash
pip3 install -r requirements.txt
```
2. Run the image downloader:
```bash
cd scripts
python3 image_downloader.py
```
## Output
The script creates an `images_dataset/` directory containing:
- `images_dataset.json` - Complete dataset with images in base64 format
### Dataset Structure
```json
{
"metadata": {
"created_at": "2025-08-11 12:34:56 UTC",
"summary": {
"total_images": 42,
"channels": ["memes", "general", "nsfw"],
"total_size_bytes": 1234567,
"file_extensions": [".png", ".jpg", ".gif"],
"authors": ["user1", "user2"]
}
},
"images": [
{
"url": "https://cdn.discordapp.com/attachments/...",
"channel": "memes",
"author_name": "username",
"author_nickname": "User Nickname",
"author_id": "123456789",
"message_id": "987654321",
"timestamp_utc": "2020-03-11 18:25:49.086000+00:00",
"content": "Message text content",
"file_extension": ".png",
"file_size": 54321,
"url_hash": "abc123def456",
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
}
]
}
```
## Supported Image Formats
- PNG (.png)
- JPEG (.jpg, .jpeg)
- GIF (.gif)
- WebP (.webp)
- BMP (.bmp)
- TIFF (.tiff)
## Configuration
You can modify the following variables in `image_downloader.py`:
- `MAX_RETRIES` - Number of download retry attempts (default: 3)
- `DELAY_BETWEEN_REQUESTS` - Delay between requests in seconds (default: 0.5)
- `SUPPORTED_EXTENSIONS` - Set of supported image file extensions
## Error Handling
The script includes robust error handling:
- Skips non-image URLs
- Retries failed downloads with exponential backoff
- Validates content types from server responses
- Continues processing even if individual downloads fail
- Logs all activities and errors to console