# Discord Image Downloader This script processes Discord chat log CSV files to download and convert images to a base64 dataset. ## Features - Parses all CSV files in the `discord_chat_logs/` directory - Extracts attachment URLs from the `attachment_urls` column - Downloads images using wget-like functionality (via Python requests) - Converts images to base64 format for easy storage and processing - Saves metadata including channel, sender, timestamp, and message context - Handles Discord CDN URLs with query parameters - Implements retry logic and rate limiting - Deduplicates images based on URL hash ## Setup 1. Install dependencies: ```bash ./setup.sh ``` Or manually: ```bash pip3 install -r requirements.txt ``` 2. Run the image downloader: ```bash cd scripts python3 image_downloader.py ``` ## Output The script creates an `images_dataset/` directory containing: - `images_dataset.json` - Complete dataset with images in base64 format ### Dataset Structure ```json { "metadata": { "created_at": "2025-08-11 12:34:56 UTC", "summary": { "total_images": 42, "channels": ["memes", "general", "nsfw"], "total_size_bytes": 1234567, "file_extensions": [".png", ".jpg", ".gif"], "authors": ["user1", "user2"] } }, "images": [ { "url": "https://cdn.discordapp.com/attachments/...", "channel": "memes", "author_name": "username", "author_nickname": "User Nickname", "author_id": "123456789", "message_id": "987654321", "timestamp_utc": "2020-03-11 18:25:49.086000+00:00", "content": "Message text content", "file_extension": ".png", "file_size": 54321, "url_hash": "abc123def456", "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..." } ] } ``` ## Supported Image Formats - PNG (.png) - JPEG (.jpg, .jpeg) - GIF (.gif) - WebP (.webp) - BMP (.bmp) - TIFF (.tiff) ## Configuration You can modify the following variables in `image_downloader.py`: - `MAX_RETRIES` - Number of download retry attempts (default: 3) - `DELAY_BETWEEN_REQUESTS` - Delay between requests in seconds (default: 0.5) - `SUPPORTED_EXTENSIONS` - Set of supported image file extensions ## Error Handling The script includes robust error handling: - Skips non-image URLs - Retries failed downloads with exponential backoff - Validates content types from server responses - Continues processing even if individual downloads fail - Logs all activities and errors to console