image downloader +read me
This commit is contained in:
98
IMAGE_DOWNLOADER_README.md
Normal file
98
IMAGE_DOWNLOADER_README.md
Normal file
@@ -0,0 +1,98 @@
|
||||
# Discord Image Downloader
|
||||
|
||||
This script processes Discord chat log CSV files to download and convert images to a base64 dataset.
|
||||
|
||||
## Features
|
||||
|
||||
- Parses all CSV files in the `discord_chat_logs/` directory
|
||||
- Extracts attachment URLs from the `attachment_urls` column
|
||||
- Downloads images using wget-like functionality (via Python requests)
|
||||
- Converts images to base64 format for easy storage and processing
|
||||
- Saves metadata including channel, sender, timestamp, and message context
|
||||
- Handles Discord CDN URLs with query parameters
|
||||
- Implements retry logic and rate limiting
|
||||
- Deduplicates images based on URL hash
|
||||
|
||||
## Setup
|
||||
|
||||
1. Install dependencies:
|
||||
```bash
|
||||
./setup.sh
|
||||
```
|
||||
|
||||
Or manually:
|
||||
```bash
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
2. Run the image downloader:
|
||||
```bash
|
||||
cd scripts
|
||||
python3 image_downloader.py
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
The script creates an `images_dataset/` directory containing:
|
||||
|
||||
- `images_dataset.json` - Complete dataset with images in base64 format
|
||||
|
||||
### Dataset Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"metadata": {
|
||||
"created_at": "2025-08-11 12:34:56 UTC",
|
||||
"summary": {
|
||||
"total_images": 42,
|
||||
"channels": ["memes", "general", "nsfw"],
|
||||
"total_size_bytes": 1234567,
|
||||
"file_extensions": [".png", ".jpg", ".gif"],
|
||||
"authors": ["user1", "user2"]
|
||||
}
|
||||
},
|
||||
"images": [
|
||||
{
|
||||
"url": "https://cdn.discordapp.com/attachments/...",
|
||||
"channel": "memes",
|
||||
"author_name": "username",
|
||||
"author_nickname": "User Nickname",
|
||||
"author_id": "123456789",
|
||||
"message_id": "987654321",
|
||||
"timestamp_utc": "2020-03-11 18:25:49.086000+00:00",
|
||||
"content": "Message text content",
|
||||
"file_extension": ".png",
|
||||
"file_size": 54321,
|
||||
"url_hash": "abc123def456",
|
||||
"base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Supported Image Formats
|
||||
|
||||
- PNG (.png)
|
||||
- JPEG (.jpg, .jpeg)
|
||||
- GIF (.gif)
|
||||
- WebP (.webp)
|
||||
- BMP (.bmp)
|
||||
- TIFF (.tiff)
|
||||
|
||||
## Configuration
|
||||
|
||||
You can modify the following variables in `image_downloader.py`:
|
||||
|
||||
- `MAX_RETRIES` - Number of download retry attempts (default: 3)
|
||||
- `DELAY_BETWEEN_REQUESTS` - Delay between requests in seconds (default: 0.5)
|
||||
- `SUPPORTED_EXTENSIONS` - Set of supported image file extensions
|
||||
|
||||
## Error Handling
|
||||
|
||||
The script includes robust error handling:
|
||||
|
||||
- Skips non-image URLs
|
||||
- Retries failed downloads with exponential backoff
|
||||
- Validates content types from server responses
|
||||
- Continues processing even if individual downloads fail
|
||||
- Logs all activities and errors to console
|
||||
Reference in New Issue
Block a user