Files
cult-scraper/IMAGE_DOWNLOADER_README.md

2.5 KiB

Discord Image Downloader

This script processes Discord chat log CSV files to download and convert images to a base64 dataset.

Features

  • Parses all CSV files in the discord_chat_logs/ directory
  • Extracts attachment URLs from the attachment_urls column
  • Downloads images using wget-like functionality (via Python requests)
  • Converts images to base64 format for easy storage and processing
  • Saves metadata including channel, sender, timestamp, and message context
  • Handles Discord CDN URLs with query parameters
  • Implements retry logic and rate limiting
  • Deduplicates images based on URL hash

Setup

  1. Install dependencies:

    ./setup.sh
    

    Or manually:

    pip3 install -r requirements.txt
    
  2. Run the image downloader:

    cd scripts
    python3 image_downloader.py
    

Output

The script creates an images_dataset/ directory containing:

  • images_dataset.json - Complete dataset with images in base64 format

Dataset Structure

{
  "metadata": {
    "created_at": "2025-08-11 12:34:56 UTC",
    "summary": {
      "total_images": 42,
      "channels": ["memes", "general", "nsfw"],
      "total_size_bytes": 1234567,
      "file_extensions": [".png", ".jpg", ".gif"],
      "authors": ["user1", "user2"]
    }
  },
  "images": [
    {
      "url": "https://cdn.discordapp.com/attachments/...",
      "channel": "memes",
      "author_name": "username",
      "author_nickname": "User Nickname",
      "author_id": "123456789",
      "message_id": "987654321",
      "timestamp_utc": "2020-03-11 18:25:49.086000+00:00",
      "content": "Message text content",
      "file_extension": ".png",
      "file_size": 54321,
      "url_hash": "abc123def456",
      "base64_data": "iVBORw0KGgoAAAANSUhEUgAA..."
    }
  ]
}

Supported Image Formats

  • PNG (.png)
  • JPEG (.jpg, .jpeg)
  • GIF (.gif)
  • WebP (.webp)
  • BMP (.bmp)
  • TIFF (.tiff)

Configuration

You can modify the following variables in image_downloader.py:

  • MAX_RETRIES - Number of download retry attempts (default: 3)
  • DELAY_BETWEEN_REQUESTS - Delay between requests in seconds (default: 0.5)
  • SUPPORTED_EXTENSIONS - Set of supported image file extensions

Error Handling

The script includes robust error handling:

  • Skips non-image URLs
  • Retries failed downloads with exponential backoff
  • Validates content types from server responses
  • Continues processing even if individual downloads fail
  • Logs all activities and errors to console