Managing Large Numbers of Small Files in S3: Best Practices for Buckets and Prefixes

It’s common for teams to misunderstand how Amazon S3 works internally. As a result, they may encounter issues such as slow object listing, 503 Service Unavailable errors, degraded performance, and other operational challenges.

Let’s walk through a practical scenario:

  • 600 million small objects in a single S3 bucket (with ongoing growth)
  • Frequent listing and access operations

We’ll explore how to design your object key structure to ensure stable performance and scalability.

Key Concepts

To solve this, we rely on:

  • Sharding
  • Prefix distribution
  • Hash-based key design

How S3 Handles Prefixes

A critical concept:

Amazon S3 is not a file system or a traditional database.
It is a distributed key-value store where the object key is simply a string.

S3 automatically scales by distributing requests across prefixes (e.g., data/, logs/2024/05/). Each prefix can handle requests independently and has its own throughput limits.

Key takeaway

  • The more evenly distributed your prefixes are, the higher your total throughput.
  • If most requests hit the same prefix, you’ll eventually face:
    • Increased latency
    • 503 errors
    • Bottlenecks during bulk operations

Scenario Overview

  • 600 million objects
  • Small object size
  • Continuous growth
  • High read/list activity

Let’s evaluate different design approaches.

1. No Prefix Structure (All Objects in Root)

All objects are stored without any prefix hierarchy.

Problems

  • Effectively a single prefix
  • All traffic hits one partition
  • Quickly reaches throughput limits
  • Leads to 503 errors

Verdict

Worst possible design — avoid entirely

2. One Prefix per Object (Millions of Unique Prefixes)

Each object gets its own unique “folder.”

Advantages

  • Perfect load distribution
  • No hot prefixes

Problems

  • Listing becomes impractical
  • Navigation via console/API is difficult
  • S3 internals (metadata, billing, indexing) aren’t optimized for this
  • High operational complexity

Verdict

Overengineered — not recommended in practice

3. Controlled Sharding (~600 Prefixes)

Split objects across a fixed number of prefixes (~1 million objects per prefix).

Advantages

  • Even load distribution
  • Parallel processing across prefixes
  • Manageable structure
  • Aligns with S3 best practices
  • Scales without redesign

Verdict

Recommended approach

Prefix Naming Strategy

Avoid sequential naming like:

data-1/, data-2/, data-3/

Especially if object keys are also sequential — this creates hot partitions.

Best practice: Use hashing

Algorithm:

  1. Take a unique object identifier (e.g., user_id, file_id)
  2. Compute a hash (MD5, SHA-1, SHA-256)
  3. Use the first N characters as the prefix

Why Hex Prefixes?

Hexadecimal (0–9, a–f) is widely used because:

  • 16 possible values per character → strong distribution
  • Most hash functions output hex
  • Easy to implement
  • Predictable scaling

Sharding Depth Options

1 Hex Character → 16 Prefixes

0/, 1/, ..., f/
  • ~37.5M objects per prefix

Use when:

  • Low traffic
  • Prototypes

Downside:

  • Too many objects per prefix

2 Hex Characters → 256 Prefixes

00/, 01/, ..., ff/
  • ~2.34M objects per prefix

Use when:

  • Moderate load (thousands of RPS)
  • Typical production workloads

3 Hex Characters → 4096 Prefixes

000/, 001/, ..., fff/
  • ~146K objects per prefix

Use when:

  • High-load systems
  • Tens of thousands of RPS
  • Critical services

Best default choice for most production systems

Example Implementation (Python)

import hashlib

def get_s3_path(object_id: str, prefix_level: int = 3) -> str:
    """
    Generate an S3 object path using hex-based prefix sharding.

    Args:
        object_id: Unique object identifier
        prefix_level: Number of hex characters (1–4)
    """

    hash_hex = hashlib.md5(object_id.encode()).hexdigest()
    prefix = hash_hex[:prefix_level]

    # Create hierarchy: a/b/c/object_id
    prefix_path = "/".join(prefix)

    return f"{prefix_path}/{object_id}"


# Examples:
print(get_s3_path("user_12345.pdf", 2))   # 7/f/user_12345.pdf
print(get_s3_path("image_67890.jpg", 3))  # a/1/b/image_67890.jpg
 

Additional Recommendations

  • Consider multiple buckets if storing more than ~500 million objects
  • Always apply key sharding at scale
  • Use multi-level hash prefixes (e.g., f4c/3a/9/...)
  • Avoid monotonically increasing keys at the beginning of object names
  • Never generate prefixes manually — always derive them via hashing
  • Use at least 3 hex characters (~4096 prefixes) for production
  • Avoid full-bucket LIST operations — rely on prefixes and delimiters

Have you tried Cloud4U services? Not yet?

Visit Website

Try for free

 

  • 0 Users Found This Useful
Was this answer helpful?

Related Articles

Getting Started with S3 Cloud4U Using the AWS CLI

1. Install the AWS CLI Windows: Download and run the appropriate installer (64-bit or...

Working with S3 Cloud4U Using the AWS SDK for PHP

Prerequisites 0. Install PHPEnsure PHP is installed and available in your system path....

Understanding Billing in S3 Object Storage

S3 billing is calculated based on three primary parameters: Storage used (amount of data...

Accessing S3 from the vCloud Director Dashboard

The VMware vCloud Director® Object Storage Extension™ (OSE) plug-in enables seamless access to...

Connection Parameters for Cloud4U S3 Object Storage

Configuration Information: Region Name: K41 S3 Endpoint...