Covers designing a highly consistent file storage system, chunk-level synchronization, deduplication strategies, and client sync states.
What is Google Drive / Dropbox?
A cloud file storage and synchronization service allows users to upload, edit, and sync files across multiple devices. The key design challenges are optimizing network bandwidth and managing write storage efficiently.
Core Storage Pattern: Chunking & Deduplication
If we upload entire files on every edit, we waste massive amounts of network bandwidth and disk storage. We solve this using **Chunking** and **Deduplication**:
Chunking & Deduplication: Files are split into blocks, hashed, checked for duplication, and saved in block storage
Chunking (Block split): Split a file into smaller, fixed-size blocks (typically 4MB).
Bandwidth benefit: If a user makes a minor change to a 50MB file, only the modified 4MB block needs to be re-uploaded, rather than the entire 50MB.
Deduplication (Block-level): Calculate the SHA-256 hash of each block. Check if that hash exists in the database.
Storage benefit: If 10 different users upload the identical raw template file, the block storage saves it only once. The metadata database simply creates 10 file mapping pointers pointing to the same block hashes.
Data Structure & Synchronization
A file sync system separates the data into two locations:
Block Storage (S3): Stores immutable, content-addressed raw block bytes.
Metadata Database: Stores files, user access list, file paths, versions, and lists of block hashes (in order) that make up each file version.
Sync conflict resolution:
If two users edit the same file offline and sync at the same time:
• The first user to sync succeeds, updating the file version from V1 to V2 in the metadata DB.
• The second user's sync fails because their local state expects V1 but the server is now at V2.
• The conflict service prompts the second user, creating a local duplicate copy (e.g. report (Conflict).pdf) to allow manual merge.
Check Your Understanding
1. In a file sync service, what is the main benefit of block-level chunking?
2. How does block-level deduplication optimize disk storage?
3. When a sync conflict occurs because two users updated the same file concurrently, how does Google Drive/Dropbox resolve it?