8.9 KiB
Encrypted File Storage System – Initial Discovery Document
1. Overview
This document outlines the discovery and design considerations for a secure, encrypted file storage system intended for self-hosting enthusiasts. The system is designed not only to securely store and version files with client-side encryption and an immutable manifest but also to extend into a peer-to-peer (P2P) network for redundancy, backup, and collaborative sharing between trusted nodes.
2. Goals & High-Level Vision
2.1. Security & Privacy
-
End-to-End Encryption:
- Files are encrypted on the client before being uploaded, ensuring that the server or any peer only ever handles ciphertext.
- Decryption occurs solely on the client, preserving data privacy even in a distributed network.
-
Robust Key Management:
- Encryption keys are derived client-side using strong key derivation functions (e.g., Argon2, scrypt, or PBKDF2).
- Integration with OS-level or hardware-based key management solutions is considered for enhanced security.
2.2. Efficient Versioning & Storage
-
File Partitioning & Chunking:
- Files are divided into fixed-size or content-defined chunks (e.g., 1 MB per chunk).
- Only updated chunks are re-uploaded which reduces upload times and storage consumption.
-
Content Addressable Storage & Deduplication:
- Each chunk is encrypted separately and indexed by its SHA-256 hash.
- This method allows for deduplication, reducing redundant uploads across versions and even across files.
-
Verifiable Append-Only Manifest:
- An append-only log (referred to as the manifest) maintains a complete history of chunk operations and file versions.
- Periodic snapshots (that include a hash of the log state) allow for pruning of older logs while ensuring verifiability of data integrity.
2.3. P2P Node Network & Redundancy
-
Peer-to-Peer (P2P) Connectivity:
- Users can connect their nodes with those of friends using an IP address plus public key combination for secure identification and mutual authentication.
- Nodes establish encrypted channels (e.g., via TLS or secure Diffie-Hellman exchanges) to share encrypted file blobs.
-
Data Redundancy and Backup:
- Shared file blobs are replicated across nodes to provide data redundancy and backup.
- Users can define replication policies, choosing which peers should store copies.
-
Dynamic Trust and Access Control:
- Nodes employ certificate or public-key-based validation mechanisms to grant or restrict access.
- A decentralized trust framework ensures that only authorized nodes participate in file sharing.
3. Architecture & Component Overview
3.1. Client Application
-
User Interaction:
- Initially a command-line interface (CLI) is used for file management; a web interface is planned for future development.
-
Processing & Encryption:
- Files are partitioned into chunks which are each individually encrypted using authenticated encryption algorithms (e.g., AES-GCM or ChaCha20-Poly1305).
- The client maintains an immutable, append-only manifest that records all operations such as additions, modifications, and deletions.
-
Synchronization:
- The manifest is used for multi-device synchronization, allowing devices to merge their change logs seamlessly.
- The manifest incorporates periodic snapshots to prune old entries while keeping a cryptographic chain for integrity verification.
3.2. Server/P2P Node
-
Self-Hosting Friendly:
- Designed to run on user-managed servers with simple configuration.
- Containerized deployment (e.g., Docker) ensures consistent and easy installation.
-
API & Data Handling:
- The node provides a secure, stateless API (RESTful or gRPC) for interactions with clients.
- Encrypted blobs (chunks) are stored and indexed using content addressing to support deduplication and versioning.
-
P2P Functionality & Node Connectivity:
- Nodes establish direct secure connections using an IP address and public key identifier.
- Support for NAT traversal techniques (e.g., UPnP, STUN, TURN) is built in to facilitate peer connections.
- A decentralized discovery mechanism (potentially via a DHT or bootstrap nodes) helps nodes locate one another.
-
Redundancy & Data Synchronization:
- Each node maintains information in the manifest regarding which peers store which file chunks.
- Health checks and heartbeat signals maintain up-to-date replication and rebalancing across the network.
4. Detailed Design Considerations
4.1. File Partitioning and Deduplication
-
Chunking Strategy:
- Fixed-size chunks provide predictability, while content-defined chunking (e.g., using Rabin fingerprinting) may help reduce unnecessary re-uploads when file content shifts.
-
Content Addressing:
- SHA-256 is used to hash each chunk before storage, allowing duplicate chunks to be recognized and only stored once across nodes.
4.2. Manifest (Append-Only Log)
-
Structure & Integrity:
- The manifest is an append-only log where each entry records a discrete change (addition, deletion, modification of chunks), along with metadata such as timestamps, device identifiers, and operation details.
- Cryptographic chaining (each log entry containing a hash of the previous one) ensures tamper-evident history.
-
Snapshot Mechanism:
- Periodic snapshots capture the full state of the manifest up to that point by incorporating an aggregate hash.
- Future log entries reference the last snapshot, allowing previous data to be pruned while maintaining verifiability.
4.3. Multi-Device & P2P Synchronization
-
Manifest Merging:
- Devices (including those on separate nodes) merge their manifest logs using ordering methods such as vector clocks or Lamport timestamps to resolve concurrent updates.
-
Node-to-Node Sharing:
- Secure node connections based on IP address and public key identifiers facilitate the sharing of encrypted blobs between trusted peers.
- A decentralized model ensures that file updates, redundancy, and replication policies are distributed and maintained across the network.
4.4. Security and Access Control
-
Connection Security:
- All node-to-node communications use strong encryption (TLS, Diffie-Hellman key exchange, or equivalent) to protect data in transit.
-
Trust & Authentication:
- Nodes exchange public keys during initial handshake for mutual authentication.
- Certificate-based or signed permission systems safeguard against unauthorized access, ensuring only trusted peers participate.
4.5. Scalability, Resilience, and Management
-
NAT Traversal & Discovery:
- Implementation of NAT traversal techniques (UPnP, STUN, TURN) and decentralized peer discovery ensures reliable connectivity, even behind firewalls.
-
Monitoring & Conflict Resolution:
- Systems to monitor node availability (heartbeats, health checks) are essential for maintaining replication and redundancy.
- Conflict resolution protocols are implemented in the manifest for consistent state management across devices and nodes.
5. Threat Model & Security Audit
-
Threat Model Perspective:
- Protect against potential vulnerabilities including unauthorized access, tampering with the manifest, compromised nodes, and interception of communications.
- Emphasize that all sensitive operations (encryption, key management, and manifest maintenance) are performed client-side or within a trusted node environment.
-
Audit & Transparency:
- An immutable, cryptographically chained manifest offers auditability.
- The design encourages regular security audits and community reviews to validate the security framework.
6. Roadmap & Future Enhancements
-
Initial Development:
- Build a robust CLI client for file operations, including adding files, updating versions, and restoring files, backed by an append-only manifest.
- Develop the server/P2P node with secure API endpoints and replication functionality.
-
P2P Network Expansion:
- Implement automated peer discovery, NAT traversal support, and dynamic trust mechanisms.
- Create user-friendly tools for configuring peer connections and managing replication policies.
-
Advanced Features:
- Extend the system with a web interface.
- Explore integration with existing distributed systems or version control systems.
- Consider plugins or additional tools to automate backup management and network health monitoring.
7. Conclusion
This discovery document establishes the fundamental framework for a secure, self-hosted encrypted file storage system that integrates efficient versioning, a verifiable append-only manifest, and a P2P network for file sharing and redundancy. By combining client-side encryption, content-addressable storage, robust manifest management, and secure node connectivity, the product aims to deliver high security, privacy, and resilience in a distributed self-hosting environment.