# Encrypted File Storage System – Initial Discovery Document ## 1. Overview This document outlines the discovery and design considerations for a secure, encrypted file storage system intended for self-hosting enthusiasts. The system is designed not only to securely store and version files with client-side encryption and an immutable manifest but also to extend into a peer-to-peer (P2P) network for redundancy, backup, and collaborative sharing between trusted nodes. ## 2. Goals & High-Level Vision ### 2.1. Security & Privacy - **End-to-End Encryption:** - Files are encrypted on the client before being uploaded, ensuring that the server or any peer only ever handles ciphertext. - Decryption occurs solely on the client, preserving data privacy even in a distributed network. - **Robust Key Management:** - Encryption keys are derived client-side using strong key derivation functions (e.g., Argon2, scrypt, or PBKDF2). - Integration with OS-level or hardware-based key management solutions is considered for enhanced security. ### 2.2. Efficient Versioning & Storage - **File Partitioning & Chunking:** - Files are divided into fixed-size or content-defined chunks (e.g., 1 MB per chunk). - Only updated chunks are re-uploaded which reduces upload times and storage consumption. - **Content Addressable Storage & Deduplication:** - Each chunk is encrypted separately and indexed by its SHA-256 hash. - This method allows for deduplication, reducing redundant uploads across versions and even across files. - **Verifiable Append-Only Manifest:** - An append-only log (referred to as the manifest) maintains a complete history of chunk operations and file versions. - Periodic snapshots (that include a hash of the log state) allow for pruning of older logs while ensuring verifiability of data integrity. ### 2.3. P2P Node Network & Redundancy - **Peer-to-Peer (P2P) Connectivity:** - Users can connect their nodes with those of friends using an IP address plus public key combination for secure identification and mutual authentication. - Nodes establish encrypted channels (e.g., via TLS or secure Diffie-Hellman exchanges) to share encrypted file blobs. - **Data Redundancy and Backup:** - Shared file blobs are replicated across nodes to provide data redundancy and backup. - Users can define replication policies, choosing which peers should store copies. - **Dynamic Trust and Access Control:** - Nodes employ certificate or public-key-based validation mechanisms to grant or restrict access. - A decentralized trust framework ensures that only authorized nodes participate in file sharing. ## 3. Architecture & Component Overview ### 3.1. Client Application - **User Interaction:** - Initially a command-line interface (CLI) is used for file management; a web interface is planned for future development. - **Processing & Encryption:** - Files are partitioned into chunks which are each individually encrypted using authenticated encryption algorithms (e.g., AES-GCM or ChaCha20-Poly1305). - The client maintains an immutable, append-only manifest that records all operations such as additions, modifications, and deletions. - **Synchronization:** - The manifest is used for multi-device synchronization, allowing devices to merge their change logs seamlessly. - The manifest incorporates periodic snapshots to prune old entries while keeping a cryptographic chain for integrity verification. ### 3.2. Server/P2P Node - **Self-Hosting Friendly:** - Designed to run on user-managed servers with simple configuration. - Containerized deployment (e.g., Docker) ensures consistent and easy installation. - **API & Data Handling:** - The node provides a secure, stateless API (RESTful or gRPC) for interactions with clients. - Encrypted blobs (chunks) are stored and indexed using content addressing to support deduplication and versioning. - **P2P Functionality & Node Connectivity:** - Nodes establish direct secure connections using an IP address and public key identifier. - Support for NAT traversal techniques (e.g., UPnP, STUN, TURN) is built in to facilitate peer connections. - A decentralized discovery mechanism (potentially via a DHT or bootstrap nodes) helps nodes locate one another. - **Redundancy & Data Synchronization:** - Each node maintains information in the manifest regarding which peers store which file chunks. - Health checks and heartbeat signals maintain up-to-date replication and rebalancing across the network. ## 4. Detailed Design Considerations ### 4.1. File Partitioning and Deduplication - **Chunking Strategy:** - Fixed-size chunks provide predictability, while content-defined chunking (e.g., using Rabin fingerprinting) may help reduce unnecessary re-uploads when file content shifts. - **Content Addressing:** - SHA-256 is used to hash each chunk before storage, allowing duplicate chunks to be recognized and only stored once across nodes. ### 4.2. Manifest (Append-Only Log) - **Structure & Integrity:** - The manifest is an append-only log where each entry records a discrete change (addition, deletion, modification of chunks), along with metadata such as timestamps, device identifiers, and operation details. - Cryptographic chaining (each log entry containing a hash of the previous one) ensures tamper-evident history. - **Snapshot Mechanism:** - Periodic snapshots capture the full state of the manifest up to that point by incorporating an aggregate hash. - Future log entries reference the last snapshot, allowing previous data to be pruned while maintaining verifiability. ### 4.3. Multi-Device & P2P Synchronization - **Manifest Merging:** - Devices (including those on separate nodes) merge their manifest logs using ordering methods such as vector clocks or Lamport timestamps to resolve concurrent updates. - **Node-to-Node Sharing:** - Secure node connections based on IP address and public key identifiers facilitate the sharing of encrypted blobs between trusted peers. - A decentralized model ensures that file updates, redundancy, and replication policies are distributed and maintained across the network. ### 4.4. Security and Access Control - **Connection Security:** - All node-to-node communications use strong encryption (TLS, Diffie-Hellman key exchange, or equivalent) to protect data in transit. - **Trust & Authentication:** - Nodes exchange public keys during initial handshake for mutual authentication. - Certificate-based or signed permission systems safeguard against unauthorized access, ensuring only trusted peers participate. ### 4.5. Scalability, Resilience, and Management - **NAT Traversal & Discovery:** - Implementation of NAT traversal techniques (UPnP, STUN, TURN) and decentralized peer discovery ensures reliable connectivity, even behind firewalls. - **Monitoring & Conflict Resolution:** - Systems to monitor node availability (heartbeats, health checks) are essential for maintaining replication and redundancy. - Conflict resolution protocols are implemented in the manifest for consistent state management across devices and nodes. ## 5. Threat Model & Security Audit - **Threat Model Perspective:** - Protect against potential vulnerabilities including unauthorized access, tampering with the manifest, compromised nodes, and interception of communications. - Emphasize that all sensitive operations (encryption, key management, and manifest maintenance) are performed client-side or within a trusted node environment. - **Audit & Transparency:** - An immutable, cryptographically chained manifest offers auditability. - The design encourages regular security audits and community reviews to validate the security framework. ## 6. Roadmap & Future Enhancements - **Initial Development:** - Build a robust CLI client for file operations, including adding files, updating versions, and restoring files, backed by an append-only manifest. - Develop the server/P2P node with secure API endpoints and replication functionality. - **P2P Network Expansion:** - Implement automated peer discovery, NAT traversal support, and dynamic trust mechanisms. - Create user-friendly tools for configuring peer connections and managing replication policies. - **Advanced Features:** - Extend the system with a web interface. - Explore integration with existing distributed systems or version control systems. - Consider plugins or additional tools to automate backup management and network health monitoring. ## 7. Conclusion This discovery document establishes the fundamental framework for a secure, self-hosted encrypted file storage system that integrates efficient versioning, a verifiable append-only manifest, and a P2P network for file sharing and redundancy. By combining client-side encryption, content-addressable storage, robust manifest management, and secure node connectivity, the product aims to deliver high security, privacy, and resilience in a distributed self-hosting environment.