dsfx/docs/discovery.md
2025-03-21 16:42:01 -04:00

174 lines
8.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Encrypted File Storage System Initial Discovery Document
## 1. Overview
This document outlines the discovery and design considerations for a secure, encrypted file storage system intended for self-hosting enthusiasts. The system is designed not only to securely store and version files with client-side encryption and an immutable manifest but also to extend into a peer-to-peer (P2P) network for redundancy, backup, and collaborative sharing between trusted nodes.
## 2. Goals & High-Level Vision
### 2.1. Security & Privacy
- **End-to-End Encryption:**
- Files are encrypted on the client before being uploaded, ensuring that the server or any peer only ever handles ciphertext.
- Decryption occurs solely on the client, preserving data privacy even in a distributed network.
- **Robust Key Management:**
- Encryption keys are derived client-side using strong key derivation functions (e.g., Argon2, scrypt, or PBKDF2).
- Integration with OS-level or hardware-based key management solutions is considered for enhanced security.
### 2.2. Efficient Versioning & Storage
- **File Partitioning & Chunking:**
- Files are divided into fixed-size or content-defined chunks (e.g., 1 MB per chunk).
- Only updated chunks are re-uploaded which reduces upload times and storage consumption.
- **Content Addressable Storage & Deduplication:**
- Each chunk is encrypted separately and indexed by its SHA-256 hash.
- This method allows for deduplication, reducing redundant uploads across versions and even across files.
- **Verifiable Append-Only Manifest:**
- An append-only log (referred to as the manifest) maintains a complete history of chunk operations and file versions.
- Periodic snapshots (that include a hash of the log state) allow for pruning of older logs while ensuring verifiability of data integrity.
### 2.3. P2P Node Network & Redundancy
- **Peer-to-Peer (P2P) Connectivity:**
- Users can connect their nodes with those of friends using an IP address plus public key combination for secure identification and mutual authentication.
- Nodes establish encrypted channels (e.g., via TLS or secure Diffie-Hellman exchanges) to share encrypted file blobs.
- **Data Redundancy and Backup:**
- Shared file blobs are replicated across nodes to provide data redundancy and backup.
- Users can define replication policies, choosing which peers should store copies.
- **Dynamic Trust and Access Control:**
- Nodes employ certificate or public-key-based validation mechanisms to grant or restrict access.
- A decentralized trust framework ensures that only authorized nodes participate in file sharing.
## 3. Architecture & Component Overview
### 3.1. Client Application
- **User Interaction:**
- Initially a command-line interface (CLI) is used for file management; a web interface is planned for future development.
- **Processing & Encryption:**
- Files are partitioned into chunks which are each individually encrypted using authenticated encryption algorithms (e.g., AES-GCM or ChaCha20-Poly1305).
- The client maintains an immutable, append-only manifest that records all operations such as additions, modifications, and deletions.
- **Synchronization:**
- The manifest is used for multi-device synchronization, allowing devices to merge their change logs seamlessly.
- The manifest incorporates periodic snapshots to prune old entries while keeping a cryptographic chain for integrity verification.
### 3.2. Server/P2P Node
- **Self-Hosting Friendly:**
- Designed to run on user-managed servers with simple configuration.
- Containerized deployment (e.g., Docker) ensures consistent and easy installation.
- **API & Data Handling:**
- The node provides a secure, stateless API (RESTful or gRPC) for interactions with clients.
- Encrypted blobs (chunks) are stored and indexed using content addressing to support deduplication and versioning.
- **P2P Functionality & Node Connectivity:**
- Nodes establish direct secure connections using an IP address and public key identifier.
- Support for NAT traversal techniques (e.g., UPnP, STUN, TURN) is built in to facilitate peer connections.
- A decentralized discovery mechanism (potentially via a DHT or bootstrap nodes) helps nodes locate one another.
- **Redundancy & Data Synchronization:**
- Each node maintains information in the manifest regarding which peers store which file chunks.
- Health checks and heartbeat signals maintain up-to-date replication and rebalancing across the network.
## 4. Detailed Design Considerations
### 4.1. File Partitioning and Deduplication
- **Chunking Strategy:**
- Fixed-size chunks provide predictability, while content-defined chunking (e.g., using Rabin fingerprinting) may help reduce unnecessary re-uploads when file content shifts.
- **Content Addressing:**
- SHA-256 is used to hash each chunk before storage, allowing duplicate chunks to be recognized and only stored once across nodes.
### 4.2. Manifest (Append-Only Log)
- **Structure & Integrity:**
- The manifest is an append-only log where each entry records a discrete change (addition, deletion, modification of chunks), along with metadata such as timestamps, device identifiers, and operation details.
- Cryptographic chaining (each log entry containing a hash of the previous one) ensures tamper-evident history.
- **Snapshot Mechanism:**
- Periodic snapshots capture the full state of the manifest up to that point by incorporating an aggregate hash.
- Future log entries reference the last snapshot, allowing previous data to be pruned while maintaining verifiability.
### 4.3. Multi-Device & P2P Synchronization
- **Manifest Merging:**
- Devices (including those on separate nodes) merge their manifest logs using ordering methods such as vector clocks or Lamport timestamps to resolve concurrent updates.
- **Node-to-Node Sharing:**
- Secure node connections based on IP address and public key identifiers facilitate the sharing of encrypted blobs between trusted peers.
- A decentralized model ensures that file updates, redundancy, and replication policies are distributed and maintained across the network.
### 4.4. Security and Access Control
- **Connection Security:**
- All node-to-node communications use strong encryption (TLS, Diffie-Hellman key exchange, or equivalent) to protect data in transit.
- **Trust & Authentication:**
- Nodes exchange public keys during initial handshake for mutual authentication.
- Certificate-based or signed permission systems safeguard against unauthorized access, ensuring only trusted peers participate.
### 4.5. Scalability, Resilience, and Management
- **NAT Traversal & Discovery:**
- Implementation of NAT traversal techniques (UPnP, STUN, TURN) and decentralized peer discovery ensures reliable connectivity, even behind firewalls.
- **Monitoring & Conflict Resolution:**
- Systems to monitor node availability (heartbeats, health checks) are essential for maintaining replication and redundancy.
- Conflict resolution protocols are implemented in the manifest for consistent state management across devices and nodes.
## 5. Threat Model & Security Audit
- **Threat Model Perspective:**
- Protect against potential vulnerabilities including unauthorized access, tampering with the manifest, compromised nodes, and interception of communications.
- Emphasize that all sensitive operations (encryption, key management, and manifest maintenance) are performed client-side or within a trusted node environment.
- **Audit & Transparency:**
- An immutable, cryptographically chained manifest offers auditability.
- The design encourages regular security audits and community reviews to validate the security framework.
## 6. Roadmap & Future Enhancements
- **Initial Development:**
- Build a robust CLI client for file operations, including adding files, updating versions, and restoring files, backed by an append-only manifest.
- Develop the server/P2P node with secure API endpoints and replication functionality.
- **P2P Network Expansion:**
- Implement automated peer discovery, NAT traversal support, and dynamic trust mechanisms.
- Create user-friendly tools for configuring peer connections and managing replication policies.
- **Advanced Features:**
- Extend the system with a web interface.
- Explore integration with existing distributed systems or version control systems.
- Consider plugins or additional tools to automate backup management and network health monitoring.
## 7. Conclusion
This discovery document establishes the fundamental framework for a secure, self-hosted encrypted file storage system that integrates efficient versioning, a verifiable append-only manifest, and a P2P network for file sharing and redundancy. By combining client-side encryption, content-addressable storage, robust manifest management, and secure node connectivity, the product aims to deliver high security, privacy, and resilience in a distributed self-hosting environment.