dsfx/docs/discovery.md
2025-03-21 16:42:01 -04:00

8.9 KiB
Raw Blame History

Encrypted File Storage System Initial Discovery Document

1. Overview

This document outlines the discovery and design considerations for a secure, encrypted file storage system intended for self-hosting enthusiasts. The system is designed not only to securely store and version files with client-side encryption and an immutable manifest but also to extend into a peer-to-peer (P2P) network for redundancy, backup, and collaborative sharing between trusted nodes.

2. Goals & High-Level Vision

2.1. Security & Privacy

  • End-to-End Encryption:

    • Files are encrypted on the client before being uploaded, ensuring that the server or any peer only ever handles ciphertext.
    • Decryption occurs solely on the client, preserving data privacy even in a distributed network.
  • Robust Key Management:

    • Encryption keys are derived client-side using strong key derivation functions (e.g., Argon2, scrypt, or PBKDF2).
    • Integration with OS-level or hardware-based key management solutions is considered for enhanced security.

2.2. Efficient Versioning & Storage

  • File Partitioning & Chunking:

    • Files are divided into fixed-size or content-defined chunks (e.g., 1 MB per chunk).
    • Only updated chunks are re-uploaded which reduces upload times and storage consumption.
  • Content Addressable Storage & Deduplication:

    • Each chunk is encrypted separately and indexed by its SHA-256 hash.
    • This method allows for deduplication, reducing redundant uploads across versions and even across files.
  • Verifiable Append-Only Manifest:

    • An append-only log (referred to as the manifest) maintains a complete history of chunk operations and file versions.
    • Periodic snapshots (that include a hash of the log state) allow for pruning of older logs while ensuring verifiability of data integrity.

2.3. P2P Node Network & Redundancy

  • Peer-to-Peer (P2P) Connectivity:

    • Users can connect their nodes with those of friends using an IP address plus public key combination for secure identification and mutual authentication.
    • Nodes establish encrypted channels (e.g., via TLS or secure Diffie-Hellman exchanges) to share encrypted file blobs.
  • Data Redundancy and Backup:

    • Shared file blobs are replicated across nodes to provide data redundancy and backup.
    • Users can define replication policies, choosing which peers should store copies.
  • Dynamic Trust and Access Control:

    • Nodes employ certificate or public-key-based validation mechanisms to grant or restrict access.
    • A decentralized trust framework ensures that only authorized nodes participate in file sharing.

3. Architecture & Component Overview

3.1. Client Application

  • User Interaction:

    • Initially a command-line interface (CLI) is used for file management; a web interface is planned for future development.
  • Processing & Encryption:

    • Files are partitioned into chunks which are each individually encrypted using authenticated encryption algorithms (e.g., AES-GCM or ChaCha20-Poly1305).
    • The client maintains an immutable, append-only manifest that records all operations such as additions, modifications, and deletions.
  • Synchronization:

    • The manifest is used for multi-device synchronization, allowing devices to merge their change logs seamlessly.
    • The manifest incorporates periodic snapshots to prune old entries while keeping a cryptographic chain for integrity verification.

3.2. Server/P2P Node

  • Self-Hosting Friendly:

    • Designed to run on user-managed servers with simple configuration.
    • Containerized deployment (e.g., Docker) ensures consistent and easy installation.
  • API & Data Handling:

    • The node provides a secure, stateless API (RESTful or gRPC) for interactions with clients.
    • Encrypted blobs (chunks) are stored and indexed using content addressing to support deduplication and versioning.
  • P2P Functionality & Node Connectivity:

    • Nodes establish direct secure connections using an IP address and public key identifier.
    • Support for NAT traversal techniques (e.g., UPnP, STUN, TURN) is built in to facilitate peer connections.
    • A decentralized discovery mechanism (potentially via a DHT or bootstrap nodes) helps nodes locate one another.
  • Redundancy & Data Synchronization:

    • Each node maintains information in the manifest regarding which peers store which file chunks.
    • Health checks and heartbeat signals maintain up-to-date replication and rebalancing across the network.

4. Detailed Design Considerations

4.1. File Partitioning and Deduplication

  • Chunking Strategy:

    • Fixed-size chunks provide predictability, while content-defined chunking (e.g., using Rabin fingerprinting) may help reduce unnecessary re-uploads when file content shifts.
  • Content Addressing:

    • SHA-256 is used to hash each chunk before storage, allowing duplicate chunks to be recognized and only stored once across nodes.

4.2. Manifest (Append-Only Log)

  • Structure & Integrity:

    • The manifest is an append-only log where each entry records a discrete change (addition, deletion, modification of chunks), along with metadata such as timestamps, device identifiers, and operation details.
    • Cryptographic chaining (each log entry containing a hash of the previous one) ensures tamper-evident history.
  • Snapshot Mechanism:

    • Periodic snapshots capture the full state of the manifest up to that point by incorporating an aggregate hash.
    • Future log entries reference the last snapshot, allowing previous data to be pruned while maintaining verifiability.

4.3. Multi-Device & P2P Synchronization

  • Manifest Merging:

    • Devices (including those on separate nodes) merge their manifest logs using ordering methods such as vector clocks or Lamport timestamps to resolve concurrent updates.
  • Node-to-Node Sharing:

    • Secure node connections based on IP address and public key identifiers facilitate the sharing of encrypted blobs between trusted peers.
    • A decentralized model ensures that file updates, redundancy, and replication policies are distributed and maintained across the network.

4.4. Security and Access Control

  • Connection Security:

    • All node-to-node communications use strong encryption (TLS, Diffie-Hellman key exchange, or equivalent) to protect data in transit.
  • Trust & Authentication:

    • Nodes exchange public keys during initial handshake for mutual authentication.
    • Certificate-based or signed permission systems safeguard against unauthorized access, ensuring only trusted peers participate.

4.5. Scalability, Resilience, and Management

  • NAT Traversal & Discovery:

    • Implementation of NAT traversal techniques (UPnP, STUN, TURN) and decentralized peer discovery ensures reliable connectivity, even behind firewalls.
  • Monitoring & Conflict Resolution:

    • Systems to monitor node availability (heartbeats, health checks) are essential for maintaining replication and redundancy.
    • Conflict resolution protocols are implemented in the manifest for consistent state management across devices and nodes.

5. Threat Model & Security Audit

  • Threat Model Perspective:

    • Protect against potential vulnerabilities including unauthorized access, tampering with the manifest, compromised nodes, and interception of communications.
    • Emphasize that all sensitive operations (encryption, key management, and manifest maintenance) are performed client-side or within a trusted node environment.
  • Audit & Transparency:

    • An immutable, cryptographically chained manifest offers auditability.
    • The design encourages regular security audits and community reviews to validate the security framework.

6. Roadmap & Future Enhancements

  • Initial Development:

    • Build a robust CLI client for file operations, including adding files, updating versions, and restoring files, backed by an append-only manifest.
    • Develop the server/P2P node with secure API endpoints and replication functionality.
  • P2P Network Expansion:

    • Implement automated peer discovery, NAT traversal support, and dynamic trust mechanisms.
    • Create user-friendly tools for configuring peer connections and managing replication policies.
  • Advanced Features:

    • Extend the system with a web interface.
    • Explore integration with existing distributed systems or version control systems.
    • Consider plugins or additional tools to automate backup management and network health monitoring.

7. Conclusion

This discovery document establishes the fundamental framework for a secure, self-hosted encrypted file storage system that integrates efficient versioning, a verifiable append-only manifest, and a P2P network for file sharing and redundancy. By combining client-side encryption, content-addressable storage, robust manifest management, and secure node connectivity, the product aims to deliver high security, privacy, and resilience in a distributed self-hosting environment.