The File Format
===============

The file format is block oriented, similar like common image formats (e.g. PNG). It uses a magic header, and blocks with
a four byte identifier and 64bit size field. Blocks can have a static size, or use a "chunked format" where the size is
determined by following a sequence of smaller "data-chunks".

This modular framework would allow to randomly add and mix different blocks and types, yet this format is very strict.
Not only the number of blocks is defined, also the exact sequence of block is fixed and must not be changed. Using a
strict format is speeding up verification and processing of the data. Also, it reduced risks of attacks using
manipulated files.

Overall Structure
-----------------

- 8 bytes with magic ``0xfe``, ``FFE``, ``0x0d``, ``0x0a``, ``0x1a``, ``0x0a``
- n blocks with the following format:

Static Blocks
^^^^^^^^^^^^^

Static blocks are the default to efficiently read and write data from random accessible media:

- 4 bytes with the block type.
- 8 bytes with the size of the block. Big-endian, unsigned, 64-bit. A zero value indicates an empty block. No data bytes
  will follow a such block. The size value has to be less than 0xffff_0000_0000_0000.
- n bytes with the data of the block.

Size values equal or greater than 0xffff_0000_0000_0000 are reserved for extensions of the format and therefore invalid
regular size values.

Chunked Blocks
^^^^^^^^^^^^^^

To allow encrypting large streams, there is an alternative chunked data format. This format uses the value
``0xffff800000000000`` as the size of the block, to indicate the data size is defined by chunks. The data of the block
consists of a sequence of chunks, each prefixed with a 16-bit size and the corresponding number of bytes. A size value
of zero indicates the end of the sequence.

This format is currently only allowed for the ``DATA`` block.

- 4 bytes with the block type. (``DATA``)
- 8 bytes indicating the stream format. Value == ``0xffff800000000000``
- n data chunks with the following format:
    - 2 bytes with the size of the data chunk. Big-endian, unsigned, 16-bit. Value 0 means: end of stream.
    - n bytes as specified with the data chunk size.

The chunks *shall* make use of a maximum size of 0xffff bytes for all except the last block. The decryptor *may*
check if this is the case and reject files with random chunk sizes.

Block Types and Correct Order
-----------------------------

``CONF`` Encryption configuration (maximum size 128 bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

``k:RSA-4096,e:AES-256,b:CBC,h:SHA3-512,v:1``

Comma separated fields "<key>:<value>". The result must be ASCII encoded and only contain the following limited range of
characters (regular expression): ``[-_,:A-Za-z0-9]+``

- ``k`` key algorithm
- ``e`` encryption algorithm
- ``b`` block algorithm
- ``h`` hash algorithm
- ``v`` file format version

There is no need to allow other algorithms. If the configuration does not match exactly the shown string, it is safe to
stop decoding with an error.

This string is part of the file format to describe the used algorithms in a format which can be read and understand by
humans. It shall allow to reconstruct the data, if the knowledge of this format get lost, but the keys and files are
still available.

``EPUB`` Encryption Key Hash (maximum size 1k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The hash for the used public key to encrypt the symmetric key. This hash is generated by using a SHA3-512 hash on the
DER encoded public key in SubjectPublicKeyInfo format.

This block allows a quick verification which key was used to encrypt the file, without trial-and-error. The hash itself
is required for the decryption process.

``ESYM`` Encrypted Symmetric Key (maximum size 1k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For RSA-4096: The symmetric key is encrypted using OAEP padding, with the given hash size and no label.

This block is the key for the asymmetric encryption of the data. The 16 bytes of the AES-256 key are encrypted using the
public RSA key. This is done using the *optimal asymmetric encryption padding* for RSA, also called RSA-OAEP and
standardized in PKCS#1 und RFC 3447. The hash used in this encryption is SHA256, the mask function is MFG1.

.. code-block:: python

    encryption_key = os.urandom(AES_KEY_LENGTH_BYTES)
    encrypted_encryption_key = public_key.encrypt(
        encryption_key,
        OAEP(
            mgf=MGF1(algorithm=hashes.SHA256()),
            algorithm=hashes.SHA256(),
            label=None))

All encrypted blocks in the file are encrypted using this AES-256 key, but with different initialization vectors. See
section "Encrypted File Format" for all details.

``META`` Encrypted metadata (maximum size 10k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The block with the encrypted metadata. If no metadata is given, this block and the hash block is empty.

See section "Metadata Format" for all details.

``MDHA`` Hash for metadata (maximum size 1k)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The hash for the decrypted metadata. Encrypted.

``DATA`` Encrypted file data (no maximum size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The encrypted file data.

For empty files with a size of zero, this block and the hash block is empty.

``DTHA`` Hash for the decrypted data.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The hash for the decrypted file data. Encrypted.

``ENDH`` = End of file with hash
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Marks the end of the file and contains a SHA3-512 hash for the whole file, up to this block, without the bytes of the
block type. The bytes following this block type are therefore always 0x40,0,0,0,0,0,0,0 for a 64 byte sized block,
followed by the 64 byte hash.

In order to quickly check the integrity of a file, you can create a digest of the file data up to
``file_size - 76 bytes``, then skip 12 bytes, read the next 64 and compare the digest.

Block Order
^^^^^^^^^^^

The blocks must appear in the shown order.

Encrypted Data Block Format
---------------------------

Static Blocks
^^^^^^^^^^^^^

The data format in an encrypted data block:

- 8 bytes, big endian, unsigned, with the size of the decrypted data.
    - If the encrypted file is empty, the *block* is empty and this size is not given.
    - If this value is greater than zero, it is the size of the decrypted data in bytes.
- 16 bytes (for AES-256/CBC) with the IV for the encryption.
- The encrypted data, aligned to the cipher block size.

If the encrypted data is empty, this block is empty.

Streamed Blocks
^^^^^^^^^^^^^^^

As the size of the encrypted data is not known the encryption format for streamed blocks is different.

- 16 bytes (for AES-256/CBC) with the IV for the encryption.
- The encrypted data, padded with ISO/IEC 9797-1 padding method 2.

ISO padding adds the byte 0x80 and fills the last block with zero bytes.


Metadata Format
---------------

- Each file can contain a custom block with metadata.
- The metadata is stored as a UTF-8 encoded block in JSON format.
- The JSON block must be stored compact, without pretty formatting.
- The JSON block must encode a top level object like this:

.. code-block:: json

    {
      "attribute1": "data2",
      "attribute2": "data2",
      "attribute3": "data3"
    }

- So there has to be an object with attributes (no top-level list, etc.), but the format of the attribute is user
  defined and can contain lists and nested objects.
- The size of the encrypted metadata must not exceed 100k.
- Field names only consist of lowercase letters and the underscore character. They must be shorter than 64 characters.

Predefined Metadata Fields
--------------------------

- ``file_path`` The original absolute path of the encrypted file.
- ``file_name`` The original filename of the encrypted file.
- ``file_size`` The original size of the encrypted file.
- ``created`` The original created UTC date/time in ISO format (yyyy-mm-ddThh:mm:ss)
- ``modified`` The original modified UTC date/time in ISO format (yyyy-mm-ddThh:mm:ss)
- ``mime_type`` The MIME type of the file contents.
- ``version`` A version of the file, free format.
- ``encryptor`` The name of the application which encrypted the file.

Error Handling on Decoding
--------------------------

If there is a problem, decoding shall simply stop. Do not try recover from the problem.

- If the file is smaller than 256 bytes, it is invalid, stop decoding.
- If an unknown block type field is read, stop decoding.
- If an expected block is missing, stop decoding.
- If the size of the block, exceeds the specified size, stop decoding.
- If the ``CONF`` field do not match the exact encryption specification, stop decoding.
- If the hash does not match the decoded data, stop decoding.