The goal of the proposed framework is to develop a database model capable of storing encrypted records and applying relational algebra primitives on it without the knowledge of any cryptographic keys or the need for decryption. A trade-off between performance and security is desirable, however we completely discard deterministic encryption whenever possible for security reasons. The only exception are contexts with unique records, which avoid by definition weaknesses intrinsic to deterministic encryption. The applicability of this framework goes beyond SQL databases. Besides the relational algebra hereby used to describe the framework, it can be extended to key-value, document-oriented, full text and several other databases classes that keep the same attribute structure.
The three main operations needed to build a useful database are insertion, selection and update. Once data is loaded, being able to select only those pieces that correspond to an arbitrary predicate is the basic block to construct more complex operations, such as grouping and equality joins. This functionality is fundamental when there is a physical separation between the database and the data owner, otherwise high demand for bandwidth is incurred to transmit large fractions of the database records. Furthermore, real data is frequently mutable and thus the database must support updates to remain useful.
We define as secure a system model that guarantees that the data owner is the only entity capable of revealing data, which can be achieved by his exclusive possession of the cryptographic keys. Thus, a fundamental aspect of our proposal is the scenario in which the database and the application server handle data with minimum knowledge.
Lastly, the framework does not ensure integrity, freshness or completeness of results returned to the application or the user, since an adversary that compromises the database in some way can delete or alter records. We consider this threat to be outside the scope of this framework.
5.1 Classes of attributes
Records in an encrypted database are composed by attributes. These consist of a name and a value, that can be an integer, float, string or even a binary blob. Values of attributes are classified according to their purpose:
-
An immutable value only used for storage. It is not expected to be evaluated with any function, so there is no special requirement for its encryption.
-
Used for building a single or multivalued searchable index. It should enable one to verify if an arbitrary term is contained in a set without the need to acquire knowledge of its content.
The implementation of each attribute must satisfy the requirements without leaking any vital information beyond those related directly with the attribute objective (e.g.: order for index attributes). Since the name of an attribute reveals information, it may need to be protected as well. However, the acknowledgement of an attribute is done using its name, so even anonymous attributes must be traceable in a query. An option for anonymizing the attribute name is to treat it as an index.
The aforementioned cryptosystems are natural suggestions to be applied within these classes. Since static is a class for storage only, which has no other requirements, any scheme with appropriate security level and performance may be used, as AES. On the other hand, index and computable attributes are immediate applications of ORE and HE schemes. Particularly, the latter defines the HE scheme according to the required operations. Attributes that require only one operation can be implemented with a PHE scheme, which provides good performance; while those that require arbitrary additions and multiplications must use FHE and deal with the performance issues.
Definition 4
(Secure ORE) Let E and C be, respectively, an encryption and a comparison function. The pair (E,C) forms an encryption scheme with the order-revealing property defined as “secure” if and only if it satisfies Definition 1; the encryption of a message m can be written as E(m)=(c
L
,c
R
)=(E
L
(m),E
R
(m)), where E
L
and E
R
are complementary encryption functions; and the comparison between two ciphertexts c1 and c2 is done by C(cL1,cR2). This way, C may be applied without the complete knowledge of the ciphertexts.
In order to build a secure and efficient index, an ORE scheme that corresponds to Definition 4 should be used. We define the search framework as in Definition 5.
Definition 5
(Encrypted search framework) Let S be a set of words, sk a secret key, and an ORE scheme (ENC, CMP) that satisfies Definition 4. The operations required for an encrypted search over S are defined as follows: BUILDINDEX
s
k
(S): Outputs the set
$$S^{*} = \left\{c_{R}\text{}\mid \left(c_{L}, c_{R}\right) = \text{\textsc{Enc}}_{{sk}}(w), \forall w \in S\right\}. $$
TRAPDOOR
s
k
(w): Outputs the trapdoor
$$\mathcal{T}_{w} = \left(c_{L}\mid \left(c_{L}, c_{R}\right) = \text{\textsc{Enc}}_{{sk}}(w)\right). $$
\(\text {\textsc {Search}}_{S^{*},r}(\mathcal {T}_{w})\): To select all records in S∗ with the relation r∈{LOWER, EQUAL, GREATER } to word w, one computes the trapdoor \(\mathcal {T}_{w}\) and iterates through S∗ looking for the records w∗∈S∗ that satisfy
$$\text{\textsc{Cmp}}\left(\mathcal{T}_{w}, w^{*}\right) = r. $$
The set \(\hat {S}\) with all the elements in S∗ that satisfy this equation is returned.
5.2 Database operations
Let us consider a model composed by an encrypted dataset stored in a remote server and a user that possesses the secret cryptographic keys. The latter would like to perform queries on data without revealing sensitive information to the server, as defined in Section 3.1.
In 1970, Codd proposed the use of a relational algebra as a model for SQL [50]. This consists of a small set of operators that can be combined to execute complex queries over the data.
Through the functions defined in Definition 5, a relational algebra for encrypted database operations can be built. The basic operators for such algebra are defined as follows.
-
1.
Projection (π
A
): Returns a subset A of attributes from selected records. This subset may be defined by attribute names that may or may not be encrypted.
-
(a)
encrypted: If encrypted, a deterministic scheme is used or they are treated as index values.
deterministic scheme: The user computes A∗={Enc
Det
(a)|a∈A}. A∗ is sent to the server, which picks the projected attributes using a standard algorithm.
index attributes: The user computes A∗={Trapdoor
s
k
(a)|a∈A}. A∗ is sent to the server, which picks the projected attributes using Search.
-
(b)
unencrypted: Unencrypted selectors are sent to and selected by the server using a standard algorithm.
-
2.
Selection (σ
φ
): Given a predicate φ, returns only the records satisfying it.
-
Handles exclusively index, hence φ must be equivalent to a combination of comparative operators supported by Search.
-
Let w◇x←φ, where ◇ is a compatible comparative operator, w an index attribute, and x the operand to be compared (e.g.: σage>30 signals for records which the attribute named “age” value is greater than 30). The trapdoor \(\mathcal {T}_{\varphi } = Trapdoor_{{sk}}(\varphi)\) is sent to the server that executes Search.
-
3.
Cartesian product (×): The Cartesian product of two datasets is executed using a standard algorithm.
-
4.
Difference (−): The difference between two datasets A and B encrypted with the same keys is defined as A−B={x∣x∈A and x∉B}.
-
5.
Union (∪): The union of two datasets A and B encrypted with the same keys is defined as
A∪B={x∣x∈A or x∈B}.
Union and difference are defined over datasets with the same set of attributes. The opposite is expected for Cartesian product, so that no attribute may be shared between operands.
Ramakrishnan calls these “basic operators” in the sense that they are essential and sufficient to execute relational operations [51]. Additional useful operators can be built over those. For instance: rename, join-like, and division. The same observation applies in the encrypted domain, and complex operators can be constructed given basic operators defined over the encrypted domain.
-
6.
Rename (ρa,b): Renames attributes. Their names may or may not be encrypted.
-
(a)
encrypted: Encryption shall be executed using a deterministic cryptosystem or names treated as index values.
deterministic scheme: Let a be an attribute name to be replaced by b. The user computes a∗←Enc
D
e
t
(a) and b∗←Enc
D
e
t
(b), and sends the output to the server, which applies a standard replacement algorithm.
index attributes: Let a be an attribute name to be replaced by b. The user computes a∗←Trapdoor(a) and b∗←c
R
∣(c
L
,c
R
)=Enc
i
n
d
e
x
(b) and sends the output to to the server, which selects attributes related to a∗ as equal through the operation Search and renames the result to b∗.
-
(b)
unencrypted: Unencrypted attribute names may be renamed by the server using a standard algorithm.
-
7.
Natural join (⋈): Let A and B be datasets with a common subset of attributes. The natural join between A and B is defined as the selection of all elements that lies in A and B and match all the values in those attributes. More formally, let c1,c2,…,c
n
be attributes common to A and B; x1,x2,…,x
n
attributes not contained in A or in B; a1,a2,…,a
m
be attributes unique to A; b1,b2,…,b
k
be attributes unique to B; and \(\mathbb {K} =\mathbb {N}_{n+1}^{*}\). We have that,
$$A\bowtie B \equiv \sigma_{c_{i} = x_{i}} \left(\rho_{\left(c_{i}, x_{i}\right)}(A) \times B\right), \forall i \in \mathbb{K}. $$
-
8.
Equi-join (⋈
θ
): Let A and B be datasets. The equi-join between A and B is defined as the selection of all elements that lie in A and B and satisfy a condition θ. More formally, A⋈B=σ
θ
(A×B).
-
9.
Division (/): Let A and B be datasets and C the subset of attributes unique to A. The division operator joins the operands by common attributes but projects only those unique to the dividend. Hence, A/B=π
C
(A⋈B).
Finally, it is important to define data insertion and update despite these cannot be properly defined as relational operators.
-
Insert: Encrypted data is provided and inserted into the database using a standard algorithm.
-
Update: An update operation is defined as a selection followed by the evaluation of a computable attribute by a supported homomorphic operation.
This set of operators enables operating over an encrypted database without the knowledge of cryptographic keys or acquiring sensitive information from user queries.
5.3 Security analysis
We assume the scenario in which the data owner has exclusive possession of cryptographic keys. This way, insertions to the database must be locally encrypted before being sent to the server. The database or the application never deal with plaintext data. Our framework thus has the advantage over CryptDB of preserving privacy even in the outcome of a compromised database or application server.
Despite being conceptually similar to OPE, ORE is able to address several of its security limitations. ORE does not necessarily generate ciphertexts that reveal their order by design, but allows someone to protect this information by only revealing it through specific functions. ORE is able to achieve the IND-OCPA security notion and adds randomization to ciphertexts. Those characteristics make it much safer against inference attacks [37]. The proposal of Lewi and Wu goes even beyond that and is capable of limiting the use of the comparison function [34]. Their scheme generates a ciphertext that can be decomposed into left and right components such that a comparison between two ciphertexts requires only a left component of one ciphertext and the right component of the other. This way, the authors argue that robustness against such attacks is ensured since the database dump may only contain the right component, that is encrypted using semantically-secure encryption. Their scheme satisfies our notion of a Secure ORE and, therefore, provides strong defenses against Snapshot attackers.
An eavesdropper (Active or Persistent passive attacker) is not capable of executing comparisons by himself in a Secure ORE. However he may learn the result of these and recognize repeated queries by observing the outcome of a selection. This weakness may still be used for inference attacks, that can breach confidentiality from related attributes. This issue can get worse if the trapdoor is deterministic, when there is no other solution than implementing a key refreshment algorithm. Besides that, the knowledge of the numerical order between every pair of elements in a sequence may leak information depending on the application. This problem manifests itself in our proposal on the σ primitive if it uses a weak index structure, like a naive sequential index. A balanced-tree-based structure, on the other hand, obscures the numerical order of elements in different branches. This way, an attacker is capable of recovering the order of up to O(logn) database elements and infer about the others, in a database with n elements.
Schemes used with computable attributes are limited to IND-CCA1, and typically reach only IND-CPA. Moreover, homomorphic ciphertexts are malleable by design. Thus, an attacker that acquires knowledge about a ciphertext can use it to predictably manipulate others.
Finally, BUILDINDEX is not able to hide the quantity of records that share the same index. This way, one is able to make inferences about those by the number of records. There is also no built-in protection for the number of entries in the database. A workaround is to fix the size of each static attribute value and round the quantity of records in the database using padding. This approach increases secrecy but also the storage overhead.
5.4 Performance analysis
The application of ORE as the main approach to build a database index provides an extremely important contribution to selection queries. SEARCH does not require walking through all the records testing a trapdoor, but only a logarithmic subset of it when implemented over an optimal index structure, such as an AVL tree or B-tree based structure [52]. This characteristic is highlighted on union, intersection and difference operations, which work by comparing and selecting elements in different groups. Moreover, current proposals in the state of the art of ORE enjoy good performance provided by symmetric primitives and does not require more expensive approaches such as public-key cryptography [33, 34, 36]. In particular, although fully homomorphic cryptosystems promise to fulfill this task and progress has been made with new cryptographic constructions [53], it is still prohibitively expensive for real-world deployments [54].
Space consumption is also affected. Ciphertexts are computed as a combination of the plaintext with random data. This way, a non-trivial expansion rate is expected. Differently from speed overheads which are affected by a single attribute type, all attributes suffer with the expansion rate of encryption.
5.5 Capabilities and limitations
Our framework is capable of providing an always-encrypted database that preserves secrecy as long as the data owner keeps the cryptographic keys secure. One is able to select records through index and apply arbitrary operations on attributes defined as computable. Furthermore, it increases the security of data but maintaining the computational complexity of standard relational primitives, achieving a fair trade-off between security and performance.
Although the framework has no constraints about attributes classified as both index and computable, there is no known encryption scheme in the literature capable of satisfying all the requirements. This way, the relational model of the database must be as precise as possible when assigning attributes to each class, specially because the costs of a model refactor can be prohibitive.
Some scenarios appear to be more compatible with an encrypted database as described than others. An e-mail service, for example, can be trivially adapted. The e-mails received by a user are stored in encrypted form as static and some heuristic is applied on its content to generate a set of keywords to be used on BUILDINDEX. This heuristic may use all unique words in the e-mail, for example. The sender address may be an important value for querying as well, so it may be stored as an index. To optimize common queries, a secondary collection of records may be instantiated with, for example, counters. The quantity of e-mails received from a particular sender, how often a term appears or how many messages are received in a time frame. Storing this metadata information in a secondary data collection avoids some of the high costs of searching in the main dataset.
However, our proposal fails when the user wants to search for something that was not previously expected. For example, regular expressions. Suppose a query that searches for all the sentences that start with “Attack” and end with “dawn”, or all the e-mails on the domain “mail.com”. If these patterns were not foreseen when the keyword index was built, then no one will be able to correctly execute this selection without the decryption of the entire database. Since the format of the strings is lost on encryption, this kind of search is impossible in our proposal.
Lastly, relational integrity is a desired property for a relational database. It connects two or more sets using same-value attributes in both sets (e.g.: every value of a column in a table A exists in a column in table B), and establishes a primary-foreign key relationship. This way, the existence of a record in an attribute classified as foreign key depends on the existence of the related record on the other set, in which the primary key is equal to that foreign key. To implement such feature one must provide to the DBMS capabilities to reinforce relational integrity rules. In other words, the server must be able to recognize such a relationship to guarantee it will be respected by issued queries.
An example of the applicability of this concept is an e-commerce database. Best practices dictate that user data should be stored separately from products and orders. Thus, one may model it as in Fig. 3. When a new order arrives, it is clear that a user chose some product and informed the store about his intent to buy it. Users and products are concrete elements. However, a sale is an abstract object and cannot happen without a buyer and a product. This way, to maintain the consistency of the database the DBMS must assure that no sale record will exist without relating user and product. This can be achieved by constructing the sales table such that records contain foreign keys for the user and product tables (implying that these contain attributes classified as primary keys). By definition this feature imposes an inherent requirement that the DBMS has knowledge about this relationship between records on different tables. Any approach to protect the attributes against third parties will affect the DBMS itself and will never really achieve the needed protection. Thus, any effort on implementing secure relational integrity is at best security through obscurityFootnote 5.