SimAttack: private web search under fire
© Petit et al. 2016
Received: 23 July 2015
Accepted: 23 March 2016
Published: 18 April 2016
Web Search engines have become an indispensable online service to retrieve content on the Internet. However, using search engines raises serious privacy issues as the latter gather large amounts of data about individuals through their search queries. Two main techniques have been proposed to privately query search engines. A first category of approaches, called unlinkability, aims at disassociating the query and the identity of its requester. A second category of approaches, called indistinguishability, aims at hiding user’s queries or user’s interests by either obfuscating user’s queries, or forging new fake queries. This paper presents a study of the level of protection offered by three popular solutions: Tor-based, TrackMeNot, and GooPIR. For this purpose, we present an efficient and scalable attack – SimAttack – leveraging a similarity metric to capture the distance between preliminary information about the users (i.e., history of query) and a new query. SimAttack de-anonymizes up to 36.7 % of queries protected by an unlinkability solution (i.e., Tor-based), and identifies up to 45.3 and 51.6 % of queries protected by indistinguishability solutions (i.e., TrackMeNot and GooPIR, respectively). In addition, SimAttack de-anonymizes 6.7 % more queries than state-of-the-art attacks and dramatically improves the performance of the attack on TrackMeNot by 23.6 %, while retaining an execution time faster by two orders of magnitude.
Search engines (e.g., Google, Bing, Yahoo!) have become the preferred way for users to find content on the Internet. However, by repetitively querying for a large number of topics and websites, users disclose a large amount of personal data to these search engines. Consequently, the latter are able to create accurate knowledge on users by extracting their personal interests from their queries. Even though all user queries are not related to sensitive topics, this automated data processing about individuals raises a serious privacy issue, as users cannot control the use of their personal data and have no right to be forgotten. To deal with this issue, many solutions have been proposed to enforce private Web search. These solutions can be mainly classified into two categories. The first one, called unlinkability, consists in hiding the user’s identity from the search engine (typically her IP address). Anonymous communication protocols (e.g., Onion Routing , TOR , Dissent [3, 4], RAC ) are the main solutions enforcing this property. The second type of solutions, called indistinguishability, aims at either altering the user’s queries or hiding the user’s interests. For instance, GooPIR  adds extra queries to the original query while TrackMeNot  sends periodically fake queries.
Despite these solutions improve the user privacy, a previous study  using a machine learning algorithm and preliminary information about the user (i.e., part of its query history) shows that an adversary is able to break both categories of solutions. However, this study was conducted using only 60 specific users (i.e., users who issued queries with a given number of keywords or queries considered as sensitive by the authors) and considering non-active users (called “other user” in the study). Consequently, it is not clear if an adversary is still able to break these unlinkability and indistinguishability solutions for active users. As active users can expose more information to the adversary, they represent the most difficult category of users to protect.
To better understand the limits of unlinkability and indistinguishability solutions on individual’s privacy, we present in this paper a study of private Web search solutions focusing on active users. This study is conducted with SimAttack, an efficient attack that leverages a similarity metric to capture the distance between a query and user profiles. These user profiles gather preliminary information about the users collected by the adversary. While the original version of SimAttack  was designed for a specific target, this paper presents a generalization of this attack for unlinkability and indistinguishability solutions.
We exhaustively evaluated our new SimAttack on three popular solutions: Tor-based, TrackMeNot, and GooPIR. Our experiments used real world Web search datasets involving up to 15,000 users. We show that SimAttack scales particularly well with respect to the number of users considered in the system. More precisely, compared to the previous machine learning attacks, SimAttack divides by 158 and 100 the execution time considering respectively 1,000 users protected by an unlinkability solution and 100 users protected by TrackMeNot. Moreover, SimAttack succeeds to de-anonymize as many users queries as the machine learning attack for unlinkability solutions, and identify up to 45.3 % of initial queries for TrackMeNot.
Finally, the generic nature of SimAttack based on a similarity distance between pre-built user profiles and a query allows an adversary to design attacks for others private Web search solutions.
For instance, we leverage SimAttack to evaluate the privacy protection offered by GooPIR. We succeed to identify at least 50.6 % of initial queries protected by this solution even if they were protected by 7 fake queries. Last but not least, as we show in our study that the previous aforementioned solutions (i.e., Tor-based, TrackMeNot, and GooPIR) do not protect properly the user privacy, we also analyze hybrid private Web search solutions: GooPIR over an unlinkability solution and TrackMeNot over an unlinkability solution.
The remaining of the paper is organized as follows. In Section 2, we present the state-of-the-art approaches. In Section 3, we describe the considered adversary model. In Section 4, we detail SimAttack and how it is able to break unlinkability solutions, indistinguishability solutions and their combinations. We then present our experimental set-up in Section 5 before evaluating the robustness of unlinkability solutions, TrackMeNot, GooPIR and hybrid solutions in Section 6, 7, 8 and 9, respectively. Finally, Section 10 concludes the paper.
2 Related work
The main solutions to privately query search engines can be classified in two categories: (i) systems ensuring unlinkability between requesters and their queries, and (ii) systems guaranteeing indistinguishability of user interests. Privacy-aware mechanism can be also directly implemented on the search engine side through Private Information Retrieval (PIR) protocols.
2.1 Unlinkability solutions
One approach to protect the user privacy from a too curious search engine is to prevent the latter from identifying the real identity of users. The identity of users is tracked through multiple techniques such as the IP address, quasi-identifiers (e.g., cookies), or fingerprints (e.g., HTTP headers, set of browser plugins ). While quasi-identifiers can be removed as suggested in , a basic solution to hide the IP address consists in leveraging a Proxy  or a VPN  server as relay. This distant server forwards user queries to the search engine on behalf of the user and returns results to the user. Unfortunately, this mechanism only shifts the privacy problem from the search engine to the relay which can collect and analyze queries from users.
Anonymous networks (e.g., Onion Routing , Tor , Dissent [3, 4], RAC ) represents a more complex approach to prevent a third party to map a user identity to a query. Indeed, anonymous network leverages onion routing and path forwarding to route user queries through multiple nodes before reaching the search engine. However, this approach relies on either a high number of cryptographic operations (e.g., Tor-based solutions), or all-to-all communication (e.g., RAC and Dissent) which generate a costly overhead in terms of latency and network traffic. These important overheads make anonymous networks impractical for interactive tasks such as Web search.
Other techniques try to achieve the same goal using a fully decentralized architecture. For instance,  and  proposed a protocol in which users exchange their queries in a privacy-preserving way (i.e., users do not know who issued which queries) and send them on behalf of each other. As the identity of the initial requester is unknown by the search engine and the other users, the results must be broadcasted to all the users. Therefore, these solutions generate significant overheads in terms of traffic and latency.
2.2 Indistinguishability solutions
Indistinguishability solutions consist in making the search engine only able to collect inaccurate users’ queries and interests. Consequently, as the users’ interests cannot be truly discovered, the privacy of users is preserved. A popular solution in this category, TrackMeNot , periodically sends fake queries on behalf the user. The challenge in this approach is to create fake queries that cannot be distinguished from the real ones. To do so, TrackMeNot (TMN) based the generation of fake queries on RSS feeds. However, as these RSS feeds are set up by default or manually by the user, an adversary could be able to distinguish real queries from fake ones. For instance, in  authors present a simple clustering attack over small time windows that enables an adversary to retrieve fake queries. Besides, other solutions adopt a similar technique: Plausibly Deniable Search (PDS)  generates k plausibly deniable queries which are similar to previous user queries but on different topics. Optimized Query Forgery (OQF)  provides a theoretical approach to generate fake queries by measuring the Kullback-Leibler divergence between the user profile and the population distribution. Noise Injection for Search Privacy Protection (NISSP)  (another similar approach to OQF) gives a theoretical property that optimal fake queries should respect. However, these four solutions (TMN, PDS, OQF and NISSP) might overload the network by generating a large number of fake queries.
Another possibility to achieve indistinguishability is to modify the initial query. For instance, GooPIR  adds to the initial query (k−1) fake queries (generated using a dictionary) where all of these queries are separated by the logical OR operation in a new obfuscated query. As GooPIR’s authors consider that an adversary has no background knowledge about the user, this adversary can only guess the initial query with a probability equal to 1/k. However, we present in Section 4 an efficient attack that is able to retrieve the initial query with a high probability considering user profiles preliminary created with their past queries. Another technique called Query Scrambler (QS)  protects the user by sending, instead of the initial query, a set of new queries built by generalizing the concepts of the initial query. Then, by filtering all the results, QS retrieves potential results related to the initial query. However, despite similar queries and the filtering approach proposed, the accuracy of the results remains low compared to the results obtained by issuing the original query.
2.3 Private information retrieval
Search engines can implement Private Information Retrieval (PIR) protocols to offer privacy preserving query service to users. For instance,  presents a system in which the query is broken down in multiple buckets of words and then, the user uses homomorphic encryption to retrieve search results without revealing her initial query. However, this scheme faces many limitations to be adopted in practice (i.e., costly homomorphic encryption and it requires specific implementation at the search engine side).
3 Adversary model
Users are more and more concerned about the privacy risks of querying search engines. In this paper, we analyze the robustness of popular private Web search solutions. We considered three categories of solutions: unlinkability solutions, indistinguishability solutions, and indistinguishability solutions over unlinkability solutions. In our approach, we assumed an adversary which aims to retrieve for each protected query, both the content of the initial query and the identity of the associated user. Moreover, we assumed an adversary which was able to collect preliminary information about the interests of each user in the system. This preliminary information are stored in user profile structures. Preliminary information of users can be collected in different manners, from their social networks activity, from their posts on blogs or discussions on forums 1. In this paper, we considered as preliminary information a part of the history of query of users.
In practice, our adversary model can be seen as a search engine receiving protected queries from users who just start to adopt a private Web search solution. In this use case, the preliminary information represent the non-protected queries sent by the users to the search engine before the exploitation of a private Web search solution. Consequently, the most active users have exposed more preliminary information to the search engine through their past querying activity.
In this section, we present SimAttack, an attack against private Web search solutions. SimAttack computes a distance between an incoming query and the preliminary information collected by the adversary (i.e., user profiles). As consequence, according to this similarity distance, the adversary is able to de-anonymize the query or differentiate the fake queries from real ones. SimAttack is a user-centric attack which tries to compromise the privacy of each user independently. In this paper, we generalized the original version of SimAttack  for unlinkability and indistinguishability solutions.
Compared to existing attacks, SimAttack is generic and can be adapted against all types of private Web search solutions. Indeed, by defining the user profile and the considered similarity metric, an adversary can personalize SimAttack to any type of protections.
The next sections explain how the similarity between a user profile and a query is computed, and detail how SimAttack is able to break several types of protection mechanism based on unlinkability, indistinguishability, and indistinguishability over unlinkability.
4.1 Similarity metric between a query and a user profile
We create a similarity metric s i m(q, P u ) to characterize the proximity between a query q and a user profile P u . As mentioned in , vector space model is widely used for text representation. Thus, we model the query q as a vector where each dimension corresponds to a separate term. For each dimension, the value of the vector is either 0 or 1 (i.e., 0 means that the keyword is not used in the query while 1 means that the keyword is used in the query). Let us define a user profile P u as a set of queries (i.e., a set of word vectors). The similarity metric s i m(q, P u ) returns a value between 0 and 1 where greater values indicate that the query is close to the user’s profile. It is computed as presented in Algorithm 1.
It first computes the value c o e f[i] corresponding to the Dice’s coefficient  between the query q and the query q i stored in P u , the profile of user u (line 2). As defined in Section 3, this profile contains part of the history of query already issued by the user and preliminary collected by the adversary. The coefficients c o e f[i] are then ranked in ascending order (line 3). The similarity metric s i m(q, P u ) is finally computed as the exponential smoothing of these coefficients (lines 4 to 6). Consequently, this similarity depends on the smoothing factor α that enables to change the weight given to the coefficients. This parameter α takes its value between 0 and 1. In practice, the value of α does not strongly impact the results as shown in Section 6.1. Furthermore, we consider the Dice’s coefficient which gives slightly better results than other similarity metrics (e.g., cosine similarity , Jaccard index ). As shown in our evaluations, although SimAttack is faster than concurrent approaches, the time required to perform the attack must remain as short as possible. The Dice’s coefficient provides a good trade off between performance against execution time compared to edit-based and more complex token-based metrics .
4.2 Unlinkability attack
The de-anonymization attack consists in finding the identity of the requester of a specific query. Algorithm 2 describes this attack. For each user profile P u previously collected by the adversary, it computes its similarity with the query q (line 3). It then returns the identity id corresponding to the profile with the highest similarity. If the highest similarity equals 0 (i.e., all similarities equal 0), the identity of the requester remains unknown and the attack is unsuccessful. Otherwise, the algorithm considers the user, id, as the issuer of the query q.
4.3 Indistinguishability attack
The attack against indistinguishability solutions aims to identify initial queries among faked or obfuscated queries received by the search engine. Contrary to the previous attack, the adversary knows the identity of the user and thus tries to pinpoint fake queries by analyzing the similarity between queries and the user profile. The attack detailed in the Algorithm 3 proceeds as follow. It first determines which obfuscation mechanism is being used. More precisely, it checks if the obfuscated query q + contains several fakes queries separated by the logical OR operator (line 1) (i.e., behavior of GooPIR). It might appear that the logical OR operator was introduced by the user in her query (and not by the obfuscation mechanism). Nevertheless, as the user query and all fake queries have the same number of keywords, it is easy in most of cases to detect if the logical OR was introduced by the user or the obfuscation mechanism.
Let us consider the first case in which the query q + is composed of k+1 queries (i.e., the initial query and k fake queries). The algorithm extracts each aggregated query q i from q + and computes the similarity metric between these aggregated queries q i and the user profile P u (lines 3 and 4). Then it stores the query with the highest similarity in the variable q ′. If the similarity s i m(q ′,P u ) is different from 0, the algorithm returns q ′ as the initial request. Otherwise, the attack fails and the initial query is not retrieved as the (k+1) queries are not similar to any user profile.
On the second case (i.e., the query does not contain the logical OR operator), it distinguishes two cases: if the adversary has a prior knowledge about RSS feeds used by the user to generate the fake queries or not. If we consider first that the adversary does not have this external knowledge, it evaluates if the similarity between the query q + and the user profile P u is greater than a given threshold δ. If so, then q + is considered as a real query, and is therefore returned (line 8). Otherwise, the query is considered to be a fake query (line 11).
Conversely, if we consider the situation where the adversary knows the RSS feeds used by the user to generate the fake queries, the adversary generates fake queries using these predefined RSS feeds. These fake queries are stored in a profile P FQ (same structure as a user profile P u ). Then, the adversary uses this external knowledge to distinguish fake queries (line 10). It first compares the similarity between the query q + and the user profile P u (i.e., s i m(q +,P u )) against the similarity between the query q + and the profile of fake queries P FQ (i.e., s i m(q +,P FQ )). If s i m(q +,P u ) is greater than s i m(q +,P FQ ), q + is closer to the user profile than the profile of fake queries. Consequently, q + is considered as a real query, and is then returned. Otherwise, the query is considered to be a fake query (line 11).
4.4 Indistinguishability over an unlinkability solution attack
The attack that breaks an indistinguishability solution over an unlinkability solution combines the two previous attacks. The attack aims at identifying both the initial requester and the initial query. To achieve that, it follows the Algorithm 4. As the attack presented in Algorithm 3, the Algorithm 4 first determines which obfuscation mechanism is being used by looking for logical OR operators (line 1). In that case, it first extracts the (k+1) queries q i from q + and then retrieves for each query q i , its potential requester i d[i] by invoking Algorithm 2 (lines 2 to 3). Then, it removes queries which are not associated to a potential requester (lines 5 to 6), i.e. queries for which Algorithm 2 was unsuccessful. We denote the set of indexes corresponding to the remaining queries by I. Finally, if I contains one element a (i.e., only one query is associated to a potential requester), it returns the pair (q a ,i d[a]) corresponding to the initial query q a and to the initial requester i d[a] (lines 7 to 9).
However, if I contains at least two elements, it retrieves the pairs (q a ,i d[a]) and (q b ,i d[b]) which have the highest similarity over I and evaluates the difference between them (i.e., s i m(q a ,P id[a] )−s i m(q b ,P id[b] )). To ensure a certain confidence in the results, if this difference is too small, the attack is thus unsuccessful, as the algorithm retrieves at least two pairs of query and requester, and it is not able to clearly identify the real one. However, if the difference is greater than a threshold (initialized at 0.01 by default), it returns the pair (q a ,i d[a]) corresponding to the initial query q a and to the initial requester i d[a] which maximizes s i m(q a ,P id[a] ) over I (lines 10 to 14).
When queries do not contain OR operators, the algorithm first retrieves the potential requester id by calling the Algorithm 2 (line 16). If this id is not empty (i.e., if the attack made by the Algorithm 2 is successful), it distinguishes two cases depending if the adversary has a prior knowledge about RSS feeds used by the user. As mentioned in the previous section, if the adversary is able to generate fake queries, she creates a profile P FQ (similar to user profile P u ) that contains a set of fake queries. Let us consider the first case in which the adversary does not have this knowledge (lines 18 to 19). The adversary is able to distinguish between fake queries and real ones by comparing the similarity between the query q + and the user profile P id (i.e., s i m(q +,P id )) with the threshold δ. If s i m(q +,P id ) is greater than δ, the query is considered as a real query sent by the user id and thus, the pair (q +,i d) is returned.
Now, if we consider that the adversary is able to generate a set of fake queries (lines 20 to 22). The algorithm determines if the similarity distance between the query q + and the user profile P id (i.e., s i m(q +,P id )) is greater than the similarity metric between the query q + and the profile of fake queries P FQ (i.e., s i m(q +,P FQ )). In that case, the pair (q +,i d) is respectively considered as the initial query and the initial requester and returned by the algorithm. Otherwise, as no pair has been returned, the attack is either unsuccessful or the query is considered as a fake query (line 23).
5 Experimental set-up
In this section, we provide the experimental set-up of our evaluation: the datasets, an overview of the considered indistinguishability solutions (i.e., TrackMeNot and GooPIR), and both the evaluation metrics and the concurrent approaches we use to assess the performance of SimAttack. All our experiments were conducted on a commodity desktop workstation with a 2.2 GHz quad core processor with 8 GB of memory.
5.1 Web search dataset
To evaluate the robustness of private Web search solutions, we use a real world Web search dataset from AOL Web search logs  published in 2006. AOL dataset contains approximately 21 million queries formulated by 650,000 users over three months (March, April and May of 2006). As this dataset contains many inactive users (i.e., users that issued too few queries), we first filtered the whole dataset to target active users. More precisely, we select users that: (i) sent queries on at least 45 different days (i.e., half of the dataset period), and (ii) issued queries on a period of at least 61 days (i.e., two-thirds of the dataset period). Finally, after this filtering phase, our dataset gathers 18,164 users who issued from 62 queries to 3,156 queries over the dataset period.
In addition, to assess private Web search solutions with a larger number of users, we create 3 extra datasets containing the top 5,000, 10,000 and 15,000 users: AOL5000, AOL10000 and AOL15000. Figure 1 shows that these 3 datasets do not follow the previous distribution of queries per user due to the lack of high active users in the AOL dataset. However, results obtained with these datasets give a lower bound as having more queries in the user profile would likely increases the efficiency of the attack.
Finally, we pre-process and filter the queries of users to remove the irrelevant keywords. To achieve that, we leverage the Stanford CoreNLP library . Using the tokenizer, we split queries in string vectors and then remove stop words (i.e., articles and short function words) and irrelevant keywords. Irrelevant keywords are identified with the Named Entity Tagger and the Part-Of-Speech Tagger. The former enables the recognition of names or numerical and temporal entities while the latter recognizes the function of the word. As consequence, numbers, dates or pronouns are removed. Lastly, we stem each keyword by eliminating or replacing the suffix using Porter algorithm .
As mentioned in Section 3, we considered that the adversary has already built a user profile for each user. Consequently, we split each dataset in two parts: a training set used to build the user profiles, and a testing set used to assess the robustness of the considered privacy-preserving mechanism. We used two third of user queries to create the training set and the remaining third of queries to create the testing set. We used two third of user queries to create the training set, and the remaining third of queries to create the testing set.
TrackMeNot, called TMN in the rest of the paper, is a Firefox plugin which periodically generates fake queries to hide user queries in a stream of related queries. After the installation of TMN, the user can define different settings to select the desired level of protection. Two main parameters impact the user protection: the RSS feed lists and the delay between two fake queries. The RSS feeds list is composed by default of four RSS feeds coming from: cnn.com, nytimes.com, msnbc.com and theregister.co.uk. The user can modify this list to remove or add extra RSS feeds. Modifying this setting is crucial, as keeping the initial list might help an adversary to distinguish between real queries and fake ones. However, it is not trivial for users to find good RSS feeds, as they should find RSS feeds that cover all their ever changing interests. Moreover, the user can customize the protection by choosing the time between two fake queries. TMN offers several possibilities: from 10 fake queries per minute to 1 fake query per hour. Consequently, the users are able to chose the quantity of noise they want to introduce in their queries. Also, the user could activate the “burst mode”. In that case, when the user issues a query, TMN sends in the same time multiple fake queries to cover it.
To generate these fake queries, TMN transforms titles of articles listed in RSS feeds into queries. To do so, it randomly extracts keywords from a title and aggregate them into a fake query. The number of keywords is randomly chosen between 1 and 6. As a direct consequence, for a given title, this algorithm is able to create multiple fake queries and thus, two TMN users using the same RSS feeds do not systematically create the same fake queries.
Finally, to simulate users using TMN, we need to add fake queries to the datasets created in Section 5.1. To do that, we create our own implementation of TMN to generated the fake queries. We thus collected RSS feeds from the TMN default setting during one month and half (from August 28th, 2014 to October 9th, 2014), and we generate fake queries from the 13,878 news titles that we extracted. Additionally, we need to specify the number of fake queries that we want to generate. To do so, we consider that users used their computers 8 hours a day and have set up 60 queries per hour. Consequently, we generate 14,880 fake queries per users (i.e., 60 queries × 8 hours × 31 days). We call TMN100 this new dataset that contains the queries of AOL100 plus 1,488,000 fake queries.
To ensure that the generated fake queries (built from RSS feeds captured in 2014) are using similar terms that users cared to look for in 2006, we compute the overlap between the words used in fake queries and the words used in the whole AOL dataset. We found out that 85.6 % of words used in fake queries are also contained in the AOL dataset (6,918 words out 8,082).
Furthermore, we also generate fake queries for the adversary (i.e., profile of fake queries P FQ defined in Section 4.3). To do that, we generate the same number of fake queries for the adversary as for users (i.e., 14,880 fake queries).
GooPIR (Google Private Information Retrieval) is a Java program to query Google in a privacy-preserving way. This protection mechanism can also be used with other search engine but only Google is supported by the application. GooPIR obfuscates user queries by adding extra fake queries separated by the logical OR operation. GooPIR uses a dictionary to generate these fake queries. It can exploit any type of dictionary – in the current implementation news articles from WikiNews are used but GooPIR’s authors mentioned that query logs can also be used. By default, GooPIR creates three fake queries but users can manually set up this number.
To generate k fake queries, GooPIR selects for each keyword of the initial query, k words using the dictionary. All these k selected words have a similar usage frequency than the keyword in the initial query. Consequently, if the initial query is composed of n keywords, GooPIR selects k×n words and then creates k fake queries of n words (i.e., fake queries and user’s queries have the same number of keywords).
Query answers returned by Google contain results related to the initial query but also to the fake ones. As a consequence, GooPIR implements a filtering phase that tries to remove irrelevant results introduced by fake queries. This algorithm tests for each result if its title or its description contains keywords of the initial queries. If so, the result is displayed, it is discarded otherwise.
In our experiments, to implement the behavior of GooPIR, we created the dictionary from the AOL dataset by extracting all keywords and their usage frequency from the 20 million AOL Web search queries.
5.4 Evaluation metrics
5.5 Concurrent approach
To compare the performance of SimAttack, we consider a recent attack using machine learning algorithms  as comparison baseline. This attack targets both unlinkability solutions and TMN, and uses Weka  as machine learning framework. In both cases, this attack is based on two steps: it first builds and trains a model for each user from its query history (and for TMN, it builds and trains a model from fake queries), and then it leverages these models to de-anonymize anonymous queries or to distinguish fake queries from real ones.
To de-anonymize anonymous queries, the concurrent attack uses the Support Vector Machine (SVM) classifier. This choice is motivated by a previous study  that shows that SVM classifier gives better results for text classification. To implement this attack, we reproduce the same condition as reported by the authors: using LibSVM (i.e., an efficient implementation of SVM), the same algorithm (i.e., C-SVC), and the same type of kernel (i.e., linear). We also let the parameter Epsilon (i.e., tolerance of termination criterion) to its default value (i.e., 0.001). However, for parameter C (i.e., cost), Weka offers a specific option (CVParameterSelection) to find the value that maximizes the performance of the classification. Using this option, we found out that the best value for C is 1.1.
To distinguish fake queries from real ones, the concurrent attack considers several machine learning algorithms: Logistic Regression, Alternating Decision Trees, Random Forest, Random Tree and ZeroR. For the sake of simplicity, we only use the Logistic Regression classifier (reported by the authors of the attack as the classifier which produces the best performance), and the SVM classifier (which was not considered in the previous study).
6 Evaluation of unlinkability solutions
In this section, we evaluate the capacity of SimAttack to compromise the anonymity of users’ queries protected by an unlinkability solutions. More precisely, we assess the sensitivity of SimAttack on unlinkability solutions over various parameters. Finally, we compare the performance provided by SimAttack against the performance of the concurrent machine learning approach.
6.1 Impact of smoothing factor α
6.2 Impact of the number of users in the system
6.3 Impact of targeting p users with the highest similarity instead of the highest one
6.4 Impact of the number of user profiles
6.5 Impact of the size of the user profiles
6.6 Privacy protection
SimAttack is able to de-anonymize a large number of queries protected by unlinkability solutions. Nevertheless, the SimAttack’s capacity of de-anonymizing queries depends on the number of users in the system, and both the number and the quality of the preliminary user profiles collected by the adversary. While SimAttack provides similar performances than the concurrent machine learning attack, SimAttack is much more faster.
7 Evaluation of TrackMeNot
In this section, we evaluate the capacity of SimAttack to distinguish fake queries sent by TrackMeNot from the real queries sent by users. More precisely, we assess the sensitivity of SimAttack for TMN over various parameters. Finally, we compare the performance provided by SimAttack against the performance provided by the concurrent machine learning approach.
7.1 Impact of smoothing factor δ
7.2 Impact of the external knowledge
Performance of SimAttack considering an adversary with prior knowledge about RSS feeds
7.3 Impact of the number of fake queries
7.4 Impact of the size of the user profiles
The number of queries stored in the user profiles impacts the efficiency of SimAttack. To measure this impact, we depict on Fig. 11 the precision and the recall of SimAttack for TMN100 when the profile of users pre-built by the adversary only contains a sub part of their query history (from 0 to 100 %). Results show that exploiting smaller user profiles make harder the identification of user queries by SimAttack. For instance, if we consider 100 % of the user profiles, SimAttack identifies 36.8 % or 45.3 % of queries (depending on the exploitation or not of the prior knowledge) while this number drops to 12.6 % or 15.3 % if we consider only 5 % of the user profiles. However, the quality of the attack (i.e., the precision) increases. For instance, decreasing the number of queries considered in the user profiles from 100 to 5 % makes the precision increases from 21.7 to 92 %. Indeed, with less accurate user profiles, SimAttack does not have enough information to correctly retrieve the users. Consequently, increasing the size of user profiles increases the recall of SimAttack, but also decreases the precision as more queries get misclassified.
7.5 Privacy protection
Performance of the machine learning classifiers on queries protected by TrackMeNot
Support Vector Machine
SimAttack succeeds to distinguish a high ratio of fake queries sent by TrackMeNot. Nevertheless, this ratio depends on the number of fake queries generated by TMN, and both the number and the quality of the preliminary user profiles collected by the adversary. Finally, SimAttack outperforms machine learning attacks and is faster.
8 Evaluation of GooPIR
In this section, we evaluate the capacity of SimAttack to distinguish fake queries generated by GooPIR. More precisely, we assess the sensitivity of SimAttack for GooPIR over different parameters.
8.1 Impact of the number of fake queries
We then study why some queries are not identified by SimAttack (i.e., Misclassified and Unknown queries on Fig. 13). Results show that the proportion of queries in the these two categories changes according to the number of fake queries. For instance, for 1 fake query, unknown queries represent 78.9 % of non-identified queries while misclassified queries represent 21.1 %. If we consider 7 fake queries, these percentages change to 40.8 and 59.2 %, respectively.
8.2 Impact of the size of the user profiles
Furthermore, changing the number of fake queries significantly impacts the percentage of identified queries only if the adversary considered enough queries in the user profiles. For instance, adding 6 fake queries decreases by 9.6 % the percentage of query identified when 100 % of the query history is taken into account in the user profile. This decrease drops to 3.4 % when only 10 % of the query history is considered.
SimAttack breaks GooPIR protection for more than half of the queries. In addition, the protection of the query of users is impacted by the number of fake queries: increasing the number of fake queries offers a better protection. Moreover, the size of the user profile have an impact on the performance as non-accurate user profiles make SimAttack less efficient.
9 Evaluation of indistinguishability over an unlinkability solution
As shown in the three previous sections, both unlinkability and indistinguishability approaches fail to properly protect user queries. Therefore, we carried out two further experiments which combine these two approaches (i.e., TrackMeNot and GooPIR over an unlinkability solution).
9.1 TrackMeNot over an unlinkability solution
In this section, we evaluate a solution composed of TrackMeNot over an unlinkability solution. Consequently, both queries of users and fake ones generated by TMN are sent anonymously. The remaining of this section presents a sensitivity analysis of the considered solution over various parameters.
9.1.1 Without prior knowledge on RSS feeds
9.1.2 With prior knowledge on RSS feeds
Performance of SimAttack considering an adversary with prior knowledge on RSS feeds
Furthermore, compared to the results obtained by SimAttack when no prior knowledge is considered (i.e., Section 9.1.1), SimAttack with prior knowledge increases by 14.9 % the recall but decreases by 16 % the precision. Overall, the F-Measure without prior knowledge is higher than the one with prior knowledge (27.3 versus 20.7 %). Interesting enough, SimAttack on TMN alone with prior knowledge provides higher performance than without prior knowledge (their F-Measures are 61.0 and 46.3 %, respectively).
9.1.3 Impact of the number of fake queries
9.1.4 Impact of the size of the user profiles
9.2 GooPIR over an unlinkability solution
In this section, we assess a solution combining the obfuscation of GooPIR and an unlinkability solution. The remaining of this section presents a sensitivity analysis of the considered solution over various parameters.
9.2.1 Impact of the number of fake queries
9.2.2 Impact of the size of the user profiles
9.2.3 Impact of considering more than one pair (query,user)
Combining an indistinguishability technique (i.e., TrackMeNot or GooPIR) over an unlinkability solution gives a better protection to the queries of user, especially if the adversary is not able to collect a large quantity of information about the user or if the user configures its indistinguishability solution to sent a high number of fake queries. Nevertheless, in most of cases, the adversary is still able to retrieve a non-negligible proportion of user queries.
This paper presents SimAttack, a generic attack that targets popular private Web search solutions. SimAttack leverages a similarity metric to capture the distance between a query and pre-built user profiles gathering preliminary information about the user interests. We exhaustively evaluate SimAttack using a real world Web search dataset. We show that SimAttack succeeds to de-anonymize, or retrieve among fake queries a high ratio of initial queries from user.
Our analysis shows that neither unlinkability solutions, nor TrackMeNot and GooPIR protects properly the users. Besides, we study the combination of TrackMeNot and GooPIR over an unlinkability solution. The first combination (i.e., TrackMeNot over an unlinkability solution) gives a satisfactory protection when enough fake queries are periodically sent. However, this solution generates an important overhead in term of message on the network. The second combination (i.e., GooPIR over an unlinkability solution) still suffers from a high ratio of initial queries identified by SimAttack.
Dynamically evaluating protected queries in order to measure their level of protection over time represents an interesting research agenda for future works. For instance, thanks to this dynamic assessment, it will be possible to adapt the queries protection before sending them, and to reinforce the user awareness.
1 How the adversary collects preliminary information remains outside the scope of this paper.
The presented work was supported by the EEXCESS project funded by the EU Seventh Framework Programme FP7/2007-2013 under grant agreement number 600601. Research reported in this publication has been carried out as part of the International Research and Innovation Centre in Intelligent Digital Systems (IRIXYS).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Goldschlag D, Reed M, Syverson P. Onion routing. Commun ACM. 1999; 42(2):39–41.View ArticleGoogle Scholar
- Dingledine R, Mathewson N, Syverson P. Tor: The second-generation onion router. In: Proceedings of the 13th Conference on USENIX Security Symposium - Volume 13. San Diego, CA: USENIX Association: 2004. p. 21–1.Google Scholar
- Corrigan-Gibbs H, Ford B. Dissent: accountable anonymous group messaging. In: Proceedings of the 17th ACM Conference on Computer and Communications Security. Chicago, Illinois, USA: ACM: 2010. p. 340–50.Google Scholar
- Wolinsky DI, Corrigan-Gibbs H, Ford B. Dissent in numbers: Making strong anonymity scale. In: Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation. Hollywood, CA, USA: USENIX Association: 2012. p. 179–92.Google Scholar
- Ben Mokhtar S, Berthou G, Diarra A, Quéma V, Shoker A. Rac: A freerider-resilient, scalable, anonymous communication protocol. In: Proceedings of the 33rd IEEE International Conference on Distributed Computing Systems. Philadelphia, PA, USA: IEEE Computer Society: 2013. p. 520–9.Google Scholar
- Domingo-Ferrer J, Solanas A, Castellà-Roca J. h(k)-private information retrieval from privacy-uncooperative queryable databases. Online Inform Rev. 2009; 33(4):720–44.View ArticleGoogle Scholar
- Toubiana V, Subramanian L, Nissenbaum H. Trackmenot: Enhancing the privacy of web search. CoRR. 2011. http://arxiv.org/abs/1109.4677.
- Peddinti ST, Saxena N. Web search query privacy: Evaluating query obfuscation and anonymizing networks. J Comput Secur. 2014; 22(1):155–99.Google Scholar
- Petit A, Cerqueus T, Ben Mokhtar S, Brunie L, Kosch H. Peas: Private, efficient and accurate web search. In: Proceedings of the 14th IEEE International Conference on Trust, Security and Privacy in Computing and Communications. Helsinki, Finland: IEEE Computer Society: 2015. p. 571–80.Google Scholar
- Eckersley P. How unique is your web browser? In: Proceedings of the 10th International Conference on Privacy Enhancing Technologies. Berlin, Germany: Springer-Verlag: 2010. p. 1–18.Google Scholar
- Saint-Jean F, Johnson A, Boneh D, Feigenbaum J. Private web search. In: Proceedings of the 2007 ACM Workshop on Privacy in Electronic Society. Alexandria, VA, USA: ACM: 2007. p. 84–90.Google Scholar
- Shapiro M. Structure and Encapsulation in Distributed Systems: the Proxy Principle. In: Proceedings of the IEEE 6th International Conference on Distributed Computing Systems. Cambridge, MA, USA: IEEE Computer Society: 1986. p. 198–204.Google Scholar
- Seid HA, Lespagnol AL. Virtual private network. Google Patents. US Patent 5,768,271. 1998. https://www.google.com/patents/US5768271.
- Castellà-Roca J, Viejo A, Herrera-Joancomartí J. Preserving user’s privacy in web search engines. Comput Commun. 2009; 32(13–14):1541–51.View ArticleGoogle Scholar
- Lindell Y, Waisbard E. Private web search with malicious adversaries. In: Proceedings of the 10th International Conference on Privacy Enhancing Technologies. Berlin, Germany: Springer-Verlag: 2010. p. 220–35.Google Scholar
- Al-Rfou R, Jannen W, Patwardhan N. Trackmenot-so-good-after-all. arXiv preprint arXiv:1211.0320. 2012.Google Scholar
- Murugesan M, Clifton C. Providing Privacy through Plausibly Deniable Search. Sparks, NV, USA: Society for Industrial and Applied Mathematics; 2009, pp. 768–79. Chap. 65.View ArticleGoogle Scholar
- Rebollo-Monedero D, Forné J. Optimized query forgery for private information retrieval. IEEE Trans Inf Theory. 2010; 56(9):4631–42.MathSciNetView ArticleGoogle Scholar
- Ye S, Wu F, Pandey R, Chen H. Noise injection for search privacy protection. In: Proceedings of the IEEE 12th International Conference on Computational Science and Engineering. Vancouver, Canada: IEEE Computer Society: 2009. p. 1–8.Google Scholar
- Arampatzis A, Efraimidis P, Drosatos G. A query scrambler for search privacy on the internet. Inform Retriev. 2013; 16(6):657–79.View ArticleGoogle Scholar
- Pang H, Ding X, Xiao X. Embellishing text search queries to protect user privacy. VLDB’10. 2010; 3(1–2):598–607.Google Scholar
- Singhal A. Modern information retrieval: A brief overview. IEEE Data Eng Bull. 2001; 24(4):35–43.Google Scholar
- Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945; 26(3):297–302.View ArticleGoogle Scholar
- Jaccard P. The distribution of the flora in the alpine zone. New Phytologist. 1912; 11(2):37–50.View ArticleGoogle Scholar
- Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In: Proceedings of the KDD-03 Workshop on Data Cleaning and Object Consolidation. Washington, DC, USA: ACM: 2003. p. 73–8.Google Scholar
- Pass G, Chowdhury A, Torgeson C. A picture of search. In: Proceedings of the 1st International Conference on Scalable Information Systems. Hong Kong: ACM: 2006. p. 1.Google Scholar
- Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D. The Stanford CoreNLP natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, MD, USA: Association for Computational Linguistics: 2014. p. 55–60.Google Scholar
- Porter MF. An algorithm for suffix stripping. Program. 1980; 14(3):130–7.View ArticleGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The weka data mining software: an update. ACM SIGKDD Explor Newsletter. 2009; 11(1):10–18.View ArticleGoogle Scholar
- Hearst MA, Dumais ST, Osman E, Platt J, Scholkopf B. Support vector machines. IEEE Intell Syst. 1998; 13(4):18–28.View ArticleGoogle Scholar