Wfinger: A Joint-decoder for very short tardos fingerprinting codes

Fazzinga, Bettina; Furfaro, Filippo; Flesca, Sergio; Masciari, Elio

doi:10.1145/3105831.3105860

Protection of copyrighted contents is a crucial activity for digital content producers in order to avoid unauthorized use of the artifacts or worse in order to prevent sensible information to be stealth (e.g. private documents of an administrative board). A common solution is to uniquely identify each copy by embedding some distinguishing features into it. This activity is usually known as fingerprinting (a.k.a. watermarking) and the embedded content is referred as fingerprinting code. In order to make this process robust against possible malicious users attacks, it is mandatory to hide the positions where the code is embedded. However, even in the case that the positions where the code is embedded are hidden to the users, a group of malicious users (referred in what follows as pirates) may establish a coalition in order to compare their copies and identify the positions where they differ as a positions where the fingerprinting code has been embedded. This kind of attack is named coalition attack. If the coalition succeeds in this identification process, pirates can then arbitrarily change the fingerprint code embedded in the distributed copy so that it does not correspond to any of the original fingerprinting codes of the pirates. However, for the purpose of designing proper protection strategies, it can be assumed that they do not know the positions of the hidden code where the bits of their codes agreed and therefore they cannot alter these positions. This assumption is referred as the marking condition. A (collusion resistant) fingerprinting code can be built by a randomized procedure to choose codewords (the code generation) and a tracing algorithm tailored for tracing one of the pirates based on all these codewords and the forged codeword read from the unauthorized copy distributed by the pirates. Obviously, we should avoid two type of errors: 1) accusing an innocent user and 2) not accusing a pirate. In this respect, the tracing algorithm fails if it falsely accuses an innocent user or outputs no accused user at all. The above mentioned errors should occur with small probability. This problem have been largely investigated in the literature and all the approaches proposed so far shares a common terminology that we introduce here in order to ease the reading of next sections. More in detail, we briefly recall the following key terms: â¢ Alphabet size. The codewords are sequences over a fixed alphabet Î£. Usually, fingerprinting codes are built by leveraging the binary alphabet Î£ = 0,1, however larger alphabets can be used thus the size Î£ of the alphabet is an important parameter; â¢ Codelength. This parameter refers to the length of the codewords, usually denoted by n; â¢ Number of users. Usually denoted by N, it coincides with the number of codewords. â¢ Pirate Coalition Size. This parameter takes into account the actual size of the coalition that could be lower than the expected one (say it c), in such a case, the accusation algorithm should achieve a small error probability1; â¢ Error probability. A code is â-secure against a coalition of c pirates if the probability of the error of the accusation algorithm is at most e for any set of at most c pirates performing an arbitrary pirate strategy to produce the forged codeword under the marking assumption; â¢ Code rate. The rate R of a fingerprinting code is computed as r = log(N)/n, where the logarithm is binary. The goal of fingerprinting schemes is to find efficient and secure fingerprinting codes while taking into account the high cost of embedding every single digit of the code. In several real world applications fingerprinting codes may be short, such as the case of fingerprinting text documents due to the intrinsic difficulty of embedding information in text. However, in literature many proposal have been defined based on the seminal work of Tardos[11, 12] that state many interesting theoretical results on the generation of short codes that under some conditions guarantees the possibility of finding guilty users. Tardos fingerprinting scheme is optimum as it generates fingerprinting codes sufficient to deal with n users and c pirates with the guarantee that the probability of accusing an innocent is bounded by a constant e, and having a length which is asymptotically minimum. Moreover, the accusation algorithm allows to detect a traitor by looking only at the code they have been assigned to him, disregarding both the codes assigned to other users and the type of attack that have been performed. It is worth noticing that in literature have been defined many other approaches that slightly outperforms Tardos scheme while having the same asymptotic complexity [9, 10]. Unfortunately, Tardos based fingerprinting are not effective in accusation processes when the leveraged code is too short [1, 13]. For instance, in the case that the code has to be embedded in a textual document by applying some modification of words, phrases or generally speaking tokens appearing in the text of the document as described in [3, 5, 6], and the document is about 20 pages long it is expected that the longest fingerprint that can be embedded is at most 200 bit long. In such a case, the Tardos accusation algorithm fails in accusing any user with a suitable probability of being guilty. For instance, in the case that the maximum coalition size is 2 and the desired probability of being guilty is 90% it requires a code of length at least 800 bits to accuse an user. This code length could be impractical in many scenarios. In order to overcome the above mentioned limitations, new approaches have been proposed and one of the most interesting is joint-decoding. Joint-decoders compute the guilty probability for a set of users instead of a single one. A first proposal has been made in [7, 8], however those algorithms are tailored for small coalition and do not scale-up properly. This drawback occurs as the search space computation grows up exponentially w.r.t. the number of users (or the maximum expected number of users that we may conjecture that could form a coalition for spreading the pirated copy). To ameliorate this problem, joint-decoding has been investigated from the theoretical viewpoint in order to define efficient approaches that work properly for real life situations. In this respect, a Markov Chain Monte Carlo (MCMC) based approach has been proposed in [2]. The proposed approach leverages Gibbs sampling for estimating the marginal probability that a user joined a coalition for generating a pirated copy. However, this approach turns to be ineffective for code length greater than 1024 bit due to the low quality of the probability estimation (as noted by the authors themselves). The main limitation of the above mentioned Tardos based approaches is that they perform satisfactorily when dealing with image or video fingerprinting[4] while their use for textual documents turns to be ineffective. More in detail, textual document watermarking is prone to several types of attacks, even very simple ones like the so called cut & paste attack. This attack allows to completely strip the watermark and the corresponding fingerprinting code by simply extracting the text in the source document and inserting it in a brand new document. The latter cause the deletion of eventual watermark inserted in the source text. This type of attack causes the fingerprint of the pirated copy to be empty, thus avoiding any accuse to users by using Tardos based schemes. In order to overcome such limitations, we propose a simpler but still effective accusation model based on Metropolis-Hastings (MH) sampling. Next sections are devoted to describe our proposal.