Shingles algorithm의 간단한 설명
http://en.wikipedia.org/wiki/W-shingling The document, "a rose is a rose is a rose" can be tokenized as follows. (a,rose,is,a,rose,is,a,rose) The set of all contiguous sequences of 4 tokens (N-grams, here: 4-grams) is { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } By removing duplicate elements from this set, a 4-shingling is obtained: { (a,rose,is,a), (rose..