As a way to better verify our hypothesis that online discussions can be used as a source of information to describe the entities and domains they belong to, we developed a simple classification task that checks how easily comments can be attributed to their source. This serves as a “sanity check” for our work, as the goal of this task is to determine if comments have discriminative information with respect to their associated entities and domains.
6.1 Comment representation
In order to perform this classification task on comments, we first need structured representations for them. A simple way to do this is by means of a bag-of-words style representation, i.e., each comment c is represented by a boolean vector:
$$\mathbf{c} = \left[w_{1}^{c}, \ldots, w_{|\mathcal{V}|}^{c}\right], $$
where \(|\mathcal {V}|\) is the size of the dictionary \(\mathcal {V}\) (number of words), and \(w_{j}^{c}\) indicates whether the word vj of the dictionary appears in comment c or not, \(w_{j}^{c}\!\! =\!\! 1\) if so and \(w_{j}^{c}\! \,=\, \!0\) otherwise.
A general problem with this approach is that the vector c can become very large and sparse, especially if \(\mathcal {V}\) is the set of all words present in the data set [30]. In our case, however, from the analysis done in Section 5, we know that the proportion of words capable of discriminating one series from the other is small (see Fig. 3). Therefore, the idea is to leverage this observation in the process of generating comment representations, in which only the most informative K words of each entity are used for the vector \(\vec {c}\). To quantify the importance each word has for a series or episode, the TF-IDF metric will be used, which is calculated as follows:
$$tfidf(v,p,\mathcal{P}) = tf(v,p) * idf(v,\mathcal{P}). $$
The term tf(v,p) is the Term Frequency, that is, the number of times that the word v appears in document p. The term \(idf(v,\mathcal {P})\) is the Inverse Document Frequency, that is, how common or rare the word v is between the set of documents \(\mathcal {P}\), being calculated by the formula:
$$idf(v,\mathcal{P}) = log \frac{|\mathcal{P}|}{ \left| \left\{ p \in \mathcal{P} : v \in \mathcal{V}^{p} \right\} \right| }. $$
In the case of this work, we calculate the TF-IDF metric considering two different contexts: words used in series, and words used in episodes of a given series. For the first one, we use the concatenation of the comments from each sequence \(\mathcal {C}^{d}\) as a document p, with all series \(d \in \mathcal {D}\) from the data set being considered in \(\mathcal {P}\) for the IDF calculation. For the second, each document p is constructed from the concatenation of the comments of each sequence \(\mathcal {C}^{t}\) belonging to an episode \(t \in \mathcal {T}^{\mathcal {D}}\), for a given series \(\mathcal {D}\). The set of documents \(\mathcal {P}\) considered for it is built only from the episodes of this single series d.
Thus, we define by \(\mathcal {V}^{K_{p}}\) the set of words in document p (either an entity \(\mathcal {T}\) or a domain d) given by this process as the K most important, hereafter also referred to as “top-K words” of the document. From the top-K words of each of the \(|\mathcal {P}|\) relevant documents (relative to the TF-IDF values), we defined the set of \(|\mathcal {V}^{K}|\) relevant words among all the documents as:
$$\mathcal{V}^{K} = \bigcup_{p \in \mathcal{P}} \mathcal{V}^{K_{p}}. $$
Thus, the vector representing each comment is now given by:
$$\mathbf{c} = \left[w_{1}^{c}, \ldots, w_{\left|\mathcal{V}^{K}\right|}^{c}\right], $$
where \(w_{i}^{c} = 1\) if the word \(v_{i} \in \mathcal {V}^{K_{p}}\) appears in comment c or 0 otherwise. Note that \(|\mathcal {V}^{K}|\) is at most equal to \(K \times |\mathcal {P}|\).
With this, each comment is represented by a Boolean vector indicating, for the top-K words of each episode or series, whether that word is present or not in the comment.
6.2 Comment filter
Through preliminary analysis, we noticed that a large number of comments do not contain even a single word from the set of top-K words that discriminate their series or episode. For example, series d457 contains 1719 comments (of a total of 2833) that do not contain any word from the top-K, for K=10, which equates to approximately 60% of the published comments about the series. Our hypothesis is that such comments are less relevant and descriptive to the series or episode with which they are associated, in comparison to comments containing words from the set of top-K words. Therefore, we considered a second parameter, α, which indicates the minimum number of relevant words (i.e., words from the top-K set) that a comment c must have so that it is not discarded. In other words, we discard all comments for which:
$$\sum\limits_{j}^{\left|\mathcal{V}^{K}\right|} w_{j}^{c} < \alpha. $$
Thus, given determinate values for the K and α parameters, we wish to know how easily the selected comments can be identified as being associated with a certain episode or series. By doing this, we want to determine a way to select a relevant and descriptive subset of comments for the series or episode in question. Being able to select such a subset would be useful in finding good comments to explain the summaries for those entities.
As K increases, the greater the number of words considered to be relevant for the comment representations. A longer vector representation increases the amount of information about each comment, but the usefulness of this information varies, depending on how discriminative the words used as features for the vector are. Greater values of K also facilitate the inclusion of more comments in the selected subset due to how the the K and α parameters interact, i.e., a longer vector is more likely to have its norm be non-zero. If we take, for example, \(K=\mathcal {V}\), that is, all words in the data set as the set of top-K words, all comments of size at least equal to α would be selected as “relevant”, for any value of α.
On the other hand, as α increases, more comments are considered “irrelevant” regardless of the document. Comments with few words from the top-K set would be considered to have low descriptive utility for an entity or domain. It also becomes harder for shorter comments to meet this parameter’s requirements, in general, likely increasing the average length of the selected comments.
Thus, our goal in this section is to find a representation that can serve as input for classification algorithms so that they can accurately identify those comments that are clearly associated with their series or episode and, consequently, good candidates to describe them.
6.3 Comment classification
For this purpose, we define two classification tasks to identify how well comments represent a given series and episode, respectively, as K and α vary:
-
In the first task, the comments are grouped by series and the objective is to classify each comment to the series with which it is associated, among the \(|\mathcal {D}|\) series of our data set. For the TF-IDF calculation, a document p is the collection of comments \(\mathcal {C}^{d}\) associated with each series d.
-
In the second task, the comments of a given series d are grouped by episode and the objective is to classify each comment to the episode with which it is associated, among the \(\left |\mathcal {T}^{d}\right |\) episodes of that series. In this second case, for the TF-IDF calculation, a document p is the collection of comments \(\mathcal {C}^{t}\) associated with each episode t.
To perform such tasks, we used the Naive Bayes [31] classifier from the Weka tool collectionFootnote 8 [32]. Other classifiers were also tested, and the results were similar. The standard parameters values for the Naive Bayes implementation in Weka were used, and the results were obtained using 10-fold cross-validation. Thus, our goal is to find values of K and α which present a good compromise between the accuracy of the classification and the number of comments correctly classified in each of the tasks. Remember that if the value of K is too large, many words will be used in the classification task and probably many of them will be less discriminative. At the same time, if the value of α is too large, few comments will be considered in the classification task.
Results of the classification tasks are illustrated in Fig. 6. First, observe that identifying which series each comment refers to is very easy. Second, note that considering the comments with no words from the top-K set present in the representation causes a considerable loss of accuracy. This follows the expected behavior, since comments of this type do not have any discriminative information in their representations (their representation vectors being entirely comprised of 0s, regardless of domain).
When we begin to classify the comments by episode, given a series, the classification accuracy decreases significantly. This is expected, as the number of possible episodes a comment can belong to is far greater than the number of possible series, as seen in Table 1 (13 series, and 369 episodes in total). Note in Fig. 7 that as we consider more words (greater value of K), it becomes more difficult to identify which episode a comment refers to. In addition, disregarding comments with low informational value (relative to the number of relevant words) has a significant positive impact on the results, especially for lower values of K, although removing too many comments also worsens the accuracy.
On the other hand, in Fig. 8, we can verify that the number of comments considered in the evaluation increases with greater values of K, and decreases with greater values of α. This follows the expected behavior, and shows that we can get a more concise set of comments and with better explanatory capacity for the episode if we choose an appropriate parameter configuration.
Specifically for the data set used in this work, it may be noted that selecting a value of 10 for the parameter K not only considers a smaller number of comments (due to the interaction with α) but also creates more discriminative representations for the comments (higher classification accuracy). However, the parameter α, for K=10, has the best accuracy result for entity classification with α=2, with α=3 achieving an accuracy value less than 1% lower. As for the domain classification task, the classification accuracy is high for any value of α greater than 0, with only insignificant improvements as α increases beyond that. This generates a trade-off in the selection of comments between descriptive potential (we want comments that identify well their series or episode) and succinctness (we want few comments).