Comparisons in Product Reviews

Annotations for sentences from camera reviews annotated with comparisons, e.g., "A has a better zoom than B". A comparison consists of a comparative predicate which can have up to four arguments: the two entities that are compared, the aspect under discussion and in some cases a scale part of the predicate. All sentences have been manually annotated by one of three student annotators.

Licence

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material

for any purpose, even commercially.

Under the following terms:

Attribution — You must give appropriate credit (see below), provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

As attribution, please cite this paper:

Wiltrud Kessler and Jonas Kuhn (2014)
A Corpus of Comparisons in Product Reviews.
In Proceedings of the 9th Language Resources and Evaluation Conference (LREC 2014),
Reykjavik, Iceland, 28.-30. May 2014,
pages 2242-2248.
[pdf] [bib]

Persistent identifier (PID) for citation of the data:

Metadata: http://hdl.handle.net/11022/1007-0000-0000-8E6B-9
Landing page (this page): http://hdl.handle.net/11022/1007-0000-0000-8E72-0

Download

Part 1 (1700 sentences):

imscam.annotationsonly.v1.txt (version from 16.5.2014 -- data published with the LREC 2014 paper)
imscam.annotationsonly.v2.txt (version from 18.11.2014 -- small corrections of typos)
imscam.annotationsonly.v3.txt (version from 24.11.2015 -- review of annotations)

Part 2 (an additional 500 sentences):

imscam.annotationsonly.v1.txt (version from 24.11.2015)

For the sake of completeness:

sentences_agreement.txt - ids of the 100 sentences used for the calculation of agreement scores (not all of them are comparisons).
sentences_train.txt - ids of the 30 sentences used for annotator training (not all of them are comparisons)..

You might want to download the README as well (which contains basically the information given on this page).

If you have any corrections or suggestions, please contact me.

Annotation format

Annotations are sentence-based, every line corresponds to a sentence. The parts of the line are separated by tabs. The first part is the ID of the sentence as assigned by the extraction tool, it consists of the review id defined in the original XML file and a sentence number. Sentences are split automatically with the Stanford CoreNLP sentence splitter, so there might be errors. The second part is a 1 indicating that the sentence is a comparison sentence. Following there are one or more comparisons that are contained in the sentence.


<sentence id> \t 1 \t <comparison> \t <comparison>*

The anchor of any comparison is a comparative predicate which can have up to four arguments: entity 1, entity 2, aspect and scale. Additionally, there is the type of the comparison and if necessary the types of the entities and the direction:


[<type> [<entity 1 tokens>]; [<entity 2 tokens>]; [<aspect tokens>]; [<scale tokens>]; <predicate tokens>]

Tokens have the format <token number>_<token form>, starting from 1 and depending on the tokenization given by the Stanford CoreNLP tokenizer in the extraction tool. All arguments can occur more than once, separate items are split with <space>,<space> , e.g., [4_buttons , 6_dials] as two aspects in "D80 has more buttons and dials". The predicate can only occur once.

The type is composed by three parts separated by underscores: <comparison type>_<entity types>_<direction> Possible values for comparison type are:

RANKED
SUPERLATIVE
EQUATIVE
DIFFERENCE

Possible values for entity types are:

P (for product)
G (for group of products)
S (for standard)
F (for aspect)
C (for company)
O (for other)

The entity type values are only present when there is an entity given in the comparison. Possible values for direction are:

> (E1 better than E2 for RANKED comparisons)
< (E2 better than E1 for RANKED comparisons)
+ (positive sentiment expression for SUPERLATIVE and EQUATIVE comparisons)
- (negative sentiment expression for SUPERLATIVE and EQUATIVE comparisons)
o (objective sentiment expression for SUPERLATIVE and EQUATIVE comparisons)
x (unknown or unclear direction)

Example annotation line:


22-122	1	[RANKED_PG_> [10_the 11_SD800]; [24_previous 25_Canon 26_digital 27_cameras]; [19_movie 20_capture 21_mode]; [15_powerful]; 14_more]

Corresponding sentence extracted by the tool:


 22-122	Movie Mode Due to the DIGIC III processor , the SD800 has a more powerful and overall flexible movie capture mode that surpasses previous Canon digital cameras with a movie record feature .

Annotation guidelines

The outline of the annotation is explained in our LREC paper. For a more detailed explanation, please read the annotation guidelines (version 4, May 16th 2014).

Actual text of sentences

The above file contains only the annotations as we cannot share the actual review text for legal reasons. The file GetComparisonSentences.zip contains code to extract the sentences from the dataset provided by (Branavan et al., 2009). You will find more information in the README file included in the zip file.

If you have any problems in obtaining the data, contact me.