Comparisons in Product Reviews

Annotations for sentences from camera reviews annotated with comparisons, e.g., "A has a better zoom than B". A comparison consists of a comparative predicate which can have up to four arguments: the two entities that are compared, the aspect under discussion and in some cases a scale part of the predicate. All sentences have been manually annotated by one of three student annotators.


Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 Unported License.

You are free to:

for any purpose, even commercially.

Under the following terms:

As attribution, please cite this paper:

Wiltrud Kessler and Jonas Kuhn (2014)
A Corpus of Comparisons in Product Reviews.
In Proceedings of the 9th Language Resources and Evaluation Conference (LREC 2014),
Reykjavik, Iceland, 28.-30. May 2014,
pages 2242-2248.
[pdf] [bib]

Persistent identifier (PID) for citation of the data:


Part 1 (1700 sentences):

Part 2 (an additional 500 sentences):

For the sake of completeness:

You might want to download the README as well (which contains basically the information given on this page).

If you have any corrections or suggestions, please contact me.

Annotation format

Annotations are sentence-based, every line corresponds to a sentence. The parts of the line are separated by tabs. The first part is the ID of the sentence as assigned by the extraction tool, it consists of the review id defined in the original XML file and a sentence number. Sentences are split automatically with the Stanford CoreNLP sentence splitter, so there might be errors. The second part is a 1 indicating that the sentence is a comparison sentence. Following there are one or more comparisons that are contained in the sentence.

<sentence id> \t 1 \t <comparison> \t <comparison>*

The anchor of any comparison is a comparative predicate which can have up to four arguments: entity 1, entity 2, aspect and scale. Additionally, there is the type of the comparison and if necessary the types of the entities and the direction:

[<type> [<entity 1 tokens>]; [<entity 2 tokens>]; [<aspect tokens>]; [<scale tokens>]; <predicate tokens>]

Tokens have the format <token number>_<token form>, starting from 1 and depending on the tokenization given by the Stanford CoreNLP tokenizer in the extraction tool. All arguments can occur more than once, separate items are split with <space>,<space> , e.g., [4_buttons , 6_dials] as two aspects in "D80 has more buttons and dials". The predicate can only occur once.

The type is composed by three parts separated by underscores: <comparison type>_<entity types>_<direction> Possible values for comparison type are:

Possible values for entity types are:

The entity type values are only present when there is an entity given in the comparison. Possible values for direction are:

Example annotation line:

22-122 1 [RANKED_PG_> [10_the 11_SD800]; [24_previous 25_Canon 26_digital 27_cameras]; [19_movie 20_capture 21_mode]; [15_powerful]; 14_more]

Corresponding sentence extracted by the tool:

22-122 Movie Mode Due to the DIGIC III processor , the SD800 has a more powerful and overall flexible movie capture mode that surpasses previous Canon digital cameras with a movie record feature .

Annotation guidelines

The outline of the annotation is explained in our LREC paper. For a more detailed explanation, please read the annotation guidelines (version 4, May 16th 2014).

Actual text of sentences

The above file contains only the annotations as we cannot share the actual review text for legal reasons. The file contains code to extract the sentences from the dataset provided by (Branavan et al., 2009). You will find more information in the README file included in the zip file.

If you have any problems in obtaining the data, contact me.