SemEval-2010 Task 8: Multi-Way Classification
of Semantic Relations Between Pairs of Nominals
Iris Hendrickx∗ , Su Nam Kim† , Zornitsa Kozareva‡ , Preslav Nakov§ ,
O S´eaghdha¶, Sebastian Pad´o , Marco Pennacchiotti∗∗,
Lorenza Romano††, Stan Szpakowicz‡‡
which had a separate binary-labeled dataset for
each of seven relations. We have defined SemEval-
2010 Task 8 as a multi-way classification task in
classification of semantic relations between
which the label for each example must be chosen
from the complete set of ten relations and the map-
to compare different approaches to seman-
ping from nouns to argument slots is not provided
tic relation classification and to provide a
in advance. We also provide more data: 10,717 an-
standard testbed for future research. This
notated examples, compared to 1,529 in SemEval-1
paper defines the task, describes the train-
ing and test data and the process of theircreation, lists the participating systems (10
teams, 28 runs), and discusses their results.
We first decided on an inventory of semantic rela-
SemEval-2010 Task 8 focused on semantic rela-
tions. Ideally, it should be exhaustive (enable the
tions between pairs of nominals. For example, tea
description of relations between any pair of nomi-
and ginseng are in an ENTITY-ORIGIN relation in
nals) and mutually exclusive (each pair of nominals
“The cup contained tea from dried ginseng.”. The
in context should map onto only one relation). The
automatic recognition of semantic relations has
literature, however, suggests that no relation inven-
many applications, such as information extraction,
tory satisfies both needs, and, in practice, some
document summarization, machine translation, or
trade-off between them must be accepted.
construction of thesauri and semantic networks.
As a pragmatic compromise, we selected nine
It can also facilitate auxiliary tasks such as word
relations with coverage sufficiently broad to be of
sense disambiguation, language modeling, para-
general and practical interest. We aimed at avoid-
phrasing, and recognizing textual entailment.
ing semantic overlap as much as possible. We
Our goal was to create a testbed for automatic
included, however, two groups of strongly related
classification of semantic relations. In developing
relations (ENTITY-ORIGIN / ENTITY-DESTINA-
the task we met several challenges: selecting a
suitable set of relations, specifying the annotation
WHOLE / MEMBER-COLLECTION) to assess mod-
procedure, and deciding on the details of the task
els’ ability to make such fine-grained distinctions.
itself. They are discussed briefly in Section 2; see
Our inventory is given below. The first four were
also Hendrickx et al. (2009), which includes a sur-
also used in SemEval-1 Task 4, but the annotation
vey of related work. The direct predecessor of Task
guidelines have been revised, and thus no complete
8 was Classification of semantic relations between
nominals, Task 4 at SemEval-1 (Girju et al., 2009),
Cause-Effect (CE). An event or object leads to an
† University of Melbourne, snkim@csse.unimelb.edu.au
effect. Example: those cancers were caused
‡ Information Sciences Institute/University of Southern
§ National University of Singapore, nakov@comp.nus.edu.sg
Instrument-Agency (IA). An agent uses an in-
¶University of Cambridge, do242@cl.cam.ac.uk
University of Stuttgart, pado@ims.uni-stuttgart.de
∗∗Yahoo! Inc., pennacc@yahoo-inc.com††
Product-Producer (PP). A producer causes a
‡‡ University of Ottawa and Polish Academy of Sciences,
product to exist. Example: a factory manu-
Content-Container (CC). An object is physically
with using techniques which do not apply to com-
stored in a delineated area of space. Example:
mon nouns. We only mark up the semantic heads of
nominals, which usually span a single word, except
Entity-Origin (EO). An entity is coming or is de-
for lexicalized terms such as science fiction.
rived from an origin (e.g., position or mate-
We also impose a syntactic locality requirement
rial). Example: letters from foreign countries
on example candidates, thus excluding instances
where the relation arguments occur in separate sen-
Entity-Destination (ED). An entity is moving to-
tential clauses. Permissible syntactic patterns in-
wards a destination. Example: the boy went
clude simple and relative clauses, compounds, and
pre- and post-nominal modification. In addition,
Component-Whole (CW). An object is a com-
we did not annotate examples whose interpretation
relied on discourse knowledge, which led to the
exclusion of pronouns as arguments. Please see
the guidelines for details on other issues, includ-
ing noun compounds, aspectual phenomena and
nonfunctional part of a collection. Example:
Message-Topic (MT). A message, written or spo-
ken, is about a topic. Example: the lecturewas about semantics
The annotation took place in three rounds. First,we manually collected around 1,200 sentences for
each relation through pattern-based Web search. In
We defined a set of general annotation guidelines
order to ensure a wide variety of example sentences,
as well as detailed guidelines for each semantic
we used a substantial number of patterns for each
relation. Here, we describe the general guidelines,
relation, typically between one hundred and several
which delineate the scope of the data to be col-
hundred. Importantly, in the first round, the relation
lected and state general principles relevant to the
itself was not annotated: the goal was merely to
collect positive and near-miss candidate instances.
Our objective is to annotate instances of seman-
A rough aim was to have 90% of candidates which
tic relations which are true in the sense of hold-
instantiate the target relation (“positive instances”).
ing in the most plausible truth-conditional inter-
In the second round, the collected candidates for
pretation of the sentence. This is in the tradition
each relation went to two independent annotators
of the Textual Entailment or Information Valida-
for labeling. Since we have a multi-way classifi-
tion paradigm (Dagan et al., 2009), and in con-
cation task, the annotators used the full inventory
trast to “aboutness” annotation such as semantic
of nine relations plus OTHER. The annotation was
roles (Carreras and M`arquez, 2004) or the BioNLP
made easier by the fact that the cases of overlap
2009 task (Kim et al., 2009) where negated rela-
were largely systematic, arising from general phe-
tions are also labelled as positive. Similarly, we
nomena like metaphorical use and situations where
exclude instances of semantic relations which hold
more than one relation holds. For example, there is
only in speculative or counterfactural scenarios. In
a systematic potential overlap between CONTENT-
practice, this means disallowing annotations within
the scope of modals or negations, e.g., “Smoking
ing on whether the situation described in the sen-
may/may not have caused cancer in this case.”
tence is static or dynamic, e.g., “When I came,
We accept as relation arguments only noun
the <e1>apples</e1> were already put in the
phrases with common-noun heads. This distin-
<e2>basket</e2>.” is CC(e1, e2), while “Then,
guishes our task from much work in Information
the <e1>apples</e1> were quickly put in the
Extraction, which tends to focus on specific classes
<e2>basket</e2>.” is ED(e1, e2).
of named entities and on considerably more fine-
In the third round, the remaining disagreements
grained relations than we do. Named entities are a
were resolved, and, if no consensus could be
specific category of nominal expressions best dealt
achieved, the examples were removed. Finally, we
merged all nine datasets to create a set of 10,717
The full task guidelines are available at http://docs.
instances. We released 8,000 for training and kept
Table 1 shows some statistics about the dataset.
The first column (Freq) shows the absolute and rel-
ative frequencies of each relation. The second col-
umn (Pos) shows that the average share of positive
instances was closer to 75% than to 90%, indicating
that the patterns catch a substantial amount of “near-
miss” cases. However, this effect varies a lot across
relations, causing the non-uniform relation distribu-
tion in the dataset (first column).3 After the secondround, we also computed inter-annotator agreement
Table 1: Annotation Statistics. Freq: Absolute and
(third column, IAA). Inter-annotator agreement
relative frequency in the dataset; Pos: percentage
was computed on the sentence level, as the per-
of “positive” relation instances in the candidate set;
centage of sentences for which the two annotations
were identical. That is, these figures can be inter-
preted as exact-match accuracies. We do not report
precision (P), recall (R), and F1-Score for each
Kappa, since chance agreement on preselected can-
relation, (4) micro-averaged P, R, F1, (5) macro-
didates is difficult to estimate.4 IAA is between
averaged P, R, F1. For (4) and (5), the calculations
60% and 95%, again with large relation-dependent
ignored the OTHER relation. Our official scoring
variation. Some of the relations were particularly
metric is macro-averaged F1-Score for (9+1)-way
easy to annotate, notably CONTENT-CONTAINER,
classification, taking directionality into account.
which can be resolved through relatively clear cri-
The teams were asked to submit test data pre-
teria, despite the systematic ambiguity mentioned
dictions for varying fractions of the training data.
above. ENTITY-ORIGIN was the hardest relation to
Specifically, we requested results for the first 1000,
annotate. We encountered ontological difficulties
2000, 4000, and 8000 training instances, called
in defining both Entity (e.g., in contrast to Effect)
TD1 through TD4. TD4 was the full training set.
and Origin (as opposed to Cause). Our numbersare on average around 10% higher than those re-
ported by Girju et al. (2009). This may be a sideeffect of our data collection method. To gather
Table 2 lists the participants and provides a rough
1,200 examples in realistic time, we had to seek
overview of the system features. Table 3 shows the
productive search query patterns, which invited
results. Unless noted otherwise, all quoted numbers
certain homogeneity. For example, many queries
for CONTENT-CONTAINER centered on “usual sus-pect” such as box or suitcase. Many instances of
MEMBER-COLLECTION were collected on the ba-
the teams by the performance of their best system
sis of from available lists of collective names.
on TD4, since a per-system ranking would favorteams with many submitted runs. UTD submit-
ted the best system, with a performance of over82%, more than 4% better than the second-best
The participating systems had to solve the follow-
system. FBK IRST places second, with 77.62%,
ing task: given a sentence and two tagged nominals,
a tiny margin ahead of ISI (77.57%). Notably, the
predict the relation between those nominals and the
ISI system outperforms the FBK IRST system for
TD1 to TD3, where it was second-best. The accu-
We released a detailed scorer which outputs (1) a
racy numbers for TD4 (Acc TD4) lead to the same
confusion matrix, (2) accuracy and coverage, (3)
overall ranking: micro- versus macro-averaging
2This set includes 891 examples from SemEval-1 Task 4.
does not appear to make much difference either.
We re-annotated them and assigned them as the last examples
of our training dataset to ensure that the test set was unseen.
A random baseline gives an uninteresting score of
3To what extent our candidate selection produces a biased
6%. Our competitive baseline system is a simple
sample is a question that we cannot address within this paper.
Naive Bayes classifier which relies on words in the
4We do not report Pos or IAA for OTHER, since OTHER is
sentential context only; two systems scored below
a pseudo-relation that was not annotated in its own right. Thenumbers would therefore not be comparable to other relations.
thography) + Cyc; parameterestimation by optimization ontraining set
like FBK NK-RES1 with differ-ent context windows and collo-cation cutoffs
prepositional patterns, estima-tion of semantic relation
Table 2: Participants of SemEval-2010 Task 8. Res: Resources used (WN: WordNet data; WP:Wikipedia data; S: syntax; LC: Levin classes; G: Google n-grams, RT: Roget’s Thesaurus, PB/NB:
PropBank/NomBank). Class: Classification style (ME: Maximum Entropy; BN: Bayes Net; DR: DecisionRules/Trees; CRF: Conditional Random Fields; 2S: two-step classification)
Table 3: F1-Score of all submitted systems on the test dataset as a function of training data: TD1=1000,TD2=2000, TD3=4000, TD4=8000 training examples. Official results are calculated on TD4. The results
marked with ∗ were submitted after the deadline. The best-performing run for each participant is italicized.
As for the amount of training data, we see a sub-
with no clear advantage for either. Similarly, two
stantial improvement for all systems between TD1
systems, UTD and ISTI (rank 1 and 6) split the task
and TD4, with diminishing returns for the transi-
into two classification steps (relation and direction),
tion between TD3 and TD4 for many, but not all,
but the 2nd- and 3rd-ranked systems do not. The
systems. Overall, the differences between systems
use of a sequence model such as a CRF did not
are smaller for TD4 than they are for TD1. The
spread between the top three systems is around 10%
The systems use a variety of resources. Gener-
at TD1, but below 5% at TD4. Still, there are clear
ally, richer feature sets lead to better performance
differences in the influence of training data size
(although the differences are often small – compare
even among systems with the same overall archi-
the different FBK IRST systems). This improve-
tecture. Notably, ECNU-SR-4 is the second-best
ment can be explained by the need for semantic
system at TD1 (67.95%), but gains only 7% from
generalization from training to test data. This need
the eightfold increase of the size of the training data.
can be addressed using WordNet (contrast ECNU-1
At the same time, ECNU-SR-3 improves from less
to -3 with ECNU-4 to -6), the Google n-gram col-
than 40% to almost 69%. The difference between
lection (see ISI and UTD), or a “deep” semantic
the systems is that ECNU-SR-4 uses a multi-way
resource (FBK IRST uses Cyc). Yet, most of these
classifier including the class OTHER, while ECNU-
resources are also included in the less successful
SR-3 uses binary classifiers and assigns OTHER
systems, so beneficial integration of knowledge
if no other relation was assigned with p>0.5. It
sources into semantic relation classification seems
appears that these probability estimates for classes
are only reliable enough for TD3 and TD4.
The Influence of System Architecture.
the systems suggest that it might be possible to
all systems used either MaxEnt or SVM classifiers,
achieve improvements by building an ensemble
system. When we combine the top three systems
nominals) or as ED (because of the preposition
(UTD, FBK IRST-12VBCA, and ISI) by predict-
into). Another example: [.] <e1>Rudders</e1>
ing their majority vote, or OTHER if there was none,
are used by <e2>towboats</e2> and other ves-
we obtain a small improvement over the UTD sys-
sels that require a high degree of manoeuvrability.
tem with an F1-Score of 82.79%. A combination of
This is an instance of CW misclassified as IA, prob-
the top five systems using the same method shows
ably on account of the verb use which is a frequent
a worse performance, however (80.42%). This sug-
gests that the best system outperforms the rest bya margin that cannot be compensated with system
combination, at least not with a crude majority vote.
There is little doubt that 19-way classification is a
We see a similar pattern among the ECNU systems,
non-trivial challenge. It is even harder when the
where the ECNU-SR-7 combination system is out-
domain is lexical semantics, with its idiosyncrasies,
performed by ECNU-SR-5, presumably since it
and when the classes are not necessarily disjoint,
incorporates the inferior ECNU-SR-1 system.
despite our best intentions. It speaks to the successof the exercise that the participating systems’ per-
formance was generally high, well over an order
performance on individual relations, especially the
of magnitude above random guessing. This may
extremes. There are very stable patterns across all
be due to the impressive array of tools and lexical-
systems. The best relation (presumably the eas-
semantic resources deployed by the participants.
iest to classify) is CE, far ahead of ED and MC.
Section 4 suggests a few ways of interpreting
Notably, the performance for the best relation is
and analyzing the results. Long-term lessons will
75% or above for almost all systems, with compar-
undoubtedly emerge from the workshop discussion.
atively small differences between the systems. The
One optimistic-pessimistic conclusion concerns the
hardest relation is generally IA, followed by PP.5
size of the training data. The notable gain TD3 →
Here, the spread among the systems is much larger:
TD4 suggests that even more data would be helpful,
the highest-ranking systems outperform others on
but that is so much easier said than done: it took
the difficult relations. Recall was the main prob-
the organizers well in excess of 1000 person-hours
lem for both IA and PP: many examples of these
to pin down the problem, hone the guidelines and
two relations are misclassified, most frequently as
relation definitions, construct sufficient amounts of
OTHER. Even at TD4, these datasets seem to be
trustworthy training data, and run the task.
less homogeneous than the others. Intriguingly, PPshows a very high inter-annotator agreement (Ta-ble 1). Its difficulty may therefore be due not to
questionable annotation, but to genuine variability,
X. Carreras and L. M`arquez. 2004. Introduction to
or at least the selection of difficult patterns by the
the CoNLL-2004 shared task: Semantic role label-
dataset creator. Conversely, MC, among the easiest
relations to model, shows only a modest IAA.
I. Dagan, B. Dolan, B. Magnini, and D. Roth. 2009.
Recognizing textual entailment: Rational, evalua-tion and approaches. Natural Language Engineer-
that are classified incorrectly by all systems. We
analyze them, looking for sources of errors. In ad-dition to a handful of annotation errors and some
R. Girju, P. Nakov, V. Nastase, S. Szpakowicz, P. Tur-
ney, and D. Yuret. 2009. Classification of semantic
borderline cases, they are made up of instances
relations between nominals. Language Resources
which illustrate the limits of current shallow mod-
eling approaches in that they require more lexical
I. Hendrickx, S. Kim, Z. Kozareva, P. Nakov, D. ´
knowledge and complex reasoning. A case in point:
S´eaghdha, S. Pad´o, M. Pennacchiotti, L. Romano,
The bottle carrier converts your <e1>bottle</e1>
into a <e2>canteen</e2>.
8: Multi-way classification of semantic relations be-
OTHER is misclassified either as CC (due to the
tween pairs of nominals. In Proc. NAACL Workshopon Semantic Evaluations, Boulder, CO.
5The relation OTHER, which we ignore in the overall F1-
score, does even worse, often below 40%. This is to be ex-
J. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii.
pected, since the OTHER examples in our datasets are near
2009. Overview of BioNLP’09 shared task on event
misses for other relations, thus making a very incoherent class.
extraction. In Proc. BioNLP-09, Boulder, CO.
• Research targets “cleaner” populations• What percentage (approx.) of your clients/ • Clinicians/ agencies do not communicate• One question or problem related to CODs • Alcohol and drug dependence can present with symptoms suggestive of psychiatric disorders– Drug interactions– Aggravating medical problems• Underlying/ Primary problem progresses– Develop treatmen
Consensus statement by the Scandinavian Post-Transplant Diabetes Expert Group January 2012 Diagnosis, treatment and management of glucometabolic disorders emerging after kidney transplantation A consensus statement by the Scandinavian Post-Transplant Diabetes Expert Group Edited by Mads Hornum and Bo Feldt-Rasmussen, Denmark Jørn Petter Lindahl and Trond Jenssen, Norway Be