fried.in

Fried.in

SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals Iris Hendrickx∗ , Su Nam Kim† , Zornitsa Kozareva‡ , Preslav Nakov§ , O S´eaghdha¶, Sebastian Pad´o , Marco Pennacchiotti∗∗, Lorenza Romano††, Stan Szpakowicz‡‡ which had a separate binary-labeled dataset for each of seven relations. We have defined SemEval- 2010 Task 8 as a multi-way classification task in classification of semantic relations between which the label for each example must be chosen from the complete set of ten relations and the map- to compare different approaches to seman- ping from nouns to argument slots is not provided tic relation classification and to provide a in advance. We also provide more data: 10,717 an- standard testbed for future research. This notated examples, compared to 1,529 in SemEval-1 paper defines the task, describes the train- ing and test data and the process of theircreation, lists the participating systems (10 teams, 28 runs), and discusses their results.
We first decided on an inventory of semantic rela- SemEval-2010 Task 8 focused on semantic rela- tions. Ideally, it should be exhaustive (enable the tions between pairs of nominals. For example, tea description of relations between any pair of nomi- and ginseng are in an ENTITY-ORIGIN relation in nals) and mutually exclusive (each pair of nominals “The cup contained tea from dried ginseng.”. The in context should map onto only one relation). The automatic recognition of semantic relations has literature, however, suggests that no relation inven- many applications, such as information extraction, tory satisfies both needs, and, in practice, some document summarization, machine translation, or trade-off between them must be accepted.
construction of thesauri and semantic networks.
As a pragmatic compromise, we selected nine It can also facilitate auxiliary tasks such as word relations with coverage sufficiently broad to be of sense disambiguation, language modeling, para- general and practical interest. We aimed at avoid- phrasing, and recognizing textual entailment.
ing semantic overlap as much as possible. We Our goal was to create a testbed for automatic included, however, two groups of strongly related classification of semantic relations. In developing relations (ENTITY-ORIGIN / ENTITY-DESTINA- the task we met several challenges: selecting a suitable set of relations, specifying the annotation WHOLE / MEMBER-COLLECTION) to assess mod- procedure, and deciding on the details of the task els’ ability to make such fine-grained distinctions.
itself. They are discussed briefly in Section 2; see Our inventory is given below. The first four were also Hendrickx et al. (2009), which includes a sur- also used in SemEval-1 Task 4, but the annotation vey of related work. The direct predecessor of Task guidelines have been revised, and thus no complete 8 was Classification of semantic relations between nominals, Task 4 at SemEval-1 (Girju et al., 2009), Cause-Effect (CE). An event or object leads to an † University of Melbourne, snkim@csse.unimelb.edu.au effect. Example: those cancers were caused ‡ Information Sciences Institute/University of Southern § National University of Singapore, nakov@comp.nus.edu.sg Instrument-Agency (IA). An agent uses an in- ¶University of Cambridge, do242@cl.cam.ac.uk University of Stuttgart, pado@ims.uni-stuttgart.de ∗∗Yahoo! Inc., pennacc@yahoo-inc.com†† Product-Producer (PP). A producer causes a ‡‡ University of Ottawa and Polish Academy of Sciences, product to exist. Example: a factory manu- Content-Container (CC). An object is physically with using techniques which do not apply to com- stored in a delineated area of space. Example: mon nouns. We only mark up the semantic heads of nominals, which usually span a single word, except Entity-Origin (EO). An entity is coming or is de- for lexicalized terms such as science fiction.
rived from an origin (e.g., position or mate- We also impose a syntactic locality requirement rial). Example: letters from foreign countries on example candidates, thus excluding instances where the relation arguments occur in separate sen- Entity-Destination (ED). An entity is moving to- tential clauses. Permissible syntactic patterns in- wards a destination. Example: the boy went clude simple and relative clauses, compounds, and pre- and post-nominal modification. In addition, Component-Whole (CW). An object is a com- we did not annotate examples whose interpretation relied on discourse knowledge, which led to the exclusion of pronouns as arguments. Please see the guidelines for details on other issues, includ- ing noun compounds, aspectual phenomena and nonfunctional part of a collection. Example: Message-Topic (MT). A message, written or spo- ken, is about a topic. Example: the lecturewas about semantics The annotation took place in three rounds. First,we manually collected around 1,200 sentences for each relation through pattern-based Web search. In We defined a set of general annotation guidelines order to ensure a wide variety of example sentences, as well as detailed guidelines for each semantic we used a substantial number of patterns for each relation. Here, we describe the general guidelines, relation, typically between one hundred and several which delineate the scope of the data to be col- hundred. Importantly, in the first round, the relation lected and state general principles relevant to the itself was not annotated: the goal was merely to collect positive and near-miss candidate instances.
Our objective is to annotate instances of seman- A rough aim was to have 90% of candidates which tic relations which are true in the sense of hold- instantiate the target relation (“positive instances”).
ing in the most plausible truth-conditional inter- In the second round, the collected candidates for pretation of the sentence. This is in the tradition each relation went to two independent annotators of the Textual Entailment or Information Valida- for labeling. Since we have a multi-way classifi- tion paradigm (Dagan et al., 2009), and in con- cation task, the annotators used the full inventory trast to “aboutness” annotation such as semantic of nine relations plus OTHER. The annotation was roles (Carreras and M`arquez, 2004) or the BioNLP made easier by the fact that the cases of overlap 2009 task (Kim et al., 2009) where negated rela- were largely systematic, arising from general phe- tions are also labelled as positive. Similarly, we nomena like metaphorical use and situations where exclude instances of semantic relations which hold more than one relation holds. For example, there is only in speculative or counterfactural scenarios. In a systematic potential overlap between CONTENT- practice, this means disallowing annotations within the scope of modals or negations, e.g., “Smoking ing on whether the situation described in the sen- may/may not have caused cancer in this case.” tence is static or dynamic, e.g., “When I came, We accept as relation arguments only noun the <e1>apples</e1> were already put in the phrases with common-noun heads. This distin- <e2>basket</e2>.” is CC(e1, e2), while “Then, guishes our task from much work in Information the <e1>apples</e1> were quickly put in the Extraction, which tends to focus on specific classes <e2>basket</e2>.” is ED(e1, e2).
of named entities and on considerably more fine- In the third round, the remaining disagreements grained relations than we do. Named entities are a were resolved, and, if no consensus could be specific category of nominal expressions best dealt achieved, the examples were removed. Finally, we merged all nine datasets to create a set of 10,717 The full task guidelines are available at http://docs.
instances. We released 8,000 for training and kept Table 1 shows some statistics about the dataset.
The first column (Freq) shows the absolute and rel- ative frequencies of each relation. The second col- umn (Pos) shows that the average share of positive instances was closer to 75% than to 90%, indicating that the patterns catch a substantial amount of “near- miss” cases. However, this effect varies a lot across relations, causing the non-uniform relation distribu- tion in the dataset (first column).3 After the secondround, we also computed inter-annotator agreement Table 1: Annotation Statistics. Freq: Absolute and (third column, IAA). Inter-annotator agreement relative frequency in the dataset; Pos: percentage was computed on the sentence level, as the per- of “positive” relation instances in the candidate set; centage of sentences for which the two annotations were identical. That is, these figures can be inter- preted as exact-match accuracies. We do not report precision (P), recall (R), and F1-Score for each Kappa, since chance agreement on preselected can- relation, (4) micro-averaged P, R, F1, (5) macro- didates is difficult to estimate.4 IAA is between averaged P, R, F1. For (4) and (5), the calculations 60% and 95%, again with large relation-dependent ignored the OTHER relation. Our official scoring variation. Some of the relations were particularly metric is macro-averaged F1-Score for (9+1)-way easy to annotate, notably CONTENT-CONTAINER, classification, taking directionality into account.
which can be resolved through relatively clear cri- The teams were asked to submit test data pre- teria, despite the systematic ambiguity mentioned dictions for varying fractions of the training data.
above. ENTITY-ORIGIN was the hardest relation to Specifically, we requested results for the first 1000, annotate. We encountered ontological difficulties 2000, 4000, and 8000 training instances, called in defining both Entity (e.g., in contrast to Effect) TD1 through TD4. TD4 was the full training set.
and Origin (as opposed to Cause). Our numbersare on average around 10% higher than those re- ported by Girju et al. (2009). This may be a sideeffect of our data collection method. To gather Table 2 lists the participants and provides a rough 1,200 examples in realistic time, we had to seek overview of the system features. Table 3 shows the productive search query patterns, which invited results. Unless noted otherwise, all quoted numbers certain homogeneity. For example, many queries for CONTENT-CONTAINER centered on “usual sus-pect” such as box or suitcase. Many instances of MEMBER-COLLECTION were collected on the ba- the teams by the performance of their best system sis of from available lists of collective names.
on TD4, since a per-system ranking would favorteams with many submitted runs. UTD submit- ted the best system, with a performance of over82%, more than 4% better than the second-best The participating systems had to solve the follow- system. FBK IRST places second, with 77.62%, ing task: given a sentence and two tagged nominals, a tiny margin ahead of ISI (77.57%). Notably, the predict the relation between those nominals and the ISI system outperforms the FBK IRST system for TD1 to TD3, where it was second-best. The accu- We released a detailed scorer which outputs (1) a racy numbers for TD4 (Acc TD4) lead to the same confusion matrix, (2) accuracy and coverage, (3) overall ranking: micro- versus macro-averaging 2This set includes 891 examples from SemEval-1 Task 4.
does not appear to make much difference either.
We re-annotated them and assigned them as the last examples of our training dataset to ensure that the test set was unseen.
A random baseline gives an uninteresting score of 3To what extent our candidate selection produces a biased 6%. Our competitive baseline system is a simple sample is a question that we cannot address within this paper.
Naive Bayes classifier which relies on words in the 4We do not report Pos or IAA for OTHER, since OTHER is sentential context only; two systems scored below a pseudo-relation that was not annotated in its own right. Thenumbers would therefore not be comparable to other relations.
thography) + Cyc; parameterestimation by optimization ontraining set like FBK NK-RES1 with differ-ent context windows and collo-cation cutoffs prepositional patterns, estima-tion of semantic relation Table 2: Participants of SemEval-2010 Task 8. Res: Resources used (WN: WordNet data; WP:Wikipedia data; S: syntax; LC: Levin classes; G: Google n-grams, RT: Roget’s Thesaurus, PB/NB: PropBank/NomBank). Class: Classification style (ME: Maximum Entropy; BN: Bayes Net; DR: DecisionRules/Trees; CRF: Conditional Random Fields; 2S: two-step classification) Table 3: F1-Score of all submitted systems on the test dataset as a function of training data: TD1=1000,TD2=2000, TD3=4000, TD4=8000 training examples. Official results are calculated on TD4. The results marked with ∗ were submitted after the deadline. The best-performing run for each participant is italicized.
As for the amount of training data, we see a sub- with no clear advantage for either. Similarly, two stantial improvement for all systems between TD1 systems, UTD and ISTI (rank 1 and 6) split the task and TD4, with diminishing returns for the transi- into two classification steps (relation and direction), tion between TD3 and TD4 for many, but not all, but the 2nd- and 3rd-ranked systems do not. The systems. Overall, the differences between systems use of a sequence model such as a CRF did not are smaller for TD4 than they are for TD1. The spread between the top three systems is around 10% The systems use a variety of resources. Gener- at TD1, but below 5% at TD4. Still, there are clear ally, richer feature sets lead to better performance differences in the influence of training data size (although the differences are often small – compare even among systems with the same overall archi- the different FBK IRST systems). This improve- tecture. Notably, ECNU-SR-4 is the second-best ment can be explained by the need for semantic system at TD1 (67.95%), but gains only 7% from generalization from training to test data. This need the eightfold increase of the size of the training data.
can be addressed using WordNet (contrast ECNU-1 At the same time, ECNU-SR-3 improves from less to -3 with ECNU-4 to -6), the Google n-gram col- than 40% to almost 69%. The difference between lection (see ISI and UTD), or a “deep” semantic the systems is that ECNU-SR-4 uses a multi-way resource (FBK IRST uses Cyc). Yet, most of these classifier including the class OTHER, while ECNU- resources are also included in the less successful SR-3 uses binary classifiers and assigns OTHER systems, so beneficial integration of knowledge if no other relation was assigned with p>0.5. It sources into semantic relation classification seems appears that these probability estimates for classes are only reliable enough for TD3 and TD4.
The Influence of System Architecture.
the systems suggest that it might be possible to all systems used either MaxEnt or SVM classifiers, achieve improvements by building an ensemble system. When we combine the top three systems nominals) or as ED (because of the preposition (UTD, FBK IRST-12VBCA, and ISI) by predict- into). Another example: [.] <e1>Rudders</e1> ing their majority vote, or OTHER if there was none, are used by <e2>towboats</e2> and other ves- we obtain a small improvement over the UTD sys- sels that require a high degree of manoeuvrability.
tem with an F1-Score of 82.79%. A combination of This is an instance of CW misclassified as IA, prob- the top five systems using the same method shows ably on account of the verb use which is a frequent a worse performance, however (80.42%). This sug- gests that the best system outperforms the rest bya margin that cannot be compensated with system combination, at least not with a crude majority vote.
There is little doubt that 19-way classification is a We see a similar pattern among the ECNU systems, non-trivial challenge. It is even harder when the where the ECNU-SR-7 combination system is out- domain is lexical semantics, with its idiosyncrasies, performed by ECNU-SR-5, presumably since it and when the classes are not necessarily disjoint, incorporates the inferior ECNU-SR-1 system.
despite our best intentions. It speaks to the successof the exercise that the participating systems’ per- formance was generally high, well over an order performance on individual relations, especially the of magnitude above random guessing. This may extremes. There are very stable patterns across all be due to the impressive array of tools and lexical- systems. The best relation (presumably the eas- semantic resources deployed by the participants.
iest to classify) is CE, far ahead of ED and MC.
Section 4 suggests a few ways of interpreting Notably, the performance for the best relation is and analyzing the results. Long-term lessons will 75% or above for almost all systems, with compar- undoubtedly emerge from the workshop discussion.
atively small differences between the systems. The One optimistic-pessimistic conclusion concerns the hardest relation is generally IA, followed by PP.5 size of the training data. The notable gain TD3 → Here, the spread among the systems is much larger: TD4 suggests that even more data would be helpful, the highest-ranking systems outperform others on but that is so much easier said than done: it took the difficult relations. Recall was the main prob- the organizers well in excess of 1000 person-hours lem for both IA and PP: many examples of these to pin down the problem, hone the guidelines and two relations are misclassified, most frequently as relation definitions, construct sufficient amounts of OTHER. Even at TD4, these datasets seem to be trustworthy training data, and run the task.
less homogeneous than the others. Intriguingly, PPshows a very high inter-annotator agreement (Ta-ble 1). Its difficulty may therefore be due not to questionable annotation, but to genuine variability, X. Carreras and L. M`arquez. 2004. Introduction to or at least the selection of difficult patterns by the the CoNLL-2004 shared task: Semantic role label- dataset creator. Conversely, MC, among the easiest relations to model, shows only a modest IAA.
I. Dagan, B. Dolan, B. Magnini, and D. Roth. 2009.
Recognizing textual entailment: Rational, evalua-tion and approaches. Natural Language Engineer- that are classified incorrectly by all systems. We analyze them, looking for sources of errors. In ad-dition to a handful of annotation errors and some R. Girju, P. Nakov, V. Nastase, S. Szpakowicz, P. Tur- ney, and D. Yuret. 2009. Classification of semantic borderline cases, they are made up of instances relations between nominals. Language Resources which illustrate the limits of current shallow mod- eling approaches in that they require more lexical I. Hendrickx, S. Kim, Z. Kozareva, P. Nakov, D. ´ knowledge and complex reasoning. A case in point: S´eaghdha, S. Pad´o, M. Pennacchiotti, L. Romano, The bottle carrier converts your <e1>bottle</e1> into a <e2>canteen</e2>.
8: Multi-way classification of semantic relations be- OTHER is misclassified either as CC (due to the tween pairs of nominals. In Proc. NAACL Workshopon Semantic Evaluations, Boulder, CO.
5The relation OTHER, which we ignore in the overall F1- score, does even worse, often below 40%. This is to be ex- J. Kim, T. Ohta, S. Pyysalo, Y. Kano, and J. Tsujii.
pected, since the OTHER examples in our datasets are near 2009. Overview of BioNLP’09 shared task on event misses for other relations, thus making a very incoherent class.
extraction. In Proc. BioNLP-09, Boulder, CO.

Source: http://fried.in/documents/task8.pdf

Microsoft powerpoint - cip co-occurring handout [compatibility mode]

• Research targets “cleaner” populations• What percentage (approx.) of your clients/ • Clinicians/ agencies do not communicate• One question or problem related to CODs • Alcohol and drug dependence can present with symptoms suggestive of psychiatric disorders– Drug interactions– Aggravating medical problems• Underlying/ Primary problem progresses– Develop treatmen

organtransplantation.dk

Consensus statement by the Scandinavian Post-Transplant Diabetes Expert Group January 2012 Diagnosis, treatment and management of glucometabolic disorders emerging after kidney transplantation A consensus statement by the Scandinavian Post-Transplant Diabetes Expert Group Edited by Mads Hornum and Bo Feldt-Rasmussen, Denmark Jørn Petter Lindahl and Trond Jenssen, Norway Be