Semantic role clustering: an empirical assessment of semantic role types in non-default case assignment

  REVISED AND ACCEPTED VERSION  –  January 31, 2014(to appear in  Studies in Language  ) Semantic role clustering: an empirical assessment of semantic role types in non-default case assignment * Balthasar Bickel * , Taras Zakharko * , Lennart Bierkandt + & AlenaWitzlack-Makarevich ‡* University of Zürich + University of Jena ‡ University of Kiel This paper seeks to determine to what extent there is cross-linguistic evidence forpostulating clusters of predicate-specific semantic roles such as experiencer, cognizer,possessor, etc.. For this, we survey non-default case assignments in a sample of 141languages and annotate the associated predicates for cross-linguistically recurrentsemantic roles, such as ‘the one who feels cold’, ‘the one who eats sth.’, ‘the thingthat is being eaten’. We then determine to what extent these roles are treated alikeacross languages, i.e. repeatedly grouped together under the same non-default casemarker or under the same specific alternation with a non-default marker. Apply-ing fuzzy cluster and NeighborNet algorithms to these data reveals cross-linguisticevidence for role clusters around experiencers, undergoers of body processes andcognizers/perceivers in one- and two-place predicates; and around sources and trans-mitted speech in three-place predicates. No support emerges from non-default caseassignment for any other role clusters that are traditionally assumed (e.g. for anydistinctions among objects of two-argument predicates, or for distinctions betweenthemes and instruments). 1 Introduction Apart from default or canonical case assignments, such as the assignment of accusativecase to the most patient-like argument of transitives, many, perhaps most languagesshow alternatives in the form of non-default or non-canonical assignments for specific * Versions of this paper were presented at the workshop on role complexes in Zurich, 4–5 April 2011, andthe conference on non-canonically case-marked subjects in Reykjavík, 4–8 June 2012. We thank bothaudiences for helpful comments and questions. Author contributions: BB, AWM, and TZ conceivedand designed the study. BB and TZ conducted the statistical analysis. LB and AWM did most of thedata analysis and database coding. TZ did most of the data aggregation work. BB and AWM wrotethe paper. We thank Kevin Bätscher for help in data collection, Lukas Wiget for help in preparing thegraphs and Sergey Say for comments on an earlier draft. We also thank three anonymous reviewersand an anonymous editorial board member for helpful suggestions and requests for clarification.  2sets of predicates, e.g. the accusative on arguments of experience-denoting intransitives(like in German  mich friert   ‘I am cold’). It is often assumed that non-default markingof this kind does not occur at random. Indeed, several hypotheses and theories havebeen put forward that seek to predict the way in which semantic types of predicatesassociate with non-default case assignments across languages(e.g. Tsunoda 1985, 2004,Onishi 2001, Haspelmath 2001, Malchukov 2005). However, all these hypotheses assumethat predicates with non-default case assignments fall into natural semantic types suchas  experience ,  motion ,  uncontrolled event , etc.. In other words, it is assumedthat all lexical tokens of predicates, i.e. items like the German verb  frieren   or the En-glish predicate expression  be cold  , can be successfully mapped into more general andmore abstract classes like  experience . A prominent correlate of this assumption isthat predicate-specific argument roles cluster into more general and more abstract ar-gument types – “role complexes”, as the editors of this special issue call them – suchas  experiencer ,  theme , or  instrument  etc., and that these clusters are significantlysimilar across languages.This assumption is controversial. It is usually debated in terms of a choice be-tween theoretical frameworks, e.g. by appeal to the efficiency and elegancy in describinggeneral patterns of case assignment (e.g. capturing which intransitive verbs assign ac-cusative rather than nominative case in German) or constructional constraints beyondcase (e.g. in auxiliary choice or participle formation). Some theories assume role types(e.g. Lexical-Functional Grammar; Bresnan & Kanerva 1989, Butt 2008, Dalrymple &Nikolaeva 2011), others reject them (e.g. Role and Reference Grammar; Van Valin &Wilkins 1996, Van Valin 2005). In this paper we want to turn the debate into an em-pirical one. Based on a typological database, we assess the empirical evidence for roleclusters, asking to what extent non-default case assignment suggests natural and cross-linguistically relevant clusters: is there cross-linguistic evidence that non-default caseassignment indeed systematically carves out, say, experiencers and themes as generaltypes among the sole argument of one-place predicates? Is there evidence for carvingout, say, perceiver and agent types among the most agent-like argument of two-placepredicates (such as  see   vs.  hit  , etc.)? Is there any evidence for such type distinctionsas recipients vs. spatial goals among the non-moving argument of three-place predicates(e.g.  give   vs.  put  )?We start by annotating case frames for predicate-specific roles (e.g. ‘the one whofeels cold’, ‘the one who sees sth.’, ‘the one that gets hit by so.’, etc.) that recur acrosslanguages and that can be reasonably identified by translational approximation. Welimit our attention to non-default case assignment (and non-default case alternations),assuming that defaults have no semantic specification of their own and cover every-thing that is not covered by non-default cases (or alternations). We then examine ona typological database to what extent predicate-specific semantic roles are grouped to-gether by the same non-default cases (or alternations) in each language and derive fromthis a measure of dissimilarity of the roles across languages. The resulting dissimilaritymatrix is then mined for statistical clusters, applying algorithms for fuzzy cluster (Kauf-man & Rousseeuw 1990) and NeighborNet (Bryant & Moulton 2004, Huson & Bryant  32006) analysis. Any resulting cluster of predicate-specific roles is potentially indicativeof cross-linguistically relevant role complexes.In the following, we first explain our notions of non-default case assignment andgeneralized argument classes (Section 2). Section 3 explains our database and the way wedeveloped the cross-linguistic annotations of predicate-specific semantic roles. Section 4describes the data-mining algorithms we used. Results of these are then presented inSection 5 and discussed in Section 6 in the light of expectations from the literature. Thefinal section summarizes our findings. 2 Non-default case assignment and generalized argument classes Many languages exhibit diverse possibilities of case assignment. 1 This can be illustratedwith the following examples from Chechen ( ISO639.3:che ; Nakh-Daghestanian; ZarinaMolochieva, p.c.). The clauses in (1) show that the sole argument of a one-place predicatecan be in the absolutive, the dative, the ergative and the allative case. (2) shows aselection of possibilities available for the arguments of two-argument predicates: 2 (1) a.  so 1sABS ohw-v-uuzhu-u. down-V-fall-PRS ‘I fall down.’b.  suuna  1sDAT  j-ouxa  J-hot  j-u. J-be.PRS ‘I am hot.’c.  as  1sERG  jouxarsh  cough tyyxi-ra. hit-WITNESSED.PST ‘I was coughing.’d.  soega  1sALL nir  diarrhea qiett-a. strike-PRF ‘I’ve got diarrhea.’(2) a.  as  1sERG wazh  apple(B).ABS b-u’-u. B-eat-PRS ‘I eat apples.’b.  so 1sABS hwo-x  2s-LAT taxana  today qiet-a. meet-PRS ‘I meet you today.’c.  suuna  1sDAT Zaara  Zara(J).ABS  j-iez-a. J-love-PRS ‘I love Zara.’ 1 We use the term ‘case assignment’ in the broad sense of a paradigmatic contrast in the shape of nounphrases that distinguishes their roles as arguments of a predicate, including affixes, tone oppositions,adpositions, particles, morphological zeros in opposition to overt devices, etc. 2 Glossing follows the Leipzig Glossing Rules; ‘V’,‘J’, and ‘B’ denote genders.  4Obviously case assignment is sometimes not an isolated phenomenon but is part of alarger constructional choice. In (1d), for example, the allative is conditioned by the factthat the predicate is not expressed by a simple stem but instead by a complex lexicalizedexpression that involves the allative-assigning verb stem  qiett-  ‘strike’. Strictly speaking,then, the lexical entry ‘have diarrhea’ associates with the entire complex construction‘allative+ qiett  -’ and not just with the allative declension form. In this paper, we glossover this complication for the following reason: Our interest is in wether or not the rolesthat are licensed by various cross-linguistically identifiable lexical meanings (such as therole of the single argument of ‘have diarrhea’) are treated alike or not in a language, andwhether there are systematic patterns behind this treatment across languages. For thisquestion, it does not matter if a specific predicate meaning associates with a simple casechoice or with a complex constructional choice of case and complex predicate structureat the same time. This difference is as irrelevant to our question as the differencebetween a case choice that affects only a simple suffix and one that involves some complexexpression consisting of, say, a declension form and an adposition. 3 As we will explainfurther below, we base our analysis of roles exclusively on the semantics of lexical entries(where Chechen  nir qiett-  licenses a single argument S just like English ‘have diarrhea’)and not on the formal shape of these entries (where one can argue about the transitivityof the expressions).It is commonly assumed that some types of case assignment represent the basic orcanonical choice and others a non-basic, non-canonical choice. In Chechen, for example,one would consider the absolutive in (1a) and the ergative-absolutive frame in (2a) tobe the basic choices. The range of individual predicates in basic case frames is typicallyopen-ended, with no specified semantic limits. Open-ended classes of this kind aredifficult to survey across languages because sufficiently rich dictionaries are scarce.One way out of this problem is to proceed with an  a priori   list of universal predicatemeanings (like ‘eat’, ‘have diarrhea’, etc.) whose case assignments can then be cata-logued for every language. 4 Like all onomasiological (denotation-based, stimuli-based)approaches, this procedure allows easy comparison, but the pre-selection of predicatemeanings brings with it the risk that the results are in part pre-determined. For exam-ple, it makes a difference for role clusters among intransitives how many different verbsof body functions (e.g. ‘belch’) there are, how many experience-related verbs (e.g. ‘becold’) there are, etc.: if there are more experience-related meanings than body-functionmeanings in a list, evidence from case assignment patterns related to experiences weighsmore than evidence related to body functions in cluster analyses, and this artificiallyfavors the detection of experiencer clusters over role clusters related to body functions. 3 In addition, we note that it can be very difficult to decide whether a specific case assignment is motivatedby some sub-structure of the lexical predicate. The answer will often depend on the precise etymologyof the expression and on the question to what extent speakers still have access to this sub-structure. 4 This is the approach taken by the Leipzig Valency Class Project (Comrie & Malchukov in press, ) and the Valency Project at the Russian Academy of Sciences, St.Petersburg (Say 2011).  5While these problems can be somewhat kept at bay by enlarging lists of surveyedmeanings and by trying to avoid euro-centrism when compiling them, we explore herean alternative approach. We concentrate exclusively on non-basic case assignment. Thepredicates associated with non-basic case assignment have the advantage that they arepositively characterized by lists of verbs (which are also typically retrievable in gram-mars because one needs to say when the relevant non-basic cases show up). Basic caseassignment patterns, by contrast, can be expected to apply to open-ended lists of verbs,with an equally open diversity of meanings. Lists of verbs assigning non-basic casescan be readily surveyed and compared without any  a priori   assumptions about what toexpect. But this approach is not without problems either. The most pressing one is howone can in fact distinguish basic from non-basic case frames.There are basically two classes of approaches to this, each of them replacing the in-tuitive notion of a basic choice in case assignment by more concrete concepts that can bebetter operationalized. In one approach, the notion of a basic choice is replaced by that of canonical arguments, or, more precisely, notions of canonically intransitive, canonicallytransitive and canonically ditransitive argument frames. Canonicity is in turn establishedon the basis of a range of morphosyntactic, or semantically-grounded morphosyntactic,criteria so that, for instance, only the frames with accusatively-marked objects, or onlywith accusatively-marked and affected patient objects, or only with objects which can bepromoted to subjects through passivization and which denote affected patients, are con-sidered canonically transitive (cf. e.g. Onishi 2001, relying on Dixon 1994, but also moregenerally any research relying on notions like “quirky subjects”, “oblique objects”, etc.).In another approach, the notion of basic choice in case assignment patterns is groundedin prototypical predicate meanings. The choice is then based on pre-established notionsof what would be prototypical representatives of one-, two-, and three-argument pred-icates, e.g. one would define predicates meaning something like ‘kill’ or ‘break’ as theprototypical representatives of two-place predicates (in the spirit of Comrie 1981) andtake the case frame of these predicates to be basic.The first approach has been criticized for mixing semantic and syntactic criteria thatare not strictly comparable across languages (Haspelmath 2011), and we do not adoptthis approach here for this reason. The second approach does not suit the purpose of ourinvestigation because it builds into a theoretical assumption what we want to exploreempirically: the approach assumes  a priori   that across languages, predicates fall intoat least two basic semantic types or clusters, a prototypical one including (in the caseof two-argument predicates) meanings like ‘kill’ or ‘break’, and a non-prototypical oneincluding meanings like ‘love’ or ‘see’. This may well be the case, but if so, we expect itto emerge empirically from a cluster analysis.Therefore, we need an alternative approach: we approach the notion of basic caseframes by the idea of   default  case frames: case frames that are assigned when thereis no other case specification in the lexical entry of a predicate. Default case frames inthis sense are expected to be licensed by predicates that form an open class, and therebyby whatever class has the largest number of members in the lexicon and that is most
