Documentation for SweFN

swefn

Follow the development via RSS.

1 Background and motivation

Access to multi-layered lexical, grammatical and semantic information representing text content is a prerequisite for lexicological and linguistic research, as well as for many LT applications. Information about the types of lexical frames of the words of the language, the frame elements of each such frame type described in terms of their semantic roles (semantic valence) and their syntactic manifestations (syntactic valence), are arguably necessary components of a full-fledged modern computational lexical resource. The earliest and best-known such resource is, without doubt, the Berkeley FrameNet (Ruppenhoffer et al., 2006; Fillmore, 2008). Compiling of dictionaries as well as text understanding and generation of natural language by computers are some applications which can benefit from the information provided by a framenet.

Currently FrameNet-like resources exist for a few languages, including some domain- specific and multilingual initiatives (Boas, 2009; Uematsu et al., 2009; Venturi et al., 2009), but are still unavailable for most languages, including Swedish, except for some pilot studies exploring the semi-automatic acquisition of Swedish frames (Johansson & Nugues, 2006; Borin et al., 2007). At the University of Gothenburg, we are now embarking on a project to build a Swedish FrameNet-like resource. It is intended to be a free, full-scale, multi-functional resource covering morphological, syntactic and semantic description of 50,000 lexical units, with information accessible to both human users and LT systems. To make the work on this project cost and time effective, we intend to reuse freely available digital resources and software. A novel feature of this project is that the Swedish FrameNet will be an integral part of a larger many-faceted lexical resource. Hence the name Swedish FrameNet++ (SweFN++). This larger resource will besides information on the modern Swedish encompass lexical data on 19th century Swedish, and eventually on Old Swedish. For information on the ongoing work on the Old Swedish strand see Borin & Forsberg (2009a).

The theoretical approach underlying SweFN is based on frame semantics, the brainchild of Charles J. Fillmore. The English version of the FrameNet (henceforth FN) elaborated by the Berkeley research group and documented in FrameNet II: Extended Theory and Practice. Ruppenhofer et.al. 2010, (available at http://framenet.icsi.berkeley.edu.) provides valuable guidelines for the construction of SweFN, as the English FrameNet contains more than 10,000 lexical units and nearly 800 hierarchically related frames, exemplified in more than 135,000 sentences. According to FN, a lexical unit (LU) is a pairing of a simplex word or multiword expression with its meaning. Each sense of a polysemous word or multiword expression evokes a different semantic frame, a script-like conceptual structure which describes a particular type of situation, object or event along with its typical participants and props. The participants of a frame are described in terms of semantic roles: frame elements .

In our work on frames, we follow the Berkeley frames specifications concerning: (i) the name of the frame, (ii) its definition pointing out semantic relations between the set of core elements, as well as (iii) the specification of frame elements including their definitions. We also take advantage of the meta-information provided on the types of semantic relations between the frames. In the initial phase of the SweFN project, we focus on semantic description of frames for verbs, nouns and adjectives and on their populating with lexical units. A syntactic description will be added at a later stage of the project.

As Swedish is a compounding language, special attention is paid to semantic relations within one word compounds populating a frame. An implicit semantic relation existing between the co-components of a compound is made explicit by indicating the type of semantic relation established between the head of a compound evoking the frame and its co-component, representing one of the frame elements. This information is of particular relevance for understanding the meaning of a compound and for capturing the range of semantic and syntactic alternations which are brought about in texts in the course of compounding .

The work on SweFN is in progress, which means that information can be subjected to further refinements and modification. The lexical data is available for inspection at : http://spraakbanken.gu.se/swe/forskning/swefn/utvecklingsversion. The information provided there is updated four times a day. It is released under open content licence.

2. About SweFN on the SweFN++ website

In the menu part on the website for SweFN++, there are several items that provide access to the information on frames in SweFN.

  • Search SweFN++ gives an overview of information on a searched word in all the linked lexical resources in which the word is present, including search in SweFN.
  • Documentation links to the document being under inspection.
  • Development version displays a current web version of the SweFN frames which are elaborated in an editing tool. (See below for examples concerning encoding conventions used in the analysis and the editing process.)
  • Statistics presents data about the number of (i) frames in SweFN, (ii) unique core elements, (iii) unique peripheral elements, (iv) lexical units populating the elaborated frames, and (v) lexical units evoking the respective frames, but lacking description in the SALDO resource.
  • List of elaborated frames for Swedish with links to each frame.
  • History is a register of changes.
  • Error report provides results of control tests performed on the SweFN data that check some aspects of their consistency, for example unique assignment of lexical units to frames or failure in following the technical encoding conventions).

3. Overview of the content provided in the development version of SweFN

3.1 Content template

As already hinted at, the very first phase of the project is focused on elaborating the semantic description of frames and populating the frames with Swedish lexical units. The approach used so far has been that of extension, as most of the meta-data on the English frames have been re-used in the process of creating Swedish frames. To make this meta-connection clear, we use the English names for frames and frame elements in SweFN. This makes linking to the corresponding English frames more straightforward and provides direct access to the original definitions of frames and frame elements in FN.

The Swedish frames are presented on the website in tables with following content fields:

  • Name of a frame, in most cases identical to a corresponding one in FN. See below for a list of and information on new or modified frames.
  • Domain, inclusion of domain information opens for creation of sub-framenets for special vocabularies, e.g. art and medicine in contrast to the general language domain.
  • Semantic type, referring to ontological classification taken from the SIMPLE ontology.
  • List of core frame elements, whose names are in most cases identical to the corresponding ones in the English FN. The name of a FE is matched with a colour visualising its type..
  • List of peripheral frame elements, whose names are in most cases identical to the corresponding ones in the English FN. The name of a FE is matched with a colour visualising its type.
  • Examples, a set of semantically annotated examples taken from corpus texts. FEs are in colours matching the corresponding FE name and the LU is always put in red.
  • List of instantiated compound patterns, defined by the type of frame element preceding the head of a compound. The heads of compounds are lexical units which can evoke the frame.
  • Compound patterns with frame relevant examples.
  • List of Swedish lexical units populating a frame with equivalent senses in SALDO. SALDO equivalents add information on semantic associative relations and morphology.
  • List of Swedish lexical units populating a frame but without equivalents in SALDO.
  • Note field is reserved for comments. New or modified frames are usually provided with explanations.

3.2 Frames in SweFN which deviate from the English ones

Since we intend to create a resource which suits Swedish LT applications, we reserve the right to modify and to add new frames to the repository of English frames. The new frames are created to make the original frames: (i) more homogeneous semantically, as is the case with, for example, splitting Medical_conditions into Health_status and Diseases, and further splitting Cure into Cure_mod and Medical_treatment (ii) more specific as to their content, e.g. Change_position_on _scale with subvariants ...Increase …/Decrease…/Fluctuation), (iii less specific as to their content, changing Jury_deliberation to Delibration (frames Dimension, Position_on_scale are being re-considerd), (iv) to add new frames needed to cover the 50,000 units from SALDO, for example Social_care_scenario.

The list of new or modified frames in SweFN:
  • Active_substance_medical
  • Active_substance_mod
  • Animals
  • Cause_change_position_on_a_scale_decrease
  • Cause_change_position_on_a_scale_fluctuation
  • Cause_change_position_on_a_scale_increase
  • Cause_contraction
  • Cause_expansionion_mod
  • Change_position_on_a_scale_decrease
  • Change_position_on_a_scale_fluctuation
  • Change_position_on_a_scale_increase
  • Contraction
  • Cure_mod
  • Deliberation
  • Entity_specific_modes_of_being
  • Expansion_mod
  • Expertise_negative
  • Expertise_positive
  • Falling_ill
  • Furniture
  • Health_status
  • Inner_parts_of_body
  • Languages
  • Medical_disorders
  • Medical_interaction_scenario
  • Medical_treatment
  • Musical_instruments
  • Observable_bodyparts
  • People_by_disease
  • People_by_morality_negative
  • People_by_morality_positive
  • Plants
  • Social_care_scenario
  • Sound_makers
  • Stimulus_focus_negative
  • Stimulus_focus_positive

4. Encoding conventions for annotation purposes

The coding conventions described below are meant to provide technical guidance in the annotation process performed with the help of an editing tool. They may be also useful for interpretation of the colour based annotations in the web version of SweFN.

4.1 Encoding of frame elements and examples in the editing tool

Lists of elements. As far as core and peripheral lists of frame elements are concerned, each frame element needs to be followed by a, for the frame, unique letter code within brackets. For example for the Communication frame the following frame elements are listed and letter coded:

(i) Core FEs: Communicator (C), Medium (M), Message (ME), Topic (T); (ii) Non-core FEs: Addressee (A), Amount_of_information (AI), Depictive (D), Duration (DU), Frequency (F), Manner (MA), Means (MS), Place (P), Purpose (PU), Time (TI).

Examples. In the example field, the parts of an example sentence, clause or phrase which match semantic roles are marked with respective letter codes. The letter code precedes the word or expression and the semantic range of the expression is indicated by square brackets.

[C EG-domstolen] [LU meddelade] [ME sin dom i Laval-målet].

[MS Språket] [LU säger] [AI mycket] [T om människan]!

LU marks a target word evoking the frame.

4.2 Encoding of verbs

COP is used for annotating copula verbs such as 'vara' or 'bli'. SUPP is used for support verbs in collocative expressions where the noun is the semantically dominant element, e.g. 'ta' in the collocation 'ta beslut'. Auxiliary verb forms like 'ha', 'skola', 'komma' or the infinitive marker 'att' are not marked as part of the LU Han har [LU återhämtat sig] efter en depression.

4.3 Encoding of nouns and prepositional phrases

When encodning nouns, the indefinite article is left outside the brackets of the LU.

For the encoding of prepositional phrases in coordinated conjunctive expressions, the range of a coordinated construction is marked by additional square brackets and the FE code. It is assumed that the preposition used in the first expression applies also to the other expressions.

[W Han] [COP var] [LU klädd] [C [C i jeans], [C jacka] och [C mössa]].

Där stod [W mannen] [C [C i skinnjacka], [C jeans], [C cowboyhatt] och [C cowboystövlar]] invid en knallröd Porsche.

4.4 Encoding of syntactically complex frame elements

Frame elements whose syntactic manifestations are discontinuous, elliptic or coordinated require more intricate encoding. Here are some conventions used for annotation of such constructions.

Discontinuity. Cases of discontinuity within LU are coded by using internal brackets with frame annotation for the interposed expression. In cases where the interposed expression does not match any frame element, X is used to mark such an expression.

[LU gick [TI på morgonen] ut].

[LU steg [X varken] in] eller [LU {steg} ut].

Coordinated FEs. Double encoding is used; the first square bracket and D marks the beginning of the coordinated construction and the last bracket marks the range of the construction with a number of coordinated phrases representing the same frame element. Each of the coordinated constructions is also annotated separately.

[C Hanna] [LU kom till världen] [D [D med ett allvarligt hjärtfel], [D avbruten aortabåge] och [D med hål i kammarskiljeväggen]]. (The example is from the Birth frame; D stands for a peripheral frame element called Depictive.)

Elliptical constructions. In sentences with elliptical constructions, the missing element is added in braces { }.

[R Vägen] [LU slingrar sig upp] och [LU {slingrar sig} ned] [P längs sjöar och åkrar].

4.5 Encoding of semantically complex frame elements

Conflated frame elements. Conflated FEs are marked with double letter codes separated by a slash. In the sentence below, taken from Cure frame, Patient and Affliction are conflated frame elements.

[T Transplantation] kan [LU ha botat] [P/A hiv-smittad].

4.6 Other technical encoding

End of line. Double semicolons mark the end of line i all cases except the last line in a field.

Domain coding. MED medicine, ART art, GEN general. In case of mixed LU lists, the dominant domain is given as first. Thus ART/GEN implies that most of LUs belong to the art domain, but that general words do occur there.

© University of Gothenburg 2009, Box 100, 405 30 Gothenburg, Sweden
Tel +46 31 786 0000, Contact

About the site

X
Loading