Rosoka Toolkit 4.1

Rosoka Toolkit 4.1

Welcome

Welcome to Toolkit

Rosoka Toolkit 4.1 is a development tool for the data scientist.  A Data Scientist can modify and create entity types, relationship definitions, lexicons, character-based regex rules, semantic vector regex rules, and maintain quality control with regression testing.  Rosoka Toolkit provides the data scientist with an ability to create domain-specific document sets (corpora) and corpora baselines through an integrated results store.  The output of Rosoka Toolkit is a Rosoka LxBase.

This tutorial expects that you have already read the Rosoka Toolkit 4.1 User Manual, and have completed the following training modules:
 

  • Data Collection and Emacs
  • Scripting Morphology
  • News a Sources for Docs

 

If you haven't already set a Favorite User Directory for the quick addition of corpora, please do so by referring to page 6 of the Rosoka Toolkit 4.1 User Manual.

Part 1: Corpus Overview

Adding a Corpus

Please add the SampleDocs corpus to Toolkit.

After clicking on the "View corpus list" icon from the left menu bar, the Create New Corpus page will appear.  Once you have uploaded a corpus, this icon will direct you to the Corpus Management page.

Once the Create a New Corpus page appears, you will be able to tell Toolkit which corpus to add.   I have saved a folder on my Desktop titled "ToolkitTestDocs," which is where I save all of my corpora.  I previously set that path as my favorite directory and it will automatically appear every time I go to this page.

Processing a Corpus

Once I click on "Submit" the SampleDocs corpus should appear as the Entry.  Clicking on the "Process all documents" icon will process this corpus.

*Note:  Since the last release of Rosoka Toolkit, we have modified the entity GENERIC.  Your version will not reflect those changes until the next release.  For the purposes of this tutorial we are going to ignore it in its entirety.

The first thing you should notice is the list of documents on the left.  Clicking on a document will make it appear on the right.  The list of ENTITY types found in the document appears in the middle.

Anytime you would like to return to this active view of your corpus, clicking on the "View active corpus" icon from the left navigation bar will bring you back.

Entity Values

Each entity view can be expanded.  Likewise, each individual entity can be expanded further to provide the user with additional information.  Here, the entity, breast cancer awareness month has been expanded.  Going down the list in order, we learn the following:

 The entity type is an EVENT.
 

Its Salience is 25 - which rates how important it is to the document on a scale from 0 - 100.
 

The Polarity of this entity is 0 - which rates how positive or negative the entity is on a scale from -3 - 3 (0 is neutral).
 

The Aspect of this entity is 3, meaning that the author is persuading the reader.  Aspect is defined on a scale of -3 - 3 ranging from the author dominating the reader to the author persuading the reader.
 

The Intensity of this entity is a 1.  This means that on a scale of 0-3, the language is not causing a high level of activation.
 

 The mood of this entity is a 2, meaning that on a scale of -3 - 3, that as the reader I should be moderately happy.

Finally, under the last instance of the entity breast cancer awareness month there are two additional pieces of information;

The Rule trace states whether an entity was discovered in the lexicon or which rules fired to make the token an entity.
 

The Norm states whether or not the entity extracted is the normalized form or a variant form that has a normalized value in the lexicon.
 

The rule trace contains the sequence of processing steps that fired against a particular output when a string or document was processed.  The following provides a summary of how to interpret the rule trace.

  • : = Delimiter between rule actions
  • i = Initial tokenization
  • LL = Lexical Lookup

  • C =  The token was combined
  • [...] = Contents of the brackets [ ] after the C includes the rule trace of the combined tokens
  • MT = MultiToken rule
  • IT = IntraToken rule
  • GR = General rules
  • HC = Hard coded rule
  • BCE = Back Chaining Entity detection
  • att = Attribute portion of a rule assigning attribution to a token

    Therefore, reading through the codes, you can see that after it did not find a match for breast cancer awareness month in the lexicon, the General Rule labeled event_nc-0001 fired and decided that this entry should be an EVENT entity. 

      This entity is the norm value, not a variant form.

You may have noticed that the colors for your entities are different than the ones in the example.  This is because they are customizable.  Simply click on the color wheel located next to the "Arrange" drop-down and select the entity and its custom color.

 

Rosoka Toolkit extracts quite a few different entities.  A list of entities and their corresponding definitions has been provided in Appendix A.  It is recommended that you familiarize yourself with it and use it for reference.

Aggregate Entities

Sometimes you will want to generate a list of all entities in a corpus.  This can be done by clicking on the "Aggregate view" icon located above the list of individual documents.

Each entity can be expanded, just as it can be in the text view to reflect its entity values.

Additionally, the number of instances of each entity type that is found throughout the corpus appears beside it in parentheses; up to 300 instances based on how salient they are.  If an entity lists (300+), then the entity type can be right-clicked, which generates a pop-up menu with the option to "Collect all instances."

Interpreting the Corpus

In Toolkit, place your mouse over different words in the Text tab.  You will notice that some words are grouped together, while others are not.  This is how Toolkit is "tokenizing."  In Linguistics, a token refers to an individual instance of something.  Therefore, in the document you will see individual instances of words, entities and phrases being tokenized.

Below, all instances of organizations are highlighted in a dark blue, such as AFLAC, NewsUSA and the Association of Rehabilitation Nurses.  The event Breast Cancer Awareness Month appears in a rust color.  Meanwhile, different medical procedures such as rehabilitation, rehab and joint replacement appear in pink.  You will also notice many instances of health related issues.  These are qualifying as DISEASE entities and appear in a turquoise blue.  People names and their pronouns appear in red and URLs are in periwinkle.  Again, your entities may be set to appear in different colors, this is to serve as a guide.

A list of tokens in a document can be generated from the View drop-down menu.

Each instance in this list of tokens is also expandable.  When expanded, you will see all of the same entity value information that you previously saw when you expanded the entity values for breast cancer awareness month.

Document View: Text Tab

From the Document view, an entity can be right-clicked, enabling a right-click menu of options.

From the right-click menu, you can search both Wikipedia and Google for the entity.  You can also click on the "Contexts" option, which opens a new page to display a list of documents where the entity is found in your corpus.  This page also provides a snapshot of the context in which the entity is found for each document.

Selecting the next option from the right-click menu, "Explore in documents," automatically populates a list of documents on the left that contain the entity.  After selecting a document, the entity view in the middle will automatically expand that particular entity to display its corresponding entity values.

The next option in the right-click menu, "Explore connections," automatically opens a Relationships page.  This page will show the connections between the entity you have selected and other entities.  You can also further constrain the connections by selecting from the check-boxes on the left.  For example, maybe you only want to see the connections between the entity rehab and PERSON entities.

The nodes representing the entities are also right-clickable, as shown in Figure 16.  The right-click menu that appears allows you to search both Wikipedia and Google for the entity, as well as, "Explore in documents."  This instance of "Explore in documents" is the same function as previously discussed.

Right-clicking on an entity and selecting "Look up in lexicon" opens a pop-up window which will allow you to modify the lexical entry in the lexicon.

Clicking on the "View relational links" icon from the left navigation menu will also bring you to this page.  You can see the relationships between all entities, or just those of a particular entity.

From here, you can Update the entity (or lexical entry).  Below, medical has been searched and Toolkit automatically populated a list of semantic vectors that are related to medical.  If I felt that rehab were a disease, I could click of the semantic vector entity DISEASE and attach that value to the term.

Likewise, if I felt that the entry was not a noun, or qualified as the entity MEDICAL_PROCEDURE, I could click on the appropriate instance on the left and remove that semantic vector from the term.

Many times you will find a term in a document that is not appearing as an entity as it should be.  If it is something that will not be matched by the terms of a rule, then you may want to add it to the lexicon.  This is a quick way to do so.

The last option in the right-click menu is to "Show rule match detail."  This will pop-up a new window displaying each instance that a rule matched to the token.

Below, the entity Susan Wirt is shown.  This is a PERSON entity and Toolkit assigned it this entity type by matching the tokens Susan and Wirt to a rule that combines given_names and surnames.  You can scroll to the right within the pop-up window to see further information regarding the rule.

Each instance of a rule is expandable.  Expanding a rule will display the actual rule, which can be viewed by scrolling down within the window.

Document View: Entities Tab

Clicking on the Entities tab will display a new page with clusters of entities.  The size of the entity is directly correlated to the entities salience.  Each entity is right-clickable and will display the same right-click menu as previously discussed with the same functionality.

Aggregate View: Clusters Tab

If you are in the Aggregate View instead of the Document View, a Clusters tab is available.  This tab automatically opens a new page and displays a cluster of entities found in the entire corpus.  This page functions like the Entities tab, allowing the same right-click menu options.

Document View: Relationships Tab

From the Document View, the Relationships tab will open a new page displaying each entity in the document mapped to its related entities.  Hovering over an entity will display an information balloon, which includes the entity name, type of entity, and prompt you to right-click the node if the entity can be found in Wikipedia.  Right-clicking on an entity node will also bring up the same right-click menu as previously discussed.

Aggregate View: Relationships Tab

From the Aggregate View, the Relationships tab will open a new page displaying each entity in the entire corpus mapped to its related entities.  Hovering over an entity will display an information balloon, which includes the entity name, type of entity, and prompt you to right-click the node if the entity can be found in Wikipedia.  Right-clicking on an entity node will also bring up the same right-click menu as previously discussed.

Document View: Concepts Tab

From the Document view, clicking on the Concepts tab will display a visualization of all entities in the documents, as well as, all other lexical entries.  Hovering over a token will pop-up an information balloon, which will indicate if the token is an entity or a word, the selection itself, its salience, and prompt you to right click to search the entry in Wikipedia.

Right-clicking on an entry will pop-up the same right-click menu as previously discussed.

Searching the Corpus

From the left navigation menu, the "Search" icon will open a Search page.  When searching, if an entity or word if found, each instance will appear in the search results along with corresponding context.

There is also a search bar in the upper right corner of the Document view page and Relationships page.

Corpus Assessment 1

Please read the following document and click the entities you feel Toolkit should extract.  Use the list of Entities found in Appendix A for assistance.  If you would like to see more in Toolkit first, feel free to create corpora on your own and add them to Toolkit to get a feel for the additional entities.   Please do not load these documents into Toolkit. 

Don't forget, we are ignoring the entity GENERIC.

Corpus Assessment 2

Please read the following document and click the entities you feel Toolkit should extract.  Use the list of Entities found in Appendix A for assistance.  If you would like to see more in Toolkit first, feel free to create corpora on your own and add them to Toolkit to get a feel for the additional entities.   Please do not load these documents into Toolkit. 

Don't forget, we are ignoring the entity GENERIC.

Part 2: Corpus for New Entity

Preparation

In order to create the entity SPORT_STAT, we need to have an idea of what types of terms are used in the discussion of sports statistics and a list of documents to test our entity against.  Creating a corpus of general sports news will help us get started.

Please add the corpus titled SportCorpus to Toolkit.

As you begin to read through the documents, you will need to start making a list of words that you feel are important to the entity SPORT_STAT.  Toolkit may already be extracting some of these terms, but not others.  There are some key things to keep in mind:
 

  • We are avoiding the use of the entity GENERIC - so let's ignore entries such as player, coach and team.
     
  • We don't want to add something to the lexicon that could potentially over-generalize on other types of documents - so let's not add terms like Team USA to the lexicon.
     
  • We also need to add variant forms.  For example, the Philadelphia 76ers are also known as the 6ers.  It is doubtful that we will find 6ers in lots of other document types, so we can add both of those to the lexicon if they aren't already there.  However, we need to pick one to be the norm.  Philadelphia 76ers should be the norm, so 6ers should be the variant form.

Part 3: LxBase Overview

Preparation

We need to make sure that your userEnglish Dictionary is blank before starting.

Clicking on the "Edit lexicon" icon from the left navigation menu will open the Dictionary Editor page. 

Each dictionary is expandable and comes with multiple lexicons.  Under dictionary, you will find an XML file titled userEnglish.xml.  Whenever you make changes to lexical entries from the Text tab in Toolkit, those changes are added to this file.  You will need to routinely add those entries to another, more permanent, dictionary as this file will frequently automatically clear itself.

Use the "Plus sign" icon for the addition of a "new dictionary file."  Here you can create new dictionaries specific to your needs.

Clicking on userEnglish.xml, opens up the Editing page for this file.  You will notice that I only have one entry:

<lex><word>will gray</word><sv><PERSON/></sv></lex>

This is the first name listed in the document 2016 us open.txt in the SportCorpusWill Gray was not previously appearing as a PERSON entity.  This is most likely because the surname Gray was not matching a rule to combine with the given_name Will since Gray has many semantic vectors associated with it.  

Go ahead and look at that document in Toolkit and notice how it does not match as a PERSON entity.

Right click on both Will and Gray and look at the lexical entries for each.  Take note that each has multiple semantic vectors associated with it.  There is most likely a rule that states if an entry has a combination of certain semantic vectors, such as parts of speech (like noun), then it should not match as a PERSON.  If this is a really important person and you want them to be extracted, then you just hardcode both the given_name and surname together as PERSON, which is what I have done.   See if you can do this yourself.

One trick is to highlight the entire text with your mouse.  You will notice that even though Toolkit tokenized both words together, they are still treated as individual words when you right-click them unless you highlight your selection as shown below.

I selected "Look up in lexicon" from the right-click menu, which opened up a pop-up.  In the search bar I searched for the entity PERSON, selected it and at the bottom click the button to "Add SV."

One the SV PERSON appears in the left, and I'm satisfied that I have added all necessary SVs, I can then click the "Add" button at the bottom.  This is what creates the entry Will Gray in the userEnglish dictionary previously discussed.

Navigate back to your userEnglish dictionary.  If you have any entries, go ahead and delete them so that your dictionary looks like the image below. 

Before leaving the page, click on the "Validate the current file" icon, and then save your changes by clicking on the "Validate and save the current file" icon.

In order to make sure any changes do not re-appear.  You need to go back to the Corpus Management page and clear the processing results from your corpus.

 

Click on the "View corpus list" icon from the left navigation menu. 

Above your corpus entry click on the "Clear processing results" icon.  This will clear all results from your corpus. 

Then click on the "Process all documents" icon to re-process your corpus.

Entity Creation

The following few sections are going to take you through the creation process.  You will be creating rules, adding SVs and adding to the lexicon.  Before we begin, we need to go ahead and create our new entity SPORT_STAT.

From the left navigation menu, click on the "Edit token definitions" icon. 

This opens the Token Definition Editor page.  From here new entities and semantic vectors can be created.  Appendix B contains a list of all Part of Speech and Pragmatic semantic vectors with their corresponding definitions for reference.

As shown below, there are many categories to choose from, you can select from:

  • Entities:  Used for the addition or removal of an entity.  Those with a lock icon cannot be edited.

     
  • No Output:  Used for the addition or removal of entities that do not appear in the output.
  • Part of Speech:  Used for the addition or removal of part of speech related semantic vectors, such as verb or noun.
  • Pragmatics:  Used for the addition or removal of semantic vectors that represent pragmatic categories, such as given_name or surname.
  • Relational:  Used for the addition or removal of semantic vectors that represent Predicate, Subject, Object (PSO) relationships, such as traveling_to.

From the Token Definition Editor page, click on the "Add Definition" button and name the entity SPORT_STAT.  You can give it a brief description as well.  Then click on the "Apply" button and "Save" button.

You will now get a notice across the top of the screen informing you that Toolkit must be restarted for changes to take effect.  For future use, if you know that you need to add multiple semantic vectors/entities, you can keep adding them and click on the "Apply" button after each one.  Then you will only need to restart Toolkit once.  Go ahead and restart Toolkit.

You should now find your new entity SPORT_STAT in the list of entities.

Tuning Documents

One of the best ways to figure out what kind of rules you need write, and what semantic vectors you need to create, is to begin reading through the documents in the corpus.  While doing so, you should create a list of terms that you would like Toolkit to extract.  You can then start mapping out your rules.  

Start with the first document in the list titled 2016 us open.txt and make a list of everything you think should be extracted.  Below, I have added my list with notes for each entry.

Add to Lexicon

 

Will Gray

As previously discussed, this name doesn't match to a rule.  Let's add the SV PERSON to the entire name.

Ernie Els

Using the same rationale, add SV PERSON to the entire name.

Golf Channel Fantasy Challenge

Golf Channel is already being matched as an ORG, because it is hardcoded in the lexicon as an ORG.  We can add the entire entry as an EVENT.

Louis Oothuizen

Oothuizen is already being matched as a city name.  We want to make sure to keep it as a city, so we will hardcode the entire entry as PERSON.

Justin Rose

In this case, Justin is appearing as a City.  It is more likely that we want it to always match to a given_name, so let's remove the SV city_name and see what happens.

U.S. Open

In this case, U.S. is matching to a place.  We don't want to change that, so let's hardcode U.S. Open as an EVENT.

   

*Congressional

While in this case, congressional is referring to an event, it would greatly over-generate in other documents.  We do not want to hardcode it as an EVENT.  We will not do anything.

*Memorial

Using the same rationale, we will not do anything.

   
 

*After re-processing the document, Justin Rose did not match to PERSON from a rule, so we need to hardcode that name in as PERSON and re-process again.

   
   

Extract by Rule

 

third-place, fifth-place, runner-up, etc.

We can associate a semantic vector with these terms and write a rule that makes that semantic vector match our entity.

second, fifth, ninth, 12th, etc.

We can associate a semantic vector that gets matched to a rule.

156 players

We can write a rule that matches a number + a semantic vector.  So we will create a semantic vector that we will associate terms like player, coach and team with.

T-4, T-12, T-9, etc.

We can write a regular expression to match all instances of this because it is a nice pattern.

No. 1

This is also a predictable pattern so we can write a regular expression for it as well.

top-10

I feel that this has a high likelihood of over-generating and will be difficult to constrain by writing a rule

won

I feel that this has a high likelihood of over-generating and will be difficult to constrain by writing a rule

win

I feel that this has a high likelihood of over-generating and will be difficult to constrain by writing a rule

At this point, any lexical changes you have made will be extracted after the document is re-processed.  So now it is time to start writing rules.

Rule: stat_cf-0001

This rule is going to extract lexical entries such as third-place, fifth-place, runner-up, etc.

Since we know we want to write a rule that matches a semantic vector that is associated with these terms, we first need to create the semantic vector.

Go back to the Token Definition Editor page and click on the "Pragmatics" button.  We will create a new semantic vector titled placement and give it a description.

From the left navigation menu click on the "Edit rules" icon.  This will open the Rule Editing page.

Since these rules concern the extraction of numbers, I have decided to add them to NumberRules.xml.  You can find a list of rule files on the left. 

Clicking on the "Insert a new rule" icon at the top of the page will automatically populate a blank rule template at the beginning of the rule file.

Sometimes using this blank template will be the most efficient way to create a new rule, while other times it might be faster to copy and paste a similar rule, just making any necessary changes.  Please keep in mind the rule ordering.  In this case, we will just copy this new rule and paste it at the bottom of the rule file.  You should do that now, so that you don't forget.

There may be times where one rule needs to fire first before another can.  Say for instance you wrote a basic rule to extract a 3 part name, another rule to extract a 2 part name and another to extract a 1 part name.  You don't want the 1 part name rule to fire first because it would match to both of the other rules.  Whereas, if the 3 part name rule fires first, it can only match to 3 part names, and once it has done so, then other rules have the potential to fire.

As you can see below, I have moved my new rule, titled stat_lc-0001 to the bottom of the rule file and filled in the missing information.  You may already be familiar with XML, if not you should be able to notice a pattern here.  

In line 1767 the beginning of a rule is notated by <Rule..., then line 1781 reflects the end of a rule by </Rule>. 

Likewise, line 1772 signifies the addition of an SV with <sv> then the SV itself is entered, and then it signifies that all SVs have been entered by </sv>. 

Let's walk through the rule, line by line in the table below.

Line #

Entry

Definition

1767

<Rule ID="stat_cf-0001">

Beginning of the rule.  This informs us that it is a statistical rule, the canonical form, and is assigned a unique numeric.

1768

<description>Finds sport results placement, such as third-place, fifth-place, etc.</description>

Between the description XML parameters a description of what the rule does is entered.

1769

<order>10</order>

Ignore this, it is almost always 10.

1770

<result>

This begins a result statement.  Basically, between this entry and its corresponding </result> in line 1775, defines the actual result.

1771

<combine>0</combine>

In regards to tokens, this is the number of tokens we want Toolkit to stop on when it decides what gets extracted.  Numbers can be in the negative and positive.  Here, we just want the one term, but if we wanted to include lexical entries before and after we would change this number.

1772

<sv><SPORT_STAT/></sv>

This is where we add in the entity we want the lexical entries to be labeled.  We only want one entity.

1773

<attributes></attributes>

We can assign attributes to an entity.  If we wanted to be more specific with a rule, such as labeling the statistic as football related, or tennis related, we could add that here.

1774

<nolonger></nolonger>

Sometimes we need to tell Toolkit that an entry is no longer another entity.  Perhaps third-place is being matched to an ORG.  We would tell it here that it is no longer an ORG.

1775

</result>

This closes our result statement.

1776

<when>

This opens our when statement.  This means that everything we say happens in the result statement happens "when" the follow things happen.

1777

<T offset="0">

When the token is the zero token...

1778

<IS><sv><placement/></sv></IS>

The token is the semantic vector placement.

1779

</T>

Now we are done setting the parameters for the 0 token.  We could add more in for additional tokens.

1780

</when>

This closes the when statement.

1781

</Rule>

This closes the rule.

1782

 

 

1783

</RuleSet>

Notice that this is not part of the rule.  This should always be at the end of the page.  It closes the rule file.

List of typical rule naming conventions:

lc = left context (not part of the match)

rc = right context (not part of the match)

cf = canonical form

nc = no context

pf = prefix indicator (context is part of the match)

sf = suffix indicator (context is part of the match)

sv = Rule creates SV, not ENTITY

sc = special case (i.e. a weird syntax form)

mm = Rule creates multiple SV and/or ENTITY

md = Rule modifies existing ENTITY for attributes or scope

nl = Rule that unsets SV (no longer)

rc = Recursive rule

rr = Rule generates a lot for recall remove to improve precision

[a-z] after a number indicates rules are variations or related

Language digraph can be use as the first element to denote language-specific rules

 

Rule naming convention:

(projectDescriptor_)?descriptor_type-idnumber(variant)?

 

default rule name:    address-cf-0001

rule name w/ project:          acme_address-cf-0002

variant rule:               vessel_pf-2304a

Lexical Tuning Part 1

Ok, so now that our rule is set we need to add things to our lexicon for it to match to.  This rule is going to associate instances of terms like third-place as SPORT_STAT based on those lexical entries having the semantic vector placement.  Therefore, we need to check our lexicon and add <placement/> to any of these types of entries that are already there, and add the ones that aren't.  Keep in mind that these entries have a hyphen.  You want to account for all possible forms that could appear in a document, such as third place, third - place, third-place, 3rd place, 3rd - place  and 3rd-place.

It is usually easiest if you make a list, then check the lexicon adding the SV to existing entries, then add the missing entries last.  For this tutorial we will just do 1st through 10th place.

first place

first - place

first-place

1st place

1st - place

1st-place

second place

second - place

second-place

2nd place

2nd - place

2nd-place

third place

third - place

third-place

3rd place

3rd - place

3rd-place

fourth place

fourth - place

fourth-place

4th place

4th - place

4th-place

fifth place

fifth - place

fifth-place

5th place

5th - place

5th-place

sixth place

sixth - place

sixth-place

6th place

6th - place

6th-place

seventh place

seventh - place

seventh-place

7th place

7th - place

7th-place

eighth place

eighth - place

eighth-place

8th place

8th - place

8th-place

ninth place

ninth - place

ninth-place

9th place

9th - place

9th-place

tenth place

tenth - place

tenth-place

10th place

10th - place

10th-place

From the Dictionary Editor page you can perform a search in the search bar at the top for each entry. 

Clicking on the "Edit lexicon" icon from the left navigation menu takes you to the Dictionary Editor page.

I need to add all of the lexical terms to the lexicon with the semantic vector <placement/>.  They can be added individually or imported as a list.  We will go over a few different ways to add entries.

If you want to add an entry to an existing dictionary, this is a simple process.  Simple click on the name of the dictionary, and Toolkit will populate most of the XML for you.  In this case, noun.xml would be the closest dictionary that I would want to add these entries to.    

I have selected noun.xml, and then in the window on the right I have inserted a blank line where I want my new entry to be.

Clicking on the "Insert new dictionary entry" icon will insert a blank XML entry at your cursor.

We need to enter the word and also the semantic vector

Make sure you click the "Validate" and "Save" icons  at the top.

When you have a list of entries, like we do, this method is not the most efficient.  You could take a list of words that you have used Emacs to add the XML to with a macro and copy and paste them into the same place.

Another option is to add a new dictionary.  When you have a large list of entries that are seemingly not related to the existing dictionaries, this is the best option.

Clicking on the "Plus" icon above the list of dictionaries opens up a new pop-up allowing the creation of a new dictionary. 

I have decided where I want my new dictionary to be and named it.  Once I click on the "Submit" button, the list refreshes and my dictionary will be listed

From here, I have the same options of adding individual entries, or copying and pasting a list of XML entries directly to the dictionary.

Another, even more efficient option, is to use the "Word Import Tool." 

Click on the "Import word list" icon from the left navigation menu. This opens the Word Import page. 

From this page, you can select the file you would like your word list to be added to, and either "Append" your list to it, or "Overwrite" the entire file with your new list.  From this page, you also have the option to "Upload," which will add your new file directly as a new dictionary file.

You then scroll down the list on the right and select the semantic vectors that you would like associated with each entry.  We will scroll down to the Pragmatics section and select  placement.

Once the Word Import Tool is complete, you can navigate back to your new SPORT_STAT dictionary and see a list of around 60+ entries with the appropriate XML and semantic vector.

Rule Testing

You can now navigate back to the Corpus Management page and click on the "Clear processing results" icon to clear your previous results. 

Now click on the "Process all documents" icon to re-process your corpus.  You should now see the results of the rule you just created extracting entries like third-place and fifth-place as the new entity SPORT_STAT.

In the first document 2016 us open.txt right click on the SPORT_STAT entity third-place from the Text tab view.  From the right-click menu select "Show rule match detail."  You will now see the list of rules that fired, showing General Rule stat_cf-0001, the rule you just wrote.

Lexical Tuning Part 2

Next in our list of rules, is the desire to extract placement results such as second, fifth, ninth and 12th.  

Since we just added the <placement/> semantic vector, we know that these terms are not associated with it yet.  

Even though many of them are already in the CORE dictionary, you cannot edit the CORE, so you will need to add these terms to a new dictionary with the appropriate semantic vector.

Then you will need to write a rule that matches that semantic vector to the entity SPORT_STAT.

Take the list in the table below, add it to the SPORT_STAT dictionary with the appropriate semantic vector.

first

1st

second

2nd

third

3rd

fourth

4th

fifth

5th

sixth

6th

seventh

7th

eighth

8th

ninth

9th

tenth

10th

eleventh

11th

twelfth

12th

Lexical Tuning Assessment

Please copy and paste your entire SPORT_STAT dictionary here.

Rule: stat_lc-0002

Next in our list of entries we would like to extract as SPORT_STAT is examples like 156 players.  This time we need to create a new semantic vector for general people terms like  players, coaches and teams, etc.  We will then write a rule that matches that semantic vector to a number and assigns the result SPORT_STAT.

Go ahead and create a new semantic vector under pragmatics called participant.

Next, you need to create a list of terms that you would like to be associated with this semantic vector.  Use the list found in the table below.  Then add them to the SPORT_STAT dictionary.

player

players

official

officials

team

teams

member

members

coach

coaches

Go back to the NumberRules.xml file and scroll all the way to the bottom of the rule file.  Copy the rule you previously created and paste it directly under itself.  This time when we make a new rule, we are going to just edit a few things.

We want to make the following changes to the rule at the bottom:

  • The name of the rule.
  • Its description.
  • The combine.
  • Add another token entry.
Change the semantic vector in the current token entry.

Line #

Entry

Definition

1782

<Rule ID="stat_lc-0002">

Rule file name, using lc to denote that it is matched based on left context.

1783

<description>Finds sports statistics based on number and general person terms, such as 156 players</description>

Description of the rule

1786

<combine>1</combine>

This time we are combining two tokens, so we need to combine the 0 token 156  and the 1 token  players.

1792-1794

<T offset="0">

      <IS><sv><NUMBER/></sv></IS>

            </T>

We need the 0 token to be a number, s we gave it the semantic vector <NUMBER/>

1795-1797

<T offset="1">

                <IS><sv><participant/></sv></IS>

            </T>

We need the 1 token to be one of our general person terms that we just added, so we gave it the semantic vector <participant/>

After validating and saving the rule, you should now be able to go back to your Corpus Management page, clear the processing results, and re-process the corpus.  Now you should see additional results being extracted as SPORT_STAT, such as 156 players and 10 players.

Browse through the documents in your corpus and notice that there is some over-generalization of SPORT_STAT.  We have allowed our rules to extract a very broad base of terms, like any number paired with a term that is tagged <participant/>.  Going forward, you need to use your judgment.  This task is here because we will never extract SPORT_STAT as an entity and the process covers a wide range of tasks within Toolkit.  When you need to create entities, semantic vectors and write rules, you need to decide if what you are doing will over-generalize and extract too much.  

Rule Assessment

Please copy and paste stat_lc-0002 here.

Regular Expressions

A regular expression is a special string of text that matches a particular search pattern.  For example, we can't possibly list every email address in our lexicon, but we can write a regular expression that can look for strings of text that match its parameters.  Then we can associate those results with the entity EMAIL.

Appendix C contains a Regular Expression Tutorial.  Please read through it and try the practice exercises contained within.  Then move on to the next task of writing our first regular expression.

stat_lc-0003

Next in our list of desired rules, is to extract instances like T-4, T-9  and T-12.  These are references to being tied in golf.  While a golf tournament could have a significant amount of players and potential places, we are just going to write this rule to match tie placements between 1 and 12.  

There are many useful regex tutorials online, as well as sites that will check your regex.  Feel free to use those for assistance.  The table below will walk you through this regex: 

 

\b[a-zA-Z]-[1-9][1-2]?\b

Character

Definition

\b

Indicates that we are starting at a new word boundary.

[

Indicates a character class.  Some character will be captured.

a-zA-Z

Indicates any character in the range a-z, lowercase or uppercase.  (*Note, if the rule in the rule file indicates case insensitive you only need to have a-z, but having both doesn't hurt.)

]

Closes the character class.

-

States that the hyphen now needs to be captured.

[

Opens a new character class where a character within the brackets will be captured.

1-9

Any character in the range 1-9 will be captured.

]

The character class is closed.

*Note, after this character class was closed we did not include any other characters after the bracket.  You can include characters to indicate how many times you want the pattern to be repeated, or say that it was optional but not necessary.

[

Opens a character class.

1-2

Indicates that a character between 1 and 2 can be captured.

]

Closes the character class.

?

Here we have included a question mark character after a character class was closed.  This indicates that anything immediately preceeding it is optional.  In this case, a character class is immediately preceeding, therefore either a 1 or 2 is optionally allowed to be found.

\b

Indicates that this pattern stops at a word boundary.

This type of rule includes two types of tokens, letters and numbers.  Therefore, we will include it in our Multitoken Rule File.  

Navigate back to the Rule Editor page and select the MultiTokenRules.xml file.

You will notice that the rule structure is the same as the rules we have previously seen. 

Go ahead and create a blank rule template and move it to the bottom of the rule file to edit.  You will need to complete the following, as shown in Figure 48.  Then validate and save the rule.

 

  • Rule name
  • Description
  • Regex
  • SV
  • 2 Examples

Now go back and clear the processing results and re-process your corpus.  

When viewing 2016 us open.txt in the text tab view, you should notice that it is now extracting golf ties matching the regular expression.  However, if you expand the WEAPON entity from the list of entities in the middle, you should see that T-12 appears.  This is because T-12 is hardcoded in the lexicon as a WEAPON.  

This presents a problem.  When these instances appear, you need to use your judgment and decide if the entry needs to be hardcoded in the lexicon or not.  In this case, it is an example of over-generation.  I can imagine other instances of T-12 in documents that I do not want to be matched to a WEAPON.  Therefore, in this case, the best thing to do is remove the semantic vectors indicating that this is a WEAPON from the lexical entry.

Go ahead and remove those semantic vectors and re-process the document.  You should now see T-12 extract as SPORT_STAT.

stat_lc-0004

The last rule that we want to create needs to be able to extract instances of No. 1, No. 2, etc.  This is a very predictable pattern and perfect for a regex.  Go ahead and write a regex with the following parameters:

Find only instances of the characters n & o, followed by a period, then a space and a number.  For brevity, lets cap the number at 5.

Once your regex is complete, add it to the MultiTokenRules.xml file, clear your processing results and re-process your corpus.  You should now see No. 1 being extracted as SPORT_STAT in the 2016 us open.txt document.

Regular Expression Assessment

Please copy and paste your regular expression for rule stat_lc-0004 here.