Information extraction

217,615pages on
this wiki
Discuss this page0

AnHai's notes on IE


Modern Information Extraction is, in general, credited to MUC which was established by DARPA. It has evolved quite a bit and present-day evaluations are done within the context of the ACE (Automatic Content Extraction) evaluation program run by NIST.

  1. Information Extraction, Andrew McCallum, University of Massachusettes, ACM Queue November 2005 [1] A non-technical overview of information extraction that presents a 5-step high-level overview consisting of
    1. Segmentation: essentially tokenization of text
    2. Classification: classify each segmented piece as one of several classes (Person, Organization etc.)
    3. Association: essentially relationship detection
    4. Normalization: different things are normalized to be the same (3-3:30 and 1500-1530 to possibly the same ISO std).
    5. Deduplication: essentially coreference resolution.
    6. Talks about the different higher-level approaches to Information Extraction and applicability of these
      1. Simple regular expressions: for simple extraction tasks
      2. Rules: more complex exraction tasks but semantics are still clearly defineable
      3. Machine learning algorithms: Subtle rules for very complex tasks
    7. Uncertainty is an integral part of information extraction and needs to be managed appropriately. Easier training needed (defining large numbers of labeled examples is not easy) and therefore semi-supervised methods and interactive extraction.
    8. Note Very lightweight tutorial and a good light reading.
  2. Automatic Information Extraction, Hamish Cunnigham, University of Sheffield[2] An extensive overview of different IE tasks along with nice examples. Starts with the claim that IE tasks are faced with the specificity-complexity tradeoff; i.e., more complex the IE task the more specific the domain should be from which the information is being extracted. Several applications are listed such as marketing, PR, media analytst etc. IE tasks are broadly divided into 5 categories.
    1. Named Entity
    2. Coreference Resolution
    3. Template Element Construction: Constructs templates by adding description to extracted information (primarily using Coreference Resolution
    4. Template Relation Constrcution: Essentially Relationship identification
    5. Scenario Extraction: Tie together the elements and relations into a single complex event such as Person A was replaced by Person B on Date C at Organization Y.
    6. Summary: Information Extraction (modern) pretty much started with MUC (Message Understanding Conference) and the afore-mentioned tasks were the basis for this conference. The newer information extraction conference is known as ACE (Automatic Content Extraction) [3] and is significantly harder with 1 & 2 a single task, 3 & 4 a single task and 5 a separate task.
    7. Note Fairly vanilla tutorial that has basically followed the MUC and ACE tasks in most of the description.
  3. Introduction to Information Extraction, Doug Appelt and David Israel, SRI, Tutorial at IJCAI 1999[4] Very detailed introduction to Information Extraction from a rule-based linguist perspective. After some introduction the tutorial talks about two main approaches to building extraction systems (a) Knowledge Engineering Approach and (b) Automatically trainable Systems. Several examples of pros and cons of the two approches are discussed. Different components of an information extraction system are described as
    1. Tokenization: Straightforward
    2. Morphological Processing: (a) Identify inflectional variants (b) Lexical lookup of tokens (c) Part-of-speech tagging (d) names and structured items: Identification of structured items such as dates, times, telephone numbers and proper names etc. There is a, somewhat, detailed discussion on both knowledge-based and machine learning approaches to named-entity extraction. A generic approach to building rule-based named-entity recognizers is given. Discussion of trainable named-entity taggers using HMMs etc. There is also pointers to several tools for building named-entity taggers.
    3. Syntactic Analysis: shallow and full-parsing. Both knowledge-based and trainable parsers are discussed.
    4. Domain Analysis: (a) Coreference analysis with a detailed description of a coreference algorithm (b) Merging of partial results
    5. Note Incomplete. This is a very comprehensive tutorial but not very well-organized. Towards the end (Domain Analysis) it gets to be fairly opaque and several issues are mixed-up.
  4. Information Extaction and Integration: an Overview, William Cohen [5] An excellent overview of statistical methods for Information Extraction. Detailed explanation of various algorithms starting with simple ones like a sliding window approach used to capture the sequence information. This is followed by more powerful markovian algorithms in increasing order of power -- Hidden Markov Models, Maximum Entropy Markov Models and Conditional Random Fields.
  5. Empirical Methods in Information Extraction, Claire Cardie, AI Magazine Another overview to several of the MUC tasks. [6]

Machine Learning

(a) Relational Markov Networks for Collective Information Extraction, Bunescu and Mooney, ACL 2004[7]

(b) Statistical Information Extraction Course offered by Andrew McCallum at UMass. I will look through here and collect appropriate papers and summarize them here. [8]

In general machine learning approaches to information extraction are becoming more popular. Several points to be noted (I will expand on them later)

  • Mainly supervised techniques. Unfortunately, labeled data is very sparse (ACE for example has a total of 300,000 labeled corpus for training of all tasks)
  • Modeling sequence is important (more so for named-entity recognition)
  • Semi-supervised methods are becoming more popular. Recent innovation is to use other related problems to help the problem at hand.

Named-Entity Recognition

Till recently Basic Plot for named-entity recognition followed the following two steps

  • Train classifiers to learn 3 classes (from labeled data) Start of named-entity, contains named-entity and other.
  • Use resulting outputs to form a huge lattice. Prune it and then find the sequence with highest probability.

A statistical Model For Multilingual Entity Detection And Tracking[9] This paper follows the basic plot above and compares the performance of two algorithms for Step (a) namely, Robust Risk Minimization (Text Chunking based on a Generalization of Winnow [10]) and Maximum Entropy (A Maximum Entropy Approach to Natural Language Processing [11]). A paper closely related to this work is Maximum Entropy Markov Models for Information Extraction and Segmentation [12].

An easy to read paper on application of maximum entropy for text classification is (Using Maximum Entropy for Text Classification [13]).

Other references

Scalability Considerations in Information Extraction

There does not seem to have been well addressed but some prliminary papers are starting to crrep in. One particular paper of interest is the one from University of Washington [17]. This paper describes a scalable architecture that describes the workings of a binding engine ....TO BE COMPLETED

Recent work on using inverted indices to speed up named-entity recognition (Pointer to come shortly).


First Shot At Characterizing Information Extraction

In this section I will start to characterize the area of Information Extraction in along several axis that will allow us to start defining the boundaries of the tutorial.

  1. Task-Based Classification: Broadly can be classified into three categories
    1. Entity Identifiction: This covers named-entity extraction and coreference resolution. Examples of named-entity recognition are Persons, Organizations, Rock-N-Roll Bands, Restaurants, Software Names etc.
    2. Relationship Extraction This involves extracting relationship between the entities identified in the task above. Relationships can be of different ary's; Unary, Binary etc.
      1. Example of a Unary relationship is ....
      2. Example of a Binary relationship is Person WorksFor Organization.
    3. Business Event Identification Need a broader definition Events than that has been used in conventional IE. More discussions needed here which will be done over the next few days.
  2. Approach-Based Classification
    1. Knowledge-based Approach: Here I will describe general recipes for knowledge-based approaches to different information extraction tasks. A lot of information extraction tasks are accomplished by matching regular expressions over lexical features of input symbols. CPSL (Common Pattern Specification Language) was designed as a language for specifying such finite-state grammars to allow the specification application independent IE rules.[22]
      1. named-entity identification
      2. Relationship identification
      3. Business event identification
    2. Machine-Leaning Supervised Approaches: Here I will describe general recipes for machine-learning approaches to
      1. named-entity identification: The goal here is to model sequences. Initial approaches tried to model this sequence by building classifiers that used sliding windows over text as features. More powerful statistical approaches that model sequences such as HMMs, and MEMMs have been proposed. The state-of-the-art approach today is CRF.
      2. relationship identification
      3. business event identification
    3. Machine Learning Semi-supervised

Around Wikia's network

Random wikia