Skip to navigation | Skip to main content | Skip to footer
Menu
Menu

Current postgraduate taught students

COMP61332: Text Mining (2011-2012)

This is an archived syllabus from 2011-2012

Text Mining
Level: 6
Credit rating: 15
Pre-requisites: COMP61011 (ML&DM), COMP60421 (Onto Eng for SW)
Co-requisites: No Co-requisites
Duration: 5 weeks, 1 day per week
Lectures: 20
Labs: 20
Lecturers: Jock McNaught
Course lecturer: Jock McNaught

Additional staff: view all staff
Timetable
SemesterEventLocationDayTimeGroup
Sem 2 P4 Lecture 2.15 Mon 09:00 - 11:00 -
Sem 2 P4 Lab 2.25abcd Mon 11:00 - 13:00 -
Sem 2 P4 Lecture 2.15 Mon 14:00 - 16:00 -
Sem 2 P4 Lab 2.25abcd Mon 16:00 - 18:00 -
Assessment Breakdown
Exam: 50%
Coursework: 50%
Lab: 0%

Themes to which this unit belongs
  • Text Mining

Introduction

We naturally record and communicate much of our knowledge in textual form. However, for many years now the rate of growth of textual information has been such that the individual struggles to keep up to date in his fields of interest. Well before the advent of the Web, people suffered from information overload and information overlook, and today the ease of electronic publication has only exacerbated these problems. It has also long been realised that the vast archives of text at our disposal contain hidden, unsuspected, potentially valuable information: nowhere explicitly stated, but only discoverable through (until recently) serendipity, or painstaking manual identification and linking of often disparate chunks of knowledge.

Text mining has evolved in recent years as a way of mitigating information overload and information overlook, and of helping us discover new knowledge from old. To do this, it employs a battery of techniques from information retrieval, natural language processing and data mining. Although the holy grail of text mining is the discovery of previously unsuspected knowledge, text mining techniques find application in a wide number of areas, to do essentially with the organising, selecting, filtering, combining, association and exploitation of information. Text mining goes far beyond, and is not to be confused with, classic information retrieval (conventional search engine technology).

What makes text mining challenging is the combination of: the core problems of natural language processing (how to make sense of unstructured data (text), how to deal with the ambiguity inherent in language that humans naturally cope with); the problems that arise when dealing with very large amounts of electronic text and the even larger amounts of representations derived from these during processing; the problems of integrating different components, with different input/output spefications and different intermediate representations, in text mining workflows to accomplish sophisticated tasks; and the non-trivial problem of matching actual system capabilities to user expectations and requirements.

Applications of text mining are many and varied: systems to find promising targets for drug discovery, to support systematic reviews, to match CVs to job profiles, to carry out business news analysis for competitive intelligence, to aid discovery of disease-gene associations, to monitor reports of terrorist activity, to help generate hypotheses for scientific research, to direct customer queries to appropriate support staff, to discover positive and negative opinions on topics of interest, to discover hot topics and trends, ...

The School's Text Mining Research Group is closely linked with the National Centre for Text Mining, also hosted by the School, and this course unit is unique in benefitting from both the expertise and the technology available via the National Centre.

Aims

This course unit aims to provide students with an understanding of principles, issues, techniques and solutions connected with text mining, and to enable them to gain knowledge of how recent advances in text mining relate to innovative approaches to organising, characterising, finding and exploiting large scale textual information in the search for new knowledge.

Programme outcomeUnit learning outcomesAssessment
A1Demonstrate a requisite understanding of selected concepts, terminology and issues related to text mining
  • Examination
A2Demonstrate a requisite understanding of the fundamental techniques for text mining
  • Examination
A1Demonstrate a requisite understanding of the relationship between text mining techniques and those of related areas (information retrieval, data mining)
  • Examination
A1Demonstrate a requisite understanding of relevant (de facto) standards supporting text mining
  • Examination
B3Explain the general principles of text mining and discuss the content and role of relevant key publications and (de facto) standards
  • Examination
B2 B3Explain the difficulty of analysing different types of content in relation to user needs
  • Examination
B1 B2 B3Explain how techniques for characterising the meaning of content and for semantic search are applied
  • Examination
B3Discuss, critically analyse and evaluate current approaches in the field
  • Examination
  • Individual coursework
C1 C3 C4Be able to use the power of text mining for content analysis, search, personalisation and enterprise applications
G4Appreciate issues of communication of information and knowledge discovery.
G4Ability to support enterprise knowledge management activities.

Syllabus

Introduction: background, motivation, dealing with information overload and information overlook, unstructured vs. (semi-)structured data, evolving information needs and knowledge management issues, enhancing user experience of information provision and seeking, the business case for text mining.

The text mining pipeline: information retrieval, information extraction and data mining.

Fundamentals of natural language processing: linguistic foundations, levels of linguistic analysis.

Approaches to text mining: rule-based vs. machine learning based vs. hybrid; generic vs. domain specific; domain adaptation.

Dealing with real text: text types, document formats and conversion, character encodings, markup, low-level processes (sentence splitting, tokenisation, part of speech tagging, chunking).

Information extraction: term extraction, named entity recognition, relation extraction, fact and event extraction; partial analysis vs. full analysis.

Data mining and visualisation of results from text mining.

Evaluation of text mining systems: evaluation measures, role of evaluation challenges, usability evaluation, the U-Compare initiative.

Resources for text mining: annotated corpora, computational lexica, ontologies, computational grammars; design, construction and use issues.

Issues in large scale processing of text: distributed text mining, scalable text mining systems.

A sampler of text mining applications and services; case studies.

Reading List

Core Text
Title: Text mining handbook: advanced approaches in analyzing unstructured data
Author: Feldman, Ronen and James Sanger
ISBN: 9780521836579
Publisher: CUP
Edition:
Year: 2008


Supplementary Text
Title: Text mining: classification, clustering and applications
Author: Srivastava, Ashok and Mehran Sahami (eds.).
ISBN: 9781420059403
Publisher: Chapman & Hall
Edition:
Year: 2009


Supplementary Text
Title: Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition (2nd edition)
Author: Jurafsky, Daniel and James H. Martin
ISBN: 9780135041963
Publisher: Pearson International
Edition:
Year: 2009


Supplementary Text
Title: Text mining for biology and biomedicine
Author: Ananiadou, Sophia and John McNaught (eds.).
ISBN: 158053984X
Publisher: Artech House
Edition:
Year: 2006


Supplementary Text
Title: Introduction to information retrieval
Author: Manning, Christopher D. and Prabhakar Raghavan and Hinrich Schutze
ISBN: 9780521865715
Publisher: Cambridge University Press
Edition:
Year: 2008
Available online at: http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html