Multi-document summarization

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. The resulting summary report allows individual users, such as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload.

Key benefits

Multi-document summarization creates information reports that are both concise and comprehensive. With different opinions being put together & outlined, every topic is described from multiple perspectives within a single document. While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required. Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.

Technological challenges

The multi-document summarization task has turned out to be much more complex than summarizing a single document, even a very large one. This difficulty arises from inevitable thematic diversity within a large set of documents. A good summarization technology aims to combine the main themes with completeness, readability, and conciseness. Document Understanding Conferences,^[1] conducted annually by NIST, have developed sophisticated evaluation criteria for techniques accepting the multi-document summarization challenge.

An ideal multi-document summarization system does not simply shorten the source texts but presents information organized around the key aspects to represent a wider diversity of views on the topic. When such quality is achieved, an automatic multi-document summary is perceived more like an overview of a given topic. The latter implies that such text compilations should also meet other basic requirements for an overview text compiled by a human. The multi-document summary quality criteria are as follows:

clear structure, including an outline of the main content, from which it is easy to navigate to the full text sections
text within sections is divided into meaningful paragraphs
gradual transition from more general to more specific thematic aspects
good readability

The latter point deserves additional note - special care is taken in order to ensure that the automatic overview shows:

no paper-unrelated "information noise" from the respective documents (e.g., web pages)
no dangling references to what is not mentioned or explained in the overview
no text breaks across a sentence
no semantic redundancy.

Real-life systems

The multi-document summarization technology is now coming of age - a view supported by a choice of advanced web-based systems that are currently available.

Ultimate Research Assistant^[2] - performs text mining on Internet search results to help summarize and organize them and make it easier for the user to perform online research. Specific text mining techniques used by the tool include concept extraction, text summarization, hierarchical concept clustering (e.g., automated taxonomy generation), and various visualization techniques, including tag clouds and mind maps.
iResearch Reporter^[3] - Commercial Text Extraction and Text Summarization system, free demo site accepts user-entered query, passes it on to Google search engine, retrieves multiple relevant documents, produces categorized, easily readable natural language summary reports covering multiple documents in retrieved set, all extracts linked to original documents on the Web, post-processing, entity extraction, event and relationship extraction, text extraction, extract clustering, linguistic analysis, multi-document, full text, natural language processing, categorization rules, clustering, linguistic analysis, text summary construction tool set.
Newsblaster^[4] is a system that helps users find news that is of the most interest to them. The system automatically collects, clusters, categorizes, and summarizes news from several sites on the web (CNN, Reuters, Fox News, etc.) on a daily basis, and it provides users an interface to browse the results.
NewsInEssence^[5] may be used to retrieve and summarize a cluster of articles from the web. It can start from a URL and retrieve documents that are similar, or it can retrieve documents that match a given set of keywords. NewsInEssence also downloads news articles daily and produces news clusters from them.
NewsFeed Researcher^[6] is a news portal performing continuous automatic summarization of documents initially clustered by the news aggregators (e.g., Google News). NewsFeed Researcher is backed by a free online engine covering major events related to business, technology, U.S. and international news. This tool is also available in on-demand mode allowing a user to build a summaries on selected topics.
Scrape This^[7] is like a search engine, but instead of providing links to the most relevant websites based on a query, it scrapes the pertinent information off of the relevant websites and provides the user with a consolidated multi-document summary, along with dictionary definitions, images, and videos.
JistWeb^[8] is a query specific multiple document summariser.

As auto-generated multi-document summaries increasingly resemble the overviews written by a human, their use of extracted text snippets may one day face copyright issues in relation to the fair use copyright concept.

Bibliography

Günes Erkan and Dragomir R. Radev. Lexrank: Graph-based centrality as salience in text summarization. Journal of Artificial Intelligence Research (JAIR), 2004.
Dragomir R. Radev, Hongyan Jing, Malgorzata Styś, and Daniel Tam. Centroid-based summarization of multiple documents. Information Processing and Management, 40:919–938, December 2004.
Kathleen R. McKeown and Dragomir R. Radev. Generating summaries of multiple news articles. In Proceedings, ACM Conference on Research and Development in Information Retrieval SIGIR'95, pages 74–82, Seattle, Washington, July 1995.
C.-Y. Lin, E. Hovy, "From single to multi-document summarization: A prototype system and its evaluation", In "Proceedings of the ACL", pp. 457–464, 2002
Kathleen McKeown, Rebecca J. Passonneau, David K. Elson, Ani Nenkova, Julia Hirschberg, "Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization", SIGIR’05, Salvador, Brazil, August 15–19, 2005
R. Barzilay, N. Elhadad, K. R. McKeown, "Inferring strategies for sentence ordering in multidocument news summarization", Journal of Artificial Intelligence Research, v. 17, pp. 35–55, 2002
M. Soubbotin, S. Soubbotin, "Trade-Off Between Factors Influencing Quality of the Summary", Document Understanding Workshop (DUC), Vancouver, B.C., Canada, October 9–10, 2005
C Ravindranath Chowdary, and P. Sreenivasa Kumar. "Esum: an efficient system for query-specific multi-document summarization." In ECIR (Advances in Information Retrieval), pp. 724–728. Springer Berlin Heidelberg, 2009.

References

↑ "Document Understanding Conferences". Nlpir.nist.gov. 2014-09-09. Retrieved 2016-01-10.
↑ "Generate Research Report". Ultimate Research Assistant. Retrieved 2016-01-10.
↑ "iResearch Reporter service". Iresearch-reporter.com. Retrieved 2016-01-10.
↑ Archived April 16, 2013, at the Wayback Machine.
↑ Archived April 11, 2011, at the Wayback Machine.
↑ "News Feed Researcher | General Stuff". Newsfeedresearcher.com. Retrieved 2016-01-10.
↑ Archived September 19, 2009, at the Wayback Machine.
↑ Archived May 29, 2013, at the Wayback Machine.

External links

Natural language processing

General terms	Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigram)

Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word sense disambiguation Terminology extraction Truecasing

Automatic summarization	Multi-document summarization Sentence extraction Text simplification

Machine translation	Computer-assisted Example-based Rule-based

Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation

Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic indexing

Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing

Natural language user interface	Automated online assistant Chatterbot Interactive fiction Question answering

This article is issued from Wikipedia - version of the Saturday, April 23, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.