文本挖掘

出版时间：2009-8 出版社：人民邮电出版社作者：（以）费尔德曼,（美）桑格页数：410 字数：506000
Tag标签：无

前言

　　The information age has made it easy to store large amounts of data. The prolifera- tion of documents available on the Web， on corporate intranets， on news wires， and elsewhere is overwhelming. However， although the amount of data available to us is constantly increasing， our ability to absorb and process this information remains constant. Search engines only exacerbate the problem by making more and more documents available in a matter of a few key strokes.　　Text mining is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining， machine learning， natural language processing （NLP）， information retrieval （IR）， and knowledge management. Text mining involves the preprocessing of document collections （text categorization， information extraction， term extraction）， the storage of the intermediate represen- tations， the techniques to analyze these intermediate representations （such as distri- bution analysis， clustering， trend analysis， and association rules）， and visualization of the results.　 This book presents a general theory of text mining along with the main tech- niques behind it. We offer a generalized architecture for text mining and outline the algorithms and data structures typically used by text mining systemg　　The book is aimed at the advanced undergraduate students， graduate students， academic researchers， and professional practitioners interested in complete cov- erage of the text mining field. We have included all the topics critical to people who plan to develop text mining systems or to use them. In particular， we have covered preprocessing techniques such as text categorization， text clustering， and information extraction and analysis techniques such as association rules and link analysis.　　The book tries to blend together theory and practice; we have attempted to provide many real-life scenarios that show how the different techniques are used in practice. When writing the book we tried to make it as self-contained as possible and have compiled a comprehensive bibliography for each topic so that the reader can expand his or her knowledge accordingly.

内容概要

　　本书是一部文本挖掘领域名著，作者为世界知名的权威学者。书中涵盖了核心文本挖掘操作、文本挖掘预处理技术、分类、聚类、信息提取、信息提取的概率模型、预处理应用、可视化方法、链接分析、文本挖掘应用等内容，很好地结合了文本挖掘的理论和实践。　　本书非常适合文本挖掘、信息检索领域的研究人员和实践者阅读，也适合作为高等院校计算机及相关专业研究生的数据挖掘和知识发现等课程的教材。

作者简介

Ronen Feldman 机器学习、数据挖掘和非结构化数据管理的先驱人物。以色列Bar-Ilan大学数学与计算机科学系高级讲师、数据挖掘实验室主任，Clearforest公司（主要为企业和政府机构开发下一代文本挖掘应用）合作创始人、董事长，现在还是纽约大学Stern商学院的副教授。

书籍目录

Ⅰ. Introduction to Text Mining　Ⅰ.1　Defining Text Mining　Ⅰ.2　General Architecture of Text Mining SystemsⅡ. Core Text Mining Operations　Ⅱ.1　Core Text Mining Operations　Ⅱ.2　Using Background Knowledge for Text Mining　Ⅱ.3　Text Mining Query LanguagesⅢ. Text Mining Preprocessing Techniques　Ⅲ.1 Task-Oriented Approaches　Ⅲ.2 Further ReadingⅣ. Categorization　Ⅳ.1　Applications of Text Categorization　Ⅳ.2　Definition of the Problem　Ⅳ.3　Document Representation　Ⅳ.4　Knowledge Engineering Approach to TC　Ⅳ.5　Machine Learning Approach to TC　Ⅳ.6　Using Unlabeled Data to Improve Classification　Ⅳ.7　Evaluation of Text Classifiers　Ⅳ.8　Citations and NotesⅤ. Clustering　Ⅴ.1　Clustering Tasks in Text Analysis　Ⅴ.2　The General Clustering Problem　Ⅴ.3　Clustering Algorithms　Ⅴ.4　Clustering of Textual Data　Ⅴ.5　Citations and NotesⅥ. Information Extraction　Ⅵ.1　Introduction to Information Extraction　Ⅵ.2　Historical Evolution of IE: The Message Understanding Conferences and Tipster　Ⅵ.3　IE Examples　Ⅵ.4　Architecture of IE Systems　Ⅵ.5　Anaphora Resolution　Ⅵ.6　Inductive Algorithms for IE　Ⅵ.7　Structural IE　Ⅵ.8　Further ReadingⅦ. Probabilistic Models for Information Extraction　Ⅶ.1　Hidden Markov Models　Ⅶ.2　Stochastic Context-Free Grammars　Ⅶ.3　Maximal Entropy Modeling　Ⅶ.4　Maximal Entropy Markov Models　Ⅶ.5　Conditional Random Fields　Ⅶ.6　Further ReadingⅧ. Preprocessing Applications Using Probabilistic and Hybrid Approaches　Ⅷ.1　Applications of HMM to Textual Analysis　Ⅷ.2　Using MEMM for Information Extraction　Ⅷ.3　Applications of CRFs to Textual Analysis　Ⅷ.4　TEG: Using SCFG Rules for Hybrid Statistical–Knowledge-Based IE　Ⅷ.5　Bootstrapping　Ⅷ.6　Further ReadingⅨ. Presentation-Layer Considerations for Browsing and Query Refinement　Ⅸ.1　Browsing　Ⅸ.2　Accessing Constraints and Simple Specification Filters at the Presentation Layer　Ⅸ.3　Accessing the Underlying Query Language　Ⅸ.4　Citations and NotesⅩ. Visualization Approaches　Ⅹ.1　Introduction　Ⅹ.2　Architectural Considerations　Ⅹ.3　Common Visualization Approaches for Text Mining　Ⅹ.4　Visualization Techniques in Link Analysis　Ⅹ.5　Real-World Example: The Document Explorer SystemⅪ. Link Analysis　Ⅺ.1　Preliminaries　Ⅺ.2　Automatic Layout of Networks　Ⅺ.3　Paths and Cycles in Graphs　Ⅺ.4　Centrality　Ⅺ.5　Partitioning of Networks　Ⅺ.6　Pattern Matching in Networks　Ⅺ.7　Software Packages for Link Analysis　Ⅺ.8　Citations and NotesⅫ. Text Mining Applications　Ⅻ.1　General Considerations　Ⅻ.2　Corporate Finance: Mining Industry Literature for Business Intelligence　Ⅻ.3　A “Horizontal” Text Mining Application: Patent Analysis Solution Leveraging a Commercial Text Analytics Platform　Ⅻ.4　Life Sciences Research: Mining Biological Pathway Information with GeneWaysAppendix A: DIAL: A Dedicated Information Extraction Language forText Mining　A.1　What Is the DIAL Language?　A.2　Information Extraction in the DIAL Environment　A.3　Text Tokenization　A.4　Concept and Rule Structure　A.5　Pattern Matching　A.6　Pattern Elements　A.7　Rule Constraints　A.8　Concept Guards　A.9　Complete DIAL ExamplesBibliographyIndex

章节摘录

　　Similarity Functions for Simple Concept Association Graphs　　Similarity functions often form an essential part of working with simple concept asso- ciation graphs, allowing a user to view relations between concepts according to differ- ing weighting measures. Association rules involving sets （or concepts） A and B that have been described in detail in Chapter II are often introduced into a graph format in an undirected way and specified by a support and a confidence threshold. A fixed confidence threshold is often not very reasonable because it is independent of the sup- port from the RHS of the rule. As a result, an association should have a significantly higher confidence than the share of the RHS in the whole context to be considered as interesting. Significance is measured by a statistical test （e.g., t-test or chi-square）.　 With this addition, the relation given by an association rule is undirected. An asso- ciation between two sets A and B in the direction AB implies also the association B　A. This equivalence can be explained by the fact that the construct of a statisti- cally significant association is different from implication （which might be suggested by the notation AB）. It can easily be derived that if B is overproportionaUy represented in A, then A is also overproportionally represented in B.　　As an example of differences of similarity functions, one can compare the undi- rected connection graphs given by statistically significant association rules with the graphs based on the cosine function. The latter relies on the cosine of two vectors and is efficiently applied for continuous, ordinal, and also binary attributes. In case of documents and concept sets, a binary vector is associated to a concept set with the vector elements corresponding to documents. An element holds the value 1 if all the concepts of the set appear in the document. Table X.1 （Feldman, Kloesgen, and Zilberstein 1997b）, which offers a quick summary of some common similarity functions, shows that the cosine similarity function in this binary case reduces to the fraction built by the support of the union of the two concept sets and the geometrical mean of the support of the two sets.　　A connection between two sets of concepts is related to a threshold for the cosine similarity （e.g., 10%）. This means that the two concept sets are connected if the support of the document subset that holds all the concepts of both sets is larger than 10 percent of the geometrical mean of the support values of the two concept sets.

媒体关注与评论

　　“……我购买了这本书。这本书绝对是非常值得拥有的参考书。”　　——L.Venkata Subramaniam，IBM印度研究实验室　　“一本由该领域最重要专家鳊写的文本挖掘导论。这本书写得非常好。完美地结合了文本挖掘的理论和实践，既适合研究人员又适合实践者……极力推荐那些没有任何计算语言学背景而想钻研文本挖掘领域的人阅读本书。”　　——Rada Mihalcea，北得克萨斯大学　　文本挖掘已经成为令人兴奋的新兴研究领域。本书由世界知名的权威学者编写，除了讲解核心文本挖掘和链路检测算法及技术之外，还介绍了高级预处理技术。并考虑了知识表示方面的因素以及可视化方法。此外。书中还探讨了有关技术在实践中的应用，很好地兼顾了文本挖掘的理论和实践

图书封面

图书标签Tags

无

评论、评分、阅读与下载

还没读过(63)
勉强可看(457)
一般般(780)
内容丰富(3235)
强力推荐(265)

文本挖掘 PDF格式下载

用户评论 (总计15条)

书的质量很好，是一本计算机数据挖掘的深入学习教材。
非常好的书，非常值得拥有，文本挖据领域的一本好书
这本书对我的研究工作很有帮助，其中介绍的理论和方法给了我很多启发，为我下一步的研究指明了方向。作者的语言写的通俗易懂，没有太多深涩的专业术语，就连英语水平一般的我也看的精精有味。总之这是一本值得珍藏的书。
很不错的一本书，最近开始研究来着
给女儿买的，她说：原版书太贵，这本书影印效果很好，与原版不差。
这本书原先是借的图书馆的原版书，考虑到可以作为一本工具书，遂买了一本影印版的，人民邮电出版社最近几年出的很多影印版的书不错，很及时。
这本书，不错，正是我所需要的
本书是先借同事看的，不错后才买的
还没认真去看，不过我想应该很好的，单从书的质量和外观来看，非常不错！
包装，书都算不错吧
英文原版的。读起来有点难度。
想请教一下，我主要的工作是信息检索与分析，信息检索导论（作者：（美）曼宁，（美）拉哈万，（德）舒策　著，王斌　译）与文本挖掘这两本书都需要研读吗？
实验室买了好几本，都在看！
内容还是不叫容易懂的，不是特别的难。但本人英文不是很好，还是有点费力。内容不深，可以作为入门教材。
书的内容不错，但是组织上不太好，看上去有点枯燥

文本挖掘

用户评论 (总计15条)

推荐图书

相关图书