SEO最佳建议

针对搜索引擎体验的网站优化

Ruby Gems Collection

Mechanize: HTTP客户端
Parallel: 多线程工具
rmmseg-cpp: 中文分词工具
w3c_validators: w3c验证工具(抓取官方结果)
PageRankr: pagerank查询工具
pismo: 网页文本分析工具
nokogiri: HTML/XML分析工具

Vim Configure

安装方法:
http://spf13.com/project/spf13-vim

安装后:
在~/.vimrc中非注释的第2行加入set t_Co=256

建议:
删除~/.vimrc中的spell那行,这玩意注释掉都没用,用起来很恶心 如果不起作用,在~/.vimrc中加入set nospell。也许这样也没用...自求多福吧,除了这点其他都很爽。
在~/.vimrc.local中加入color molokai
搜索tree,找到C-e,改成C-t, 否则与原来快捷键冲突

Seo System

基于ruby开发的"针对搜索引擎体验的网站分析系统 - seo oriented website optimization analysis system",简称"SEO系统"
相关工具:
http://www.ruby-lang.org/
https://github.com/peterc/pismo
https://github.com/tenderlove/mechanize
https://github.com/pluskid/rmmseg-cpp
https://github.com/alexdunae/w3c_validators
https://github.com/seoaqua/ruby-baidu
https://github.com/seoaqua/ruby-local-cache
https://github.com/seoaqua/ruby-website
https://github.com/blatyo/page_rankr

项目地址:
https://github.com/seoaqua/seo-unit-test
https://github.com/seoaqua/seosys

Seo Book

信息检索领域相关资料 (A Guide to Information Retrieval)
Organized by Hongfei Yan
Last updated on Sept. 16, 2009

---------------------
Contents
    Books
        + Finding Out About: Search Engine Technology from a cognitive 
            Perspective (Belew, R.K., 2000)
            http://www-cse.ucsd.edu/~rik/foa/
        + Foundations of Statistical Natural (C. Manning and H. Schutze, 1999)
        + Information Retrieval, 2nd edition (C.J. van Rijsbergen, 1979)
            (full text)
            http://www.dcs.gla.ac.uk/Keith/Preface.html
        + Information Retrieval: A Survey (Ed Greengrass, 2000)
            http://www.csee.umbc.edu/cadip/readings/IR.report.120600.book.pdf
        + Information Retrieval: Data Structures & Algorithms
            (Frakes, W. and Baeza-Yates, R., 1992)
            http://www.dcc.uchile.cl/~rbaeza/iradsbook/irbook.html
        + Information Retrieval Interaction (Ingwersen, P., Taylor Graham, 1992)
            http://www.db.dk/pi/iri/
        + Introduction to Information Retrieval
            (Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze, 2008)
            http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html
        + Managing Gigabytes:compressing and indexing documents and images,
            2nd edition, (Ian H. Witten, Alistair Moffat,and Timothy Bell,1999)
        + Mining the Web: Discovering Knowledge from Hypertext Data 
            (Soumen Chakrabarti, 2003)
        + Modeling the Internet and the Web: 
            probabilistic Methods and Algorithms 
            (Pierre Baldi, Paolo Frasconi and Padhraic Smyth, 2003)
        + Modern Information Retrieval 
            (Ricardo Baeza-Yates and Berthier Ribeiro-Neto, 2000)
        + Readings in Information Retrieval. 
            (Sparck-Jones, K. and Willett, P., 1997)
        + Search Engines: Information Retrieval in Practice
            (B. Croft, D. Metzler, T. Strohman, 2009)
            http://www.pearsonhighered.com/croft1epreview/samples.html
        + Search Engine: Principle,Technology and Systems 
            搜索引擎-原理、技术与系统
            (Xiaoming Li,et al., 2005 ), (full text)
            http://sewm.pku.edu.cn/book/dlbook.html
        + The Geometry of Information Retrieval 
            (C.J. van Rijsbergen, 2004)
            http://ir.dcs.gla.ac.uk/GeometryOfIR/
        + The Turn: Integration of Information Seeking and Retrieval in Context
            (Ingwersen, P., and Jarvelin, K., 2005)
        + TREC: Experiment and Evaluation in Information Retrieval 
            (Voorhees, E.M., and Harman, D.K., 2005)
            http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=10667

    Conferences and Workshops
        + CIKM: Conference on Information and Knowledge Management
            http://www.csee.umbc.edu/cikm/
        + SIGIR: Special Interest Group on Information Retrieval
            http://www.sigir.org/
        + SIGKDD: Knowledge Discovery and Data Mining
            http://www.kdd.org/
        + World Wide Web
            http://www.iw3c2.org/
        + SEWM: Symposium of Search Engine and WebMining
            全国搜索引擎和网上信息挖掘学术研讨会
            http://net.pku.edu.cn/~sewm/

    Courses
        + CMU Information Retrieval
            http://nyc.lti.cs.cmu.edu/classes/11-741/ (Spring 2006)
            Instructors: Jamie Callan and Yiming Yang 
        + Cornell University The Structure of Information Networks (Spring 2006)
            http://www.cs.cornell.edu/courses/cs685/2006sp/
            Instructor: Jon Kleinberg
        + Peking University Web Based Information Architectures (Fall 2006)
            http://net.pku.edu.cn/~wbia/
            Instructor: Xiaoming Li, Jimin Wang and Bo Peng
        + Stanford Univ. Text Information Retrieval and Web Mining (Autumn 2005)
            http://www.stanford.edu/class/cs276/
            Instructor: Christopher Manning and Prabhakar Raghavan
        + UIUC Introduction to Text Information Systems (Spring 2007)
            http://sifaka.cs.uiuc.edu/course/410s07/
            Instructor: ChengXiang Zhai
        + UMass Univ. Information retrieval course (Spring 2005)
            http://ciir.cs.umass.edu/cmpsci646/
            Instructors: James Allan
        + Washington Univ. Search Engines course
            http://courses.washington.edu/lis544/

    Evaluation Resources
        + CLEF: Cross-Language Evaluation Forum
            http://clef.iei.pi.cnr.it/
        + CWIRF: Chinese Web Information Retrieval Forum
            http://www.cwirf.org/
        + DUC: Document Understanding Conferences
            http://duc.nist.gov/
        + INEX: INitiative for the Evaluation of XML Retrieval
            http://inex.is.informatik.uni-duisburg.de/
        + NTCIR: NII-NACSIS Test Collection for IR Systems
            http://research.nii.ac.jp/ntcir/
        + TREC: Text REtrieval Conference 
            http://trec.nist.gov/

    Journals
        + Briefings in Bioinformatics (full text)
            http://bib.oxfordjournals.org/archive/
        + Computational Linguistics, The MIT Press
            http://mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=10
        + Data & Knowledge Engineering (DKE), Elsevier
            http://www.elsevier.com/wps/find/journaldescription.cws_home/505608/description?navopenmenu=-2
        + D-Lib Magazine
            http://www.dlib.org/
        + Information Processing Letters, Elsevier
            http://www.elsevier.com/locate/issn/00200190
        + Information Processing and Management (IP&M), Elsevier
            http://www.elsevier.com/locate/infoproman
        + Information Retrieval, Springer
            http://www.springer.com/sgw/cda/frontpage/0,11855,3-0-70-35744790-detailsPage%253Djournal%257Cdescription%257Cdescription,00.html
        + Information Research
            http://informationr.net/ir
        + International Journal on Digital Libraries, Springer
            http://link.springer.de/link/service/journals/00799/index.htm
        + International Journal of Cooperative Information Systems (IJCIS), 
            World Scientific
            http://ejournals.wspc.com.sg/ijcis/ijcis.shtml
        + International Journal on Document Analysis and Recognition, Springer
            http://link.springer.de/link/service/journals/10032/index.htm
        + International Journal of Intelligent Systems, Wiley
            http://www3.interscience.wiley.com/cgi-bin/jhome/36062
        + International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems (IJUFKS), World Scientific
            http://ejournals.wspc.com.sg/ijufks/ijufks.shtml
        + Journal of the American Society for Information Science and Technology (JASIST), Wiley
            http://www3.interscience.wiley.com/cgi-bin/jhome/76501873
        + Journal of Documentation (JDoc). Emerald
            http://www.emeraldinsight.com/0022-0418.htm
        + Journal of Intelligent Information Systems (JIIS), Springer
            http://www.wkap.nl/journalhome.htm/0925-9902
        + Knowledge and Information Systems (KAIS), Springer
            http://link.springer.de/link/service/journals/10115/index.htm
        + Natural Language Engineering, Cambridge University Press
            http://www.cambridge.org/journals/journal_catalogue.asp?mnemonic=NLE
        + Transactions On Information Systems (TOIS), ACM
            http://www.acm.org/tois/
        + Transactions on Knowledge and Data Engineering (TKDE), IEEE 
            http://www.computer.org/tkde/

    List Archives
        + SIG-IRList, http://www.sigir.org/sigirlist/index.html

    Organizations and Special Interest Groups
        + Cambridge NLIP, http://www.cl.cam.ac.uk/Research/NL/
        + CMU LTI, http://www.lti.cs.cmu.edu/
        + DEC laboratories in Palo Alto, Calif.
        + Glasgow Information Retrieval Group, http://www.dcs.gla.ac.uk/ir/
        + Google Labs, http://labs.google.com/
        + LTI, http://www.lti.cs.cmu.edu/
        + Massachusetts CIIR, http://ciir.cs.umass.edu/
        + MSR Asia, Web Search & Data Mining Group
            http://research.microsoft.com/wsm/
        + Standford InfoLab, http://infolab.stanford.edu/
        + UIUC Information Retrieval Group, http://sifaka.cs.uiuc.edu/ir/
        + 北大天网组, http://sewm.pku.edu.cn/
        + 北京大学计算语言学研究所, http://icl.pku.edu.cn/
        + 复旦大学信息检索和自然语言处理组, 
            http://www.cs.fudan.edu.cn/mcwil/irnlp/
        + 哈工大信息检索组, http://ir.hit.edu.cn/
        + 清华大学智能技术与系统国家重点实验室
            http://www.csai.tsinghua.edu.cn/ 
        #+ 中科院大规模内容计算组, http://159.226.40.18/ (fail to visit)

    Researchers
        + Andrew McCallum,
            http://www.cs.umass.edu/~mccallum/
        + ChengXiang Zhai, developing Lemur
            http://www-faculty.cs.uiuc.edu/~czhai/
        + Gerard Salton
            http://www.cs.cornell.edu/Info/Department/Annual95/Faculty/Salton.html
        + Karen Sparck, developing IDF
            http://www.cl.cam.ac.uk/users/ksj/
        + Keith van Rijsbergen
            http://www.dcs.gla.ac.uk/~keith/
        + Jamie Callan, 
            http://www.cs.cmu.edu/~callan/
        + Jon Kleinberg, developing HIT
            http://www.cs.cornell.edu/home/kleinber/
        + Li Xiaoming, developing Tianwang & Infomall
        + Nick Craswell, developing Terabyte Track
            http://research.microsoft.com/~nickcr
        + Susan Dumais, developing LSI
            http://research.microsoft.com/~sdumais/
        + Yiming Yang, developing text categorization
            http://www.cs.cmu.edu/~yiming/
        + Stephen Robertson, 
            http://research.microsoft.com/users/robertson/
        + Tefko Saracevic
            http://www.scils.rutgers.edu/~tefko/
        + W. Bruce Croft
            http://ciir.cs.umass.edu/personnel/croft.html

    Research-related Resources
        + http://www-faculty.cs.uiuc.edu/~czhai/research.html

    Software
        + Apache Lucene: a full-featured text search engine library
            http://lucene.apache.org/java/docs/index.html
        + Gate: a general architecture for text engineering
            http://gate.ac.uk/
        + Lemur: A full-text search engine
            http://www.lemurproject.org/
        + MG: A full-text search engine
            http://www.math.utah.edu/pub/mg/
        + Porter Stemmer: English stemming algorithm
            http://www.tartarus.org/martin/PorterStemmer/
        + Nutch: an open source web search engine
            http://sourceforge.net/projects/nutch/
        + TSE: A Tiny Search Engine
            http://sewm.pku.edu.cn/src/TSE/

---------------------
References: 
[1] Information Retrieval Resources, http://www.sigir.org/resources.html
[2] http://ir.dcs.gla.ac.uk/resources.html
[3] http://www.cs.cmu.edu/~callan/Teaching/Resources.html
[4] Diekemar, Information Retrieval Links, Jan. 28, 1999. 
    http://web.syr.edu/~diekemar/ir.html
[5] 陈鸿标,网上研习信息检索,1999年11月. 
    http://159.226.40.18/freshman/resources/网上研习信息检索.doc
[6] 数据挖掘研究院, http://www.dmresearch.net/
[7] 语音自然语言在线, http://www.snlpinfo.com/index.php
[8] PKU SEWM Group, http://sewm.pku.edu.cn/
[9] http://www.cs.cmu.edu/~callan/Teaching/Resources.html
[10] http://icl.pku.edu.cn/member/lisujian/maincontent.htm
[11] http://www.cs.fudan.edu.cn/mcwil/irnlp/link.htm
[12] Robert Krovetz, A Guide to the Literature of Information Retrieval,
    http://159.226.40.18/freshman/resources/guide-to-ir-lit.ps
[13] ACM Digital Library, 
    http://portal.acm.org/portal.cfm
    http://acm.lib.tsinghua.edu.cn/acm/
[14] http://www.sigir.org/proceedings/Proc-Browse.html
[15] SIGIR,
    http://portal.acm.org/browse_dl.cfm?linked=1&part=series&idx=SERIES278&coll=portal&dl=ACM&CFID=72474811&CFTOKEN=69288563
[16] WWW, International World Wide Web Conference
    http://portal.acm.org/browse_dl.cfm?linked=1&part=series&idx=SERIES968&coll=portal&dl=ACM&CFID=72474811&CFTOKEN=69288563
[17] China Digital Journal Community, http://wanfang.calis.edu.cn/wf/szhqk/index.html



---------------------

More details are listed as follows
====================
CIIR 
(The Center for Intelligent Information Retrieval, 
美国Massachusetts大学的智能信息检索中心)
http://ciir.cs.umass.edu/

The Center for Intelligent Information Retrieval, a National Science 
Foundation-created S/IUCRC Center, is one of the leading information retrieval 
research labs in the world. The CIIR develops tools that provide effective 
and efficient access to large, heterogeneous, distributed, text and 
multimedia databases.

CIIR accomplishments include significant research advances in the areas of 
distributed information retrieval, information filtering, topic detection, 
multimedia indexing and retrieval, document image processing, terabyte 
collections, data mining, summarization, resource discovery, interfaces 
and visualization, and cross-lingual information retrieval.

The Center for Intelligent Information Retrieval continues to support the 
emerging information infrastructure, both through research and technology 
transfer. The goal of the CIIR is to develop tools that provide effective 
and efficient access to large, heterogeneous, distributed, text and 
multimedia databases. 

====================
Glasgow Information Retrieval Group
http://www.dcs.gla.ac.uk/ir/
由Keith van Rijsbergen率领的英国Glasgow大学信息检索研究小组。
这个小组理论和实践并重,旨在建造一个高效、新颖、成功的多媒体信息检索系统,
为终极用户服务。

The Information Retrieval Group led by Professor Keith van Rijsbergen has a 
vigorous programme of research, based on both theory and experiment, aimed at 
giving end-users novel, effective, and efficient access to the world of 
multi-media information. The group, part of the Department of Computing Science, 
University of Glasgow, has a strong research history in a wide area of 
information retrieval research from theoretical modelling of the retrieval 
process to advanced system building and to the user-oriented evaluation of 
information retrieval systems. The group's interests also include many areas 
of Web information retrieval such as link analysis, summarisation and the 
development of novel interaction techniques (e.g., ostension, implicit feedback 
and graphical visualisation). Our research preserves a strong emphasis on 
the evaluation of interactive IR systems, and the group maintains strong links 
with researchers in Human-Computer Interaction and Psychology.

------
Keith van Rijsbergen, http://www.dcs.gla.ac.uk/~keith/
英国格拉斯哥大学。概率IR的逻辑推理学派代表人,出版了著名的IR经典教材 
INFORMATION RETRIEVAL, 重点介绍用概率研究信息检的方法。

=====================
Cambridge NLIP Group 
(Natural Language and Information Processing Group)
http://www.cl.cam.ac.uk/Research/NL/

Research in NLIP has been done in the Computer Laboratory for nearly fifty years. 
The earliest work, by Roger Needham and Karen Sparck Jones, was on automatic 
thesaurus construction, in the context of document retrieval and machine translation. 
Subsequent research by Karen Sparck Jones during the 1960s and 70s focused on 
statistical approaches to retrieval and included innovative work on term 
weighting.  From the later 1970s research in language processing developed, 
with work on syntax, semantics and discourse processing,

------
Karen Sparck Jones, http://www.cl.cam.ac.uk/users/ksj/
Karen Sparck Jones has been one of the most influential figures in Computing 
since the 1950’s. Her work on Information Retrieval and Natural Language Processing 
has never been so central as it is are today, with its implications for 
search engine technology, the semantic web and even bioinformatics.

In 1972, Karen Sparck Jones published in the Journal of Documentation the paper 
which defined the term weighting scheme now known as inverse document frequency (IDF).

Karen Sparck Jones is emeritus Professor of Computers and Information at the 
Computer Laboratory, University of Cambridge. She has worked in automatic 
language and information processing research since the late fifties, 
and has many publications including several books, most recently `Evaluating 
Natural Language Processing Systems' with Julia Galliers, and `Readings in 
Information Retrieval', edited with Peter Willett. 

1988年度Salton奖得主。现代概率IR模型的另一创始人。在NLP、IR等领域都颇有建树,
而且做了大量的组织性工作。现在供职于英国剑桥大学计算机学院。

====================
LTI
CMU (Carnegie Mellon Universit) Language Technologies Institute,
http://www.lti.cs.cmu.edu/

The Language Technologies Institute (LTI) of the School of Computer Science at
Carnegie Mellon University conducts research and provides graduate education
in all aspects of language technology and information management. The LTI was
established in 1996, as an expansion of the Center for Machine Translation
(CMT).

The Center for Machine Translation (CMT) was a research branch of the School
of Computer Science devoted to basic and applied research in all aspects of
natural language processing, with a primary focus on machine translation,
speech processing, and information retrieval. Containing a unique mix of
academic and industrial researchers specializing in various aspects of
computer science, artificial intelligence, computational linguistics and
theoretical linguistics, the CMT provided a rich and diverse environment for
collaboration among faculty, staff, visiting scholars, and qualified students.

------
Lemur Toolkit
Lemur is a collection of search engine algorithms and information retrieval
applications used for IR research, development and education. Lemur provides a
rich query language that supports search against simple texts, structured
(XML) texts, and texts annotated with part-of-speech, named-entity, and other
annotations used in NLP and text-mining applications. Lemur's search engines
comfortably support collections ranging from a few gigabytes to a few
terabytes of text. The software is distributed under open-source license, and
is used widely in the IR research community.

====================
Standford InfoLab
http://infolab.stanford.edu/

The Stanford WebBase Project
http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/

The Stanford WebBase project is investigating various issues in crawling,
storage, indexing, and querying of large collections of Web pages. The project
builds on the previous Google activity that was part of the DLI1 initiative.
The DLI2 WebBase project aims to build the necessary infrastructure to
facilitate the development and testing of new algorithms for clustering,
searching, mining, and classification of Web content.
====================
北大天网组, http://sewm.pku.edu.cn/

    北京大学网络实验室自1997年开始从事搜索引擎方面的研究与系统开发,
技术积累深厚,综合实力和学术影响在国内一直处于领先地位。我们研发的
“天网”搜索引擎系统是全国最有影响的出自校园的搜索引擎,从1997年10月
开始一直运行至今。“天网”在增量搜索技术、快速检索技术,海量信息存储
技术等方面都具有较强的优势,她的不断发展培育了一批批在海量网络文本
信息处理方面有实战经验的学生,受到中外IT企业的普遍欢迎。
    从2001年开始,本研究组在搜索引擎技术的基础上,展开了中国互联网
信息历史的收集与存档工作,形成了“中国互联网信息博物馆”,至今已
收藏20亿在不同时期出现过的中文网页,是目前全国规模最大的历史网页收藏
与回放系统。同时,我们还尝试了在其基础上进行多学科交叉的研究。

====================
中科院大规模内容计算组
http://159.226.40.18/

    信息检索小组主要针对文本信息的检索开展研究,多次参加TREC会议,
取得了很好的研究成果。小组开发的天罗检索系统在很多国家重要的信息部门
得到了广泛的应用,目前主要的研究方向包括WEB信息的获取,WEB信息检索等。
    信息分析小组的研究主要集中在大规模多源异构信息的分析与挖掘方面,
主要包括文本分类与聚类、信息过滤、个性化服务、自然语言问答和浅层
自然语言处理等。小组研制了一系列文本信息加工处理的实验平台,目前实验
平台可以通过主页中“成果演示”进行演示。值得一提的是小组开展的公开源码
计划,其中的高性能分词系统ICTCLAS得到了研究人员的广泛认同与使用。

====================
复旦大学信息检索和自然语言处理组, 
http://www.cs.fudan.edu.cn/mcwil/irnlp/     

大规模文本处理主要研究自然语言(特别是中文信息)的处理技术和方法,
包括二个方面内容:首先是基础性工作,主要是基础性的理论和算法, 包括
自动分词、未登录词识别、词性和概念标注、句法分析和语义分析等,也包括
语料库的搜集整理等;其次是中文信息处理的应用技术,包括自动索引、
文本检索、文本摘要、文本分类和文本过滤,特别是上述技术在网络环境下
的应用。这部分工作是文本方向的研究重点。

====================
HIT-IRLab, http://ir.hit.edu.cn/

    哈工大信息检索研究室 (HIT-IRLab) 成立于 2001 年 3月。研究方向
包括文本检索、问答系统、自动文摘、文本挖掘和语言分析等, 研究室以
语言分析为基础研究,以文本过滤为应用研究,以信息抽取为语言分析从
句子理解向 篇章理解的延伸,以句子检索为在语言分析和篇章理解的支持
下的智能化精准检索技术。 

====================
SIGIR(美国计算机学会信息检索特别兴趣小组)、
TREC(文本检索学术年会)
MUC(消息理解学术年会)
TIPSTER(美国国防部高级研究计划署的IR实践基地)

====================
北京大学计算语言学研究所
http://icl.pku.edu.cn/

    北京大学计算语言学研究所成立于1986年。致力于计算语言学理论、语言
信息处理的基础资源和应用技术三方面的研究。
    围绕计算语言学和自然语言处理,包括如下三个主要的方向:首先基础资源
的研究与建设:计算词典学与机器词典,综合型语言知识库,语料库语言学与
语料库加工技术,术语学、术语自动提取、术语标准化研究等。其次是基础理论、
NLP的模型和方法:计算语言学基础,自然语言处理核心技术,现代汉语语法,
汉语的词/句法/语义分析,NLP统计模型,语言处理的信息论方法等。另外是
应用技术:机器翻译的方法、技术与系统实现,信息检索与提取,自然语言
信息处理系统的评价方法和技术,受限汉语及其辅助写作系统,中国古诗词计算机
辅助研究等。

====================
清华大学智能技术与系统国家重点实验室
http://www.csai.tsinghua.edu.cn/ 

    智能技术与系统国家重点实验室依托于清华大学。实验室于1990年2月
对外开放运行。主要从事人工智能基本原理、基本方法的基础与应用基础研究,
包括智能信息处理、机器学习、智能控制,以及神经网络理论等,还从事与
人工智能有关的应用技术与系统集成技术的研究,主要有智能机器人、声音、
图形、图像、文字及语言处理等。

================
Susan Dumais, 
http://research.microsoft.com/~sdumais/

I am interested in algorithms and interfaces for improved information
retrieval, as well as general issues in and human-computer interaction. I
joined Microsoft Research in July 1997. I work on a wide variety of
information access and management issues, including: personal information
management, web search, question answering, information retrieval, text
categorization, collaborative filtering, interfaces for improved search and
navigation, and user/task modeling.

Prior to coming to Microsoft, I worked on a statistical method for
concept-based retrieval known as Latent Semantic Indexing. You can find
pointers to this work on the Bellcore (now Telcordia) LSI page. 

===============
UIUC Information Retrieval Group
http://sifaka.cs.uiuc.edu/ir/

The Information Retrieval (IR) group is part of the Database and Information
Systems (DAIS) Lab  of the Computer Science Department at University of
Illinois at Urbana-Champaign. We work on a wide spectrum of problems in the
general area of text information management, including  retrieval,
organization, filtering , and mining of textual information, aiming at
developing advanced text information management techniques and systems that
help people make better use of text information.

------
ChengXiang Zhai, 
http://www-faculty.cs.uiuc.edu/~czhai/

Research Interests: Information Retrieval, Text Mining, Natural Language
Processing, Bioinformatics

University of Illinois at Urbana-Champaign, is recognized for
his work on user-centered, adaptive intelligent information access. His
techniques expect to improve search-engine performance, support better
information organization and enable understanding of large volumes of
information. Zhai's work in information retrieval is expected to enhance
curricula and provide new educational tools for the growing information
technology workforce.

===============
Stephen Robertson, 
http://research.microsoft.com/users/robertson/

Stephen Robertson joined Microsoft Research Cambridge in April 1998.

In 1998, he was awarded the Tony Kent STRIX award by the Institute of
Information Scientists. In 2000, he was awarded the Salton Award by ACM SIGIR.
He is a Fellow of Girton College, Cambridge.

At Microsoft, he runs a group called Information Retrieval and Analysis, which
is concerned with core search processes such as term weighting, document
scoring and ranking algorithms, and combination of evidence from different
sources. These are studied theoretically through the use of formal models,
mainly statistical, and statistical methods including machine learning
methods, and experimentally, through activities such as the Text Retrieval
Conference (TREC) and with internally generated evaluation sets. The group
(with its Keenbow evaluation environment) has had some excellent results at
TREC. The group works closely with product groups to transfer ideas and
techniques.

His main research interests are in the design and evaluation of retrieval
systems. He is the author, jointly with Karen Sparck Jones, of a probabilistic
theory of information retrieval, which has been moderately influential. A
further development of that model, with Stephen Walker, led to the term
weighting and document ranking function known as Okapi BM25, which is used in
many experimental text retrieval systems.

Prior to joining Microsoft, he was at City University London, where he retains
a part-time position as Professor of Information Systems in the Department of
Information Science (homepage). He was Head of Department for eight years,
during which time it achieved the highest possible rating in two successive
research assessment exercises. He also started the Centre for Interactive
Systems Research, the main research vehicle of which is the Okapi text
retrieval system, which has also done well at TREC.

Before joining City, he was a research fellow at University College London,
where he took his PhD in the School of Library Archive and Information
Studies. Before that he was in the research department at Aslib. He has an MSc
in Information Science from City and a first degree in mathematics from
Cambridge. 

===================
Nick Craswell
http://research.microsoft.com/~nickcr

I am an associate researcher at Microsoft Research Cambridge, in the
Information Retrieval and Analysis Group.

Research Overview

I am interested in Web search evaluation, mostly on enterprise-scale webs but
also the World Wide Web. I built the VLC, VLC2, WT2g and .GOV test
collections, which have been made available to research groups around the
world. David Hawking and I coordinated the TREC Web Track experiments. I am
currently involved in the TREC Terabyte Track and Enterprise Track. Some
publications: Book chapter preprint (pdf), IR'01 (citeseer) and CSIRO'01
(pdf).

I also work on effective Web search, which means making use of information in
pages, link structure and URL structure to generate more useful Web search
results. Some papers: SIGIR'05 (pdf), SIGIR'01 (pdf), TOIS'03 (pdf) (copying
is by permission of ACM, Inc.) and ADCS'03 (pdf).

My PhD was in distributed information retrieval (thesis pdf) which means
building a system on top of multiple engines/databases that already exist. My
recent work in the area has considered whether (or when) DIR is really
practical. Some papers: ADC'99 (ps), DL'00 (pdf), ADC'03 (pdf) and ADC'04
(pdf). 

===============
Web Search & Data Mining Group of MSR Asia
http://research.microsoft.com/wsm/

The goal of the Web Search & Data Mining Group of MSR Asia is to drive the
next generation of Web search by leveraging data mining, machine learning, and
knowledge discovery techniques for information analysis, organization,
retrieval, and visualization. In addition, in contrast with current Web search
methods, which essentially do document-level ranking and retrieval, the Web
Search & Data Mining Group has created search at the object level to bring
increased knowledge and intelligence to users.

A Glimpse at Several Core Innovations:

Large-scale Experimental Web Search Platform

The Web Search & Data Mining Group is creating a large scale search platform
to efficiently store, parse, index and search billions of Web pages and other
types of documents. The search platform is flexible enough to allow for
testing of various state-of-the-art search techniques that have been created
at the lab using new technologies.

Structuralizing the Web

The biggest challenge facing both users and search engines over the next
several decades is the continued unstructured growth of the Internet. As such,
search functions that can effectively and efficiently dig out
machine-understandable information and knowledge layers from unorganized and
unstructured Web data will be the key to supporting relevant search results.
To meet this challenge, the group is exploring technologies, namely Web
information extraction, deep Web mining, and Web structure mining that can
automatically classify structures and extract objects from the Web. The
information and knowledge gathered using these new techniques greatly improves
the performance of current Web search and even facilitates the creation of
more sophisticated next generation search technologies.

Vertical Search

Today's conventional search engines can be described as page-level search
engines whose main function is to rank web pages according to their relevance
to a given query. Driving the future of the search industry are functions that
delve deeper into vertical domains to provide knowledge and intelligence to
query results. At MSR Asia, the Web Search & Data Mining Group is addressing
the greatest challenges faced by vertical search including large scale web
classification, object-level information extraction, object identification and
integration, and object relationship mining and ranking. The results of these
efforts are leading to more advanced search engines that deliver intelligence
and insight to search results.

Mobile Search

The explosive growth of new computing devices such as handheld computers,
Windows Mobile-based PocketPCs, and SmartPhones is driving demand for greater
and more efficient information access. These devices, which leverage the power
of the Web and allow greater access to information than ever before, are still
not capable of performing at the level of a desktop PC. At MSR Asia, the Web
Search & Data Mining Group is inventing new technologies to improve the mobile
search and browsing experience and deliver the capabilities of a PC to users
of these new devices. Project initiatives include developing innovative
presentation schemes and user interfaces to facilitate search and browsing
tasks on mobile devices and developing context aware search technologies to
address the special information needs of mobile users.

Multimedia Search

The Web Search & Data Mining Group is conducting research into new
technologies that index multimedia content such as images, videos, and audio.
Through content analysis and advanced visualization techniques, the group is
transforming today's conventional text based search engines to include
multimedia content thus delivering more intelligent search results to users.
For example, the group recently developed a new multimedia news reader which
mines large archival news databases presenting text, map information, images,
and background music within a unique user interface providing readers with a
more efficient news search engine and a more enjoyable reading experience.

------
Wei-Ying Ma
http://research.microsoft.com/users/wyma/

Senior Researcher, Research Manager, Microsoft Research Asia

Dr. Wei-Ying Ma received the B.S. degree in electrical engineering from the
National Tsing Hua University in Taiwan in 1990, and the M.S. and Ph.D.
degrees in electrical and computer engineering from the University of
California at Santa Barbara in 1994 and 1997, respectively. From 1994 to 1997
he was engaged in the Alexandria Digital Library (ADL) project in UCSB while
completing his Ph.D. He developed a web-based image retrieval system called
Netra which has been frequently cited by other researchers and is regarded as
one of the most representative image retrieval systems. From 1997 to 2001, he
was with HP Labs where he worked in the field of multimedia adaptation and
distributed media services infrastructure. He joined Microsoft Research Asia
in 2001. Since then, he has been leading a research group to conduct research
in the areas of information retrieval, web search, data mining, mobile
browsing, and multimedia management. He currently serves as an Editor for the
ACM/Springer Multimedia Systems Journal and Associate Editor for ACM
Transactions on Information System (TOIS). He has served on the organizing and
program committees of many international conferences including ACM Multimedia,
ACM SIGIR, ACM CIKM, WWW, ICME, CVPR, SPIE Multimedia Storage and Archiving
Systems, SPIE Multimedia Communication and Networking, etc. He is also the
general co-chair of International Multimedia Modeling (MMM) Conference 2005
and International Conference on Image and Video Retrieval (CIVR) 2005. He has
published 5 book chapters and over 100 international journal and conference
papers.

====================
Google Labs
http://labs.google.com/

Google Labs is a playground for Google engineers and adventurous Google users.
Google staffers with wild and crazy ideas post their prototypes on Google Labs
and solicit feedback on how the technology could be used or improved. None of
these experiments are guaranteed to make it onto Google.com, as this is really
the first phase in the development process. Google users with a desire to jump
over the cutting edge are invited to check out any or all of the posted
prototypes and send their comments directly to the Googlers who developed
them. Please, remember to wear your safety goggles while using this site.

Labs.google.com, Google's technology playground.
Google labs showcases a few of our favorite ideas that aren't quite ready for
prime time. Your feedback can help us improve them. Please play with these
prototypes and send your comments directly to the Googlers who developed them. 

Want to learn more about Google technology? Here are some papers.
http://labs.google.com/papers/index.html

Passionate about these topics? You should work at Google.
algorithms, artificial intelligence, compiler optimization,
computer architecture, computer graphics,
data compression, data mining, file system design,
genetic algorithms, information retrieval,
machine learning, natural language processing, operating systems,
profiling, robotics, 
text processing, user interface design,
web information retrieval, and more! 

http://www.google.com/press/podium.html
Google Press Center: The Google Podium
 Here you'll find a selection of public presentations made by Google
executives. From time to time, we will continue to add transcripts, audio or
video clips and links to presentations hosted elsewhere.

====================
Jon Kleinberg
http://www.cs.cornell.edu/home/kleinber/

Professor of Computer Science, Cornell University

My research is concerned with algorithms that exploit the combinatorial
structure of networks and information. My recent work has included
* link analysis and modeling of the World Wide Web and related information networks;
* discrete optimization and network algorithms; and
* algorithmic approaches to clustering, indexing, and data mining. 
====================

转自sewm.pku.edu.cn/IR-Guide.txt

抓取百度搜索结果(ruby代码)

xpath解析网页代码很方便,容易维护,找到解决方法之前一直用正则处理
问题的关键关键在于删除下面这个特殊符号:
=============================
▼
Unicode编码:U+25BC
维基百科注释:Black down-pointing triangle
=============================
否则使用nokogiri,Mechanize按xpath语法解析的时候会出bug,请参考下面的代码

a = Mechanize.new {|agent| agent.user_agent_alias = 'Linux Mozilla'}
a.get(url) do |page|
page.body = Iconv.iconv('UTF-16','GBK',page.body).first
page.body.gsub! ("[U0080-U2C77]+",'')
p page.search("//table")
end

更新一下:最近写了一个小的ruby gem,用来抓取百度搜索结果,排名,收录数,地址是

http://rubygems.org/gems/baidu

Rails取表一列的方法

请把column替换成你要的列名,把Model替换成你的model名
Model.find(:all, :select => "column").map{|x| x.column}
可以在Model里加一个方法
class Model < ActiveRecord::Base
  def self.names
      find(:all, :select => "column").map{|x| x.column}
    end
end

转自:http://snippets.dzone.com/posts/show/3901>

在rails项目中使用自定义类(自定义方法,自定义class)

一直用rake跑后台程序,发现很多rake文件的代码都是重复的,明显应该调用一个公用的方法了,查了半天,原来就这么简单
1.在/lib/目录建立ruby文件/lib/testclass.rb
2.编辑testclass.rb
class Testclass
  def testmsg
      "testmsg"
        end
        end

        3.在项目中调用testclass

        require 'testclass'
        t = Testclass.new
        puts t.testmsg

        BTW,rails最新版不建议使用RAILS_ROOT常量,而建议使用Rails.root.to_s
        

SEO原则

总原则

简洁,规范,统一,减少维护成本,优先考虑UE,避免重复,避免歧义,避免盲目抄袭。若以下策略违反总原则,请纠正

链接规范

  1. 屏蔽方法
    • 对站内链接,增加属性rel=”nofollow”
    • 对站外链接,增加属性rel=”external nofollow”
  2. 2次或2次以上出现同一个URL时,只保留一个URL,其他均屏蔽
  3. 站外链接,非特别声明,均屏蔽。
  4. 屏蔽以下各种站内链接
    • ”换肤”,“登陆”,“关于公司”,“联系方式”,“填写问卷”等无价值但被频繁链接的页面
    • 本页面链接指向页面主题不相关
  5. 进制<a>做为按钮使用,请使用其他标签(例如<div>).最极端的办法是将无价值链接也替换成<div>并用js和css将其伪装成<a>
  6. 面包屑以频道首页开始,而不是网站首页开始.例如,网易首页>网易汽车>某频道>某文章,应改为:网易汽车>某频道>文章汇总页>某文章
  7. 面包屑最后一项指向本页面时,不加链接,并且使用<strong>
  8. 对同一地址的链接锚文字保持一致,若考虑UE/排版缘故,可在title属性中使用统一锚文字
 

链接统一

  1. 当URL做伪静态化后,将页面中动态的URL统统替换为静态化URL
  2. 统一header/footer,以各自产品首页的header/footer为准
    • header中应包含到本产品首页的链接(网页 图片 热闻 购物 音乐 视频 词典 翻译 更多)
  3. 当URL中出现统计代码时,在<head>标签中添加<link rel=”canonical” href=”${不带任何统计参数,和无效参数的URL}”>
  4. URI规范, 当http://auto.163.com/bj/ 和 http://auto.163.com/bj 都可以访问时,统一使用前者,如果用浏览器打开后跳转,则使用跳转后的地址
  5. URI规范, 当http://dict.youdao.com/map/index.html和http://dict.youdao.com/map/ 都可以访问时,统一使用前者(考虑前端服务器rewrite规则冲突问题)
 

页面尺寸

  • JS,CSS代码尽量不出现在<header>中,从外部引用,不同项目之间公用部分尽量合并
  • 页面输出之前删除无用空字符,回车,换行,空格,制表符,无用注释等
  • 全部采用html5标签定义 <!DOCTYPE html>
 

隐藏内容

  1. 除tab切换等必要情况,禁止以任何形式故意隐藏大篇文字
  2. 尽量不将大段文字放到图片中,若必须,请把文字复制到对应图片的alt=”“中,css截取的图除外
  3. 使用ajax方式显示内容应保证更换的内容不能超过10%,如果更换的内容超过50%应该使用普通超链接.
 

程序

  1. http GET参数造成页面无法正常返回内容时一定返回Respond Code 404。404页面最好能增加对应的推荐链接,对用户产生正向引导
 

关键词运用

  1. 关键词的选取
    1. 参考index.baidu.com的数值选取相关数值中平均指数最高者
    2. 参考google adwords关键词工具,选取搜索量最高,竞争度不夸张者
  2. 关键词的放置,在不影响UI/UE的前提下,让关键词出现在
    1. <title> 网页最浓缩的文字,相当于论文标题。每个url只可能有一个<title>,因此相当重要。
      • 不同url的title是不重复的。传统上标题一致的文章有抄袭嫌疑。
      • 禁止放置宣传性的、与主题不相关的大量文字。
    2. <h1> heading(或headline)的缩写。本身是文档1级段落概要。现常见用于新闻标题,文章标题,用于介绍本页内容。
      • 禁止将<h1>用于LOGO
    3. <h2> h1的子级内容概要,常见用户子栏目名称。用于介绍详细内容,软件功能等。
    4. <strong> 用于强调文字。只在有必要时使用,禁止滥用。
    5. <a> 相当于论文的“参考”,应当链接到与本页相关的网页。
    6. <a title=”“> 相当于按钮提示,当锚文本与目标页面主题不相关时,需增加title以消除用户疑惑。
      • <a title=”更多关于xxx的内容”>更多</a>
      • <a title=”xxx”>下一页</a>
    7. <img alt=”“> alternation的缩写,当图片加载失败时浏览器使用alt文字替代图片。可用于增加,或稀释关键词密度。
    8. <img title=”“> 对图片的鼠标悬浮注释。可用于增加,或稀释关键词密度
    9. <meta name=”keywords” content=”“> 相当于论文的”关键词”,便于论文搜索引擎收录和搜索。禁止增加与文章主题不相关的关键词。便于搜索引擎分词。
    10. <meta name=”keywords” description=”“> 相当于论文摘要,用于让读者在10秒钟内了解本文的内容,以便决定是否开始阅读。会展示于搜索引擎结果页面。禁止使用全站统一的宣传性文字,一定要生成有价值信息,以便提高点击率。
    11. 任何视觉强化的文字区域
  3. 尽量保证关键词和其他文字不产生混淆,可使用_ | 《 》 ” ’ 【 】等符号隔离
  4. 关键词在一段话、一个词组中的摆放、切割一定优先考虑UE和行为引导
补充,未整理 1、CSS命名避免使用focus作为名称。 2、页面中不要出现过多strong标签。 3、页面中h1只能唯一,并且指定为页面重要的标题(与项目管理人员确认)。 4、页面中,h2~h6标题要按照等级顺序书写。 5、Img标签中不能缺少alt属性 6、图片标题使用图片作为背景,缩进隐藏文字,使搜索引擎可以抓取关键词。 7、文字使用CSS进行文字截取以符合搜索引擎对文字的抓取(与页面发布工程师确 认)。 8、在不影响用户体验的情况下给链接加title属性 9、在不影响用户体验的情况下给图片加title属性 10、对于产品页,每个细栏目名称必须是文字,建议是<h2>,如果冲突可降级(用 <h3>等等) 11、对于产品页,图片下方必须有文字区域 12、通过外部调用的方式使用JS,如果JS必须放到页面中,建议放到主内容以下的位置 13、对每个详情页正文上方增加面包屑 14、代码符合xhtml标准   -===============以下来自经纬同学============= 1、dns轮训数量 2、与设定无差异的图片尺寸 3、合并脚本 4、在固定尺寸下可再压缩的图片 5、css是否放在头部 6、同一文件统一的路径 7、静态池的cookies 8、压缩传输 9、压缩css 10、压缩html 11、压缩javascript 12、较少的dom节点(500,1000,1500,>2000) 13、没有404错误 14、较少的并发链接 15、css不适用expression 16、css不适用filter滤镜 17、favicon可缓存

找到过度抓取的页面

将最近1个月的访问日志过滤出来,只要user-agent是Baiduspider和Googlebot的

cat * | awk '{print $7}'| sort |uniq -c|sort -n|tac >/tmp/result

然后看看result文件即可,前边的数字表示被访问次数,理想状态下每天1次即可,也就是1个月30次左右,但我估计爬虫为了考虑效率和程序设计的简单性,爬虫对目标页面没有做全面的重复性检查,造成轻微过度访问也是可以的,比如平均每天30次,那加起来每个月也不过900次


SEO其实并不单单是刚入门的人理解的SEO,而是对网站整体质量和效率的优化。

美军通话实录[转]

肯定是网友杜撰的:

发生在1995年10月份,加拿大纽芬兰海岸管理局人员与美国海军船舰的真实无线电通话抄本。

美国海军总部在1995年10月10日公布此通话记录。
加拿大人员:
请改变你的航向朝南15度以避免碰撞。
美国人员:
建议你改变你的航向朝北15度以避免碰撞。
加拿大人员:
不,你必须改变你的航向朝南15度以避免碰撞。
美国人员:
这是美国海军军舰舰长,我再说一次,改变你的航向。
加拿大人员:
不,我再说一次,你要改变你的航向。
美国人员:
这是美国海军林肯号航空母舰,美国大西洋舰队第二大船舰,我们与三艘驱逐舰、三艘巡洋舰及多艘支援船只同行。我要求你改变你的航向朝北15度。
我再说一次,朝北15度,否则将采取反制措施以确保本舰的安全。

加拿大人员:这里是灯塔,