Liu 的个人资料唯有仰望是真实的照片日志列表更多 工具 帮助

日志


2006/7/31

The Art of Bootstrapping(zz)

Sorry for those OS or language geeks out there, this is about bootstrapping of a business. Since everyone want to be the next whatever, here goes some insights of some who succeed.

from http://blog.guykawasaki.com/2006/01/the_art_of_boot.html

Signum sine tinnitu--by Guy Kawasaki

Blogger. n. Someone with nothing to say writing for someone with nothing to do.

 

January 26, 2006

The Art of Bootstrapping

Istock_000000421328medium
Someone once told me that the probability of an entrepreneur getting venture capital is the same as getting struck by lightning while standing at the bottom of a swimming pool on a sunny day. This may be too optimistic.

Let's say that you can't raise money for whatever reason: You're not a “proven” team with “proven” technology in a “proven” market. Or, your company may simply not be a “VC deal”--that is, something that will go public or be acquired for a zillion dollars. Finally, your organization may be a not-for-product with a cause like the ministry or the environment. Does this mean you should give up? Not at all.

I could build a case that too much money is worse too little for most organizations--not that I wouldn't like to run a Super Bowl commercial someday. Until that day comes, the key to success is bootstrapping. The term comes from the German legend of Baron Münchhausen pulling himself out of the sea by pulling on his own bootstraps. Here is the art of bootstrapping.

  1. Focus on cash flow, not profitability. The theory is that profits are the key to survival. If you could pay the bills with theories, this would be fine. The reality is that you pay bills with cash, so focus on cash flow. If you know you are going to bootstrap, you should start a business with a small up-front capital requirement, short sales cycles, short payment terms, and recurring revenue. It means passing up the big sale that take twelve months to close, deliver, and collect. Cash is not only king, it's queen and prince too for a bootstrapper.
  2. Forecast from the bottom up. Most entrepreneurs do a top-down forecast: “There are 150 million cars in America. It sure seems reasonable that we can get a mere 1% of car owners to install our satellite radio systems. That's 1.5 million systems in the first year.” The bottom-up forecast goes like this: “We can open up ten installation facilities in the first year. On an average day, they can install ten systems. So our first year sales will be 10 facilities x 10 systems x 240 days = 24,000 satellite radio systems. 24,000 is a long way from the conservative 1.5 million systems in the top-down approach. Guess which number is more likely to happen.
  3. Ship, then test. I can feel the comments coming in already: How can you recommend shipping stuff that isn't perfect? Blah blah blah. ”Perfect“ is the enemy of ”good enough.“ When your product or service is ”good enough,“ get it out because cash flows when you start shipping. Besides perfection doesn't necessarily come with time--more unwanted features do. By shipping, you'll also learn what your customers truly want you to fix. It's definitely a tradeoff: your reputation versus cash flow, so you can't ship pure crap. But you can't wait for perfection either. (Nota bene: life science companies, please ignore this recommendation.)
  4. Forget the ”proven“ team. Proven teams are over-rated--especially when most people define proven teams as people who worked for a billion dollar company for the past ten years. These folks are accustomed to a certain lifestyle, and it's not the bootstrapping lifestyle. Hire young, cheap, and hungry people. People with fast chips, but not necessarily a fully functional instruction set. Once you achieve significant cash flow, you can hire adult supervision. Until then, hire what you can afford and make them into great employees.
  5. Start as a service business. Let's say that you ultimately want to be a software company: people download your software or you send them CDs, and they pay you. That's a nice, clean business with a proven business model. However, until you finish the software, you could provide consulting and services based on your work-in-process software. This has two advantages: immediate revenue and true customer testing of your software. Once the software is field-tested and battle-hardened, flip the switch and become a product company.
  6. Focus on function, not form. Mea culpa: I love good ”form.“ MacBooks. Audis. Graf skates. Bauer sticks. Breitling watches. You name it. But bootstrappers focus on function, not form, when they are buying things. The function is computing, getting from point A to point B, skating, shooting, and knowing the time of day. These functions do not require the more expensive form that I like. All the chair has to do is hold your butt. It doesn't have to look like it belongs in the Museum of Modern Art. Design great stuff, but buy cheap stuff.
  7. Pick your battles. Bootstrappers pick their battles. They don't fight on all fronts because they cannot afford to fight on all fronts. If you were starting a new church, do you really need the $100,000 multimedia audio visual system? Or just a great message from the pulpit? If you're creating a content web site based on the advertising model, do you have to write your own customer ad-serving software? I don't think so.
  8. Understaff. Many entrepreneurs staff up for what could happen, best case. ”Our conservative (albeit top-down) forecast for first year satellite radio sales is 1.5 million units. We'd better create a 24 x 7 customer support center to handle this. Guess what? You sell no where near 1.5 million units, but you do have 200 people hired, trained, and sitting in a 50,000 square foot telemarketing center. Bootstrappers understaff knowing that all hell might break loose. But this would be, as we say in Silicon Valley, a “high quality problem.” Trust me, every venture capitalist fantasizes about an entrepreneur calling up and asking for additional capital because sales are exploding. Also trust me when I tell you that fantasies are fantasies because they seldom happen.
  9. Go direct. The optimal number of mouths (or hands) between a bootstrapper and her customer is zero. Sure, stores provide great customer reach, and wholesalers provide distribution. But God invented ecommerce so that you could sell direct and reap greater margins. And God was doubly smart because She knew that by going direct, you'd also learn more about your customer's needs. Stores and wholesalers fill demand, they don't create it. If you create enough demand, you can always get other organizations to fill it later. If you don't create demand, all the distribution in the world will get you bupkis.
  10. Position against the leader. Don't have the money to explain your story starting from scratch? Then don't try. Instead position against the leader. Toyota introduced Lexus as good as a Mercedes but at half the price--Toyota didn't have to explain what “good as a Mercedes” meant. How much do you think that saved them? “Cheap iPod” and “poor man's Bose noise-cancelling headphones,” would work too.
  11. Take the “red pill.”This refers to the choice that Neo made in The Matrix. The red pill led to learning the whole truth. The blue pill meant waking up wondering if you had a bad dream. Bootstrappers don't have the luxury to take the blue pill. They take the red pill--everyday--to find out how deep the rabbit hole really is. And the deepest rabbit hole for a bootstrapper is a simple calculation: Amount of cash divided by cash burn per month because this will tell you how much longer you can live. And as my friend Craig Johnson likes to say, “The leading cause of failure of startups is death, and death happens when you run out of money.” As long as you have money, you're still in the game.

Written at: Atherton, California.

Ruby不完全读书指南(zz)

Programming Ruby 看完了1st Edition, 接着看2nd, AWDWR两个版本齐头并进看, 同时还有Recipes 和Ruby for Rails(这个前两天集中看了不少), 后面几本也要纳入规划了, 都是只翻了翻, 要加油的说, 对ruby还是没有那么自然的感觉.

摘选自中华读书报 http://www.gmw.cn/01ds/2006-07/19/content_452687.htm

Programming Ruby(2nd Edition)

    这似乎已经不是怪事:关于一种编程语言的经典教材,作者不是这门语言的创造者。就像Stan Lippman之于C++、Joshua Bloch之于Java、Martin Fowler之于UML一样,Dave Thomas也许是这个世界上最善于向别人讲解Ruby语言的人——至少超过Matsumoto是毫无问题的。也许正是因为自己也经历了“不懂到懂”的学习过程,有时候“旁观者”反倒比“创造者”更清楚学习者们需要什么。

  所以这本书就是Ruby的经典教材。关于Ruby的基本语法和常用工具,书中第一部分和第二部分做了详细的介绍。第三部分“Ruby Crystallized”更加阐述了Ruby语言的一些细节和设计理念,其中第23章“Duck Typing”是刚从Java或者.NET平台走出来的读者不可错过的,因为对于类型与契约的理解、对于类与类型的理解,正是Ruby这种动态语言与Java/C#等静态语言最大的区别之一。随后的第四部分提供了Ruby基础类库的速查手册。

  Dave Thomas和Andy Hunt这两个“Pragmatic Programmer”并非浪得虚名:这本Programming Ruby虽然不是一本称职的参考手册,却足够帮助一个初学者步入Ruby世界而不致误入歧途,并且能够在很少见的一些情况下——譬如说忘了yield的用法——给有经验的Ruby程序员提供帮助。在我看来,这也就足够奠定它作为经典教材的地位了。由于封面上有一柄丁字镐,这本书也被昵称为“镐头书”——它正是你发掘“红宝石”(Ruby)宝藏的必备工具。

Agile Web Development with Rails

    Rails的作者David Heinemeier Hansson说过一句大实话:“我从来不会为了学语言而学语言。”大多数人在大多数时候学习一种新的语言不是为了比较语言的优劣,而是因为这个语言底下的某个工具能给他的工作带来帮助。Ruby世界里的这个“杀手应用”,让Ruby在短短一年时间里成为焦点的这个工具,就是Rails。

  这是第一本介绍Rails的图书,又是由Rails的作者DHH和前面提到的Dave Thomas共同撰写,其价值可谓不言而喻了。许是两位作者有太多的“干货”想要交给读者,这本书的第一版被他们——不幸地——写到了558页之厚。书中首先展示了一个规模不大的在线购物网站,让读者亲身体验用Rails进行敏捷开发的感受;然后针对Rails框架的各个组件和安全、部署等延伸话题展开了深入的讨论。其内容之全面、探讨之深入,令人叹为观止。看起来,和Matsumoto不同,DHH很清楚应该怎么介绍自己的作品——不管是“浅出”还是 “深入”。

  值得中国读者高兴的是,这本书的第一版已经由林芷薰翻译,电子工业出版社付梓。Rails仍然处在高速发展的阶段,从本书第一版截稿至今,Rails已经发生了相当大的变化,因此这本中译本甫一面世便已经有很多过时之处。但这本书毕竟不是参考手册,作者更多地是在其中阐述Rails 的设计理念和最佳实践。对于英文阅读无法达到最快速度的读者来说,这个译本未尝不可以是一个称职的向导。

Rails开发者助手两种

  不难想象,有很多性急的程序员会——就像我一样——草草了解Ruby语法之后就一头扎进Rails的绚丽宫殿,体验快速开发web应用的成就感,却不得不时时因为缺乏对Ruby语言的深入了解而感到迷惑:这个类里什么都没有,它为什么会工作?那个地方写的代码是什么意思?可是,要全面系统地学习Ruby,又实在令人望而生畏。还好,我们有这本Ruby for Rails。书中介绍了一些Ruby语言特性——既有普通的也有高级的,都是Rails中使用到的。简而言之,这就是一本专门为Rails应用开发者提供的Ruby指南。更有趣的是,书中还用了一章(第17章)篇幅专门介绍“如何探索Rails源代码”,真可谓是“授人以渔”的典范了。

  另一个“助手”则是Chad Fowler——他也是Programming Ruby的合著者——的Rails Recipes。和任何一本“菜谱”(recipe)一样,这本书不会教你如何使用菜刀与炒勺、如何把蔬菜切片——你可以从别的很多地方学到这些技巧。这本RailsRecipes教给读者的,是如何在 Rails环境下急就章地完成一个你需要的功能。譬如说“用户登录与身份验证”这件事,每个网站、每个开发者都曾经做过不止一次,这本书中就给了读者一个简单而可靠的解决方案,读者只要抄抄改改,几分钟就可以完成这个功能。对于初接触Rails(以及Web 2.0)、面对很多问题尚且无从下手的新兵来说,这本书确实可以帮助他们解决一些实际问题。

  不过这本书的局限也同样明显:如果你需要的菜色超出了这份菜谱的范围,它就只好爱莫能助了;而且,仅仅给出解决问题的代码,却没有对应的单元测试,也让习惯了TDD的读者多少有些忐忑。在我看来,这本书对“授人以鱼”的专注恰好和前一本Ruby for Rails构成了一对“可怕的对称”,也让这两本书有理由共存于Rails开发者的案头。

 

Ruby In A Nutshell

  作为Ruby语言的缔造者,Yukihiro Matsumoto只能写一本“果壳书”,这本身就是一件耐人寻味的事情。O’Reilly的“果壳书”系列历来褒贬不一:有人认为它们缺乏深度,也有人认为它们是快速入门的好帮手。但Matsumoto最大的问题在于:他创造了Ruby,却没有真正意识到这种语言到底有多大的威力——后来他经常在 Ruby on Rails讨论组活动,从中了解一些精妙的Ruby用法。其结果也很自然:这本Ruby In A Nutshell作为语言参考中规中矩,但对于实际应用中的妙处——例如在DSL方面的应用——却语焉不详。再加上它所针对的Ruby版本是略显过时的 1.6版,也让这本书的地位略显尴尬。

  Ruby 奇书两种

  称它们为“奇书”,因为它们的主题实在偏颇。先看这本Enterprise Integration with Ruby:虽说脚本语言常常被称为“胶水”,有多少人会当真想到用Ruby去做企业应用集成?不过细看之下,这本书多少有些名不副实之嫌,因为它真正介绍的无非只是如何访问数据库、如何操作XML、如何通过SOCKET通信之类比较底层的技术而已。在一个生僻的题目之下写着另一些生僻的内容,尽管这些内容算得上有趣,但我还是要对那些没有读过这本书的Ruby程序员说:你没有错过太多——尽管这本书与你想象的并不一样。

  最后要介绍的这本书更是备受争议:有人盛赞它是“精通Ruby的必经之路”,也有人批评它沉溺于奇技淫巧缺乏实用价值。但无论褒贬,更多的读者正在逐一挑战其中的谜题——这本书就是James Edward Gray所著的Best of Ruby Quiz。这本书(目前出版的是第一卷)列举了25道题目,读者大多可以想出一种办法来解决这些问题,往往还能 通过思考和重构找到第二种优雅的设计,但这本书却给你列出了第三种、第四种真正精巧的解决方案——充分利用Ruby技巧才能得出的解决方案。这些题目的最终解法之巧妙,常常令人拍案叫绝(或是破口大骂)。不过这些“奇技淫巧”也并非全无用处,例如书中很多题目在解答时都用到了正则表达式,理解这些解答对于深入学习正则表达式的用法是很有帮助的。

  草率而又艰难地,我们粗粗浏览了2006年6月之前出版的所有Ruby图书。迄今为止,所有这些Ruby图书都是针对整个Ruby 语言、或是针对Rails框架的,只是关注角度各有区别。随着Ruby和Rails的不断升温,可以预见很快就会有更多阐述某一细部的技术书籍出现,各种经验与模式也会结集出版。



Trackback: http://tb.donews.net/TrackBack.aspx?PostId=978035


2006/7/30

感受第三极

今天吃饭出去的时候去了开张不久的第三极, 买了两本书, Introduction to Data Mining 和 An Introduction to the Analysis of Algorithms, 写的很好的书, 其实在第三极还很凌乱的书中有不少不错的, 不过银子有限, 哈哈, 而且留些有意思的可以去那边看, 有空调蹭, 加上那个摇摇椅很high的说.
 
说说第三极吧, 好大的建筑, 除了对那三个字没什么好感外, 整个建筑设计很不错, 正前面的那个水滴的雕塑, 非常的契合. 美中不足就是建筑顶上"第三极"的字样, 有些煞风景, 周围环境, 从家乐福那边, 鼎好, 海龙, 理想国际, 辉煌时代的几个大楼, 整个把那片的感觉弄得很现代的感觉:)
 
和图书大厦隔街对峙, 相信竞争有了, 虽不指望AW, Mining, O'reilly, 电力出版社降价, 但是至少服务质量会提高的, 相信图书大厦的小板凳是对这边做出的回应. 像我这样总去蹭书看的人, 是再高兴不过的了.
 
说说图书吧, 书的数目和种类都很有规模, 看到了一些有趣的书,  不足的是数学物理计算机的数目还不能和图书大厦相比, 整个科技区大多是中小学和教材. IT说区找到不少好看的, 但影印的书和比较新的领域的书还比较少. 倒是有整整一排有趣的Game programming的书, 今天就坐在那里看那本Mud Game programming来着, 从中还学到了怎么在C++里面调python, 和怎么用SWIG来在python中用C++的代码.
 
交款时给那位主管说了说建议, 呵呵, 人很热情的接受, 并解释刚开业, 还有很大规模书在库中匆忙上架, 欢迎提想法. 我就说了说我的看法, 对于流行的那些畅销书或是娱乐什么的, 比较自由的排放大家有个探索发现的乐趣, 但对于专业的科学技术书籍, 集中良好的分类整理和明确分块有更大的好处. 然后是书中光盘不处理容易丢失和混乱的情况, 以及他们的计算机检索系统非常的差. 希望能在后面的时间里得到改善, 毕竟是个新的店, 规模做这么大, 整体感觉还是不错的, 如果他们很好的懂得经营书店的话, 我想能从图书大厦抢很多生意的吧.
2006/7/28

看了离骚II

哭了...
 
People laugh and people cry
Some give up some always try
Some say hi while some say bye
Some will forget you but never will I

Write down tears or write down smile
Wanna a sun or a kiss beyond night
Wave to all in noise or just a quie
It is time or the mind which can fly

Congratulations to you let us celebrate the graduation
Drama will be present let us send the invitations
Someday the words can not say out all of the passion
Someday your eyes can answer all the questions

All the stars (have) fallen in to your eyes
and there are where the moon was waken
All the promises (have) fallen into my heart and never be taken
Someday the words just can not say out all of the reasons
Someday my face changes so do seasons

Find out who you are and don't be afraid of it
Drama someday let's try  write down some memories of it
Someday the story is over it's just over then
Hold this breath listen to the song of it

(Piano Solo)

I laugh i cry and I realize
Some will forget you but never will I

People laugh and people cry
Some give up some always try
Some say hi while some say bye
Some will forget you but never will I
2006/7/27

Stiff asks, great programmers answer(zz)

very interesting read

Stiff asks, great programmers answer

niedziela, 23 lipiec 2006, w kategoriach: Programowanie, Linux, Ruby, Emacs, Rails

At some hot, boring afternoon I got an _Idea_. With the help of public accessible e-mail adresses I asked 10 questions to a bunch of programmers that I consider very interesting people and I respect them for variuos things they created. Coming out with question was a 5 minute job for me - these are things I would ask about if I could speak with them personally for, let’s say, 10 minutes, and I didn’t have time for thinking too much. The last two question don’t have anything to do with programming, this is simply something I like to know about everyone I talk to, lets say that’s my hobby. Not everyone wanted to answer them, and that’s fine. It was the first „interview” I ever made, so I also made some mistakes, which went out as people started answering… But despite of this, I learnt a lot of interesting stuff, so it was definetly a valuable experience.

Not everyone responded to my e-mail, not everyone agreed to answer the questions, maybe I will also get some answers after I published this, I didn’t have the patience to wait longer, so new things may appear here over time.

Finally, here we go:

Starring:

Linus Torvalds - The Linux kernel author

Dave Thomas - Author of the „Pragmmatic Programmer”, „Programming Ruby” and other great books about programming. One can read his mainly programming-related thoughts here.

David Heinemeier Hansson - Author of the Rails Framework - the new hot web development framework. He has a weblog here.

Steve Yegge - Proably the least known from guys here, but also made one of the most interestings answers, has a popular weblog about programming. He is also the author of a game called „Wyvern”.

Peter Norvig - Research Director at Google, a well known Lisper, author of famous (in some circles at least) books about AI. See his homepage.

Guido Van Rossum - The Python language creator

James Gosling - The Java language creator

Tim Bray - One of the XML and Atom specifications author and a blogger too.

And here comes the main content:

- How did you learn programming? Were any schools of any use? Or maybe you didn’t even bother with ending any schools :) ?

Steve Yegge:

I taught myself to program on an HP calculator using their RPN stack language when I was 17 years old. I’d tried to learn programming a few times before that but never really „got” it. The HP 28c and 48g scientific calculators were pretty powerful and had great docs. I wrote a 3D wireframe viewer for the 48g — I got a book on 3D graphics and painstakingly translated an example program in Pascal into the RPN stack language. It was pretty sweet when I got it running. After that I bought a PC and Turbo Pascal, and started studying programming in earnest. I was a decently good programmer by the time I went into the CS program in college.

I went to the University of Washington and got an undergrad degree in CS. It was definitely worthwhile, and I recommend that all programmers should try to get a CS degree if possible.

Linus Torvalds:

I didn’t learn programming in school, but mostly on my own reading books and just doing it (initially on a Commodore VIC-20, later on a Sinclair QL).

That said, I think especially University was very useful. Rather than go to an engineering school, I went to Helsinki University, which is pretty theoretical, so there the teaching concentrated not so much on programming (which was just a small part, and which I ended up doing more of „on the side” anyway), but most of the courses tended to be on fundamental concepts and things like complexity analysis.

Which can seem boring and even a waste of effort at times, but I think it was useful, and I mostly enjoyed it. And I think I’m probably a better programmer for it.

David Heinemeier Hansson:

I learned programming by starting to put together my first web page in HTML. Then I wanted to make some dynamic pieces and picked up first ASP then PHP. After I already knew how to program, I then started on a joint computer science and business administration degree.

Peter Norvig:

I took courses in high school and college, but always felt I learned more on my own.

Dave Thomas:

During my secondary schooling I took a class in a local technical college on computers. It got me totally hooked: I fell in love with programming, and looked around for colleges offering courses in software. Eventually I went to Imperial College, part of London University. It was only the second year they’d offered a course in software, and it was absolutely marvelous: the staff and students worked together to make the materials better, and everyone learned a lot. The undergraduate course there gave me an incredibly strong background in software development. I stayed on to start a PhD, but got lured away by a startup.

But the overall question is „how did you learn programming?” The real answer to that is „I’m still learning programming.” I think any good developer continues to learn throughout their careers. It isn’t just a question of picking up new languages and libraries: good developers also refine their techniques and practices over the years.

Guido Van Rossum:

I went to university where they had a big mainframe and there were various computer courses. This was very important for me.

James Gosling:

Initially, I was self-taught. I got my first programming job before I went to college. But I’m glad I did. I had a lot of fun. I kept going until I had a PhD.

Tim Bray:

I thought I was going to be a math teacher. The math program at University required a few computer science courses.

- What do you think is the most important skill every programmer should posses?

Steve Yegge:

Written and verbal communication skills. You’ll never make it very far as a programmer in any field unless you can get your ideas across to people effectively. Programmers should read voraciously, practice writing, take writing courses, and even practice at public speaking.

Linus Torvalds:

It’s a thing I call „taste”.

I tend to judge the people I work with not by how proficient they are: some people can churn out a _lot_ of code, but more by how they react to other peoples code, and then obviously by what their own code _looks_ like, and what approaches they chose. That tells me whether they have „good taste” or not, and the thing is, a person without „good taste” often is not very good at judging other peoples code, but his own code often ends up not being wonderfully good.

But hey, it’s not the only thing. One thing that is very useful, especially in an open source project, is simply the ability to communicate well what you want to do, and how you are going to do it. The ability to explain to others _why_ you do something a certain way is very important, and not everybody has that ability.

That said, in the end there are also the people who just churn out good code. They may not be good at explaining it, and they may not even have great taste, but the code works well. Sometimes you need another person (one that _does_ have that hard-to-define „taste”) to maybe massage the code into a form where it’s useful in the bigger picture, but just the ability to write clear code for difficult problems is obviously a fairly fundamnetal part of any programmer.

David Heinemeier Hansson:

A strong sense of value. The ability to ask yourself the question: Is it worth doing what I’m doing right now? So many programmers seem to waste oceans of time on stuff that just doesn’t matter. And not enough on the stuff that does.

Peter Norvig:

I don’t think there’s one, but let’s say concentration.

Dave Thomas:

Passion.

Guido Van Rossum:

Your questions are rather general and hard to answer. :-) I guess being able to cook an egg for breakfast is invaluable.

James Gosling:

To be self motivated. To be really good, you have to be in love with what you do.

Tim Bray:

Ability to prefer evidence to intuition.

- Do you think mathematics and/or physics are an important skill for a  programmer? Why?

Steve Yegge:

There is a large branch of mathematics that’s very important for programmers called „discrete math” or „concrete math”. It includes disciplines such as probability, combinatorics, graph theory, induction proofs, and other useful tools. I would encourage all programmers to study discrete mathematics to whatever extent they can. Even a little is better than none at all.

As for more traditional math, well, I don’t use it as often, but it comes in very handy when I need it. For instance, I’ve only used calculus once in the past year as part of my job. I had to estimate loads for the peak traffic hour of the day for a service whose load „follows the sun” in an approximate sine curve. The simplest way to make the estimate was to integrate over 1/24th of the curve at a specific time. If I hadn’t known calculus, I would not have known how to make a reasonably accurate estimate.

When I was writing my game, Wyvern, having a solid working knowledge of basic planar geometry was incredibly helpful. And it’s quite common to use algebra and linear algebra on a regular basis. But I rarely use trigonometry or differential equations on the job, and not much calculus either.

I’d say my basic math foundation has made me maybe 5% to 10% better as a programmer. If I knew a lot more math, I’d undoubtedly be a much better programmer than I am today, so I study and practice math several hours a week.

I love physics and I have an ongoing, lifelong quest to try to understand the underpinnings of quantum mechanics. But I’ve never personally found any physics very useful towards my job as a programmer. That would, of course, be diffferent if I were doing something in a physics domain, such as 3D game programming, or certain types of simulation.

Linus Torvalds:

I personally think a fairly strong math background is a good thing. I’m not as sure about the physics side, but I’m convinced that understanding math and having a good background in it helps you to be a better programmer. If only because the mental models are similar - you can build up any kind of set of rules you want, but it should be self-consistent.

David Heinemeier Hansson:

Not at all. At least not for the kind of business programming needed for web applications. I consider it much more important that someone is a good writer.

Peter Norvig:

Yes. Many ideas are inherently mathematical: induction, recursion, logic, etc.

Dave Thomas:

Maybe. But, to be honest, I haven’t seen much of a correlation either way between these types of discipline and good software developers.

However, I _have_ seen a strong correlation between people who have some music in their background and programming skills. I have no idea why, but I suspect that some of the areas of the brain that make someone musical also make them good at software development.

Guido Van Rossum:

Math, yes (for some parts; I don’t care for differential equations, but algebra and logic are important). Physics, I don’t think so except it’s always useful to be interested in many different things.

James Gosling:

Yes! They teach you logic & deduction…. To have an analytical eye. And there’s no replacement for mathematics when it comes to analyzing algorithms.

Tim Bray:

In my case, I’ve almost never used my university-level math to support my programming.

- What do you think will be the next big thing in computer programming? X-oriented programming, y language, quantum computers, what?

Steve Yegge:

I think web application programming is gradually going to become the most important client-side programming out there. I think it will mostly obsolete all other client-side toolkits: GTK, Java Swing/SWT, Qt, and of course all the platform-specific ones like Cocoa and Win32/MFC/etc.

It’s not going to happen overnight. It’s very slowly been going that direction for ten years, and it could well be another ten years before web apps „win”. The tools, languages, APIs, protocols, and browser technology will all have to improve far beyond what you can accomplish with them today. But each year they get a little closer, and I’ve finally decided to switch all my own app development over to browser-based programming from now on.

Microsoft and Apple definitely don’t want this to happen, so a necessary first step will be for an open-source browser such as Firefox to achieve a dominant market position, which will in turn require some sort of Firefox-only killer app. (A killer app would be something like iTunes, something that everyone in the world wants to use, badly enough to download Firefox for it.)

Linus Torvalds:

I don’t think we’ll see a „big jump”. We’ve seen a lot of tools to help make all the everyday drudgery easier - with high-level languages and perhaps the integration of simple databases into the language being the main ones. But most of the buzz-words have been of pretty limited use.

For example, I personally believe that „Visual Basic” did more for programming than „Object-Oriented Languages” did. Yet people laugh at VB and say it’s a bad language, and they’ve been talking about OO languages for decades.

And no, Visual Basic wasn’t a great language, but I think the easy DB interfaces in VB were fundmantally more important than object orientation is, for example.

So I think there will be a lot of incremental improvements, and the hardware improvements will make programming easier, but I don’t expect any _huge_ productivity help or revolutions in how people do things.

At least not until you start approaching real AI, and I don’t think real AI is going to be anything you will ever „program”.

David Heinemeier Hansson:

I try not to predict the future. I’m not a big believer in fortune telling. The best way to predict the future is to implement it.

Peter Norvig:

Large-scale distributed processing.

Dave Thomas:

The next big thing in computer programming will be eclipsed by the next–next big thing in programming, and so on, and so on. I’m kinda tired of the endless search for the big things, because while doing it people tend to forget about the real issues: getting the fundamentals right. We need to get a whole lot better at talking with our customers, focussing on delivering value, and taking pride in what we do. A developer who can do these things can deliver great software with any tool set, and won’t need to worry about tracking the fads and fashions.

Guido Van Rossum:

Sorry, I’m not much of a crystal ball person. I predicted CGI about 5 years after it had been invented. :-)

James Gosling:

The two issues I’m most concerned about now are coping with parallelism and complexity.

Tim Bray:

No idea.

- If you had three months to learn one relativly new technology, which one would You choose?

Steve Yegge:

I do happen to have 3 months (part-time), and I’m spending it learning Dojo (http://dojotoolkit.org) and advanced AJAX and DHTML. I’m learning it by writing a fairly ambitious web application. Dojo’s really cool, and I’m sure it will only improve with time.

Linus Torvalds:

Hmm. I’d really love to do FPGA’s, but I’ve always been too busy to really sit down and start learning. I love the notion of playing with hardware: it’s obviously one of the reasons I ended up doing operating systems, since that (along with compilers) is about as close as you can get to playing with the hardware, without actually designing or building it yourself.

David Heinemeier Hansson:

Cocoa programming for the Mac.

Peter Norvig:

I’d like to know Javascript better. Also flash.

Dave Thomas:

If „new” means „new to Dave Thomas” then I think I’d take intensive piano lessons.

If „new” means technology stuff, then I guess I’d choose technologies related to accessibility for people with disabilities.

Guido Van Rossum:

Snowboarding.

James Gosling:

For fun, I’d catch up on the latest in 3D rendering. I’d probably write a photon-map renderer.

Tim Bray:

Security, encryption, digital signatures, identity, etc. It’s a big problem for me that I’ve never learned this stuff.

- What do you think makes some programmers 10 or 100 times more productive than others?

Steve Yegge:

I think if you pause to consider why not all atheletes are equally good, you’ll have your answer(s). Thomas Edison has a relevant quote about genius that might also provide you some clues.

Linus Torvalds:

I really have no idea. I think some people are just better able to concentrate on the things that matter, and I think a lot of it is just doing it. Most of the really good programmers I know started doing it fairly young.

David Heinemeier Hansson:

The ability to restate hard problems as easy ones.

Peter Norvig:

The ability to fit the whole problem into their heads at one time.

Dave Thomas:

They care about what they do.

Guido Van Rossum:

Genetic differet brain structure.

James Gosling:

They think about what they do. They don’t rush in and slap things together. They have a holistic picture of what is to be built.

Tim Bray:

The surprising variability of the human mind.

- What are your favourite tools (operating system, programming/scripting language, text editor, version control system, shell, database engine, other tools you can’t live without) and why do you like them more than others?

Steve Yegge:

OS: Unix! I use linux, cygwin, and darwin all about equally often now. You just can’t beat it for productivity tools. Every programmer should learn how to use every tool in /bin and /usr/bin.

Scripting language: Ruby. I’m proficient with just about every major scripting language out there: Perl, Python, Tcl, Lua, Awk, Bash, and others I’m forgetting. But I’m really lazy, and Ruby’s by far the easiest, so it’s a match made in heaven.

Programming language: I don’t have a favorite; I think they all suck. I tend to prefer Java because it’s a strong, portable platform with good tools and good libraries. But the Java language will evolve or die; it’s not good enough as-is to hold the lead indefinitely.

Text editor: Emacs, because it’s the best thing out there today.

Version control: SVN. Perforce is better, but it’s very expensive.

Shell: Bash, because I’m too lazy to learn a better one.

Database engine: MySQL, of course. Nothing else comes close.

Others: I find the GIMP invaluable, and also maddeningly unintuitive. I’ve been using it for years and can still barely do anything with it. But I couldn’t live without it, ironically enough.

Firefox is becoming an increasingly critical part of my tools lineup. I feel suffocated when I’m forced to use IE or Safari.

Note that all these tools (Unix, Emacs, Firefox, GIMP, MySQL, Bash, SVN, Perforce) have something in common: they’re extensible; i.e., they all have programming APIs. Great programmers learn how to program their tools, not just use them.

Linus Torvalds:

I actually don’t end up having that many tools I work with, and for the many of them I have spent some time of my own to just make them work for me. The OS part is clearly the biggest one, but I’ve obviously also written my own version control system (git), and the text editor I use (micro-emacs) I’ve ended up customizing and extending upon too.

Other than those three parts, the only thing I care deeply about is my email reader. I use „pine” - not because it’s necessarily the greatest email reader ever, but because I’m used to it, and it does what I need it to do with a minimum of fuzz.

David Heinemeier Hansson:

OS X, TextMate, Ruby, Subversion, MySQL. That’s the combo currently keeping me happy. I like tools that exhibit good taste and a focus on the stuff that matters.

Peter Norvig:

I dislike all three major OS - Windows, Mac, Linux. I like Python and Lisp. Emacs.

Dave Thomas:

I switched to Macs a couple of years ago after being a Linux person for more than 10 years. The tools are not necessarily better, but they don’t have to be sharpened or maintained as often, which lets be concentrate on just using them.

I’m not a great believer in single tools: I tend to switch around quite frequently just so I can get experience with as many tools as possible. Right now I’m using OSX, Emacs, TextMate, Rails, Ruby, SVN, CVS, Rake, make, xsltproc, TeX, MySQL, Postgres, and a whole lot of small productivity aids. Who knows what I’ll be using next year.

Guido Van Rossum:

Unix/Linux, Python, vi+emacs, Firefox.

James Gosling:

These days I live in NetBeans. It does everything I want, very cleanly simply and efficiently. It’s the nicest environment I’ve ever lived in.

Tim Bray:

I like Unix-like operating systems, dynamic languages like Python and Ruby and statically-typed languages like Java (in particular the Java APIs), Emacs, whatever, bash, whatever, NetBeans.

- What is your favourite book related to computer programming?

Steve Yegge:

Man, that’s a tough one. Maybe *Gödel, Escher, Bach: an Eternal Golden Braid *(Hofstadter)? Although it’s not strictly about programming. If you specifically mean „favorite book about programming”, then maybe SICP (mitpress.mit.edu/* sicp*/).

Linus Torvalds:

Heh. When I read these days, I tend to either read fiction, or non-computer-related stuff (oldie but goodie: „The Selfish Gene” by Richard Dawkins).

When it comes to programming, the only real programming book that comes to mind is actually the classic Kernighan & Ritchie „The C Programming Language” book, because it’s such an incredibly useful book while being so very readable and _short_. Considering that you can basically learn one of the most important programming languages of our times from it, the fact that it’s thin and readable is just a wonder.

That said, many other books I enjoyed a lot were not about programming per se, but about computer architecture and hardware. There’s obviously Patterson & Hennessy’s computer architecture book, but for me personally perhaps even more Crawford & Gelsinger’s „Programming the 80386″, which was what I used when I started with Linux.

For similar reasons, I have a soft spot for Andrew Tanenbaum’s „Operating Systems: Design and Implementation”.

David Heinemeier Hansson:

I like Extreme Programming Explained for its rejection of common thinking about programming practices and Patterns of Enterprise Application Architecture for striking the right balance of abstract and concrete.

Peter Norvig:

Structure and Interpretation of Computer Programs

Dave Thomas:

It depends on what you mean by „favorite.” Probably the best written book I’ve read in the area is IBM’s „IBM/360 Principles of Operation.”

Guido Van Rossum:

Neil Stephenson’s Quicksilver.

James Gosling:

Programming Pearls by Jon Bentley.

Tim Bray:

Bentley’s Programming Pearls

- What is Your favourite book NOT related to computer programming?

Steve Yegge:

Just one book? You’re asking for the impossible. There are too many great books out there to choose just one.

My favorite books that I’ve read this month are „Stardust” (Neil Gaiman) and „The Mind’s I” (Hofstadter/Dennet).

My favorite writers are Kurt Vonnegut, Jr. and Jack Vance.

Linus Torvalds:

Well, I already mentioned the Selfish Gene by Dawkins. On the fictional side, there’s just a lot of books I’ve read and anjoyed, but few I’d say were my „favourite” one. I tend to not often re-read the books, and the selection would change over time. It’s mostly science fiction and fantasy, eg „Stranger in a Strange Land” by Heinlein was my favourite one as a teenager, but it’s a bit less clear-cut for me these days..

David Heinemeier Hansson:

1984, George Orwell.

Guido Van Rossum:

Neil Stephenson’s Quicksilver.

James Gosling:

Guns, Germs & Steel by Jared Diamond

Tim Bray:

One Day in the Life of Ivan Denisovich

- What are your favourite music bands/performers/compositors?

Steve Yegge:

Favorite genres: classical, anime soundtracks, video-game music
Favorite composers: Rachmaninoff, Chopin, Bach
Favorite performers: David Russell (classical guitar), Sviatoslav Richter
(piano)
Favorite anime OSTs: Last Exile, Haibane Renmei

Linus Torvalds:

I’m actually not very much into music, but when I listen to it, I tend to listen to various classic-rockish things, ranging from Pink Floyd to the Beatles to Queen and The Who.

David Heinemeier Hansson:

I like a lot of genres. Beth Orton, Aimee Mann, Jewel, Lauryn Hill. Actually, all those examples would fit under Girls with Guitars ;).

Guido Van Rossum:

Philip Glass.

James Gosling:

I tend to like folk musicians: Christine Lavin, Woody Guthrie, Pete Seeger…

Tim Bray:

Read my blog.

2006/7/26

做乙方就是被人欺负的

在这个地方干了半年多了, 今天终于见到合同的面了, 先不说里面明文写着一式两份但只给你一份空的然后让你签字不说, 把一些常规的国家要求都写在主合同里面, 然后保密协议, 违约处理, 一大堆乙方的义务全写在附则上, 然后是"决这个, 决不那个的".
欺负人家法盲呢不是, 出离愤怒了, 可是自己扑腾着扑腾着怎么也跳不出去, 不知道该怎么办了.
2006/7/25

世界上最远的距离

这个就不用说是转载的了吧, 就不知道是哪位翻译的, 呵呵
 
    世界上最远的距离

 

             ——泰戈尔

  

  世界上最远的距离

  不是生与死的距离

  而是我站在你的面前

  你却不知道我爱你

  

  世界上最远的距离

  不是我站在你的面前

  你却不知道我爱你

  而是爱到痴迷

  却不能说我爱你

  

  世界上最远的距离

  不是我不能说我爱你

  而是想你痛彻心脾

  却只能深埋心底

  

  世界上最远的距离

  不是我不能说我想你

  而是彼此相爱

  却不能够在一起

  

  世界上最远的距离

  不是彼此相爱

  却不能在一起

  而是明明无法抵挡这一股气息

  却还得装作毫不在意

  

  世界上最远的距离,

  不是明明无法抵挡这一股气息

  却还得装作毫不在意

  而是用一颗冷漠的心

  在你和爱你的人之间

  掘了一条无法跨越的沟渠

  

  世界上最远的距离

  不是树与树的距离

  而是同根生长的树枝

  却无法在风中相依

  

  世界上最远的距离

  不是树枝无法相依

  而是相互了望的星星

  却没有交汇的轨迹

  

  世界上最远的距离

  不是星星之间的轨迹

  而是纵然轨迹交汇

  却在转瞬间无处寻觅

  

  世界上最远的距离

  不是瞬间便无处寻觅

  而是尚未相遇

  便注定无法相聚

  

  世界上最远的距离

  是鱼与飞鸟的距离

  一个在天

  一个却深潜海底 

rank of my msn space buddies

Use the simplest implement of the pagerank alogrithm, first get rid of all the rank leak node and include the rank source of the factor of 0.2, it could be represent better as on the google toolbar, which divide all the values into ten part and give each page a value based on which part are you in of the ten parts. I'm too lazy too get that done and figure out who is the hottie of my buddies space(which could be very wrong based on the fact that the grab is partial and the simple of the ranking alogrithm), and also the non-sense of the result.
Almost forgot to mention that, the total nodes related are 15000+, and after fiter out the rank leak node, only 1025 left, it takes 13 iterations to get the ε (the maximum difference of all the node rank value between two iterations) to 10^-9, 40 iterations to 10^-16, 43 to 10^-17, 48 to 10^-18, and finally we rest at 51 iterations to 10^-19.
So here is the rank value distribution on a graph, maybe not the right way to represent the data, any suggestions are welcomed;-)

我是一只小小小小鸟

想要飞啊飞却总也飞不高...
向自己发誓, 再也不抱怨, 要积极面对.
2006/7/24

习惯淋雨了

又一次淋成落汤猫(因为没见过落汤鸡, 只见过猫)了, 回来的路上本可以去哪躲躲的, 但闲浪费时间且已经淋过多次了, 就一个人骑着车, 轮下没脚的水, 眼睛被睫毛挡住的水弄得睁不开, 也还没能去吃饭.
第三次被这么大的雨淋了, 有些习惯了, 有些感觉面对大雨, 我知道我在fight against谁, 也许是面对人生, 连与谁fight都不知道, 就一天到晚的埋怨, 可怜虫..., b4自己.
淋回来洗个热水澡一切就过去了, 可生活呢, 像是无穷无尽的噩梦

Zed on Ruby, Rails, Mongrel, and More(zz)

http://www.oreillynet.com/ruby/blog/2006/05/post.html

Zed on Ruby, Rails, Mongrel, and More

Wednesday May 17, 2006 9:22AM
by Steve Mallett in Articles
Interview by Pat Eyler

Zed Shaw is an up and coming programmer in the Ruby world. He’s the creator of the popular Mongrel webserver, and is building a reputation for fast, solid, secure code. In this interview, he discusses Mongrel, Ruby, and his path to better code.

How did you come to Ruby?

Zed I actually came to Ruby years back while developing the first version of a weird revision control tool I was playing with called FastCST. I tried Ruby out but didn’t quite see the point and so went back to using C. Then I read Curt Hibbs’ article and realized, “Hey, they do domain specific languages with that thing!” Right after that I started a Ruby version of FastCST and then became distracted with work and several other weird projects related to Ruby on Rails.

Ah, there’s the Rails link that most people expect when some says they use Ruby. How much of your work is in Ruby vs Ruby on Rails?

Zed I’d say the majority of my work with Mongrel is in Ruby in order to support Rails and the other frameworks that run with Mongrel, but I work with Rails at my day job.

Zed What I like about Ruby is how you can express statements succinctly but still clearly so that other people can read your code. It has warts but the speed that I code in Ruby is incredible. The language is just amazing for how it mixes domain specific language abilities with object oriented design to let me crank out fully functioning applications at prototyping speeds with production quality.

A lot of people are saying that Ruby and Rails aren’t ready for the enterprise. What’s your take on this?

Zed Before answering this I’ll have to clarify the term “enterprise” into something people can talk about. Right now “enterprise” seems to mean three general things:

  • “Big and expensive for running real businesses.”
  • “Scales and performs well enough to meet my service demands.”
  • “Has legally enforceable commercial support options to cover potential losses.”

Okay, let’s talk about each of those concepts. Ruby and Rails are both free, how does this square with spending a lot of money?

Zed The definition that “enterprise” means you always have to spend millions on hardware and software to run your business is just wrong. For years companies have been pushing this idea because they stand to benefit if you buy more of their products. The reality is that your solution needs to be tailored to the problem at hand and simply saying that you always need a giant solution means you aren’t really evaluating your needs. Rails demonstrates that this kind of “enterprise” has set bogus expectations for architectures, features, and such that aren’t needed and have had crappy ROI.

What Rails seems to be doing is proving that you can run large operations without spending tons of money on “enterprise” solutions. Yes Rails can and does run real businesses. It actually is making some businesses lots of money and is being used by many entrepreneurs to kick-start their ideas with little capital investment. I know of several small shops in New York that got their first sites up and running with only a few developers and started making money with only a minor investment. The fact that companies can reduce their initial risk of investment like this should be reason enough to use Rails to power “real businesses”.

Alright, what about the scalability issue?

Zed The folks who mean “scalable” when they say “enterprise” do have valid claims, many of which I’m trying to address with Mongrel. First off, there needs to be a redefinition of the term “scalable” away from “high performance” and back to “resource expandable”. Once you start to talk about performance and scalability separately you can give a more concrete answer to both concerns.

Rails scales (meaning expands to meet needs) just like any other web application framework technology. Mongrel makes this even easier since it is fast and HTTP based. If you were using Tomcat, Resin, WebLogic, or Apache+PHP before, then Mongrel running Rails pretty much just drops into the existing infrastructure.

I’ll be honest right away though and say that Ruby is slow. The Ruby community has been ignoring the huge “performance” elephant standing in the room and they need to start talking about it so it goes away. Elephants hate being talked about. There are a few efforts to make Ruby faster, but I see a lot less action than is needed to solve the problem. One solution in the works is a real virtual machine called Rite (or YARV depending on who you talk to) which is showing some real promise and seems to be speed competitive with the fastest Java implementations.

Ruby’s advantage though is not in it’s blazing execution speed but it’s blazing coding speed. I’ll put it to you this way: I wrote mongrel in about 3 months. That’s a full featured stable web server that can run four Ruby web application frameworks and is already powering many Ruby web sites. This wouldn’t be possible without Ruby the language.

Rails has a different situation from Ruby’s. Rails has this wonderful caching system that compensates for Ruby’s slow execution speed called “page caching” and “fragment caching”. Rails uses this to transfer the actual web traffic from Rails to the web server itself. This means that with careful planning of how you’ll cache parts of your Rails application you can get the same performance as your static file serving web server. Because of this a Rails application many times can outperform similar applications in Java or PHP.

And, what about that commercial support question?

Zed It’s a total no-deal for Rails right now. I agree with these folks that many organizations need a commercial support option and SLA before they invest in a technology. I really think the first company to do a serious good job at Rails commercial support will make a mint, but until then these organizations are just out of luck. If you look at PHP it really wasn’t until Zend started offering commercial support that big companies considered PHP a serious platform. Nothing really changed about PHP, it was just the perception that there was a company you could take to court now, so it was safer to use.

Before starting on Mongrel, you were working on SCGI. Can you explain a bit about their respective places in the web framework and how you see them playing together (or not).

Zed SCGI was my first attempt at doing a simple alternative to FastCGI. The main goal for SCGI was fast Rails hosting with only pure Ruby. This worked pretty well, but the reality is that SCGI has limited support in most web servers and doesn’t seem to be on the radar for future development. Lighttpd’s support was originally a bolt on modification of it’s mod_fastcgi. Apache’s module comes from outside the Apache project and the Apache project just announced heavier support for FastCGI rather than SCGI. Throw in the fact that many people have problems getting SCGI in Apache to talk to multiple backends and it’s not looking good for SCGI.

Mongrel originally started as an SCGI proxy designed to solve this problem. I wrote the HTTP parser and then started working on a C only proxy that’d answer HTTP requests and translate them to SCGI. About half way through I realized that the parser I wrote was good enough to just skip the middle-man and write a Web server directly in Ruby. About a day later Mongrel was born.

My plans for SCGI right now are to simplify it down to the absolute minimum necessary to run the protocol. SCGI currently has lots of DRb management code and stuff that some folks use (and abuse) but in general doesn’t help people who want to use SCGI. In order to keep the current crop of SCGI users supported I’ll “back port” some innovations from Mongrel–such as the thread model–and then simplify the whole package to more what SCGI was like in the earlier days.

Given that, I would like to stress that my future work will be with Mongrel and that I really think it is a much more capable way to support Ruby web applications. HTTP is just an easier protocol to deal with in terms of support and deployment tools.

If there’s anyone interested in taking over support for SCGI–maybe one of those companies that took it and started their next big product on it. It has a RubyForge project and I’d be ready to hand over the keys to anyone who’s interested and capable.

How does your work on Mongrel affect your Rails work, and vice versa?

Zed When working on Mongrel itself the Rails code I have to use is very minimal. This is done on purpose to keep Mongrel loosely coupled from Rails in case they change something in a release and break everything.

When I’m working with Rails, I use Mongrel actively and take notes for later improvement. This helps keep Mongrel real and keeps me from going into space inventing things for purely academic reasons. For example, I use Windows at work for development and it’s incredibly painful. Not so much because of Windows but because I seriously think nobody developing Rails actually even gets within five feet of a Windows computer. A lot of the enhancements that Luis Lavena made were touched up and improved simply because I wanted to make using Ruby on Rails easier for me and other poor slobs forced to use Windows.

I know that you’ve worked with the Rails and Nitro camps (at least) to make Mongrel work with them. What have been the biggest obstacles?

Zed The best thing about making Mongrel “framework agnostic” is that other people take it and do unexpected things with it. I’m really the only person working to make Mongrel coexist with Rails. Some folks on the core Rails team mostly test it and use it when they do development, but myself, Luis Lavena, and a few others ended up doing all the work to make it production ready. Since Luis and myself are the only ones who need it to work with Rails in production that’s to be expected.

The Nitro, camping, and IOWA teams however mostly did the work for me on their frameworks. They took Mongrel, read the documentation, bugged me for initial help, but for the most part it’s been hands off. I think I helped Camping the most, but why (the lucky stiff) actually manages the Mongrel code related to Camping. He’s also just contributed back a nice changeset implementing solid large file uploads/downloads which I’ll put in the 0.3.12.5 release. Why says he’s doing DVD uploads/downloads off his ParkPlace project.

What does Mongrel give back to the projects that use it?

Zed The two biggest things that the projects all should start getting from Mongrel are security enhancements and win32 support.

Mongrel’s design is based around the idea that most of the security problems in HTTP servers comes from hand-coded parsers that are too “loose” with the protocol. Mongrel uses a generated parser (using Ragel) that is very strict and seems to block a huge number of attack attempts simply because it is so exacting. Since this protection comes at the HTTP level, any framework using Mongrel gets it for free.

In the EastMedia/VeriSign project we were seeing a bunch of attack attempts from a “security company”. I won’t name the company since I don’t want to give them any extra press, but they were running some kind of security scanning software against our machines (without asking first) that we hadn’t announced yet.

The beautiful part is that Mongrel blocked all of the attacks immediately at the HTTP protocol level and kicked them out without wasting any time. Meanwhile, Apache let the traffic right on through the proxy without even a warning.

After they ran the automated scans we saw a few “hand coded” attacks which probably means someone at this “security company” was very intrigued by what Mongrel was doing.

The funniest part of this is that all Mongrel does is use a correctly coded parser based on a real grammar and a parser generator (Ragel). Other web servers use hand coded HTTP parsers that turn out to be vulnerable, difficult to compare to the real HTTP 1.1 RFC grammar, and are just a pain to manage. Using Ragel makes Mongrel robust against many of these attacks without actually having to create specific logic for detecting “attacks”.

The second benefit other projects are getting from Mongrel is win32 support from Luis Lavena. After Mongrel’s success on the win32 platform I started seeing messages saying that Luis was helping other projects get solid win32 capabilities. The rumors suggest that Luis and friends might actually open up a whole Ruby world for win32 users. I’m hoping that this brings some help to Daniel Berger’s win32utils project as well.

What I’d like to see on the win32 front is the Ruby One-Click Installer pick up all the win32 support natively and include it by default. Better yet, I’d like to see Ruby do what Python does and include the win32utils stuff as a platform specific add-on or a gem folks can download.

One of the big drivers behind Mongrel is that it’s fast and mostly native Ruby. What’s your process for optimizing? What tools are you using?

Zed My main tool when trying to optimize (and also validate) C code is valgrind and kcachegrind. Both are fantastic for free tools, but sadly Ruby does not run well under valgrind. In fact, valgrind dies on even “hello world” with 30k errors before the program has started. What I did initially with the HTTP parser was I wrote a little harness that let me run the parser under valgrind and then tuned it with kcachegrind.

The rest of my performance analysis comes from setting up a series of test applications that I then hit with httperf to measure their speed. I keep a log while I’m working on it and make sure that the performance doesn’t drop. If I make a change with an expected performance boost and it doesn’t do anything then I evaluate it again and try something else that might work.

The whole process is really just the scientific method. Since I have limited information from Ruby about performance I have to just test, evaluate, adjust, and repeat until the measurements improve. What really helps is using statistical tests to confirm that each change made a difference, or at least didn’t hurt things. Without these tests I could make changes that seemed to improve things but actually made no difference.

I also use Ruby’s profiling library, but I can only do that in very limited tests where only Mongrel is running. When Mongrel runs the other application frameworks the framework code drowns out any Mongrel related performance data and doesn’t give me any decent information.

A good example of this is in a simple test I have that returns HTTP request parameters as a YAML dump. I can’t use this test for profiling because the YAML library is such a pig that all of the profiling information is about YAML. Mongrel is just a little blip. Rails or Camping does the same thing so profiling turns out to be more about them than about Mongrel.

When I get really serious about performance I use R and run planned measured tests with statistical evaluations. This involves more planning than most people are familiar with (as I’ve ranted about before) and I usually only do it if money is on the line since it takes a large amount of time to get right.

Recently, you’ve been putting a lot of work (beyond unit tests) into making Mongrels stable and secure. Would you explain your methodology and the tools you’re using to make it work?

Zed I agree with the OpenBSD group’s assertion that security holes come from defects in general, not from some specific “security hole” that you look for in the source code. This means that I think if I fix all the defects I can find, and try to be proactive about potential errors then I’ll prevent a lot of security holes in the process.

With any of my projects I try desperately to do the following:

  1. Keep the code as incredibly simple as possible. I call this “The Shibumi School of Software Structure” because I like the letter ‘S’ and because it’s the exact inverse of what most programmers do when they structure software.
  2. Code reviews of my own code before releasing, constantly trying to find:
    1. “missed assertions” — Unstated assumptions about inputs and outputs.
    2. “missed else” — Logical branches that don’t cover all test domains.
    3. “will it stop” — Looping errors that will cause classic infinite loops or short loops.
    4. “check that return” — Return values that aren’t dealt with properly (which are really assumptions about other inputs and outputs).
    5. “unexpected exceptions” — Exceptions are pretty darn evil since they’re rarely documented.
    6. “simply readable” — Replacing clever code with readable simple code where possible, and documenting complex code so it can be reviewed by others.
  3. Unit testing as much as possible. When writing networking software unit tests become really difficult since you can only actually test it over a network of some kind.
  4. External thrashing and performance tests trying to break the system with unexpected inputs. Techniques I use are fuzzing, heavy loads, stopping interactions violently mid-stream, ripping out resources at random, and trying to think about ways someone could attack the system.
  5. Usability reviews from potential or current users. My motto here is “If I KMFU (Know My F*ing Users) they won’t have to RTFM.” I really think if a system is easy to use then the security concerns are lower, but I don’t have much evidence to support this claim.

Since I’m just one guy doing all this — and since this is supposed to be fun for me — I don’t follow all of these as religiously as I would in a professional setting. When Mongrel got some real funding I took the above steps more seriously. For about 3 weeks I was increasing the number of unit tests, slowing down releases so I could do code reviews, and I grabbed Peach Fuzzer to get some simple thrashing and fuzzing tests going.

The end result was that Mongrel ended up being able to stop a large number of attacks directly at the protocol level. This doesn’t mean Mongrel is impenetrable, but I think it’s on the road to being one of the most secure web servers out there. Of course now every hax0r will try to break Mongrel but I predict any future attacks will exploit flaws in Ruby or in the application frameworks rather than in Mongrel. Not much I can do about that right now though.

What do you think Mongrel needs to take it to the next level?

Zed A big component of my Mongrel work in the near future will be simply improving the deployment documentation people use. So far there’s just a document on setting up lighttpd, but what we really need is some solid documentation on deploying production Mongrel clusters on various platforms. Once this kind of documentation is available people should start to get more comfortable with deploying Mongrel, especially if they are in an environment that already runs other application servers like Tomcat.

In fact, I tried to meet as many people as possible at Canada on Rails to convince them to try their applications on Mongrel and to sort out what kinds of deployment scenarios people are facing with their applications.

How did that go? How will you use that feedback in Mongrel?

Zed The specific Mongrel feedback I received was very positive, and the majority of it was not to me directly but from people recommending it to others. Actually some of it was pretty embarrassingly glowing which is fantastic since it means I’m on the right track. I am a little worried that people just haven’t ran into the big problems yet, but I’m guessing that Mongrel is really hitting the right spots.

The serious questions were the most valuable though. Many people asked questions about deployment that I’m hoping to address in a series of nice documents covering various deployment scenarios. Others asked about cluster management which I’m hoping Bradley Taylor from RailsMachine.com will solve with his upcoming cluster plugins for Mongrel. A few other asked about licenses which I’ll address in some FAQs.

The real tough questions seemed to be about how best to handle caching and distribute load for complex dynamic web sites. I really didn’t have an answer for these folks but I took their complaints and started formulating the base idea for my next project. I’m hoping this next effort will be another solution to the supposedly solved “caching problem”. The feedback I received from people about my ideas around caching were very enthusiastic so I think I’m on the right track there.

What’s holding Mongrel back?

Zed I’d say the biggest obstacle has been getting Mongrel accepted as a production platform. Developing Mongrel has been great. I get love letters nearly every day from the community saying how much they think Mongrel rocks. The only missing piece is a few good huge production deployments using Mongrel. I’m thinking that these will start happening in the next few months as people start to deploy the applications they’ve been developing.

What are your 5 favorite libraries/frameworks for Ruby (whether in the standard library, or off the ‘Net)?

Zed I really dig why-the-luck-stiff’s Camping framework. It’s amazing how much voodoo why put into that thing in such a small space. Mongrel has more than a few bits of code or ideas borrowed from it.

I also use webgen quite extensively to manage the Mongrel website. It’s a great way to generate static sites from a small set of pages written in wiki format.

I also really like this tiny fast little Java webserver called Simple. When I started Mongrel I studied Simple and adopted it’s Handler setup. Simple’s got some other odd features–like parsing responses to correct them–but still remains remarkably small and fast. If I were ever going to do a Rails competitor in Java, Simple would be at the center.

The only web performance measurement tool I’ll advocate to people these days is httperf. It’s the only one that gives accurate statistics, breaks the entire request/response chain down, accurately reports socket errors, has exact definitions of what each measurement means and doesn’t claim to measure “users”.

I also really like Lua as a light alternative to Ruby. It’s fast, real tiny, embeds into other programs well, and has a syntax that’s close enough to Ruby to not seem entirely foreign. I’ve been looking to use Lua in a couple C only projects I have planned as the extension language.

What’s next for Ruby, Rails, Mongrel, and Zed?

Zed You’ll have to ask Matz about Ruby and David about Rails. What I can say is what I’d like to be next for Ruby and Rails.

For Ruby I’d like to see two efforts. First that it always runs clean under valgrind. This would go a long way to improving it’s stability and to keep it clean. The second is for all these people working on different “make Ruby faster” projects to pour their collective talents into making the Ruby 1.9 virtual machine fast and perfect.

For Rails I’d like to see a lot of the fat go away and for ActiveRecord to finally get a decent connection pooling system. By “fat” I mean stuff that I believe DHH is already planning on moving out into plugins like Active Web Service. For ActiveRecord there needs to be a solid effort to refactor it so that database connections are pooled in much the same way that Hibernate does. This is especially important for people using commercial databases that license based on the connection count.

Mongrel’s future is looking pretty bright (so bright I gotta wear shades). I’m making the push toward the first official production release, dubbed “Mongrel 0.4 Enterprise Edition 1.2″ since tacking “Enterprise Edition” on everything worked so well for Java. I’m also working with more companies to either provide services around Mongrel or to include Mongrel in potential products.

My next big project will be a special caching proxy server that I’m aiming at making any dynamic web applications much faster. While building and using Mongrel I’ve found that the whole caching situation with HTTP 1.1 uses very 1996 technology. I think I’ve got an idea that could solve the problem and potentially give many of these web applications huge performance and scalability boosts.


Zed A. Shaw is a professional software developer who’s been writing software for close to 13 years in industries ranging from government, academics, and commercial software and on applications ranging from security products to network protocols and web applications. He’s also dabbled in system administration, product development, usability engineering, and customer service. In his spare time he likes to write biographies so people think he’s super cool.

Pat Eyler is an Infrastructure Engineer for the LDS Church by profession, a Ruby geek by choice, and a writer by night. He enjoys reading, cooking, spending time with his family, and helping to build the Ruby community.

揭开正则表达式的神秘面纱(zz)

狂汗的标题, 内容大多都很熟悉了, 但是排的很认真, 且例子举的满辛苦, 就转过来吧, 有些中文的说法还是怪怪的, 不过那个negative look hehind是这几天才在水母上看到用处, 挺不错的.
 

[原创文章,转载请保留或注明出处:http://www.regexlab.com/zh/regref.htm]

引言

正则表达式(regular expression)描述了一种字符串匹配的模式,可以用来:(1)检查一个串中是否含有符合某个规则的子串,并且可以得到这个子串;(2)根据匹配规则对字符串进行灵活的替换操作。

正则表达式学习起来其实是很简单的,不多的几个较为抽象的概念也很容易理解。之所以很多人感觉正则表达式比较复杂,一方面是因为大多数的文档没有做到由浅入深地讲解,概念上没有注意先后顺序,给读者的理解带来困难;另一方面,各种引擎自带的文档一般都要介绍它特有的功能,然而这部分特有的功能并不是我们首先要理解的。

文章中的每一个举例,都可以点击进入到测试页面进行测试。闲话少说,开始。


1. 正则表达式规则

1.1 普通字符

字母、数字、汉字、下划线、以及后边章节中没有特殊定义的标点符号,都是”普通字符”。表达式中的普通字符,在匹配一个字符串的时候,匹配与之相同的一个字符。

举例1:表达式 “c”,在匹配字符串 “abcde” 时,匹配结果是:成功;匹配到的内容是:”c”;匹配到的位置是:开始于2,结束于3。(注:下标从0开始还是从1开始,因当前编程语言的不同而可能不同)

举例2:表达式 “bcd”,在匹配字符串 “abcde” 时,匹配结果是:成功;匹配到的内容是:”bcd”;匹配到的位置是:开始于1,结束于4。


1.2 简单的转义字符

一些不便书写的字符,采用在前面加 “\” 的方法。这些字符其实我们都已经熟知了。

表达式

可匹配

\r, \n

代表回车和换行符

\t

制表符

\\

代表 “\” 本身

还有其他一些在后边章节中有特殊用处的标点符号,在前面加 “\” 后,就代表该符号本身。比如:^, $ 都有特殊意义,如果要想匹配字符串中 “^” 和 “$” 字符,则表达式就需要写成 “\^” 和 “\$”。

表达式

可匹配

\^

匹配 ^ 符号本身

\$

匹配 $ 符号本身

\.

匹配小数点(.)本身

这些转义字符的匹配方法与 “普通字符” 是类似的。也是匹配与之相同的一个字符。

举例1:表达式 “\$d”,在匹配字符串 “abc$de” 时,匹配结果是:成功;匹配到的内容是:”$d”;匹配到的位置是:开始于3,结束于5。


1.3 能够与 ‘多种字符’ 匹配的表达式

正则表达式中的一些表示方法,可以匹配 ‘多种字符’ 其中的任意一个字符。比如,表达式 “\d” 可以匹配任意一个数字。虽然可以匹配其中任意字符,但是只能是一个,不是多个。这就好比玩扑克牌时候,大小王可以代替任意一张牌,但是只能代替一张牌。

表达式

可匹配

\d

任意一个数字,0~9 中的任意一个

\w

任意一个字母或数字或下划线,也就是 A~Z,a~z,0~9,_ 中任意一个

\s

包括空格、制表符、换页符等空白字符的其中任意一个

.

小数点可以匹配除了换行符(\n)以外的任意一个字符

举例1:表达式 “\d\d”,在匹配 “abc123″ 时,匹配的结果是:成功;匹配到的内容是:”12″;匹配到的位置是:开始于3,结束于5。

举例2:表达式 “a.\d”,在匹配 “aaa100″ 时,匹配的结果是:成功;匹配到的内容是:”aa1″;匹配到的位置是:开始于1,结束于4。


1.4 自定义能够匹配 ‘多种字符’ 的表达式

使用方括号 [ ] 包含一系列字符,能够匹配其中任意一个字符。用 [^ ] 包含一系列字符,则能够匹配其中字符之外的任意一个字符。同样的道理,虽然可以匹配其中任意一个,但是只能是一个,不是多个。

表达式

可匹配

[ab5@]

匹配 “a” 或 “b” 或 “5″ 或 “@”

[^abc]

匹配 “a”,”b”,”c” 之外的任意一个字符

[f-k]

匹配 “f”~”k” 之间的任意一个字母

[^A-F0-3]

匹配 “A”~”F”,”0″~”3″ 之外的任意一个字符

举例1:表达式 “[bcd][bcd]” 匹配 “abc123″ 时,匹配的结果是:成功;匹配到的内容是:”bc”;匹配到的位置是:开始于1,结束于3。

举例2:表达式 “[^abc]” 匹配 “abc123″ 时,匹配的结果是:成功;匹配到的内容是:”1″;匹配到的位置是:开始于3,结束于4。


1.5 修饰匹配次数的特殊符号

前面章节中讲到的表达式,无论是只能匹配一种字符的表达式,还是可以匹配多种字符其中任意一个的表达式,都只能匹配一次。如果使用表达式再加上修饰匹配次数的特殊符号,那么不用重复书写表达式就可以重复匹配。

使用方法是:”次数修饰”放在”被修饰的表达式”后边。比如:”[bcd][bcd]” 可以写成 “[bcd]{2}”。

表达式

作用

{n}

表达式重复n次,比如:“\w{2}” 相当于 “\w\w”“a{5}” 相当于 “aaaaa”

{m,n}

表达式至少重复m次,最多重复n次,比如:“ba{1,3}”可以匹配 “ba”或”baa”或”baaa”

{m,}

表达式至少重复m次,比如: “\w\d{2,}”可以匹配 “a12″,”_456″,”M12344″…

?

匹配表达式0次或者1次,相当于 {0,1},比如:“a[cd]?”可以匹配 “a”,”ac”,”ad”

+

表达式至少出现1次,相当于 {1,},比如:“a+b”可以匹配 “ab”,”aab”,”aaab”…

*

表达式不出现或出现任意次,相当于 {0,},比如:“\^*b”可以匹配 “b”,”^^^b”…

举例1:表达式 “\d+\.?\d*” 在匹配 “It costs $12.5″ 时,匹配的结果是:成功;匹配到的内容是:”12.5″;匹配到的位置是:开始于10,结束于14。

举例2:表达式 “go{2,8}gle” 在匹配 “Ads by goooooogle” 时,匹配的结果是:成功;匹配到的内容是:”goooooogle”;匹配到的位置是:开始于7,结束于17。


1.6 其他一些代表抽象意义的特殊符号

一些符号在表达式中代表抽象的特殊意义:

表达式

作用

^

与字符串开始的地方匹配,不匹配任何字符

$

与字符串结束的地方匹配,不匹配任何字符

\b

匹配一个单词边界,也就是单词和空格之间的位置,不匹配任何字符

进一步的文字说明仍然比较抽象,因此,举例帮助大家理解。

举例1:表达式 “^aaa” 在匹配 “xxx aaa xxx” 时,匹配结果是:失败。因为 “^” 要求与字符串开始的地方匹配,因此,只有当 “aaa” 位于字符串的开头的时候,”^aaa” 才能匹配,比如:”aaa xxx xxx”

举例2:表达式 “aaa$” 在匹配 “xxx aaa xxx” 时,匹配结果是:失败。因为 “$” 要求与字符串结束的地方匹配,因此,只有当 “aaa” 位于字符串的结尾的时候,”aaa$” 才能匹配,比如:”xxx xxx aaa”

举例3:表达式 “.\b.” 在匹配 “@@@abc” 时,匹配结果是:成功;匹配到的内容是:”@a”;匹配到的位置是:开始于2,结束于4。
进一步说明:”\b” 与 “^” 和 “$” 类似,本身不匹配任何字符,但是它要求它在匹配结果中所处位置的左右两边,其中一边是 “\w” 范围,另一边是 非”\w” 的范围。

举例4:表达式 “\bend\b” 在匹配 “weekend,endfor,end” 时,匹配结果是:成功;匹配到的内容是:”end”;匹配到的位置是:开始于15,结束于18。

一些符号可以影响表达式内部的子表达式之间的关系:

表达式

作用

|

左右两边表达式之间 “或” 关系,匹配左边或者右边

( )

(1). 在被修饰匹配次数的时候,括号中的表达式可以作为整体被修饰
(2). 取匹配结果的时候,括号中的表达式匹配到的内容可以被单独得到

举例5:表达式 “Tom|Jack” 在匹配字符串 “I’m Tom, he is Jack” 时,匹配结果是:成功;匹配到的内容是:”Tom”;匹配到的位置是:开始于4,结束于7。匹配下一个时,匹配结果是:成功;匹配到的内容是:”Jack”;匹配到的位置时:开始于15,结束于19。

举例6:表达式 “(go\s*)+” 在匹配 “Let’s go go go!” 时,匹配结果是:成功;匹配到内容是:”go go go”;匹配到的位置是:开始于6,结束于14。

举例7:表达式 “¥(\d+\.?\d*)” 在匹配 “$10.9,¥20.5″ 时,匹配的结果是:成功;匹配到的内容是:”¥20.5″;匹配到的位置是:开始于6,结束于10。单独获取括号范围匹配到的内容是:”20.5″。


2. 正则表达式中的一些高级规则

2.1 匹配次数中的贪婪与非贪婪

在使用修饰匹配次数的特殊符号时,有几种表示方法可以使同一个表达式能够匹配不同的次数,比如:”{m,n}”, “{m,}”, “?”, “*”, “+”,具体匹配的次数随被匹配的字符串而定。这种重复匹配不定次数的表达式在匹配过程中,总是尽可能多的匹配。比如,针对文本 “dxxxdxxxd”,举例如下:

表达式

匹配结果

(d)(\w+)

“\w+” 将匹配第一个 “d” 之后的所有字符 “xxxdxxxd”

(d)(\w+)(d)

“\w+” 将匹配第一个 “d” 和最后一个 “d” 之间的所有字符 “xxxdxxx”。虽然 “\w+” 也能够匹配上最后一个 “d”,但是为了使整个表达式匹配成功,”\w+” 可以 “让出” 它本来能够匹配的最后一个 “d”

由此可见,”\w+” 在匹配的时候,总是尽可能多的匹配符合它规则的字符。虽然第二个举例中,它没有匹配最后一个 “d”,但那也是为了让整个表达式能够匹配成功。同理,带 “*” 和 “{m,n}” 的表达式都是尽可能地多匹配,带 “?” 的表达式在可匹配可不匹配的时候,也是尽可能的 “要匹配”。这 种匹配原则就叫作 “贪婪” 模式 。

非贪婪模式:

在修饰匹配次数的特殊符号后再加上一个 “?” 号,则可以使匹配次数不定的表达式尽可能少的匹配,使可匹配可不匹配的表达式,尽可能的 “不匹配”。这种匹配原则叫作 “非贪婪” 模式,也叫作 “勉强” 模式。如果少匹配就会导致整个表达式匹配失败的时候,与贪婪模式类似,非贪婪模式会最小限度的再匹配一些,以使整个表达式匹配成功。举例如下,针对文本 “dxxxdxxxd” 举例:

表达式

匹配结果

(d)(\w+?)

“\w+?” 将尽可能少的匹配第一个 “d” 之后的字符,结果是:”\w+?” 只匹配了一个 “x”

(d)(\w+?)(d)

为了让整个表达式匹配成功,”\w+?” 不得不匹配 “xxx” 才可以让后边的 “d” 匹配,从而使整个表达式匹配成功。因此,结果是:”\w+?” 匹配 “xxx”

更多的情况,举例如下:

举例1:表达式 “<td>(.*)</td>” 与字符串 “<td><p>aa</p></td> <td><p>bb</p></td>” 匹配时,匹配的结果是:成功;匹配到的内容是 “<td><p>aa</p></td> <td><p>bb</p></td>” 整个字符串, 表达式中的 “</td>” 将与字符串中最后一个 “</td>” 匹配。

举例2:相比之下,表达式 “<td>(.*?)</td>” 匹配举例1中同样的字符串时,将只得到 “<td><p>aa</p></td>”, 再次匹配下一个时,可以得到第二个 “<td><p>bb</p></td>”。


2.2 反向引用 \1, \2…

表达式在匹配时,表达式引擎会将小括号 “( )” 包含的表达式所匹配到的字符串记录下来。在获取匹配结果的时候,小括号包含的表达式所匹配到的字符串可以单独获取。这一点,在前面的举例中,已经多次展示了。在实际应用场合中,当用某种边界来查找,而所要获取的内容又不包含边界时,必须使用小括号来指定所要的范围。比如前面的 “<td>(.*?)</td>”。

其实,”小括号包含的表达式所匹配到的字符串” 不仅是在匹配结束后才可以使用,在匹配过程中也可以使用。表达式后边的部分,可以引用前面 “括号内的子匹配已经匹配到的字符串”。引用方法是 “\” 加上一个数字。”\1″ 引用第1对括号内匹配到的字符串,”\2″ 引用第2对括号内匹配到的字符串……以此类推,如果一对括号内包含另一对括号,则外层的括号先排序号。换句话说,哪一对的左括号 “(” 在前,那这一对就先排序号。

举例如下:

举例1:表达式 “(’|”)(.*?)(\1)” 在匹配 ” ‘Hello’, “World” ” 时,匹配结果是:成功;匹配到的内容是:” ‘Hello’ “。再次匹配下一个时,可以匹配到 ” “World” “。

举例2:表达式 “(\w)\1{4,}” 在匹配 “aa bbbb abcdefg ccccc 111121111 999999999″ 时,匹配结果是:成功;匹配到的内容是 “ccccc”。再次匹配下一个时,将得到 999999999。这个表达式要求 “\w” 范围的字符至少重复5次, 注意与 “\w{5,}” 之间的区别

举例3:表达式 “<(\w+)\s*(\w+(=(’|”).*?\4)?\s*)*>.*?</\1>” 在匹配 “<td id=’td1′ style=”bgcolor:white”></td>” 时,匹配结果是成功。如果 “<td>” 与 “</td>” 不配对,则会匹配失败;如果改成其他配对,也可以匹配成功。


2.3 预搜索,不匹配;反向预搜索,不匹配

前面的章节中,我讲到了几个代表抽象意义的特殊符号:”^”,”$”,”\b”。它们都有一个共同点,那就是:它们本身不匹配任何字符,只是对 “字符串的两头” 或者 “字符之间的缝隙” 附加了一个条件。理解到这个概念以后,本节将继续介绍另外一种对 “两头” 或者 “缝隙” 附加条件的,更加灵活的表示方法。

正向预搜索:”(?=xxxxx)”,”(?!xxxxx)”

格式:”(?=xxxxx)”,在被匹配的字符串中,它对所处的 “缝隙” 或者 “两头” 附加的条件是:所在缝隙的右侧,必须能够匹配上 xxxxx 这部分的表达式。因为它只是在此作为这个缝隙上附加的条件,所以它并不影响后边的表达式去真正匹配这个缝隙之后的字符。这就类似 “\b”,本身不匹配任何字符。”\b” 只是将所在缝隙之前、之后的字符取来进行了一下判断,不会影响后边的表达式来真正的匹配。

举例1:表达式 “Windows (?=NT|XP)” 在匹配 “Windows 98, Windows NT, Windows 2000″ 时,将只匹配 “Windows NT” 中的 “Windows “,其他的 “Windows ” 字样则不被匹配。

举例2:表达式 “(\w)((?=\1\1\1)(\1))+” 在匹配字符串 “aaa ffffff 999999999″ 时,将可以匹配6个”f”的前4个,可以匹配9个”9″的前7个。这个表达式可以读解成:重复4次以上的字母数字,则匹配其剩下最后2位之前的部分。当然,这个表达式可以不这样写,在此的目的是作为演示之用。

格式:”(?!xxxxx)”,所在缝隙的右侧,必须不能匹配 xxxxx 这部分表达式。

举例3:表达式 “((?!\bstop\b).)+” 在匹配 “fdjka ljfdl stop fjdsla fdj” 时,将从头一直匹配到 “stop” 之前的位置,如果字符串中没有 “stop”,则匹配整个字符串。

举例4:表达式 “do(?!\w)” 在匹配字符串 “done, do, dog” 时,只能匹配 “do”。在本条举例中,”do” 后边使用 “(?!\w)” 和使用 “\b” 效果是一样的。

反向预搜索:”(?<=xxxxx)”,”(?<!xxxxx)”

这两种格式的概念和正向预搜索是类似的,反向预搜索要求的条件是:所在缝隙的 “左侧”,两种格式分别要求必须能够匹配和必须不能够匹配指定表达式,而不是去判断右侧。与 “正向预搜索” 一样的是:它们都是对所在缝隙的一种附加条件,本身都不匹配任何字符。

举例5:表达式 “(?<=\d{4})\d+(?=\d{4})” 在匹配 “1234567890123456″ 时,将匹配除了前4个数字和后4个数字之外的中间8个数字。由于 JScript.RegExp 不支持反向预搜索,因此,本条举例不能够进行演示。很多其他的引擎可以支持反向预搜索,比如:Java 1.4 以上的 java.util.regex 包,.NET 中System.Text.RegularExpressions 命名空间,boost::regex 以及 GRETA 正则表达式库等。


3. 其他通用规则

还有一些在各个正则表达式引擎之间比较通用的规则,在前面的讲解过程中没有提到。

3.1 表达式中,可以使用 “\xXX” 和 “\uXXXX” 表示一个字符(”X” 表示一个十六进制数)

形式

字符范围

\xXX

编号在 0 ~ 255 范围的字符,比如:空格可以使用 “\x20″ 表示

\uXXXX

任何字符可以使用 “\u” 再加上其编号的4位十六进制数表示,比如:“\u4E2D”

3.2 在表达式 “\s”,”\d”,”\w”,”\b” 表示特殊意义的同时,对应的大写字母表示相反的意义

表达式

可匹配

\S

匹配所有非空白字符(”\s” 可匹配各个空白字符)

\D

匹配所有的非数字字符

\W

匹配所有的字母、数字、下划线以外的字符

\B

匹配非单词边界,即左右两边都是 “\w” 范围或者左右两边都不是 “\w” 范围时的字符缝隙

3.3 在表达式中有特殊意义,需要添加 “\” 才能匹配该字符本身的字符汇总

字符

说明

^

匹配输入字符串的开始位置。要匹配 “^” 字符本身,请使用 “\^”

$

匹配输入字符串的结尾位置。要匹配 “$” 字符本身,请使用 “\$”

( )

标记一个子表达式的开始和结束位置。要匹配小括号,请使用 “\(” 和 “\)”

[ ]

用来自定义能够匹配 ‘多种字符’ 的表达式。要匹配中括号,请使用 “\[” 和 “\]”

{ }

修饰匹配次数的符号。要匹配大括号,请使用 “\{” 和 “\}”

.

匹配除了换行符(\n)以外的任意一个字符。要匹配小数点本身,请使用 “\.”

?

修饰匹配次数为 0 次或 1 次。要匹配 “?” 字符本身,请使用 “\?”

+

修饰匹配次数为至少 1 次。要匹配 “+” 字符本身,请使用 “\+”

*

修饰匹配次数为 0 次或任意次。要匹配 “*” 字符本身,请使用 “\*”

|

左右两边表达式之间 “或” 关系。匹配 “|” 本身,请使用 “\|”

3.4 括号 “( )” 内的子表达式,如果希望匹配结果不进行记录供以后使用,可以使用 “(?:xxxxx)” 格式

举例1:表达式 “(?:(\w)\1)+” 匹配 “a bbccdd efg” 时,结果是 “bbccdd”。括号 “(?:)” 范围的匹配结果不进行记录,因此 “(\w)” 使用 “\1″ 来引用。

3.5 常用的表达式属性设置简介:Ignorecase,Singleline,Multiline,Global

表达式属性

说明

Ignorecase

默认情况下,表达式中的字母是要区分大小写的。配置为 Ignorecase 可使匹配时不区分大小写。有的表达式引擎,把 “大小写” 概念延伸至 UNICODE 范围的大小写。

Singleline

默认情况下,小数点 “.” 匹配除了换行符(\n)以外的字符。配置为 Singleline 可使小数点可匹配包括换行符在内的所有字符。

Multiline

默认情况下,表达式 “^” 和 “$” 只匹配字符串的开始 ① 和结尾 ④ 位置。如:

①xxxxxxxxx②\n
③xxxxxxxxx④

配置为 Multiline 可以使 “^” 匹配 ① 外,还可以匹配换行符之后,下一行开始前 ③ 的位置,使 “$” 匹配 ④ 外,还可以匹配换行符之前,一行结束 ② 的位置。

Global

主要在将表达式用来替换时起作用,配置为 Global 表示替换所有的匹配。


4. 综合提示

4.1 如果要要求表达式所匹配的内容是整个字符串,而不是从字符串中找一部分,那么可以在表达式的首尾使用 “^” 和 “$”,比如:”^\d+$” 要求整个字符串只有数字。

4.2 如果要求匹配的内容是一个完整的单词,而不会是单词的一部分,那么在表达式首尾使用 “\b”,比如: 使用 “\b(if|while|else|void|int……)\b” 来匹配程序中的关键字

4.3 表达式不要匹配空字符串。否则会一直得到匹配成功,而结果什么都没有匹配到。比如:准备写一个匹配 “123″、”123.”、”123.5″、”.5″ 这几种形式的表达式时,整数、小数点、小数数字都可以省略,但是不要将表达式写成:”\d*\.?\d*”,因为如果什么都没有,这个表达式也可以匹配成功。 更好的写法是:”\d+\.?\d*|\.\d+”

4.4 能匹配空字符串的子匹配不要循环无限次。如果括号内的子表达式中的每一部分都可以匹配 0 次,而这个括号整体又可以匹配无限次,那么情况可能比上一条所说的更严重,匹配过程中可能死循环。虽然现在有些正则表达式引擎已经通过办法避免了这种情况出现死循环了,比如 .NET 的正则表达式,但是我们仍然应该尽量避免出现这种情况。如果我们在写表达式时遇到了死循环,也可以从这一点入手,查找一下是否是本条所说的原因。

4.5 合理选择贪婪模式与非贪婪模式,参见话题讨论

4.6 或 “|” 的左右两边,对某个字符最好只有一边可以匹配,这样,不会因为 “|” 两边的表达式因为交换位置而有所不同。


5. 更多正则表达式话题

访问”正则表达式话题“,进一步讨论正则表达式运用。

Technorati : ,
Del.icio.us : ,

RailsConf 2006 some presentation files(zz)

Really worth reading, I find it really helpful, the performance talk and the deployment, hell, all are great.
 

资料来源于: do |r| Ruby & Rails end

此处作了适当整理


Technorati : ,
Del.icio.us : ,

Interviewing the JRuby Developers (zz)

Nice talk about something interesting on O'Reilly Ruby, It's always been the buzz about Jython, Python.Net, and  PyPy which by the way release some of the video of that project not long before, may be can find some inspiration on that. Now ruby is get on it's way, after read the article, i skip through the blog of these developers, nice work i have to say.
 

Interviewing the JRuby Developers

Monday July 17, 2006 9:32AM
by pat eyler

Alternative Ruby implementations seem to be on the move throughout the Ruby community. JRuby is the furthest along at this point, so I decided to talk to Charles Nutter and Thomas Enebo, two of the principal programmers on the project. Read on to hear what they have to say about Ruby, JRuby, and the art of re-implementing Ruby.


How did you find your way to Ruby?

Charles Nutter: I had begun to hear more about Ruby from friends in the fall of 2004 and was just starting to learn it. I discovered that RubyConf 2004 would be held only a few miles from Ventera’s home office, in Reston VA, so I planned a trip out east to attend. I was excited about the language’s potential and loved the enthusiasm and small size of the conference. I wondered to myself whether there might be a Ruby implementation
for Java. While sitting at the conference, on day one of my ongoingRuby experience, I discovered the JRuby project being led at the time by my friend and former co-worker Thomas Enebo. That was the beginning.

Thomas Enebo: I had been using Perl since the early nineties and I was fairly happy in that space. I like looking at new things and I happened across someones web page showing a re-implementation of a Perl script in Ruby (I cannot remember what or where). I was gobsmacked. I immediately ordered Dave Thomas’s Pickaxe book and I have been hooked ever since.

What benefits do you think JRuby will bring to the Ruby community?

Charles: Java has been in use for over a decade, and the language, runtimes, and tools have been freely available all that time. The result of this openness is that Java applications and libraries exist for just about every purpose under the sun. Ruby has really only existed in its current form for a few years, even though its full lifespan is older than Java’s. Because Ruby has evolved over a long period of time and only recently come under broad public scrutiny, there are not nearly as many libraries or projects written in Ruby. I hear people frequently asking for feature X in Ruby that they’ve used as library Y in Java. The usual answer from Rubyists is to wrap that library in a web service (on a Java server) or call it over a Ruby-Java bridge. Neither option scales well, however, and they don’t address the fact that the Ruby world is very young and is missing many libraries people want. Enabling Ruby developers to call all those same libraries without a bridge and without a web service puts the power of ten years of Java into their hands.

Thomas: The largest benefit is the additional libraries and options that the Java ecosystem brings to a Ruby developer. You need to leverage some library to parse GPS data? You can bet there is a Java library somewhere that does it. Even for Ruby libraries that already solve a particular problem, the Java equivalent may scratch that itch better.

Charles: Ruby also suffers from a slower-than-average implementation. The current version of Ruby proper is written in C and is a pure interpreter; i.e. it does no compilation and little optimization of the incoming code. This has resulted in Ruby running a fair bit slower than comparable languages like Python and Perl and far
slower than Java. Meanwhile the JVM is an astounding piece of technology that has enabled Java code to match or exceed compiled C code in some cases. I believe that JRuby could soon become a highly optimized Ruby platform, and that a Ruby-to-bytecode compiler could conceivably make JRuby faster than “C Ruby” for many applications. We hope to match C Ruby’s speed first, and then focus on compilation to take Ruby and JRuby to the next level.

Thomas: Another benefit is an easier path to adoption in an “Enterprise” environment. I quote “Enterprise”, because I am talking about it in the nebulous “this is a production environment and we need real software for real problems” business mentality that gets married to the term “Enterprise”. Java has passed this hurdle.Ruby still has not made it there yet in most places. Deploying Ruby-embedded Java server apps is an easier sell than getting an IT shop to deploy a Rails application. Embedding Ruby in Java also helps raise awareness of Ruby as a language. Over time, I see Ruby embedded Java applications growing the Ruby community. A bigger community will open more doors and gain “Enterprise” acceptance faster.

The last benefit is that by having alternate implementations of the Ruby programming language there will be a stabilizing force on the main C implementation and/or language specification. By saying this, I largely mean that when we implement an aspect of Ruby and get a weird result, then we fling an email asking what the behavior should be. This can help expose strange undiscovered corner cases. For sure, it provides a sanity check to implement the same behavior twice.

What about the Java community?

Charles: Ruby is a beautiful, elegant language. Java is a very utilitarian, practical language, but few would call it beautiful. Folks that prefer Ruby love how easy it is to read and write, and how quickly they can implement even complex code with it. Ruby also provides other language features that are very rogrammer-friendly like dynamic typing, closures, lightweight threading, and many more that Java does not have. Ruby has often been described as a great language in which to write “glue” code, which in the Unix world often means tying together other scripts and programs. However much of what we do in the Java world is simply gluing together libraries and applications to perform a new function. I believe that Ruby is a perfect language to replace Java in those areas, as the “logic” that makes Java applications go. I also believe that “the Ruby way” of writing software, exemplified by applications like Ruby on Rails, can empower the Java platform to do great things most Javaists may never have thought possible.

Thomas: Java is a decent statically-typed language and the JVM is a great runtime. The Java language really is not a one size fits all solution. Ruby is a very good compliment to Java. It allows you to leverage its open definitions and dynamic-typing to solve some problems very elegantly. Ruby’s syntax in particular really helps manipulate the static Java code in much less space. Embedding Ruby into a Java project can dramtically reduce the complexity and size of the project.

Besides embedding, just the ability to script Java and make quick tools has been a boon for me. You can create a little test case in 10-12 lines or make an administrative tool. You can mix and match Ruby and Java libraries. I wrote a simple xslt transformer using Ruby’s ‘optparse’ and Java’s built-in XSLT transformer. Use the best of both worlds.

Finally, I think it cannot be understated that the JVM needs to become a friendly multi-lingual virtual machine. Projects like JRuby, Jython, Groovy, and the many others helps to keep the JVM an innovative space. This will end up yielding more specialized tools for Java and more options for Java developers. I believe the future of Java is in it’s JVM and less in it’s language.

A lot of other folks are working on Ruby implementations, what sets JRuby apart?

Thomas: I believe we are the closest to being a complete alternative to what Ruby is today. We have re-implemented the important C external libraries and have gotten to the point that Rake, Ruby on Rails, WEBrick, and RubyGems are mostly working. This distinction will be lost over time, since I hope the other implementation succeed and provide additional decent Ruby implementations. It is a clear distinction right now though.

Charles: JRuby is by far the furthest along. We have a working interpreter that is very close to “compatible” with C Ruby version 1.8.4. We have a number of popular Ruby applications working under JRuby including IRB, RubyGems, Rake, and Rails. Rails is especially indicative of our progress; it is arguably the most complex Ruby application available today. With our compatibility and “correctness” approaching C Ruby, we are also in a better position to start advancing JRuby’s core with a compiler and new optimizations. Because we have no C extensions to support (as does C Ruby) we have the flexibility to make sweeping architecture changes within JRuby to better support the needs of the Ruby language. JRuby has actually followed the same general course C Ruby has; first an interpreter, then various optimizations, then a new VM and compiler. We are now approaching some level of completion on the interpreter and have started into optimization. We have also started looking at what would be required to compile Ruby code into Java bytecode.

Many other of the alternative Ruby implementations have taken the approach of writing a compiler first. This allows them to get a speed boost from the beginning, but also ties them to certain implementation and design decisions that might later prove to be an issue. Then there’s the fact that much of Ruby code and libraries are not yet ready for compilation; they expect .rb files to be loose on the filesystem, and frequently go looking for them. Those sorts of issues will bite the early compilers pretty hard, since they will have to contend not only with Ruby’s quirkier features but with its own libraries’ lack of readiness for next-generation VM and compiler designs.

Thomas: We are also an implementation on the JVM. This allows tight integration with Java and other JVM-backed languages. Living in a single language environment gives a nice sense of purity, but I largely think most developers are realizing this purity is out-moded. Some languages solve some problems really well and others not so much. Plugging into a functional language then giving the result to a Java class that is glued together by Ruby may seem scary; but I look at this as having choices to pick the right tools for the right jobs. An environment that provides you with choices is only a good thing in my eyes. I see the JVM as being this environment.

How do you think the various projects working on Ruby implementations can work together to help improve Ruby as a whole?

Charles: I’d say there’s many ways.

  • We could work together to start creating more formal specifications of the Ruby language and libraries. The C Ruby folks obviously have the best grasp of what would be in such a spec, but it’s mostly wrapped up in their heads and in the C code. We alternative implementations have to glean from Q&A and code inspection what is the intended behavior for various aspects of Ruby in order to correctly implement that behavior. In the process, we could be assembling a specification as we go; but I don’t believe any of us are.
  • We could work together to create libraries of tests that exercise various aspects of Ruby. Currently available tests for Ruby are rather sparse. There’s the Rubicon project, created by the Pragmatic Programmers, but it tests only a small portion of Ruby and has only been partially updated to Ruby 1.8 semantics (having been based originally on Ruby 1.6). There’s the test cases within Ruby’s own source, but those tests only seem to exercise peculiar edge cases reported as bugs; they come nowhere near a complete unit test suite. There are test cases within various applications like Rails or RubyGems; however, those also are not complete Ruby unit test cases since they test only what is relevant to those applications. The MetaRuby guys claim to have a large library of tests, but they have never been released or made public. There is, however, a renewed interest in creating an extensive suite of test cases being pushed by Dr. Wayne Kelly of Queensland University in Australia, the lead of a Microsoft-funded effort to create a “Ruby.NET” compiler. He, like the rest of us alternative implementers, very much wants to have a complete unit test suite to verify his implementation. Hopefully his efforts will bear fruit.
  • We could share pure-ruby implementations of particular libraries. We have taken the approach that if Ruby uses a native library or extension for some functionality, we first implement it in pure Ruby to get things running. Implementing in Ruby goes very fast, and allows us to continue on with other things. We eventually circle back around and reimplement those libraries in Java for performance, but the original pure-Ruby versions could be useful to other projects.
  • Documentation, documentation, documentation. We’ve discovered many quirks about how Ruby works that are not documented anywhere. We should be creating a central store of such facts for all to use.

Thomas: I think multiple implementations of Ruby will end up expanding the visibility of Ruby as a language. Just having reasonable JVM and .NET CLR implementations has great potential to get many new Ruby enthusiasts.

Also hopefully it may clarify multi-platform inconsistencies in the C Ruby implementations and perhaps have an influence on them. The Windows port of Ruby, for example, does not obey the same exact semantics as the linux/unix versions. Dir.glob with backslashes is an example. Java as its own platform needs to behave on both these systems sanely. Java deals with the cross-platform issue by a lowest common denominator approach which I think is bad; and I don’t think we want to encourage Ruby in that direction too much. I do think we want to help identify issues and come up with elegant cross-platform ways of doing things. For example, eating output from executed commands is generally some strange application code and platform check to dump to /dev/null or whatever the thing is on windows. What is wrong with this picture? Having more implementations/platforms of Ruby will help highlight issues like this.

I would really like to see multiple implementations of Ruby the language end up generating a consistent definition of Ruby that is not determined by someone changing a line of C code. I give great credit to the C Ruby developers in designing a great language, but the lack of a formal language specification will eventually become a liability. The more implementations hopefully will highlight the need for a formal language specification.

My last comment is a prediction. I predict that alternate implementations will give the main C implementors and language designers new ideas. Perhaps the idea will be architectural/implementation specific. Could even be an idea that creeps in because of some language integration feature from one of the alternate implementations. I think fresh perspectives always improve any creative process.

What are your 5 favorite libraries/frameworks for Ruby (whether in the standard library, or off the ‘Net)?

Charles:

  • Rails has to be up there somewhere just for pure ingenuity; they’ve taken the “ruby way” to its logical end and built a more dynamic framework than I’ve ever seen before.
  • RubyGems is a packaging system on par with the best options under Python, Perl, or Linux.
  • Rake is a great build tool; I’d like to create tasks for it to allow replacing Ant scripts.
  • Mongrel is very intriguing, and I’ve been trying to think of ways to implement the same under JRuby (Mongrel uses a native C library we can’t replicate). It has the potential to really expand Ruby’s use as
    a web platform.
  • Our own support for Java integration has to be in my top five as well; it’s so seamless and works so well after recent fixes, updates, and optimizations that you truly can write Ruby code calling Java libraries and never feel like you’re using Java at all.

Thomas: I also included Ruby applications in this:

  • JRuby’s ‘java’ package for java integration. I write Ruby scripts out of Java libraries quite a bit.
  • Ruby on Rails. I have been learning Rails at the same time as I have been debugging problems with it in JRuby. I am sure I have a warped view of it. It is very powerful and I like it.
  • RubyGems. Pretty slick. Every language needs dependency-based package management for libraries.
  • IRB. Interactive debugging is great. I use it alot.
  • DRb. I do not use it, but I love stuff like this and I used up a slot because I think it deserves a mention. </UL< p>

    What’s next for Ruby?

    Thomas: I don’t work on the C Ruby project, but it seems YARV and Ruby 2.0 is the next big thing. It will be cool to see how it is received and how it ends up performing. It sounds like they have some significant speedups.

    It would also be nice for a consistent GUI framework to emerge in the Ruby world. Java taught me that even a single crufty GUI API has a bigger impact that thirty others (think X11 here). I guess this is just a re-iteration of my cross-platform desire for Ruby.

    Charles: I have heard folks semi-familiar with Ruby comment that Ruby the language seems to be one of the best, but Ruby the C implementation seems to be one of the worst. I would not go that far; matz and company have done some amazing things with the C implementation. However, I believe Ruby desperately needs to take a leap to the next level, from a simple scripting “glue” language to a real application platform. The potential is there if they can clean up various reliability and performance issues. With the rapid uptake of Ruby on all platforms, it could quickly become the standard among dyntyped languages, moreso than Python or PHP. On the JVM, it could easily surpass all other scripting languages due to its vibrant community and wealth of applications (especially considering many of those applications will actually run in JRuby). We’re just now seeing the early upswing in Ruby acceptance…but it’s moving very fast.

    What’s next for JRuby?

    Charles: Performance, performance, performance. JRuby is still far slower than we’d like it to be and quite a bit slower than C Ruby. There’s no reason that we couldn’t match or exceed Ruby 1.8’s performance, and with a compiler we should be able to exceed it. There are still compatibility issues to be worked out, but that may always be the case. With so many great apps already starting to work in JRuby, improving performance has become my main priority.

    Other than performance, I want to make Ruby apps work better within a Java world. For example, Rake could be an excellent replacement for Ant. By building Rake tasks that can call javac, etc, like Ant does, Rake would easily replace Ant for many, many builds. I’m looking forward to the day I can replace my 5000-line Ant script with a far shorter Rake script and have everything work like it does today. Another example is Rails…we already have an ActiveRecord to JDBC adapter, but it’s not complete yet. It should enable total database independence, but that will take a bit more work. We’d also like to be able to use ActiveRecord as a facade for entity beans or other persistence frameworks like Hibernate. Continuing to evolve and adapt Ruby frameworks to take advantage of Java under JRuby will be very important over the next several months.

    Thomas: Where to start:

    • Performance. Charlie is determined to make JRuby fly without wings or rocket fuel. We want things fast enough where people can consider it for Ruby on Rails work.
    • Correctness. We still have bugs and we have been rapidly fixing them. It is an ongoing task that hopefully will start to diminish more soon.
    • Java integration. I really want to be able to extend Java concrete class and abstract classes from within Ruby and have Java consumers see that behavior. Right now we only support extending(implementing) Java interfaces. In addition to this, I want to create a type-mapping system so that JRubyists can define their own type conversion mapping between the Java/Ruby boundary.
    • Domain Helpers. If you look at packages like Spring or Groovy they provide a lot of domain specific helpers to make a developers life simpler. I think we will be starting and encouraging projects in that spirit. Antbuilder on Rubyforge is an example of this.
    • JEE web container support for Ruby on Rails. We will be making a Servlet capable of running Rails that integrates well. This will yield tons of interesting side projects like calling EJBs or integration with other persistence libraries. Use whatever part of Rails you want with whatever Java part you want.

    What’s next for you two?

    Thomas: JRuby is consuming alot of my spare-time. I have a day job and that is work making Java web applications. We will continue to improve JRuby and probably write a book. We will even evangelize more and do another conference (we got the opportunity to talk at JavaOne this year and it was great to meet people interested in JRuby) at some point this year.

    Charles: Well, we’re obviously neck-deep in this JRuby thing! I want to continue working to improve JRuby, but also to start expanding into Ruby apps and how they work within the Java world as mentioned above. There’s also going to be a need for support and development services around JRuby. In fact, I believe there’s enough momentum behind Ruby and JRuby to warrant full-time resources on these projects. C Ruby has matz full-time plus a number of other folks that are probably half-time or better. JRuby only has Tom and me working off-hours and weekends, and that includes work outside the core interpreter like Rails, RubyGems, Rake, and so on. It would greatly increase JRuby’s progress if we had dedicated, full-time resources. Considering what we’ve done part-time over the past six months, imagine what could be possible!

    At any rate, Tom and I plan to start working on a JRuby book around the end of this summer, depending on how the JRuby 1.0 release goes. We’ve had many people ask about a book, and we know there’s a mass of Java developers out there that really want or need Ruby in their lives. JRuby the book will help make that possible.

    I am also interested in helping the other Ruby implementations in any way I can. I want to see Ruby succeed everywhere, and I’d be happy as a clam dedicating all my waking hours to making that happen. The Ruby.NET project could certainly use a few extra hands, and the Cardinal project (Ruby on Parrot) seems to have stalled recently. Ruby is a hard language to implement, but the rewards are great; I like hard problems, so I’ll do everything I can to move Ruby forward.


    Charles Nutter has been a Java developer since 1996, recently working as the senior Java architect at Ventera Corp. He led the open-source LiteStep project in the late 90s and came to Ruby in the fall of 2004. Since then he has been a member of the JRuby team, helping to make it a true alternative Ruby platform. Charles presented JRuby at RubyConf 2005 and co-presented at JavaOne 2006 with Thomas Enebo. He hopes to co-write a JRuby book this fall with Thomas to follow up a planned
    JRuby 1.0 release. Charles currently works on a Ventera contract for the USDA’s Food and Nutrition Service at their office in Minneapolis. Charles blogs on Ruby and Java at headius.blogspot.com

    Thomas Enebo is project manager and a developer of the open source project JRuby. He is a developer at the University of Minnesota and a consultant with Aandtech Inc. Tom has been using Java in some fashion since its first public beta release. He became interested in Ruby after seeing an elegant re-implementation of some Perl code. Tom joined the JRuby project some time in late 2002. He blogs about JRuby at www.bloglines.com/blog/ThomasEEnebo

    Pat is an Infrastructure Engineer for the LDS Church by profession, a Ruby geek by choice, and a writer by night. He enjoys reading, cooking, spending time with his family, and helping to build the Ruby community. Pat currently writes for: O’Reilly, APress, Linux Journal, and on-ruby.blogspot.com (his own blog).

2006/7/23

周末小事记

昨晚去家乐福采购, 在贝塔斯曼的橱窗看到一本很耀眼的The Parents Answer book, 于是买下, 今天寄去给我哥了, 呵呵, 看起来挺不错的一本书.
他们居然让买书前加入贝塔斯曼, 然后打折, 晕, 现在回想一下, 以前一直不是会员, 但看过基本很不错的书都是别人是会员买的借来看的, 苏菲的世界, 安妮日记, 还有什么呢, 呵呵...
今天出去吃饭, 顺便去图书大厦看看书, 然后被大雨困在里面, 于是使劲看了看那个head first java, 挺喜欢这个系列傻瓜的感觉, 重点看了看那段说thread的, 和python, ruby线程一样的用, 不过可能都是学java的, 回来才发现有人提pep319把synchronize引入python, 然后被pep340取代了, 有点意思.
图书大厦给里面加了很多非常cute的小板凳, 满贴心的.
2006/7/20

ASM algorithm and Levenshtein Distance(edit-distance)

It's some problem on work and this humanreadablediff.py
(which didn't pass his own test_2, i will what i can do with that, but in a whole, nice code)
got me into reading about these Approximate string matching problem, there is a paper

AN EXTENSION OF UKKONEN'S ENHANCED DYNAMIC PROGRAMMING ASM ALGORITHM

Hal Berghel, University of Arkansas

David Roach, Acxiom Corporation

 

on http://berghel.net/publications/asm/asm.php

some discuss here http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/

about it's O(|s1|*|s2|) time-complexity the O(n^2) space-complexity if the whole of the matrix is kept for a trace-back to find an optimal alignment. If only the value of the edit distance is needed, only two rows of the matrix need be allocated; they can be "recycled", and the space somplexity is then O(|s1|) or O(n). And also list some applications on diff of unix, remote screen update, spelling correction, plagiarism detection, molecular biology, speech recognition, etc.

and here goes all the code you can learn/steal from

http://www.merriampark.com/ld.htm

and from wikipedia

copy the one i like here

 

Levenshtein Distance Algorithm: Perl Implementation

by Eli Bendersky

 
#!/usr/local/bin/perl -w

use strict;

($#ARGV == 1) or die "Usage: $0 <string1> <string2>\n";

my ($s1, $s2) = (@ARGV);

print "The Levenshtein distance between $s1 and $s2 is: " . levenshtein($s1, $s2) . "\n";



# Return the Levenshtein distance (also called Edit distance) 
# between two strings
#
# The Levenshtein distance (LD) is a measure of similarity between two
# strings, denoted here by s1 and s2. The distance is the number of
# deletions, insertions or substitutions required to transform s1 into
# s2. The greater the distance, the more different the strings are.
#
# The algorithm employs a proximity matrix, which denotes the distances
# between substrings of the two given strings. Read the embedded comments
# for more info. If you want a deep understanding of the algorithm, print
# the matrix for some test strings and study it
#
# The beauty of this system is that nothing is magical - the distance
# is intuitively understandable by humans
#
# The distance is named after the Russian scientist Vladimir
# Levenshtein, who devised the algorithm in 1965
#
sub levenshtein
{
    # $s1 and $s2 are the two strings
    # $len1 and $len2 are their respective lengths
    #
    my ($s1, $s2) = @_;
    my ($len1, $len2) = (length $s1, length $s2);

    # If one of the strings is empty, the distance is the length
    # of the other string
    #
    return $len2 if ($len1 == 0);
    return $len1 if ($len2 == 0);

    my %mat;

    # Init the distance matrix
    #
    # The first row to 0..$len1
    # The first column to 0..$len2
    # The rest to 0
    #
    # The first row and column are initialized so to denote distance
    # from the empty string
    #
    for (my $i = 0; $i <= $len1; ++$i)
    {
        for (my $j = 0; $j <= $len2; ++$j)
        {
            $mat{$i}{$j} = 0;
            $mat{0}{$j} = $j;
        }

        $mat{$i}{0} = $i;
    }

    # Some char-by-char processing is ahead, so prepare
    # array of chars from the strings
    #
    my @ar1 = split(//, $s1);
    my @ar2 = split(//, $s2);

    for (my $i = 1; $i <= $len1; ++$i)
    {
        for (my $j = 1; $j <= $len2; ++$j)
        {
            # Set the cost to 1 iff the ith char of $s1
            # equals the jth of $s2
            # 
            # Denotes a substitution cost. When the char are equal
            # there is no need to substitute, so the cost is 0
            #
            my $cost = ($ar1[$i-1] eq $ar2[$j-1]) ? 0 : 1;

            # Cell $mat{$i}{$j} equals the minimum of:
            #
            # - The cell immediately above plus 1
            # - The cell immediately to the left plus 1
            # - The cell diagonally above and to the left plus the cost
            #
            # We can either insert a new char, delete a char or
            # substitute an existing char (with an associated cost)
            #
            $mat{$i}{$j} = min([$mat{$i-1}{$j} + 1,
                                $mat{$i}{$j-1} + 1,
                                $mat{$i-1}{$j-1} + $cost]);
        }
    }

    # Finally, the Levenshtein distance equals the rightmost bottom cell
    # of the matrix
    #
    # Note that $mat{$x}{$y} denotes the distance between the substrings
    # 1..$x and 1..$y
    #
    return $mat{$len1}{$len2};
}


# minimal element of a list
#
sub min
{
    my @list = @{$_[0]};
    my $min = $list[0];

    foreach my $i (@list)
    {
        $min = $i if ($i < $min);
    }

    return $min;
}

Levenshtein Distance Algorithm: Objective-C Implementation

by Rick Bourner

------------------------------------------------------------------------ 

//
//  NSString-Levenshtein.h
//
//  Created by Rick Bourner on Sat Aug 09 2003.
//  rick@bourner.com

@interface NSString(Levenshtein)

// calculate the smallest distance between all words in stringA and  
stringB
- (float) compareWithString: (NSString *) stringB;

// calculate the distance between two string treating them each as a
// single word
- (float) compareWithWord: (NSString *) stringB;

// return the minimum of a, b and c
- (int) smallestOf: (int) a andOf: (int) b andOf: (int) c;

@end

--------------------------------------------------------------------

//
//  NSString-Levenshtein.m
//
//  Created by Rick Bourner on Sat Aug 09 2003.
//  Rick@Bourner.com

#import "NSString-Levenshtein.h"


@implementation NSString(Levenshtein)

// calculate the mean distance between all words in stringA and stringB
- (float) compareWithString: (NSString *) stringB
{
     float averageSmallestDistance = 0.0;
     float smallestDistance;
     float distance;

     NSMutableString * mStringA = [[NSMutableString alloc]  initWithString: self];
     NSMutableString * mStringB = [[NSMutableString alloc]  initWithString: stringB];


     // normalize
     [mStringA replaceOccurrencesOfString:@"\n"
                              withString: @" "
                                 options: NSLiteralSearch
                                   range: NSMakeRange(0, [mStringA  length])];

     [mStringB replaceOccurrencesOfString:@"\n"
                              withString: @" "
                                 options: NSLiteralSearch
                                   range: NSMakeRange(0, [mStringB  length])];

     NSArray * arrayA = [mStringA componentsSeparatedByString: @" "];
     NSArray * arrayB = [mStringB componentsSeparatedByString: @" "];

     NSEnumerator * emuA = [arrayA objectEnumerator];
     NSEnumerator * emuB;

     NSString * tokenA = NULL;
     NSString * tokenB = NULL;

     // O(n*m) but is there another way ?!?
     while ( tokenA = [emuA nextObject] ) {

         emuB = [arrayB objectEnumerator];
         smallestDistance = 99999999.0;

         while ( tokenB = [emuB nextObject] )
             if ( (distance = [tokenA compareWithWord: tokenB] ) <  smallestDistance )
                 smallestDistance = distance;

         averageSmallestDistance += smallestDistance;

     }

     [mStringA release];
     [mStringB release];

     return averageSmallestDistance / [arrayA count];
}


// calculate the distance between two string treating them eash as a
// single word
- (float) compareWithWord: (NSString *) stringB
{
     // normalize strings
     NSString * stringA = [NSString stringWithString: self];
     [stringA stringByTrimmingCharactersInSet:
               [NSCharacterSet whitespaceAndNewlineCharacterSet]];
     [stringB stringByTrimmingCharactersInSet:
               [NSCharacterSet whitespaceAndNewlineCharacterSet]];
     stringA = [stringA lowercaseString];
     stringB = [stringB lowercaseString];


     // Step 1
     int k, i, j, cost, * d, distance;

     int n = [stringA length];
     int m = [stringB length];	

     if( n++ != 0 && m++ != 0 ) {

         d = malloc( sizeof(int) * m * n );

         // Step 2
         for( k = 0; k < n; k++)
             d[k] = k;

         for( k = 0; k < m; k++)
             d[ k * n ] = k;

         // Step 3 and 4
         for( i = 1; i < n; i++ )
             for( j = 1; j < m; j++ ) {

                 // Step 5
                 if( [stringA characterAtIndex: i-1] == 
                      [stringB characterAtIndex: j-1] )
                     cost = 0;
                 else
                     cost = 1;

                 // Step 6
                 d[ j * n + i ] = [self smallestOf: d [ (j - 1) * n + i ] + 1
                                             andOf: d[ j * n + i - 1 ] +  1
                                             andOf: d[ (j - 1) * n + i -1 ] + cost ];
             }

         distance = d[ n * m - 1 ];

         free( d );

         return distance;
     }
     return 0.0;
}


// return the minimum of a, b and c
- (int) smallestOf: (int) a andOf: (int) b andOf: (int) c
{
     int min = a;
     if ( b < min )
         min = b;

     if( c < min )
         min = c;

     return min;
}

@end

 

Levenshtein Distance Algorithm: Java Implementation

by Chas Emerick

From an email from Chas Emerick to Michael Gilleland, 22 October 2003: 
Mr. Gilleland,

As you may know, the Apache Jakarta Commons project had appropriated 
your sample implementation of the Levenshtein Distance algorithm for 
its commons-lang Java library.  While attempting to use it with two 
very large strings, I encountered an OutOfMemoryError, due to the fact 
that a matrix is created with the dimensions of the two strings' 
lengths.  I know you created the implementation to go with your 
(excellent) illustration of the algorithm, so this matrix approach 
translates that illustration and tutorial perfectly.

However, as I said, the matrix approach doesn't lend itself to getting 
the edit distance of two large strings.  For this purpose, I modified 
your implementation to use two single-dimensional arrays; this is 
clearly more memory-friendly (although it probably results in some very 
slight performance degradation when comparing smaller strings).

I've submitted the modification to the maintainers of the commons-lang 
project, and I've appended the relevant method below.

Thanks!

Chas Emerick  
public static int getLevenshteinDistance (String s, String t) {
  if (s == null || t == null) {
    throw new IllegalArgumentException("Strings must not be null");
  }
		
  /*
    The difference between this impl. and the previous is that, rather 
     than creating and retaining a matrix of size s.length()+1 by t.length()+1, 
     we maintain two single-dimensional arrays of length s.length()+1.  The first, d,
     is the 'current working' distance array that maintains the newest distance cost
     counts as we iterate through the characters of String s.  Each time we increment
     the index of String t we are comparing, d is copied to p, the second int[].  Doing so
     allows us to retain the previous cost counts as required by the algorithm (taking 
     the minimum of the cost count to the left, up one, and diagonally up and to the left
     of the current cost count being calculated).  (Note that the arrays aren't really 
     copied anymore, just switched...this is clearly much better than cloning an array 
     or doing a System.arraycopy() each time  through the outer loop.)

     Effectively, the difference between the two implementations is this one does not 
     cause an out of memory condition when calculating the LD over two very large strings.  		
  */		
		
  int n = s.length(); // length of s
  int m = t.length(); // length of t
		
  if (n == 0) {
    return m;
  } else if (m == 0) {
    return n;
  }

  int p[] = new int[n+1]; //'previous' cost array, horizontally
  int d[] = new int[n+1]; // cost array, horizontally
  int _d[]; //placeholder to assist in swapping p and d

  // indexes into strings s and t
  int i; // iterates through s
  int j; // iterates through t

  char t_j; // jth character of t

  int cost; // cost

  for (i = 0; i<=n; i++) {
     p[i] = i;
  }
		
  for (j = 1; j<=m; j++) {
     t_j = t.charAt(j-1);
     d[0] = j;
		
     for (i=1; i<=n; i++) {
        cost = s.charAt(i-1)==t_j ? 0 : 1;
        // minimum of cell to the left+1, to the top+1, diagonally left and up +cost				
        d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);  
     }

     // copy current distance counts to 'previous row' distance counts
     _d = p;
     p = d;
     d = _d;
  } 
		
  // our last action in the above loop was to switch d and p, so p now 
  // actually has the most recent cost counts
  return p[n];
}

Levenshtein Distance Algorithm: C++ Implementation

by Anders Sewerin Johansen

// Include STL string type

#include <string>

// Include STL vector type (dynamic array)

#include <vector>

int distance(const std::string source, const std::string target) {

  // Step 1

  const int n = source.length();
  const int m = target.length();
  if (n == 0) {
    return m;
  }
  if (m == 0) {
    return n;
  }

  // Good form to declare a TYPEDEF

  typedef std::vector< std::vector<int> > Tmatrix; 

  Tmatrix matrix(n+1);

  // Size the vectors in the 2.nd dimension. Unfortunately C++ doesn't
  // allow for allocation on declaration of 2.nd dimension of vec of vec

  for (int i = 0; i <= n; i++) {
    matrix[i].resize(m+1);
  }

  // Step 2

  for (int i = 0; i <= n; i++) {
    matrix[i][0]=i;
  }

  for (int j = 0; j <= m; j++) {
    matrix[0][j]=j;
  }

  // Step 3

  for (int i = 1; i <= n; i++) {

    const char s_i = source[i-1];

    // Step 4

    for (int j = 1; j <= m; j++) {

      const char t_j = target[j-1];

      // Step 5

      int cost;
      if (s_i == t_j) {
        cost = 0;
      }
      else {
        cost = 1;
      }

      // Step 6

      const int above = matrix[i-1][j];
      const int left = matrix[i][j-1];
      const int diag = matrix[i-1][j-1];
      const int cell = min( above + 1, min(left + 1, diag + cost));

      // Step 6A: Cover transposition, in addition to deletion,
      // insertion and substitution. This step is taken from:
      // Berghel, Hal ; Roach, David : "An Extension of Ukkonen's 
      // Enhanced Dynamic Programming ASM Algorithm"
      // (http://www.acm.org/~hlb/publications/asm/asm.html)

      if (i>2 && j>2) {
        int trans=matrix[i-2][j-2]+1;
        if (source[i-2]!=t_j) trans++;
        if (s_i!=target[j-2]) trans++;
        if (cell>trans) cell=trans;
      }

      matrix[i][j]=cell;
    }
  }

  // Step 7

  return matrix[n][m];
}

Levenshtein Distance Algorithm: Perl Implementation

by Jorge Mas Trullenque

 
sub levenshtein($$){
  my @A=split //, lc shift;
  my @B=split //, lc shift;
  my @W=(0..@B);
  my ($i, $j, $cur, $next);
  for $i (0..$#A){
	$cur=$i+1;
	for $j (0..$#B){
		$next=min(
			$W[$j+1]+1,
			$cur+1,
			($A[$i] ne $B[$j])+$W[$j]
		);
		$W[$j]=$cur;
		$cur=$next;
	}
	$W[@B]=$next;
  }
  return $next;
}

sub min($$$){
  if ($_[0] < $_[2]){ pop @_; } else { shift @_; }
  return $_[0] < $_[1]? $_[0]:$_[1];
}

print levenshtein("gambol","gumbo");
print levenshtein("gumbo", "gambol");
print levenshtein("gumbo", "bumble");

A python implementation by Magnus Lie Hetland.
#!/usr/bin/env python
def distance(a,b):
    "Calculates the Levenshtein distance between a and b."
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n
       
    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)
           
    return current[n]
if __name__=="__main__":
    from sys import argv
    print distance(argv[1],argv[2])
And this in Tcl by Richard Suchenwirth
 

similarity

It strikes me that "similarity scoring" is the sort of gadget that attracts/inspires RS, Arjen, ... If I leave a mention of [1] here, will they exhibit Tcl examples?

[LV that google URL doesn't seem to point to anything for me...]

Note Tcllib FR [ 514486 ] request textutil snd-like comparisons [2]

Levenshtein's algorithm for Hamming distance is also the foundation of diff in Tcl, which compares files line-by-line instead of comparing strings character-by-character.

RS can't withstand a challenge... Indeed, I have often been wishing for such a measuring device - thanks for the link! Here's a plump translation to Tcl of the Python version of the Levenshtein algorithm given there (where it hurt to have to do all index arithmetics with expr, so I introduced a helper subtractor), plus an application of stringDistance to compute stringSimilarity, where the only little gag is that we have to determine the sum of the string lengths only once, as they're concatenated:

 proc stringDistance {a b} {
        set n [string length $a]
        set m [string length $b]
        for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
        for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
        for {set i 1} {$i<=$n} {incr i} {
           for {set j 1} {$j<=$m} {incr j} {
                set x [expr {$c([- $i 1],$j)+1}]
                set y [expr {$c($i,[- $j 1])+1}]
                set z $c([- $i 1],[- $j 1])
                if {[string index $a [- $i 1]]!=[string index $b [- $j 1]]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 }
 # some little helpers:
 proc min args {lindex [lsort -real $args] 0}
 proc max args {lindex [lsort -real $args] end}
 proc - {p q} {expr {$p-$q}}

 proc stringSimilarity {a b} {
        set totalLength [string length $a$b]
        max [expr {double($totalLength-2*[stringDistance $a $b])/$totalLength}] 0.0
 }

# Testing...

 % stringSimilarity hello hello  ;# identity implies perfect similarity
 1.0
 % stringSimilarity hello hallo  ;# changed one out of five letters
 0.8
 % stringSimilarity hello Hallo  ;# case matters
 0.6
 % stringSimilarity hello world  ;# one match of five (l or o)
 0.2
 % stringSimilarity hello helplo ;# insert costs slightly less
 0.818181818182
 % stringSimilarity hello again  ;# total dissimilarity
 0.0

[Nice work, of course; I particularly applaud the example evaluations.]


Both string* functions may be tuned to better fit the needs of the application. In stringDistance, the cost for inequality (presently constant 1, done by the incr z) could be derived from the characters in question, e.g. 0/O or I/1 could cost only 0.1, etc.; in stringSimilarity one could, if the strings are qualified as being either standard (like from a dictionary) or (possible) deviation, divide the distance by the length of the standard string (this would prevent the above effect that an insert consts slightly less, because it increases the total length.


Speed tuning: The subtractor helper "-" above makes the code look nicer than if an explicit expr were thrown in for each occurrence; however, keeping a second iterator whose value is one less than the original iterator brings runtime from 5.7 ms down to 4.2 ms (Sun Solaris; tested both "hello hello" and "hello again"):

 proc stringDistance2 {a b} {
     set n [string length $a]
     set m [string length $b]
     for {set i 0} {$i<=$n} {incr i} {set c($i,0) $i}
     for {set j 0} {$j<=$m} {incr j} {set c(0,$j) $j}
     for {set i 1; set i0 0} {$i<=$n} {incr i; incr i0} {
         for {set j 1; set j0 0} {$j<=$m} {incr j; incr j0} {
                set x [expr {$c($i0,$j)+1}]
                set y [expr {$c($i,$j0)+1}]
                set z $c($i0,$j0)
                if {[string index $a $i0]!=[string index $b $j0]} {
                        incr z
                }
                set c($i,$j) [min $x $y $z]
            }
        }
        set c($n,$m)
 } ;# RS

Artur Trzewik 2006-03-31: There is another one implementation, that I found in OmegaT java programm and have rewritten to Tcl, seems to be a little bit faster (30%).

  # author Vladimir Levenshtein
  # author Michael Gilleland, Merriam Park Software
  # author Chas Emerick, Apache Software Foundation
  # author Maxym Mykhalchuk
  proc levenshteinDistance {s t} {
    set n [string length $s]
    set m [string length $t]

    if {$n==0} {
        return $m
    } elseif {$m==0} {
        return $n
    }

    for {set i 0} {$i<=$n} {incr i} {
        lappend d 0
    }

    # indexes into strings s and t
    # int i; // iterates through s
    # int j; // iterates through t
    # int cost; // cost

    for {set i 0} {$i<=$n} {incr i} {
        lappend p $i
    }

    for {set j 1} {$j<=$m} {incr j} {
        set t_j [string index $t [expr {$j-1}]]
        lset d 0 $j

        for {set i 1} {$i<=$n} {incr i} {
            set s_i [string index $s [expr {$i-1}]]
            if {$s_i eq $t_j} {
                set cost 0
            } else {
                set cost 1
            }
            # minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            lset d $i [min [expr {[lindex $d [expr {$i-1}]]+1}]  [expr {[lindex $p $i]+1}] [expr {[lindex $p [expr {$i-1}]]+$cost}]]
        }

        # copy current distance counts to 'previous row' distance counts
        set _d $p
        set p $d
        set d $_d
    }

    # our last action in the above loop was to switch d and p, so p now
    # actually has the most recent cost counts
    lindex $p $n
  }

IL This is a topic I've been dealing with a lot lately, and I'm finding (of course) that the nature of the math really depends on the data you're trying to score. In the field of genetic sequence matching you might be looking for common subsequences, in editorial fields you might be looking for misspellings. (escargo - Is that a joke?) IL-no, i'm not that corny :(, thanks for the headsup

I've found that under a certain number of characters in a string, any percentage measurement really doesn't do much good. CAT and CAR (especially in an editorial context) are entirely different concepts but you can make the argument it only differs by one char, in this case that difference just ends up being 33%. It raised the question to me, can you ever really assign a percentage of relevancy by sheer numbers alone? Probably not, in most cases I regard string matching as highly domain specific.

Also there is the notion that A has a relevancy towards B, but the reverse might not be true. ie. you can say THE has a 100% match against THE FLYING CIRCUS.

[3] Here is a good summary I found on the topic of genetic sequence matching algorithms. I was using a variation of the Smith-Waterman algorithm to run tests against some supposedly related datasets.


For faster (though not as exact) string comparisons, see Fuzzy string search


To compare how similar sounding two strings are, try the soundex page.


Additional string functions - Arts and crafts of Tcl-Tk programming


Updated 31 Mar 2006, 13:47 GMT  -  Edit similarity  -  Revisions
Search - Recent changes - 4 References - About Wikit - Go to The Tcler's Wiki - Help

NubyGems: Inject is Functional(zz)

To be honest, coming from python, first i didn't really get inject, seems like a cool thing learn from smalltalk, this article sure helps.

http://www.oreillynet.com/ruby/blog/2006/07/nubygems_inject_is_functional.html

NubyGems : Inject is Functional!

Tuesday July 11, 2006 9:04AM
by Gregory Brown

When I first found the inject method, it became a new favorite tool.

It was great to be able to turn something like this:


a = [7,8,9]
b = [1,2,3,4]
b.each { |e| a << e + 1}

into something like this:


b = [1,2,3,4]
a = b.inject([7,8,9]) { |s,e| s << e + 1 }

It was cool because it was even possible to build hashes this way, if you used a little trick.


b = [1,2,3,4]
a = b.inject({}) { |s,e| s[e.to_s] = e; s }

This should have given me the red flag though. Why should I need to pass the hash as the last value in the block?

go ahead, try something like this:

b = [1,2,3,4]
a = b.inject([]) { |s,e| s << e if e < 3 }

Now what I would *want* this to do is give me something likr a #=> [1,2], but instead it gives us a #=> nil

So we have to do that annoying trick again:

b = [1,2,3,4]
a = b.inject([]) { |s,e| s << e if e < 3; s }

Now, I'm no nuby, but I suppose there is a little bit of nubyism in all of us. I just learned last week why inject was surprising me so much. It's not really just a shortcut for each that gives you a base value to build up, it's a functional method.

For those who aren't familiar with functional programming, there is alway the wikipedia, but the relevant part for this particular problem is that inject isn't designed to do anything destructive. That means things like << and += are bad and things like + are good.

That’s why you see the typical sum example of inject using just a + operator


sum = [1,2,3,4].inject(0) { |s,e| s + e } #=> 10

And if you want to build a hash, you can do so without the hack I showed before

hash = [1,2,3,4].inject({}) { |s,e| s.merge( { e.to_s => e } ) }

The reason these bits of code work is because the return value of the block is what becomes the new s, NOT the original parameter you passed to inject.

Now you might say ‘hey, this is probably pretty slow, building all these new references and passing them around’. I sort of thought the same thing my self, and am not quite sure how I feel about that.

But the idea is, you’re really fighting the function when you use destructive methods. If you find yourself needing to modify the original object rather than build a result set functionally, each isn’t THAT ugly. :)

Comments

Name
Website
Save?
Comments (you may use HTML tags for style)

inject does look cool, I'll have to start using that:)

Not quite sure what you mean about destructive methods. If you try something like this:

b = [1,2,3,4]
a = b.inject([]) { |s,e| s << e if e < 3; puts s.object_id; s }

it shows that s is the same object passed around. It's probably designed this way for immediate values such as summing with a Fixnum, since you can't pass them by reference.

cheers
Tim

<< is a destructive method. You are modifying the object assigned to s.

the only reason why the object id is the same in your example is because you are using inject as if it were not a functional method

You can rewrite this code to be non-destructive:
a = b.inject([]) { |s,e| e < 3 ? s + [e] : s }

For that first example, wouldn't the logical thing be

[7,8,9]+[1,2,3,4].map { |x| x+1 }

Sure, that's much nicer. I mostly was just pointing the functional stuff about inject, not trying to show the handy ways to use it.

However, the code you just showed is also functional. :)

2006/7/18

ajax on rails chatroom

555, 昨晚才发现4月30号DHH的报告"Beyond the Rocket Surgeons with AJAX on Rails"
和37signals的campfire, 发现和那个campfire几乎一样, hoho, 我再改改做好看点也可以开
服务收月租了(yy中 ;-)
程序没什么好说的, 简单的数据库管room和sessions和messages, User都是在程序中的结构,
其它都是view里的一堆rhtml, 和rjs的效果, 唉, 好想看那个RJS Templates for Rails啊,
56页的pdf, $9.9...
贴图看看很丑的样子吧, 图标和css是偷ajchat的.
 
 
2006/7/14

写点东西吧

今天是法国国庆日, 更主要的是, kaka, 我的生日, 上班什么都没心情, 没怎么干活, 静灌了一天水母.
回来继续写代码, 这快两周了的一个东西, 半天做不好, 烦, 争取今晚搞定.
对了, 感谢那些记住的没记住的care的人.
又添了一年了, 我悲哀的日子...(^H^H^H^H^H^H^H^H不发牢骚不发牢骚了)
2006/7/3

twisted version of msn social relations build (cont.)

Mike is so nice to take so much time to teach me the basic idea of twisted, thanks Mike. He bring the DeferredQueue and hop limit into this program, and the code is far good looking than mine.(btw, I really like pretty code and girls, those are my goals, haha;-)
 
import re, sys, socket
import urllib
from twisted.internet import reactor, defer
from twisted.python import log
from twisted.web import client
from sets import Set
socket.setdefaulttimeout = 10
re_buddy_pre = re.compile("href=\"(?:http://)?([^\.\"]+)\.spaces.msn.com",
re.IGNORECASE)  # http://nick.spaces.msn.com
re_buddy_post = re.compile("spaces.msn.com/members/([^\"\/]+)",
re.IGNORECASE) # http://spaces.msn.com/members/nick/
 
class MSNMapper(object):
    """Build a map of the MSN buddy space
    """
    def __init__(self, origin, radius=10, max_requests=4):
        """Start at the URL givin in 'origin'.  Find all buddies
        within 'radius' clicks.  Use up to 'max_request' clients at at
        time.
        """
        self.waiting = 0
        self.all_buddies = Set()
        self.relation_dict = {}
        self.queue = defer.DeferredQueue()
        self.clients = [MSNMapperClient(self) for i in range(max_requests)]
        self.queueBuddy(origin, radius)
        [client.start() for client in self.clients]
       
    def quit(self):
        """Tell the clients to quit
        """
        [d.callback((None, 0)) for d in self.queue.waiting]
        self.queue.waiting = []
        self.queue.pending = [(None, 0)] * len(self.clients)
    def queueBuddy(self, buddy_name, ttl):
        if buddy_name in self.all_buddies:
            return
        print "New buddy:", buddy_name
        self.queue.put((buddy_name, ttl))
    def getNextBuddy(self):
        self.waiting += 1
        if self.waiting == len(self.clients) and not self.queue.pending:
            # The queue is empty and all the clients are looking for
            # more; we are done!
            self.quit()
        d = self.queue.get()
        d.addCallback(self.cb_decWaiting)
        return d
    def cb_decWaiting(self, data):
        self.waiting -= 1
        return data
    def clientQuit(self, client):
        self.clients.remove(client)
        if not self.clients:
            print self.all_buddies
            reactor.stop()

class MSNMapperClient(object):
    def __init__(self, mapper):
        self.mapper = mapper
    def start(self):
        print "%r waiting for next buddy"%self
        d = self.mapper.getNextBuddy()
        d.addCallbacks(self.cb_gotNewName, self.eb_die)
    def eb_die(self, fail):
        fail.printTraceback()
        self.mapper.clientQuit(self)
    def cb_gotNewName(self, (buddy_name, ttl)):
        print "Getting %r with ttl %d"%(buddy_name,ttl)
       
        self.buddy_name = buddy_name
        self.ttl = ttl
       
        if buddy_name == None:
            self.mapper.clientQuit(self)
            return
       
        if buddy_name in self.mapper.all_buddies:
            # Already processed this one; skip it
            return self.start()
        self.mapper.all_buddies.add(buddy_name)
        url = "http://"+buddy_name+".spaces.msn.com/"
        d = client.getPage(url)
        d.addCallbacks(self.cb_gotPage, self.eb_die)
    def cb_gotPage(self, data):
        buddies = Set()
        for re_buddy in [re_buddy_pre, re_buddy_post]:
            for m in re_buddy.finditer(data):
                buddies.add(m.group(1))
        self.mapper.relation_dict[self.buddy_name] = buddies
        if self.ttl:
            [self.mapper.queueBuddy(buddy_name, self.ttl-1)
             for buddy_name in buddies]
        self.start()
       

log.startLogging(sys.stdout)
mapper = MSNMapper("ayueer", radius=1)
reactor.run()