by Miki Tebeka

Friday, January 18, 2008

Simple Text Summarizer

  • About 50 lines of code
  • Gives reasonable results (try it out)
  • tokenize need to be improved much more (better detection, stop words ...)
  • split_to_sentences need to be improved much more (handle 3.2, Mr. Smith ...)
  • In real life you'll need to "clean" the text (Ads, credits, ...)


Chris said...

You could improve the sentence splitting by using NLTK's Punkt tokenizer.

Anonymous said...

Hey another person reads your blog! Oh my its getting crowded :)

Anonymous said...

I read it too :)

Anonymous said...

Here you go !

