- About 50 lines of code
- Gives reasonable results (try it out)
- tokenize need to be improved much more (better detection, stop words ...)
- split_to_sentences need to be improved much more (handle 3.2, Mr. Smith ...)
- In real life you'll need to "clean" the text (Ads, credits, ...)
You could improve the sentence splitting by using NLTK's Punkt tokenizer.
ReplyDeleteHey another person reads your blog! Oh my its getting crowded :)
ReplyDeleteI read it too :)
ReplyDeleteHere you go !
ReplyDelete