- About 50 lines of code
- Gives reasonable results (try it out)
- tokenize need to be improved much more (better detection, stop words ...)
- split_to_sentences need to be improved much more (handle 3.2, Mr. Smith ...)
- In real life you'll need to "clean" the text (Ads, credits, ...)
If it won't be simple, it simply won't be. [Hire me, source code] by Miki Tebeka, CEO, 353Solutions
Friday, January 18, 2008
Simple Text Summarizer
Comments:
Tuesday, January 15, 2008
attrgetter is fast
#!/usr/bin/env python
from operator import attrgetter
from random import shuffle
class Point:
def __init__(self, x, y):
self.x, self.y = x, y
def sort1(points):
points.sort(key = lambda p: p.x)
def sort2(points):
points.sort(key = attrgetter("x"))
if __name__ == "__main__":
from timeit import Timer
points1 = [Point(x, 2 * x) for x in range(100)]
points2 = points1[:]
num_times = 10000
t1 = Timer("sort1(points1)", "from __main__ import sort1, points1")
print t1.timeit(num_times)
t2 = Timer("sort2(points2)", "from __main__ import sort2, points2")
print t2.timeit(num_times)
$ ./attr.py 0.492087125778 0.29891705513 $
Friday, January 04, 2008
Faster and Shorter "dot" using itertools
Let's calculate the dot product of two vectors:
dot2 is faster and shorter, however dot1 is more readable - my vote goes to dot2.
from itertools import starmap, izip
from operator import mul
def dot1(v1, v2):
result = 0
for i, value in enumerate(v1):
result += value * v2[i]
return result
def dot2(v1, v2):
return sum(starmap(mul, izip(v1, v2)))
if __name__ == "__main__":
from timeit import Timer
num_times = 1000
v1 = range(100)
v2 = range(100)
t1 = Timer("dot1(%s, %s)" % (v1, v2), "from __main__ import dot1")
print t1.timeit(num_times) # 0.038722038269
t2 = Timer("dot2(%s, %s)" % (v1, v2), "from __main__ import dot2")
print t2.timeit(num_times) # 0.0260770320892
Subscribe to:
Posts (Atom)