Reducing text to it’s components

This short phyton programm takes a Webpage as an input and reduces it to it’s components. The components are the words on the webpage. You can use this and customize this to fit your purpose. This code can be applied in web-crawlers, text analytics and other fields. For example if you want do leave out stop words you would define a dictonary of this word and include this with anouther if statement. This could be applied if you want to reduce patent data to it’s components and leave generic terms like ‘a’ ‘this’ ‘innovation’ etc. out. You would do this because words like this have no information value.

[sourcecode language=”python”]

def remove_tags(source):

output = [ ]

atsplit = True

splitlist = [‘ ‘,’>’,'<‘,’n’]

i = 0

while i < len(source):

if source[i] == ‘<‘:

i = source.find(‘>’,i+1)

if source[i] in splitlist:

atsplit = True

else:

if atsplit:

output.append(source[i])

atsplit = False

else:

output[-1] = output[-1] + source[i]

i = i + 1

return output[/sourcecode]

 

Verwandte Artikel:

Leave a Reply