python


Python: parse html and produce a tabular text file


The problem: I want to parse an html code and retrieve a file of tabular text such as this:
East Counties
Babergh, http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml, 876
Basildon, http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml, 1134
...
...
What I get instead:
Only East Counties appears in the txt file, so the for loop fails to print each new region. Attempt code is after the html code.
HTML code:
The code can be found in this html page, of which this is the excerpt referring to the above table:
<h2>
East Counties</h2>
<table>
<thead>
<tr>
<th>
<span id="listRegions_lvFiles_0_titleLAName_0">Local authority</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleUpdate_0">Last update</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleEstablishments_0">Number of businesses</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleCulture_0">Download</span>
</th>
</tr>
</thead>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_0">Babergh</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_0">04/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_0"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_0">876</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_0" title="Babergh: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml">English language</a>
</td>
</tr>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_1">Basildon</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_1">06/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_1"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_1">1,134</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_1" title="Basildon: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml">English language</a>
</td>
</tr>
My attempt:
from xml.dom import minidom
import urllib2
from bs4 import BeautifulSoup
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
regions=[]
with open('Regions_and_files.txt', 'w') as f:
for h2 in soup.find_all('h2')[6:]: #Skip 6 h2 lines
region=h2.text.strip() #Get the text of each h2 without the white spaces
regions.append(str(region))
f.write(region+'\n')
for tr in soup.find_all('tr')[1:]: # Skip headers
tds = tr.find_all('td')
if len(tds)==0:
continue
else:
a = tr.find_all('a')
link = str(a)[10:67]
span = tr.find_all('span')
places = int(str(span[3].text).replace(',', ''))
f.write("%s,%s,%s" % \
(str(tds[0].text)[1:-1], link, places)+'\n')
How can I fix this?
I'm not familiar with the Beautiful Soup library, but judging from the code it looks like in each h2 cycle you are traversing all the tr elements of the document. You should instead traverse only rows that belong to the table related to the specific h2 element.
Edited:
After a quick look at Beautiful Soup docs looks like you can use .next_sibling since h2 is always followed by the table, i.e. table = h2.next_sibling.next_sibling (called twice because the first sibling is a string containing whitespace). From the table you can then traverse all its rows.
The reason you are getting duplicates for Wales is because there actually are duplicates in the source.

Related Links

gdal_merge overlaying pngs over one another
How would I use a attribute or a element that will split my data from adding with each other
Scraping HTML forms with regex
How to reduce time complexity of a program for finding length of a factorial in python?
Python socket programming-saving data from socket
Python3 find last occurrence string then write
python method works outside of Class, but not within Class
How do I optimize this python code using cython?
Python - regex lookup for multiple lines of HTML
Python: how to omit lines in a .txt file
Creating COM object via ssh (issue with AutoCAD.Application)
Python: Nested While Loops & Continue Looping
Why is the group method allowed on re.match even though it can sometimes return None?
How to decode list of urls in file
numpy.savetxt() stop newline on final line
How to apply a regex to a folder full of .txt files?

Categories

HOME
cluster-computing
redis
apollo
ios10
iso
android-source
symfony-forms
shippo
marathon
facebook4j
k2
cronet
activepython
katharsis
interrupt-handling
solidworks
crt
helix-3d-toolkit
outsystems
oracle-xml-db
modelandview
vsts-package-management
android-download-manager
parent
flex4.5
cgo
crop
powermta
lirc
jexl
hpcc
launch
uicollectionviewlayout
jcreator
iron.io
gulp-typescript
typhoon
push-diffusion
rstudio-server
slam-algorithm
cron-task
unsatisfiedlinkerror
coordinate-transformation
date-format
aerogear
sun-codemodel
spservices
except
asymptote
drf-nested-routers
data-import
p-np
paypal-nvp
ytplayerview
coypu
pylearn
apache-mina
sysfs
orientation-changes
moai
symfony-2.6
tcpreplay
google-admin-audit-api
suffix-array
mov
sthttprequest
ssms-addin
help-viewer
installshield-2011
va-list
surveyor-gem
google-closure-library
jquery-selectbox
nssavepanel
tcxgrid
testunit
phpcrawl
nvelocity
audiostreamer
clause
webrat
msf
data-execution-prevention
interface-design
associativity

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App