python


Python: parse html and produce a tabular text file


The problem: I want to parse an html code and retrieve a file of tabular text such as this:
East Counties
Babergh, http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml, 876
Basildon, http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml, 1134
...
...
What I get instead:
Only East Counties appears in the txt file, so the for loop fails to print each new region. Attempt code is after the html code.
HTML code:
The code can be found in this html page, of which this is the excerpt referring to the above table:
<h2>
East Counties</h2>
<table>
<thead>
<tr>
<th>
<span id="listRegions_lvFiles_0_titleLAName_0">Local authority</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleUpdate_0">Last update</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleEstablishments_0">Number of businesses</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleCulture_0">Download</span>
</th>
</tr>
</thead>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_0">Babergh</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_0">04/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_0"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_0">876</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_0" title="Babergh: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml">English language</a>
</td>
</tr>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_1">Basildon</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_1">06/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_1"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_1">1,134</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_1" title="Basildon: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml">English language</a>
</td>
</tr>
My attempt:
from xml.dom import minidom
import urllib2
from bs4 import BeautifulSoup
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
regions=[]
with open('Regions_and_files.txt', 'w') as f:
for h2 in soup.find_all('h2')[6:]: #Skip 6 h2 lines
region=h2.text.strip() #Get the text of each h2 without the white spaces
regions.append(str(region))
f.write(region+'\n')
for tr in soup.find_all('tr')[1:]: # Skip headers
tds = tr.find_all('td')
if len(tds)==0:
continue
else:
a = tr.find_all('a')
link = str(a)[10:67]
span = tr.find_all('span')
places = int(str(span[3].text).replace(',', ''))
f.write("%s,%s,%s" % \
(str(tds[0].text)[1:-1], link, places)+'\n')
How can I fix this?
I'm not familiar with the Beautiful Soup library, but judging from the code it looks like in each h2 cycle you are traversing all the tr elements of the document. You should instead traverse only rows that belong to the table related to the specific h2 element.
Edited:
After a quick look at Beautiful Soup docs looks like you can use .next_sibling since h2 is always followed by the table, i.e. table = h2.next_sibling.next_sibling (called twice because the first sibling is a string containing whitespace). From the table you can then traverse all its rows.
The reason you are getting duplicates for Wales is because there actually are duplicates in the source.

Related Links

How to make a valid format JSON in python?
Itterate through JSON Object in Python
How can I install Django 1.1? [closed]
Google Cloud Endpoints Upload Image as BytesField - Error 400 Invalid bytes value
how to get a nested element? [closed]
While Loop Break usage
Regex FindAll occurances of a string inside a text file. How?
Setting ranges to zero using numpy
Django 1.6: static files
Python program errors
Rebase template function in bottle python doesn't render the template as it should
How to extract text and text coordinates from a pdf file?
Use dictionary for Python argparse
Get URL using Beautiful Soup [closed]
Issue about create .app using py2app with setuptools 2.2
How can I load an image on Python Pillow?

Categories

HOME
vhdl
voip
uibutton
backup
memory-leaks
command
bibtex
phpstorm
constructor
ipfs
cakephp-2.9
cs-cart
local
nuxt.js
scenebuilder
flexlm
shippo
onsen-ui
k2
shopping-cart
branch
kong
classpath
google-api-dotnet-client
aurelia-binding
devforce
picturebox
skygear
edb
swagger-editor
active-model-serializers
alchemy.js
exiftool
bayesian-networks
ssh.net
distributed-transactions
excel-interop
data-integration
proof
walmart-electrode
appstore-approval
rhel6
tofixed
search-box
jszip
redux-router
laravel-query-builder
xib
c#-interactive
children
spoofing
dojox.charting
mercurial-hook
dbscan
freefem++
qt-linguist
bluesnap
cmocka
self-hosting
eyeql
visual-studio-code
team-build
ytplayerview
apache-mina
xaml-designer
start-job
rhel5
gmaps4rails
dmp
formvalidation-plugin
snmp4j
code-testing
argument-passing
jstack
facebook-chat
bignum
dynamic-binding
angularjs-timeout
blockingqueue
continuous-testing
mpmovieplayer
xmlslurper
parameterization
z-machine
text-services-framework
point-sprites
shared-objects
serp
msbuildextensionpack
windows-live-id
audiostreamer
hardware-acceleration
mvvm-foundation
web-statistics

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App