python


Python: parse html and produce a tabular text file


The problem: I want to parse an html code and retrieve a file of tabular text such as this:
East Counties
Babergh, http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml, 876
Basildon, http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml, 1134
...
...
What I get instead:
Only East Counties appears in the txt file, so the for loop fails to print each new region. Attempt code is after the html code.
HTML code:
The code can be found in this html page, of which this is the excerpt referring to the above table:
<h2>
East Counties</h2>
<table>
<thead>
<tr>
<th>
<span id="listRegions_lvFiles_0_titleLAName_0">Local authority</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleUpdate_0">Last update</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleEstablishments_0">Number of businesses</span>
</th>
<th>
<span id="listRegions_lvFiles_0_titleCulture_0">Download</span>
</th>
</tr>
</thead>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_0">Babergh</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_0">04/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_0"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_0">876</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_0" title="Babergh: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml">English language</a>
</td>
</tr>
<tr>
<td>
<span id="listRegions_lvFiles_0_laNameLabel_1">Basildon</span>
</td>
<td>
<span id="listRegions_lvFiles_0_updatedLabel_1">06/05/2017 </span>
at
<span id="listRegions_lvFiles_0_updatedTime_1"> 12:00</span>
</td>
<td>
<span id="listRegions_lvFiles_0_establishmentsLabel_1">1,134</span>
</td>
<td>
<a id="listRegions_lvFiles_0_fileURLLabel_1" title="Basildon: English language" href="http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml">English language</a>
</td>
</tr>
My attempt:
from xml.dom import minidom
import urllib2
from bs4 import BeautifulSoup
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
regions=[]
with open('Regions_and_files.txt', 'w') as f:
for h2 in soup.find_all('h2')[6:]: #Skip 6 h2 lines
region=h2.text.strip() #Get the text of each h2 without the white spaces
regions.append(str(region))
f.write(region+'\n')
for tr in soup.find_all('tr')[1:]: # Skip headers
tds = tr.find_all('td')
if len(tds)==0:
continue
else:
a = tr.find_all('a')
link = str(a)[10:67]
span = tr.find_all('span')
places = int(str(span[3].text).replace(',', ''))
f.write("%s,%s,%s" % \
(str(tds[0].text)[1:-1], link, places)+'\n')
How can I fix this?

I'm not familiar with the Beautiful Soup library, but judging from the code it looks like in each h2 cycle you are traversing all the tr elements of the document. You should instead traverse only rows that belong to the table related to the specific h2 element.
Edited:
After a quick look at Beautiful Soup docs looks like you can use .next_sibling since h2 is always followed by the table, i.e. table = h2.next_sibling.next_sibling (called twice because the first sibling is a string containing whitespace). From the table you can then traverse all its rows.
The reason you are getting duplicates for Wales is because there actually are duplicates in the source.


Related Links

Access request from the forms.Form to get data from dB related to user
In Pudb pressing q quits without giving the option to restart
Labels of the variables on the histogram
Python JIT Compile with Numba has error : DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
Selenium Looping Dynamic Drop Downs with Submit in Python
instanse dont work correctly for form field
Python pandas dataframe to vertica table using vertica-python
Getting an error when install PIL
parallel computing combination_with_replacement using multiprocessing
TypeError: unorderable types: Study() < int() [duplicate]
If conditon to get exceeded value if it goes above threshold from file
Python sqlalchemy - independent transactions
Django - How to create a simple confirmation view?
Python tkinter - Button callback function [duplicate]
Print text file with python
from a single user to multiple users Django app

Categories

HOME
spring
google-apps-script
erlang
elixir
bing
paramiko
heap-memory
azureservicebus
urlencode
requirejs
google-classroom
metronic
maxima
medium-editor
arraylist
ada
off-canvas-menu
atlassian-plugin-sdk
koa
jconsole
rails-activerecord
ghost-inspector
servicemix
azure-servicebus-queues
java-ee-7
metis
geo
swiftcharts
fabric-digits
threshold
openpgp
lego
android-download-manager
silverlight-3.0
sendgrid-templates
android-maps-v2
g1gc
recycle-bin
data-integration
gd
mercury
date-range
gmt
cleditor
pydio
mailcatcher
boost-hana
jackson-databind
picking
drf-nested-routers
jta
eyeql
app.xaml
sem
textkit
futuretask
qbfc
calibration
harp
type-mismatch
divide-by-zero
gulp-livereload
google-plus-one
c++-actor-framework
pgagent
google-admin-audit-api
power-law
alpha-transparency
indexing-service
joox
broadcasting
arangodb-php
jquery-tabs
opendata
java.util.date
ng-hide
va-list
surveyor-gem
awesomeprint
autostart
getstring
isolatedstorage
text-services-framework
electronic-signature
msbuildextensionpack
symbol-server
svn-hooks
open-graph-beta
carbide
virtualquery





Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm